# tokenweave
**Repository Path**: mirrors_microsoft/tokenweave
## Basic Information
- **Project Name**: tokenweave
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-06-21
- **Last Updated**: 2025-06-22
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
Efficient Compute-Communication Overlap for Distributed LLM Inference
| Paper |
## Overview
**TokenWeave** is a system designed to reduce communication overhead during distributed inference of large language models (LLMs). Even with high-speed interconnects like NVLink, distributed inference can incur up to *20% performance overhead* due to communication bottlenecks.
TokenWeave addresses this by introducing a **coarse-grained compute-communication overlap** mechanism that significantly improves efficiency during inference. TokenWeave is currently integrated with `LLama-3.3-70B`, `Qwen2.5-72B` and `Mixtral-8x22B` but it can be easily extended to other similar models by modifying the model file. Please see how we modify `llama.py` to integrate TokenWeave for the steps required to integrate TokenWeave into an existing model file. Additionally, please check `csrc/tokenweave_fused_kernels.cu` for the TokenWeave fused kernels used to implement compute-communication overlap.
### TokenWeave NVIDIA Nsight Systems (nsys) profile
## Prerequisites
- **Compilation**: CUDA 12.4
- **Runtime environment**: Python 3.12, PyTorch 2.6.0, Ubuntu 22.04
- **Hardware**: 8×H100 DGX system with NVLink interconnects
## Installation
To ease the setup, we recommend using either of these two Docker images:
- `pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel` or
- `vllm/vllm-openai:v0.8.5`
```bash
apt-get update; apt-get upgrade -y; apt-get install kmod git build-essential tmux -y
git clone https://github.com/microsoft/tokenweave.git
cd tokenweave
# Install miniconda; skip if already installed
make install_miniconda # 30 seconds
make create_env
bash # Refresh shell and activate
conda activate tokenweave
make install # 18 minutes
# or alternatively:
pip3 install -v -e .
make install_dependencies # 17 seconds
```
## Quick Start
To get started with TokenWeave:
```bash
huggingface-cli login --token HF_TOKEN
# Run offline inference examples
make run_qwen2
make run_mixtral
make run_llama3
# NOTE: If Llama 3 gets stuck during the model downloading stage,
# please kill the process and start it again — that should resolve the issue.
# Note: vLLM version 0.8.5.post1 may also hang during model downloading, depending
# on the environment setup.
```
**To Generate Tokenweave Configs (Optional)**
If you want to generate TokenWeave configs for a new model, you can use the configs_generator script and modify it as needed. We have already provided configs for `LLama-3.3-70B`, `Qwen2.5-72B` and `Mixtral-8x22B` on 8xH100.
```bash
cd artifact
tmux new -s tokenweave_session # Start a new tmux session
conda activate tokenweave # Activate the conda environment
# Run the following command in the tmux session to generate configs for
# `LLaMA-3.3-70B`, `Qwen2.5-72B`, and `Mixtral-8x22B`
make configs_generator # Takes approximately 1 day
cd .. # Go back to the tokenweave directory
```
**To profile using nsys**
```bash
# install nsys
wget https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2024_4/NsightSystems-linux-cli-public-2024.4.1.61-3431596.deb
dpkg -i NsightSystems-linux-cli-public-2024.4.1.61-3431596.deb
# run
nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node
```
## Benchmarks
Our evaluation includes two types of experiments:
- Microbenchmark performance
(Figures 1, 3, 4, 5, 6, 7, and 10)
- End-to-end LLM performance
(Figures 11, 12, and 13)
To reproduce the results, use the `Makefile` in the `artifact/` directory:
```bash
cd artifact
tmux new -s tokenweave_session # start a new tmux session
conda activate tokenweave # activate the conda environment
# run the following commands in the tmux session
make clean
make correctness_check # check output/ directory for the raw text generated
make all # ~10 hours 48 minutes
# To generate the figures piece-wise
make figure_5_6_7 # 20 minutes
make figure_4_10 # 1 hour 25 minutes
make figure_9 # 8 minutes
make figure_1_3 # 3 hours 25 minutes
make figure_2_11 # 1 hour 10 minutes
make figure_12 # 2 hours 34 minutes
make figure_13 # 1 hour 52 minutes
```
The artifact scripts redirect the raw output numbers and logs to the `output/` folder, while the plotted graphs are stored
in the `graphs/` folder. CSV files for the figures can be found in the `csvs/` directory. Results may show minor runtime
variations compared to those reported in the paper, but the general trends should remain consistent.
## Citation
If you use our work, please consider citing our [paper](https://arxiv.org/abs/2505.11329):
```bibtex
@misc{gond2025tokenweave,
title={TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference},
author={Raja Gond and Nipun Kwatra and Ramachandran Ramjee},
year={2025},
url={https://arxiv.org/abs/2505.11329}
}
```
## Acknowledgment
This repository originally started as a fork of the [vLLM project](https://github.com/rajagond/vllm/tree/87aaadef73543ab3e63eea933c39cee42c418e90) (Commit ID: 87aaade). Multimem-NVLS communication collective operation kernels in TokenWeave are built on top of the [pytorch implementation](https://github.com/pytorch/pytorch/blob/f6275bf0fe198f7f27569776ec221eb040a4cfa2/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu).