# ByteTransformer

**Repository Path**: ByteDance/ByteTransformer

## Basic Information

- **Project Name**: ByteTransformer
- **Description**: optimized BERT transformer inference on NVIDIA GPU.  https://arxiv.org/abs/2210.03052
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-02-18
- **Last Updated**: 2026-01-14

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# ByteTransformer: Optimized BERT Transformer Inference on NVIDIA GPUs

## Introduction

ByteTransformer is a high-performance inference library for BERT-like transformers that offers the following features:

* Provides Python and C++ APIs, with the PyTorch plugin allowing users to enhance transformer inference with just a few lines of Python code.
* Supports both fixed-length and variable-length transformers.
* Includes end-to-end architectural-aware optimizations for the padding-free algorithm on BERT routines, including QKV encoding, softmax, feed forward network, activation, layernorm, and multi-head attention.

ByteTransformer has been widely deployed to improve in-house transformer inference serving systems at ByteDance, delivering superior performance over other transformer implementations for both fixed-length and variable-length inputs. The technical details have been published at IEEE IPDPS 2023.

## Cite Us

If you use our library, please cite our research paper.

```
@article{zhai2022bytetransformer,
  title={ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs},
  author={Zhai, Yujia and Jiang, Chengquan and Wang, Leyuan and Jia, Xiaoying and Zhang, Shang and Chen, Zizhong and Liu, Xin and Zhu, Yibo},
  journal={arXiv preprint arXiv:2210.03052},
  year={2022}
}
```

## Performance and Speedup

We compared ByteTransformer with PyTorch, TensorFlow, FasterTransformer, and DeepSpeed on an A100 GPU. The benchmark script is available in [benchmark/bert_bench.sh](https://github.com/bytedance/ByteTransformer/blob/main/benchmark/bert_bench.sh).

**1. Standard BERT batch size = 1, average sequence length = 0.6 * maximal, execution time in millisecond:**

|      | PyTorch | Tensorflow | FasterTransformer | FasterTransformer with remove padding | DeepSpeed | ByteTransformer |
|------|-------------|----------------|-------------------|---------------------------------------|---------------------|-----------------|
| 64   | 2.93        | 2.46           | 1.05              | 1.23                                  | 1.17                | 0.90            |
| 128  | 3.18        | 2.6            | 1.10              | 1.43                                  | 1.28                | 0.97            |
| 192  | 3.18        | 2.81           | 1.26              | 1.43                                  | 1.40                | 1.36            |
| 256  | 2.81        | 2.9            | 1.35              | 1.55                                  | 1.51                | 1.43            |
| 320  | 3.11        | 3.24           | 1.63              | 1.66                                  | 1.84                | 1.69            |
| 384  | 2.87        | 3.43           | 1.64              | 1.64                                  | 1.95                | 1.72            |
| 448  | 2.99        | 3.61           | 2.26              | 2.35                                  | 2.23                | 1.86            |
| 512  | 2.89        | 3.74           | 2.28              | 2.43                                  | 2.37                | 2.00            |
| 576  | 2.99        | 4.03           | 2.51              | 2.59                                  | 2.70                | 2.19            |
| 640  | 2.99        | 4.54           | 2.85              | 2.83                                  | 3.17                | 2.23            |
| 704  | 3.21        | 4.67           | 3.16              | 3.44                                  | 3.32                | 2.47            |
| 768  | 3.33        | 4.88           | 3.26              | 3.63                                  | 3.46                | 2.51            |
| 832  | 3.78        | 5.39           | 3.75              | 3.87                                  | 3.97                | 2.80            |
| 896  | 3.86        | 5.81           | 4.08              | 4.95                                  | 4.37                | 2.86            |
| 960  | 4.02        | 6.27           | 4.30              | 5.23                                  | 4.66                | 3.12            |
| 1024 | 4.2         | 6.37           | 4.51              | 4.96                                  | 4.86                | 3.16            |


**2. Standard BERT batch size = 16, average sequence length = 0.6 * maximal, execution time in millisecond:**

|      | PyTorch | Tensorflow | FasterTransformer | FasterTransformer with remove padding | DeepSpeed | ByteTransformer |
|------|-------------|----------------|-------------------|---------------------------------------|---------------------|-----------------|
| 64   | 3.2         | 4.57           | 2.24              | 1.93                                  | 2.81                | 2.09            |
| 128  | 4.97        | 6.97           | 3.62              | 3.33                                  | 4.54                | 3.18            |
| 192  | 7.65        | 9.37           | 5.26              | 5.29                                  | 6.68                | 5.08            |
| 256  | 9.56        | 12.17          | 6.77              | 5.49                                  | 9.03                | 6.85            |
| 320  | 13.21       | 15.87          | 8.85              | 6.47                                  | 12.81               | 7.49            |
| 384  | 15.01       | 18.56          | 10.37             | 7.05                                  | 15.19               | 8.44            |
| 448  | 19.06       | 23.01          | 15.97             | 12.54                                 | 18.83               | 8.89            |
| 512  | 21          | 26.03          | 18.03             | 13.79                                 | 21.55               | 9.22            |
| 576  | 24.33       | 31.24          | 21.11             | 17.65                                 | 26.2                | 10.15           |
| 640  | 28.03       | 35.07          | 24.52             | 20.34                                 | 30.24               | 12.04           |
| 704  | 32.33       | 41.43          | 28.94             | 24.52                                 | 34.65               | 13.55           |
| 768  | 35.31       | 44.62          | 32.09             | 28.21                                 | 37.95               | 16.3            |
| 832  | 40.75       | 51.87          | 36.33             | 31.69                                 | 45.32               | 16.92           |
| 896  | 44.47       | 55.65          | 42.17             | 38.05                                 | 49.48               | 20.67           |
| 960  | 49.72       | 63.59          | 47.01             | 42.98                                 | 55.72               | 23.27           |
| 1024 | 53.21       | 65.94          | 50.28             | 45.22                                 | 59.96               | 24.70           |


## Supported Models

Currently, only the standard BERT transformer encoder is available under this repository.

## Environment requirements
* CUDA: 11.6
* CMake: >= 3.13
* PyTorch: >= 1.8
* GPU compute capability: 7.0(V100) / 7.5(T4) or 8.0(A100)
* Python: >= 3.7

Tested on: A100 + CUDA 11.6 + PyTorch 1.13.0+cu116 + Python 3.9.16

## Building from Source
To build from source, run the following commands:
```bash
git submodule update --init
mkdir build && cd build
cmake -DTORCH_CUDA_ARCH_LIST="8.0" -DDataType=FP16 -DBUILD_THS=ON -DCUDAARCHS="80" ..
make
```

## Getting Started with Unit Tests
### Unit Tests in C++
To generate test data, run the following code:
```bash
cd build
# batch sz = 16, seqlen = 64, head num = 12, head sz = 64, avg seqlen = 32
python3 bert_transformer_test.py 16 64 12 64 --avg_seqlen 32 --dtype fp16 --export_data
```

Here, `16`, `64`, `12`, and `64` represent batch size, sequence length, number of heads, and head size, respectively. The `--avg_seqlen 32` flag is used to set the average sequence length, `--dtype fp16` sets the data type, and `--export_data` exports the test data.


After test data is generated (`*.in` and `*.out` files are saved under the current directory), run the following command:

```
./bin/bert_transformer_test 16 64 12 64
```

Here, the arguments represent the same parameters as used in generating the test data.

### Unit Tests in a PyTorch Plugin in Python

To perform the unit tests in a PyTorch plugin in Python, use the same script as for C++, but without the `--export_data` flag. Run the following command in the terminal:

```bash
# batch sz = 16, seqlen = 64, head num = 12, head sz = 64, avg seqlen = 32
python3 bert_transformer_test.py 16 64 12 64 --avg_seqlen 32 --dtype fp16
```

Again, the arguments represent the same parameters as used in generating the test data.

## Benchmark
```bash
cd build
../benchmark/bert_bench.sh
```