# FasterTransformer4CodeFuse **Repository Path**: log12304/FasterTransformer4CodeFuse ## Basic Information - **Project Name**: FasterTransformer4CodeFuse - **Description**: Provide high-performance model inference, mainly supporting the CodeFuse model from Ant Group. - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 3 - **Created**: 2023-09-13 - **Last Updated**: 2024-05-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # FasterTransformer4CodeFuse

stars forks License: apache 2.0 Open Issues

| [**English**](README.md) |
## 介绍 提供较为高性能的模型推理,主要支持蚂蚁CodeFuse模型。 与[原版FasterTransformer](https://github.com/NVIDIA/FasterTransformer)相比增加了: - :white_check_mark: CodeFuse模型的int8量化 - :white_check_mark: prompt结尾无需完整单词 - :white_check_mark: python api - :white_check_mark: python流式输出 - :white_check_mark: 模型加载提速 - :white_check_mark: 一些bugfix ## 性能指标 > Batch size: 1
模型版本 CodeFuse 13B
指标 推理耗时 (ms)
模型并行 单卡A100 双卡A100并行
量化情况 fp16 int8 fp16 int8
输入/输出长度 16 8 160 195 238 84
64 32 608 369 373 295
256 128 2650 1530 1492 1130
1024 512 10776 7054 6786 5415
每秒token数 48 75 77 98
## 运行方式 我们使用的运行环境:`nvcr.io/nvidia/pytorch:22.09-py3`。 #### 1. 安装依赖 ``` pip install --no-cache-dir pybind11==2.6.2 transformers accelerate sentencepiece echo "export pybind11_DIR=/opt/conda/lib/python3.8/site-packages/pybind11/share/cmake/pybind11/" >> ~/.bashrc export pybind11_DIR=/opt/conda/lib/python3.8/site-packages/pybind11/share/cmake/pybind11/ ``` #### 2. 编译 ``` mkdir build ; cd build export TORCH_PYTHON_LIBRARIES=/opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so cmake -DCMAKE_BUILD_TYPE=Release -DSM="80;75" -DBUILD_PYT=ON -DSPARSITY_SUPPORT=OFF -DMEASURE_BUILD_TIME=ON \ -DBUILD_CUTLASS_MIXED_GEMM=ON -DBUILD_MULTI_GPU=ON -DBUILD_TRT=OFF \ -DENABLE_FP8=OFF -DBUILD_PYBIND=ON -DTORCH_PYTHON_LIBRARIES=${TORCH_PYTHON_LIBRARIES} .. make -j"$(grep -c ^processor /proc/cpuinfo)" ``` #### 3. 使用 可使用`examples/pytorch/codefuse/huggingface_convert.py`脚本将huggingface transformer模型转换为可用的模型文件。例如: ``` export MODEL_NAME=codefuse export TENSOR_PARA_SIZE=2 python ../examples/pytorch/codefuse/huggingface_convert.py \ -o ../models/${MODEL_NAME}/fastertransformer \ -i ../models/${MODEL_NAME}/transformers \ -infer_gpu_num ${TENSOR_PARA_SIZE} \ -processes 20 \ -weight_data_type fp16 \ -model_name gptneox ``` 可使用`examples/pytorch/codefuse/quant_and_save.py`脚本将fp16或fp32格式的模型文件专为int8量化后的模型文件。模型文件更小,int8模式下加载更快。 ``` export MODEL_NAME=codefuse export TENSOR_PARA_SIZE=2 python ../examples/pytorch/codefuse/quant_and_save.py \ --in_dir ../models/${MODEL_NAME}/fastertransformer/${TENSOR_PARA_SIZE}-gpu \ --out_dir ../models/${MODEL_NAME}/fastertransformer/${TENSOR_PARA_SIZE}-gpu_int8 \ --lib_path ../build/lib/libth_common.so \ --tensor_para_size ${TENSOR_PARA_SIZE} \ --use_gptj_residual \ --data_type fp16 ``` 可使用`examples/pytorch/codefuse/codefuse_example.py`加载模型进行推理。 ``` export MODEL_NAME=codefuse # fp16 1gpu python ../examples/pytorch/codefuse/codefuse_example.py \ --ckpt_path ../models/${MODEL_NAME}/fastertransformer/1-gpu \ --tokenizer_path ../models/${MODEL_NAME}/transformers # int8 1gpu python ../examples/pytorch/codefuse/codefuse_example.py \ --ckpt_path ../models/${MODEL_NAME}/fastertransformer/1-gpu_int8 \ --tokenizer_path ../models/${MODEL_NAME}/transformers \ --int8_mode 1 \ --enable_int8_weights 1 # fp16 2gpus torchrun --nproc_per_node 2 ../examples/pytorch/codefuse/codefuse_example.py \ --world_size 2 \ --ckpt_path ../models/${MODEL_NAME}/fastertransformer/2-gpu \ --tokenizer_path ../models/${MODEL_NAME}/transformers # int8 2gpus torchrun --nproc_per_node 2 ../examples/pytorch/codefuse/codefuse_example.py \ --world_size 2 \ --ckpt_path ../models/${MODEL_NAME}/fastertransformer/2-gpu_int8 \ --tokenizer_path ../models/${MODEL_NAME}/transformers \ --int8_mode 1 \ --enable_int8_weights 1 ```