# AutoSmoothQuant **Repository Path**: fuscshome/auto-smooth-quant ## Basic Information - **Project Name**: AutoSmoothQuant - **Description**: AutoSmoothQuant - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-01 - **Last Updated**: 2025-08-01 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # AutoSmoothQuant AutoSmoothQuant is an easy-to-use package for implementing smoothquant for LLMs. AutoSmoothQuant speeds up model inference under various workloads. AutoSmoothQuant was created and improved upon from the [original work](https://github.com/mit-han-lab/smoothquant) from MIT. ## Install ### Prerequisites - Your GPU(s) must be of Compute Capability 8.0 or higher. Amphere and later architectures are supported. - Your CUDA version must be CUDA 11.4 or later. ### Build from source Currently this repo only support build form source. We will release package soon. ``` git clone https://github.com/AniZpZ/AutoSmoothQuant.git cd AutoSmoothQuant pip install -e . ``` ## Usage ### quantize model First add a config file named "quant_config.json" to model path. For currenttly supported models, config should be like: ```json { "qkv": "per-tensor", "out": "per-tensor", "fc1": "per-tensor", "fc2": "per-tensor" } ``` "qkv" stands for QKV matmul of attention, "out" stands for out matmul of attention. "fc1" and "fc2" are the layers of the FFNs, which might be referred to as "gate_up" and "down" in Llama-like models. You can set the value to "per-tensor" or "per-token" to perform the quant granularity you want. Once config is set, generate scales and do model quantization with following command: ``` cd autosmoothquant/examples python3 smoothquant_model.py --model-path=/path/to/model --quantize-model=True --generate-scale=True --dataset-path=/path/to/dataset ``` use following command for more information ``` python smoothquant_model.py -help ``` ### inference - inference with vLLM Comming soon (this [PR](https://github.com/vllm-project/vllm/pull/1508) could be reference) - inference in this repo ``` cd autosmoothquant/examples python3 test_model.py --model-path=/path/to/model --tokenizer-path=/path/to/tokenizer --model-class=llama --prompt="something to say" ``` ### benchmark Comming soon (this [PR](https://github.com/vllm-project/vllm/pull/1508) could be reference) ## Supported models Model support list: | Models | Sizes | | ---------| ----------------------------| | LLaMA-2 | 7B/13B/70B | | LLaMA | 7B/13B/30B/65B | | Mixtral | 8*7B | | OPT | 6.7B/13B/30B | | Baichuan-2 | 7B/13B | | Baichuan | 7B/13B | ## Performance and inference efficency Detailed data comming soon Cases: [codellama-13b with A40](https://github.com/vllm-project/vllm/pull/1508#issuecomment-1824133140). Tested with vLLM [llama-13b with A100](https://github.com/vllm-project/vllm/pull/1508#issuecomment-1853826414). Tested with vLLM ## Reference If you find SmoothQuant useful or relevant to your research, please cite their paper: ```bibtex @InProceedings{xiao2023smoothquant, title = {{S}mooth{Q}uant: Accurate and Efficient Post-Training Quantization for Large Language Models}, author = {Xiao, Guangxuan and Lin, Ji and Seznec, Mickael and Wu, Hao and Demouth, Julien and Han, Song}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, year = {2023} } ```