From 18aed7ad35216dfdbf5217f1b78b09e1ef534c2c Mon Sep 17 00:00:00 2001 From: sunyu-xuan Date: Thu, 29 Aug 2024 17:00:55 +0800 Subject: [PATCH] add quantization doc --- .../mindformers/usage/quantization.md | 246 +++++++++++++++++- 1 file changed, 245 insertions(+), 1 deletion(-) diff --git a/docs/mindspore/source_zh_cn/mindformers/usage/quantization.md b/docs/mindspore/source_zh_cn/mindformers/usage/quantization.md index 635ebf432d..1987e5e78a 100644 --- a/docs/mindspore/source_zh_cn/mindformers/usage/quantization.md +++ b/docs/mindspore/source_zh_cn/mindformers/usage/quantization.md @@ -1,3 +1,247 @@ # 量化 -[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_zh_cn/mindformers/usage/quantization.md) \ No newline at end of file +[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_zh_cn/mindformers/usage/quantization.md) + +## 概述 + +量化(Quantization)作为一种重要的大模型压缩技术,通过对模型中的浮点参数转为低精度的整数参数,实现对参数的压缩。随着模型的参数和规格不断增大,量化在模型部署中能有效减少模型存储空间和加载时间,提高模型的推理性能。 + +MindFormers 集成 MindSpore Golden Stick 工具组件,提供统一量化推理流程,方便用户开箱即用优质的量化模型。 + +## 配套安装 + +**安装MindSpore Golden Stick** + +[MindSpore Golden Stick安装指南](https://gitee.com/mindspore/golden-stick#%E5%AE%89%E8%A3%85) + +下载源码,下载后进入`golden_stick`目录。 + +```shell +bash build.sh +pip install output/mindspore_gs-0.6.0-py3-none-any.whl +``` + +执行以下命令,验证安装结果。导入Python模块不报错即安装成功: + +```python +import mindspore_gs +``` + +## 量化基本流程 + +结合实际操作,可以将量化分解为以下步骤: + +1. **选择模型:** + 选择一个预训练的语言模型,如Llama2等。 + +2. **下载模型权重:** + 从 HuggingFace 模型库中下载相应模型的权重,参考[权重转换](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_zh_cn/mindformers/function/weight_conversion.md)文档转换为ckpt格式。 + +3. **量化模型权重转换:** + (补充) + +4. **量化配置文件准备:** + 使用MindFormers内置的与模型配套的量化推理配置文件,其中量化相关的配置项为`model.model_config.quantization_config` + + 以`llama2_13b_w8a16`量化模型为例,默认量化配置如下 + + ```yaml + quantization_config: + quant_method: 'ptq' + weight_dtype: 'int8' + activation_dtype: None + kvcache_dtype: None + modules_to_not_convert: ['lm_head'] + algorithm_args: {} + ``` + + | 参数 | 属性 | 功能描述 | 参数类型 | 取值范围 | + | ---------------------- | ---- | :----------------------------------------------------------- | --------- | --------- | + | quant_method | 必选 | 支持的量化算法,目前只支持PTQ/Smooth_Quant算法 | str | ptq/sq | + | weight_dtype | 必选 | 量化的weight类型,目前只支持int8 | str | int8 | + | activation_dtype | 必选 | 参数的激活类型,None表示维持网络原计算类型(compute_dtype)不变 | str | int8/None | + | kvcache_dtype | 可选 | KVCache量化类型,None和不配置表示维持原KVCache数据类型不变 | str | int8/None | + | modules_to_not_convert | 必选 | 配置不进行量化的层 | List[str] | \ | + | algorithm_args | 必选 | 对接金箍棒不同的算法类型配置,例如:smooth_quant算法需要配置alpha=0.5 | Dict | \ | + +5. **执行推理任务:** + 基于`generate`接口实现推理脚本,执行脚本即可得到推理结果。 + +## 基于Llama2_13b模型进行w8a16量化推理实践 + +### 选择模型 + +该实践流程选择Llama2-13b模型进行量化推理。 + +本实践使用`AutoModel.from_pretrained()`通过传参模型配置/权重路径来实例化模型,预先创建存放目录。 + +```shell +mkdir ./path/llama2_13b_w8a16_dir +``` +> 注:当前AutoModel.from_pretrained()接口暂不支持通过量化模型名称传参来实例化 + +- 单卡目录结构 + +```shell +llama2_13b_w8a16_dir + ├── predict_llama2_13b_w8a16.yaml + └── llama2_13b_w8a16.ckpt +``` + +- 多卡目录结构 + +```shell +llama2_13b_w8a16_dir + ├── predict_llama2_13b_w8a16.yaml + └── checkpoint + ├── rank_0 + └── llama2_13b_w8a16_0.ckpt + ... + └── rank_x + └── llama2_13b_w8a16_x.ckpt +``` + +### 下载模型权重 + +MindFormers提供已经转换完成的预训练权重、词表文件用于预训练、微调和推理。 + +若从 HuggingFace 模型库中下载相应模型的权重,参考[权重转换](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_zh_cn/mindformers/function/weight_conversion.md)文档转换为ckpt格式。 + +| 模型名称 | MindSpore权重 | +| :--------- | :----------------------------------------------------------: | +| llama2-13b | [Link](https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/llama2/llama2-13b-fp16.ckpt) | + +### 模型权重转换 + +待补充 + + +### 量化配置文件准备 + +MindFormers已提供[predict_llama2_13b_w8a16.yaml配置文件](https://gitee.com/mindspore/mindformers/blob/dev/configs/llama2/predict_llama2_13b_w8a16.yaml),将其拷贝至`llama2_13b_w8a16_dir`目录中。 + +```shell +cp configs/llama2/predict_llama2_13b_w8a16.yaml ./path/llama2_13b_w8a16_dir +``` +> 注:预置的predict_llama2_13b_w8a16.yaml为单卡配置。如需多卡执行,修改文件内的use_parallel: True + +### 执行推理任务 + +1. **脚本实例** + + 替换MindFormers下的[run_llama2_generate.py](https://gitee.com/mindspore/mindformers/blob/dev/scripts/examples/llama2/run_llama2_generate.py)脚本为以下代码。 + + 此实践基于`AutoModel.from_pretrained()`接口实例化量化模型,需调整该接口内的参数为之前创建的目录。通过调用`generate`接口获取推理结果。具体参数说明可参考[AutoModel]()和[generate]()接口文档 + + ```python + """llama2 predict example.""" + import argparse + import os + + import mindspore as ms + from mindspore import Tensor, Model + from mindspore.common import initializer as init + + from mindformers import AutoModel + from mindformers import MindFormerConfig, logger + from mindformers.core.context import build_context + from mindformers.core.parallel_config import build_parallel_config + from mindformers.models.llama import LlamaTokenizer + from mindformers.trainer.utils import transform_and_load_checkpoint + + + def main(config_path, use_parallel, load_checkpoint): + # 构造输入 + inputs = ["I love Beijing, because", + "LLaMA is a", + "Huawei is a company that"] + batch_size = len(inputs) + + # 根据yaml文件生成模型配置 + config = MindFormerConfig(config_path) + config.use_parallel = use_parallel + device_num = os.getenv('MS_WORKER_NUM') + logger.info(f"Use device number: {device_num}, it will override config.model_parallel.") + config.parallel_config.model_parallel = int(device_num) if device_num else 1 + config.parallel_config.data_parallel = 1 + config.parallel_config.pipeline_stage = 1 + config.load_checkpoint = load_checkpoint + + # 初始化环境 + build_context(config) + build_parallel_config(config) + model_name = config.trainer.model_name + + # 实例化tokenizer + tokenizer = LlamaTokenizer.from_pretrained(model_name) + # 实例化模型 + network = AutoModel.from_pretrained("/path/llama2_13b_w8a16_dir", + download_checkpoint=False) + model = Model(network) + + # 加载权重 + if config.load_checkpoint: + logger.info("----------------Transform and load checkpoint----------------") + seq_length = config.model.model_config.seq_length + input_ids = Tensor(shape=(batch_size, seq_length), dtype=ms.int32, init=init.One()) + infer_data = network.prepare_inputs_for_predict_layout(input_ids) + transform_and_load_checkpoint(config, model, network, infer_data, do_predict=True) + + inputs_ids = tokenizer(inputs, max_length=config.model.model_config.seq_length, padding="max_length")["input_ids"] + + outputs = network.generate(inputs_ids, + max_length=config.model.model_config.max_decode_length, + do_sample=config.model.model_config.do_sample, + top_k=config.model.model_config.top_k, + top_p=config.model.model_config.top_p) + for output in outputs: + print(tokenizer.decode(output)) + + + if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument('--config_path', default='predict_llama2_7b.yaml', type=str, + help='model config file path.') + parser.add_argument('--use_parallel', action='store_true', + help='if run model prediction in parallel mode.') + parser.add_argument('--load_checkpoint', type=str, + help='load model checkpoint path or directory.') + + args = parser.parse_args() + main( + args.config_path, + args.use_parallel, + args.load_checkpoint + ) + + # 多batch输出 + # I love Beijing,because it is a city that is constantly changing. I have been living here for 10 years ... + # LlaMa is a large-scale, open-source, multimodal, multilingual, multitask, and multimodal pretrained ... + # Huawei is a company that has been around for a long time. ... + ``` + +2. **执行脚本启动命令** + + MindFormers提供`Llama2`模型的快速推理脚本,支持单卡、多卡以及多batch推理。 + + ```shell + # 脚本使用 + bash scripts/examples/llama2/run_llama2_predict.sh PARALLEL CONFIG_PATH CKPT_PATH DEVICE_NUM + # 参数说明 + PARALLEL: 是否使用多卡推理, 'single'表示单卡推理, 'parallel'表示多卡推理 + CONFIG_PATH: 模型配置文件路径 + CKPT_PATH: 模型权重文件路径 + DEVICE_NUM: 使用卡数, 仅开启多卡推理时生效 + ``` + + - 单卡推理 + + ```shell + bash scripts/examples/llama2/run_llama2_predict.sh single /path/llama2_13b_w8a16_dir/predict_llama2_13b_w8a16.yaml /path/llama2_13b_w8a16_dir/llama2_13b_w8a16.ckpt + ``` + + - 多卡推理 + + ```shell + bash scripts/examples/llama2/run_llama2_predict.sh parallel /path/llama2_13b_w8a16_dir/predict_llama2_13b_w8a16.yaml /path/llama2_13b_w8a16_dir/llama2_13b_w8a16.ckpt 2 + ``` \ No newline at end of file -- Gitee