diff --git a/docs/mindformers/docs/source_zh_cn/usage/mindie_deployment.md b/docs/mindformers/docs/source_zh_cn/usage/mindie_deployment.md index 6c2be98dcc149d263adf6a0a79f4f4fb324789c9..d545821f291232cbbba427ca2d4e2f2d9543aefe 100644 --- a/docs/mindformers/docs/source_zh_cn/usage/mindie_deployment.md +++ b/docs/mindformers/docs/source_zh_cn/usage/mindie_deployment.md @@ -1,3 +1,303 @@ # MindIE服务化部署 -[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/usage/mindie_deployment.md) \ No newline at end of file +[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/usage/mindie_deployment.md) + +## MindIE介绍 + +MindIE,全称Mind Inference Engine,是华为昇腾针对AI全场景业务的推理加速套件。详情参考[官方介绍文档](https://www.hiascend.com/software/mindie)。 + +MindFormers承载在模型应用层MindIE-LLM中,通过MindIE-Service可以部署MindFormers中的LLMs。 + +## 环境搭建 + +### 软件安装 + +1. **安装MindFormers** + + 参考[MindFormers官方安装指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/quick_start/install.html)进行安装。 + +2. **安装MindIE** + + 参考[MindIE安装依赖文档](https://www.hiascend.com/document/detail/zh/mindie/10RC2/envdeployment/instg/mindie_instg_0010.html)完成依赖安装。之后前往[MindIE资源下载中心](https://www.hiascend.com/developer/download/community/result?module=ie%2Bpt%2Bcann)下载软件包进行安装。 + + ```bash + bash Ascend-mindie_1.0.RC2_linux-aarch64.run --install --install-path=/path/to/mindie/ + ``` + + 安装完成后,若显示如下信息,则说明软件安装成功: + + ```bash + xxx install success + ``` + + **xxx**表示安装的实际软件包名。 + +### 环境变量 + +若安装路径为默认路径,可以运行以下脚本初始化各组件环境变量。 + +```bash +# Ascend +source /usr/local/Ascend/ascend-toolkit/set_env.sh +# MindIE +source /usr/local/Ascend/mindie/latest/mindie-llm/set_env.sh +source /usr/local/Ascend/mindie/latest/mindie-service/set_env.sh +# MindSpore +export RUN_MODE=predict +export MS_ENABLE_TRACE_MEMORY=1 # 优化内存申请和释放 +export MS_ENABLE_LCCL=1 +export LCAL_IF_PORT=8129 +# 组网配置 +export MS_SCHED_HOST=127.0.0.1 +export MS_SCHED_PORT=8090 # 可按实际需要更改,端口不冲突即可 +``` + +## 推理服务部署基本流程 + +### 修改MindIE启动配置 + +打开mindie-service中的config.json,修改server相关配置。 + +```bash +cd {MindIE安装目录}/ +cd mindie-service/conf +vim config.json +``` + +其中`modelWeightPath`和`backendType`必须修改配置为: + +```json +"modelWeightPath": "/path/to/mf_model" +"backendType": "ms" +``` + +`modelWeightPath`为模型配置文件目录,放置模型和tokenizer等相关文件;`backendType`后端启动方式为`ms`。 + +其他相关参数如下: + +| 可选配置项 | 取值类型 | 取值范围 | 配置说明 | +| ------------------- | -------- | -------------------- | ------------------------------------------------------------ | +| maxSeqLen | int32 | 按用户需求自定义,>0 | 最大序列长度。输入的长度+输出的长度<=maxSeqLen,用户根据自己的推理场景选择maxSeqLen | +| npuDeviceIds | list | 按模型需求自定义 | 使用的device ids | +| worldSize | int32 | 按模型需求自定义 | 使用的总卡数 | +| npuMemSize | int32 | 按显存自定义 | NPU中可以用来申请KVCache的size上限(GB),可按部署模型的实际大小计算得出:npuMemSize=(总空闲-权重/mp数量)*系数,其中系数取0.8。建议值:8 | +| cpuMemSize | int32 | 按内存自定义 | CPU中可以用来申请KVCache的size上限(GB),和swap功能有关,cpuMemSize不足时会将Cache释放进行重计算。建议值:5 | +| maxPrefillBatchSize | int32 | [1, maxBatchSize] | 最大prefill batch size,与maxPrefillTokens取优先达到。该参数与maxPrefillTokens配合使用。当batch的token总数满足maxPrefillTokens时,取小于该参数限制的请求数batch推理 | +| maxPrefillTokens | int32 | [5120, 409600] | 最大prefill token数量,与maxPrefillBatchSize取优先达到 | +| maxBatchSize | int32 | [1, 5000] | 最大decode batch size,根据模型规模和NPU显存估算得出 | +| maxIterTimes | int32 | [1, maxSeqLen-1] | 可以进行的decode次数,即一句话最大可生成长度。请求级别里面有一个max_output_length参数,maxIterTimes是一个全局设置,与max_output_length取小作为最终output的最长length | + +全量配置参数可查阅 [MindIE Service开发指南-快速开始-配置参数说明](https://www.hiascend.com/document/detail/zh/mindie/10RC2/mindieservice/servicedev/mindie_service0004.html) + +### 启动服务 + +```bash +cd /path/to/mindie/latest/mindie-service +nohup ./bin/mindieservice_daemon > output.log 2>&1 & +tail -f output.log +``` + +当log日志中出现`Daemon start success!`,表示服务启动成功。 + +### 查看日志 + +mindie-service相关日志: + +```bash +tail -f path/to/mindie/mindie-service/latest/mindie-service/output.log +``` + +Python相关日志: + +```bash +tail -f path/to/mindie/mindie-service/latest/mindie-llm/logs/pythonlog.log +``` + +## MindIE服务化部署及推理示例 + +以下例子各组件安装路径均为默认路径`/usr/local/Ascend/.` , 模型使用Qwen1.5-72B。 + +### 修改MindIE启动配置 + +打开mindie-service中的config.json文件,修改server相关配置。 + +```bash +vim /usr/local/Ascend/mindie/1.0.RC2/mindie-service/conf/config.json +``` + +需要关注以下字段的配置 + +1. `ModelDeployParam.ModelParam.backendType` + + 该配置为对应的后端类型,必填"ms"。 + + ```json + "backendType": "ms" + ``` + +2. `ModelDeployParam.ModelParam.modelWeightPath` + + 该配置为模型配置文件目录,放置模型和tokenizer等相关文件。 + + 以Qwen1.5-72B为例,`modelWeightPath`的组织结构如下: + + ```reStructuredText + mf_model + └── qwen1_5_72b + ├── config.json # 模型json配置文件 + ├── vocab.json # 模型vocab文件,hf上对应模型下载 + ├── merges.txt # 模型merges文件,hf上对应模型下载 + ├── predict_qwen1_5_72b.yaml # 模型yaml配置文件 + ├── qwen1_5_tokenizer.py # 模型tokenizer文件,从mindformers仓中research目录下找到对应模型复制 + └── qwen1_5_72b_ckpt_dir # 模型分布式权重文件夹 + ``` + + predict_qwen1_5_72b.yaml需要关注以下配置: + + ```yaml + load_checkpoint: '/mf_model/qwen1_5_72b/qwen1_5_72b_ckpt_dir' # 为存放模型分布式权重文件夹路径 + use_parallel: True + auto_trans_ckpt: False # 是否开启自动权重转换,离线切分设置为False + parallel_config: + data_parallel: 1 + model_parallel: 4 # 多卡推理配置模型切分,一般与使用卡数一致 + pipeline_parallel: 1 + processor: + tokenizer: + vocab_file: "/mf_model/qwen1_5_72b/vocab.json" #vocab文件路径 + merges_file: "/mf_model/qwen1_5_72b/merges.txt" #merges文件路径 + ``` + + 模型的config.json文件可以使用`save_pretrained`接口生成,示例如下: + + ```python + from mindformers import AutoConfig + + model_config = AutoConfig.from_pretrained("/mf_model/qwen1_5_72b/predict_qwen1_5_72b.yaml") + model_config.save_pretrained(save_directory="./json/qwen1_5_72b/", save_json=True) + ``` + + 模型权重下载和转换可参考 [权重格式转换指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/weight_conversion.html)。 + + 准备好模型配置目录后,设置参数`modelWeightPath`为该目录路径。 + + ```json + "modelWeightPath": "/mf_model/qwen1_5_72b" + ``` + +最终修改完后的config.json如下: + +```json +{ + "OtherParam": { + "ResourceParam": { + "cacheBlockSize": 128 + }, + "LogParam": { + "logLevel": "Error", + "logPath": "logs/mindservice.log" + }, + "ServeParam": { + "ipAddress": "127.0.0.1", + "managementIpAddress": "127.0.0.1", + "port": 1025, + "managementPort": 1026, + "maxLinkNum": 1000, + "httpsEnabled": false, + "maxHeaderLen": 512, + "tlsCaPath": "security/ca/", + "tlsCaFile": [ + "ca.pem" + ], + "tlsCert": "security/certs/server.pem", + "tlsPk": "security/keys/server.key.pem", + "tlsPkPwd": "security/pass/mindie_server_key_pwd.txt", + "kmcKsfMaster": "tools/pmt/master/ksfa", + "kmcKsfStandby": "tools/pmt/standby/ksfb", + "tlsCrl": "security/certs/server_crl.pem", + "multiNodesInferPort": 1130, + "interNodeTLSEnabled": true, + "interNodeTlsCaFile": "security/ca/ca.pem", + "interNodeTlsCert": "security/certs/server.pem", + "interNodeTlsPk": "security/keys/server.key.pem", + "interNodeTlsPkPwd": "security/pass/mindie_server_key_pwd.txt", + "interNodeKmcKsfMaster": "tools/pmt/master/ksfa", + "interNodeKmcKsfStandby": "tools/pmt/standby/ksfb" + } + }, + "WorkFlowParam": { + "TemplateParam": { + "templateType": "Standard", + "templateName": "Standard_llama" + } + }, + "ModelDeployParam": { + "engineName": "mindieservice_llm_engine", + "modelInstanceNumber": 1, + "tokenizerProcessNumber": 8, + "maxSeqLen": 16384, + "npuDeviceIds": [ + [ + 4, + 5, + 6, + 7 + ] + ], + "multiNodesInferEnabled": false, + "ModelParam": [ + { + "modelInstanceType": "Standard", + "modelName": "Qwen1.5-72B-Chat", + "modelWeightPath": "/mf_model/qwen1_5_72b", + "worldSize": 4, + "cpuMemSize": 16, + "npuMemSize": 16, + "backendType": "ms", + "pluginParams": "" + } + ] + }, + "ScheduleParam": { + "maxPrefillBatchSize": 1, + "maxPrefillTokens": 16384, + "prefillTimeMsPerReq": 60, + "prefillPolicyType": 0, + "decodeTimeMsPerReq": 60, + "decodePolicyType": 0, + "maxBatchSize": 128, + "maxIterTimes": 8192, + "maxPreemptCount": 0, + "supportSelectBatch": true, + "maxQueueDelayMicroseconds": 500 + } +} +``` + +### 启动服务 + +```bash +cd /usr/local/Ascend/mindie/1.0.RC2/mindie-service +nohup ./bin/mindieservice_daemon > output.log 2>&1 & +tail -f output.log +``` + +打印如下信息,启动成功。 + +```json +Daemon start success! +``` + +### 请求测试 + +服务启动成功后,可使用curl命令发送请求验证,样例如下: + +```bash +curl -w "\ntime_total=%{time_total}\n" -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"inputs": "I love Beijing, because","stream": false}' http://127.0.0.1:1025/generate +``` + +返回推理结果验证成功: + +```json +{"generated_text":" it is a city with a long history and rich culture....."} +``` \ No newline at end of file