登录
注册
开源
企业版
高校版
搜索
帮助中心
使用条款
关于我们
开源
企业版
高校版
私有云
模力方舟
登录
注册
代码拉取完成,页面将自动刷新
捐赠
捐赠前请先登录
取消
前往登录
扫描微信二维码支付
取消
支付完成
支付提示
将跳转至支付宝完成支付
确定
取消
Watch
不关注
关注所有动态
仅关注版本发行动态
关注但不提醒动态
57
Star
569
Fork
1.4K
Ascend
/
ModelZoo-PyTorch
代码
Issues
498
Pull Requests
197
Wiki
统计
流水线
服务
质量分析
Jenkins for Gitee
腾讯云托管
腾讯云 Serverless
悬镜安全
阿里云 SAE
Codeblitz
SBOM
我知道了,不再自动展开
更新失败,请稍后重试!
移除标识
内容风险标识
本任务被
标识为内容中包含有代码安全 Bug 、隐私泄露等敏感信息,仓库外成员不可访问
800I A2上双机部署W8A8量化版deepseekr1,推理测试报错
TODO
#IBY3OH
缺陷
Mouse
创建于
2025-04-01 22:33
报错如下: ``` debug/ security/ [root@pm-93f10002 /]# cat ~/mindie/log/debug/mindie-server_238_202504012210.log [2025-04-01 22:10:47.211+08:00] [238] [239] [mindie-server] ===init log=== [2025-04-01 22:10:51.277+08:00] [238] [239] [mindie-server] [ERROR] [infer_backend_manager.cpp:236] : [MIE04E06011C] [infer_backend_manager] Failed to init engine: mindieservice_llm_engine [2025-04-01 22:10:51.277+08:00] [238] [239] [mindie-server] [ERROR] [endpoint.cpp:53] : [MIE04E02011C] [endpoint] Failed to init engine! Please check in the mindservice.log, pythonlog.log or console output. [root@pm-93f10002 /]# ``` 环境: npu-smi 24.1.0 Version: 24.1.0 镜像:swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.T3.1-800I-A2-py311-openeuler24.03-lts 启动命令 ``` docker run -itd --privileged --restart always --name=deepseek-r1 --net=host \ --shm-size 500g \ -e ATB_LLM_HCCL_ENABLE=1 \ -e ATB_LLM_COMM_BACKEND="hccl" \ -e HCCL_CONNECT_TIMEOUT=7200 \ -e WORLD_SIZE=16 \ -e HCCL_EXEC_TIMEOUT=0 \ -e PYTORCH_NPU_ALLOC_CONF=expandable_segments:True \ -e MIES_CONTAINER_IP=117.68.86.165 \ -e RANKTABLEFILE=/workspace/rank_table_file.json \ -e OMP_NUM_THREADS=1 \ -e NPU_MEMORY_FRACTION=0.95 \ --device=/dev/davinci0 \ --device=/dev/davinci1 \ --device=/dev/davinci2 \ --device=/dev/davinci3 \ --device=/dev/davinci4 \ --device=/dev/davinci5 \ --device=/dev/davinci6 \ --device=/dev/davinci7 \ --device=/dev/davinci_manager \ --device=/dev/hisi_hdc \ --device /dev/devmm_svm \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/bin/hccn_tool:/usr/bin/hccn_tool \ -v /usr/local/sbin:/usr/local/sbin \ -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ -v /nfs-server/model/config.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \ -v /etc/hccn.conf:/etc/hccn.conf \ -v /nfs-server/model:/workspace \ swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.T3.1-800I-A2-py311-openeuler24.03-lts \ bash -c "source /usr/local/Ascend/ascend-toolkit/set_env.sh && source /usr/local/Ascend/nnal/atb/set_env.sh && source /usr/local/Ascend/atb-models/set_env.sh && source /usr/local/Ascend/mindie/set_env.sh && /usr/local/Ascend/mindie/latest/mindie-service/bin/mindieservice_daemon" docker run -itd --privileged --restart always --name=deepseek-r1 --net=host \ --shm-size 500g \ -e ATB_LLM_HCCL_ENABLE=1 \ -e ATB_LLM_COMM_BACKEND="hccl" \ -e HCCL_CONNECT_TIMEOUT=7200 \ -e WORLD_SIZE=16 \ -e HCCL_EXEC_TIMEOUT=0 \ -e PYTORCH_NPU_ALLOC_CONF=expandable_segments:True \ -e MIES_CONTAINER_IP=117.68.86.166 \ -e RANKTABLEFILE=/workspace/rank_table_file.json \ -e OMP_NUM_THREADS=1 \ -e NPU_MEMORY_FRACTION=0.95 \ --device=/dev/davinci0 \ --device=/dev/davinci1 \ --device=/dev/davinci2 \ --device=/dev/davinci3 \ --device=/dev/davinci4 \ --device=/dev/davinci5 \ --device=/dev/davinci6 \ --device=/dev/davinci7 \ --device=/dev/davinci_manager \ --device=/dev/hisi_hdc \ --device /dev/devmm_svm \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/bin/hccn_tool:/usr/bin/hccn_tool \ -v /usr/local/sbin:/usr/local/sbin \ -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ -v /etc/hccn.conf:/etc/hccn.conf \ -v /nfs-server/model:/workspace \ -v /nfs-server/model/config.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \ swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.T3.1-800I-A2-py311-openeuler24.03-lts \ bash -c "source /usr/local/Ascend/ascend-toolkit/set_env.sh && source /usr/local/Ascend/nnal/atb/set_env.sh && source /usr/local/Ascend/atb-models/set_env.sh && source /usr/local/Ascend/mindie/set_env.sh && /usr/local/Ascend/mindie/latest/mindie-service/bin/mindieservice_daemon" { "server_count": "2", "server_list": [ { "device": [ {"device_id": "0", "device_ip": "100.97.0.145", "rank_id": "0"}, {"device_id": "1", "device_ip": "100.97.0.146", "rank_id": "1"}, {"device_id": "2", "device_ip": "100.97.0.147", "rank_id": "2"}, {"device_id": "3", "device_ip": "100.97.0.148", "rank_id": "3"}, {"device_id": "4", "device_ip": "100.97.0.149", "rank_id": "4"}, {"device_id": "5", "device_ip": "100.97.0.150", "rank_id": "5"}, {"device_id": "6", "device_ip": "100.97.0.151", "rank_id": "6"}, {"device_id": "7", "device_ip": "100.97.0.152", "rank_id": "7"} ], "server_id": "117.68.86.166", "container_ip": "117.68.86.166" }, { "device": [ {"device_id": "0", "device_ip": "100.97.0.137", "rank_id": "8"}, {"device_id": "1", "device_ip": "100.97.0.138", "rank_id": "9"}, {"device_id": "2", "device_ip": "100.97.0.139", "rank_id": "10"}, {"device_id": "3", "device_ip": "100.97.0.140", "rank_id": "11"}, {"device_id": "4", "device_ip": "100.97.0.141", "rank_id": "12"}, {"device_id": "5", "device_ip": "100.97.0.142", "rank_id": "13"}, {"device_id": "6", "device_ip": "100.97.0.143", "rank_id": "14"}, {"device_id": "7", "device_ip": "100.97.0.144", "rank_id": "15"} ], "server_id": "117.68.86.165", "container_ip": "117.68.86.165" } ], "status": "completed", "version": "1.0" } { "Version" : "1.0.0", "LogConfig" : { "logLevel" : "Info", "logFileSize" : 20, "logFileNum" : 20, "logPath" : "logs/mindie-server.log" }, "ServerConfig" : { "ipAddress" : "117.68.86.166", "managementIpAddress" : "117.68.86.166", "port" : 11025, "managementPort" : 11026, "metricsPort" : 11027, "allowAllZeroIpListening" : false, "maxLinkNum" : 1000, "httpsEnabled" : false, "fullTextEnabled" : false, "tlsCaPath" : "security/ca/", "tlsCaFile" : ["ca.pem"], "tlsCert" : "security/certs/server.pem", "tlsPk" : "security/keys/server.key.pem", "tlsPkPwd" : "security/pass/key_pwd.txt", "tlsCrlPath" : "security/certs/", "tlsCrlFiles" : ["server_crl.pem"], "managementTlsCaFile" : ["management_ca.pem"], "managementTlsCert" : "security/certs/management/server.pem", "managementTlsPk" : "security/keys/management/server.key.pem", "managementTlsPkPwd" : "security/pass/management/key_pwd.txt", "managementTlsCrlPath" : "security/management/certs/", "managementTlsCrlFiles" : ["server_crl.pem"], "kmcKsfMaster" : "tools/pmt/master/ksfa", "kmcKsfStandby" : "tools/pmt/standby/ksfb", "inferMode" : "standard", "interCommTLSEnabled" : false, "interCommPort" : 11121, "interCommTlsCaPath" : "security/grpc/ca/", "interCommTlsCaFiles" : ["ca.pem"], "interCommTlsCert" : "security/grpc/certs/server.pem", "interCommPk" : "security/grpc/keys/server.key.pem", "interCommPkPwd" : "security/grpc/pass/key_pwd.txt", "interCommTlsCrlPath" : "security/grpc/certs/", "interCommTlsCrlFiles" : ["server_crl.pem"], "openAiSupport" : "vllm" }, "BackendConfig" : { "backendName" : "mindieservice_llm_engine", "modelInstanceNumber" : 1, "npuDeviceIds" : [[0,1,2,3,4,5,6,7]], "tokenizerProcessNumber" : 8, "multiNodesInferEnabled" : true, "multiNodesInferPort" : 11120, "interNodeTLSEnabled" : false, "interNodeTlsCaPath" : "security/grpc/ca/", "interNodeTlsCaFiles" : ["ca.pem"], "interNodeTlsCert" : "security/grpc/certs/server.pem", "interNodeTlsPk" : "security/grpc/keys/server.key.pem", "interNodeTlsPkPwd" : "security/grpc/pass/mindie_server_key_pwd.txt", "interNodeTlsCrlPath" : "security/grpc/certs/", "interNodeTlsCrlFiles" : ["server_crl.pem"], "interNodeKmcKsfMaster" : "tools/pmt/master/ksfa", "interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb", "ModelDeployConfig" : { "maxSeqLen" : 10000, "maxInputTokenLen" : 2048, "truncation" : false, "ModelConfig" : [ { "modelInstanceType" : "Standard", "modelName" : "deepseek-r1", "modelWeightPath" : "/workspace/DeepSeek-R1-W8A8/", "worldSize" : 8, "cpuMemSize" : 5, "npuMemSize" : -1, "backendType" : "atb", "trustRemoteCode" : false } ] }, "ScheduleConfig" : { "templateType" : "Standard", "templateName" : "Standard_LLM", "cacheBlockSize" : 128, "maxPrefillBatchSize" : 50, "maxPrefillTokens" : 8192, "prefillTimeMsPerReq" : 150, "prefillPolicyType" : 0, "decodeTimeMsPerReq" : 50, "decodePolicyType" : 0, "maxBatchSize" : 200, "maxIterTimes" : 512, "maxPreemptCount" : 0, "supportSelectBatch" : false, "maxQueueDelayMicroseconds" : 5000 } } } ```
报错如下: ``` debug/ security/ [root@pm-93f10002 /]# cat ~/mindie/log/debug/mindie-server_238_202504012210.log [2025-04-01 22:10:47.211+08:00] [238] [239] [mindie-server] ===init log=== [2025-04-01 22:10:51.277+08:00] [238] [239] [mindie-server] [ERROR] [infer_backend_manager.cpp:236] : [MIE04E06011C] [infer_backend_manager] Failed to init engine: mindieservice_llm_engine [2025-04-01 22:10:51.277+08:00] [238] [239] [mindie-server] [ERROR] [endpoint.cpp:53] : [MIE04E02011C] [endpoint] Failed to init engine! Please check in the mindservice.log, pythonlog.log or console output. [root@pm-93f10002 /]# ``` 环境: npu-smi 24.1.0 Version: 24.1.0 镜像:swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.T3.1-800I-A2-py311-openeuler24.03-lts 启动命令 ``` docker run -itd --privileged --restart always --name=deepseek-r1 --net=host \ --shm-size 500g \ -e ATB_LLM_HCCL_ENABLE=1 \ -e ATB_LLM_COMM_BACKEND="hccl" \ -e HCCL_CONNECT_TIMEOUT=7200 \ -e WORLD_SIZE=16 \ -e HCCL_EXEC_TIMEOUT=0 \ -e PYTORCH_NPU_ALLOC_CONF=expandable_segments:True \ -e MIES_CONTAINER_IP=117.68.86.165 \ -e RANKTABLEFILE=/workspace/rank_table_file.json \ -e OMP_NUM_THREADS=1 \ -e NPU_MEMORY_FRACTION=0.95 \ --device=/dev/davinci0 \ --device=/dev/davinci1 \ --device=/dev/davinci2 \ --device=/dev/davinci3 \ --device=/dev/davinci4 \ --device=/dev/davinci5 \ --device=/dev/davinci6 \ --device=/dev/davinci7 \ --device=/dev/davinci_manager \ --device=/dev/hisi_hdc \ --device /dev/devmm_svm \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/bin/hccn_tool:/usr/bin/hccn_tool \ -v /usr/local/sbin:/usr/local/sbin \ -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ -v /nfs-server/model/config.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \ -v /etc/hccn.conf:/etc/hccn.conf \ -v /nfs-server/model:/workspace \ swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.T3.1-800I-A2-py311-openeuler24.03-lts \ bash -c "source /usr/local/Ascend/ascend-toolkit/set_env.sh && source /usr/local/Ascend/nnal/atb/set_env.sh && source /usr/local/Ascend/atb-models/set_env.sh && source /usr/local/Ascend/mindie/set_env.sh && /usr/local/Ascend/mindie/latest/mindie-service/bin/mindieservice_daemon" docker run -itd --privileged --restart always --name=deepseek-r1 --net=host \ --shm-size 500g \ -e ATB_LLM_HCCL_ENABLE=1 \ -e ATB_LLM_COMM_BACKEND="hccl" \ -e HCCL_CONNECT_TIMEOUT=7200 \ -e WORLD_SIZE=16 \ -e HCCL_EXEC_TIMEOUT=0 \ -e PYTORCH_NPU_ALLOC_CONF=expandable_segments:True \ -e MIES_CONTAINER_IP=117.68.86.166 \ -e RANKTABLEFILE=/workspace/rank_table_file.json \ -e OMP_NUM_THREADS=1 \ -e NPU_MEMORY_FRACTION=0.95 \ --device=/dev/davinci0 \ --device=/dev/davinci1 \ --device=/dev/davinci2 \ --device=/dev/davinci3 \ --device=/dev/davinci4 \ --device=/dev/davinci5 \ --device=/dev/davinci6 \ --device=/dev/davinci7 \ --device=/dev/davinci_manager \ --device=/dev/hisi_hdc \ --device /dev/devmm_svm \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/bin/hccn_tool:/usr/bin/hccn_tool \ -v /usr/local/sbin:/usr/local/sbin \ -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ -v /etc/hccn.conf:/etc/hccn.conf \ -v /nfs-server/model:/workspace \ -v /nfs-server/model/config.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \ swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.T3.1-800I-A2-py311-openeuler24.03-lts \ bash -c "source /usr/local/Ascend/ascend-toolkit/set_env.sh && source /usr/local/Ascend/nnal/atb/set_env.sh && source /usr/local/Ascend/atb-models/set_env.sh && source /usr/local/Ascend/mindie/set_env.sh && /usr/local/Ascend/mindie/latest/mindie-service/bin/mindieservice_daemon" { "server_count": "2", "server_list": [ { "device": [ {"device_id": "0", "device_ip": "100.97.0.145", "rank_id": "0"}, {"device_id": "1", "device_ip": "100.97.0.146", "rank_id": "1"}, {"device_id": "2", "device_ip": "100.97.0.147", "rank_id": "2"}, {"device_id": "3", "device_ip": "100.97.0.148", "rank_id": "3"}, {"device_id": "4", "device_ip": "100.97.0.149", "rank_id": "4"}, {"device_id": "5", "device_ip": "100.97.0.150", "rank_id": "5"}, {"device_id": "6", "device_ip": "100.97.0.151", "rank_id": "6"}, {"device_id": "7", "device_ip": "100.97.0.152", "rank_id": "7"} ], "server_id": "117.68.86.166", "container_ip": "117.68.86.166" }, { "device": [ {"device_id": "0", "device_ip": "100.97.0.137", "rank_id": "8"}, {"device_id": "1", "device_ip": "100.97.0.138", "rank_id": "9"}, {"device_id": "2", "device_ip": "100.97.0.139", "rank_id": "10"}, {"device_id": "3", "device_ip": "100.97.0.140", "rank_id": "11"}, {"device_id": "4", "device_ip": "100.97.0.141", "rank_id": "12"}, {"device_id": "5", "device_ip": "100.97.0.142", "rank_id": "13"}, {"device_id": "6", "device_ip": "100.97.0.143", "rank_id": "14"}, {"device_id": "7", "device_ip": "100.97.0.144", "rank_id": "15"} ], "server_id": "117.68.86.165", "container_ip": "117.68.86.165" } ], "status": "completed", "version": "1.0" } { "Version" : "1.0.0", "LogConfig" : { "logLevel" : "Info", "logFileSize" : 20, "logFileNum" : 20, "logPath" : "logs/mindie-server.log" }, "ServerConfig" : { "ipAddress" : "117.68.86.166", "managementIpAddress" : "117.68.86.166", "port" : 11025, "managementPort" : 11026, "metricsPort" : 11027, "allowAllZeroIpListening" : false, "maxLinkNum" : 1000, "httpsEnabled" : false, "fullTextEnabled" : false, "tlsCaPath" : "security/ca/", "tlsCaFile" : ["ca.pem"], "tlsCert" : "security/certs/server.pem", "tlsPk" : "security/keys/server.key.pem", "tlsPkPwd" : "security/pass/key_pwd.txt", "tlsCrlPath" : "security/certs/", "tlsCrlFiles" : ["server_crl.pem"], "managementTlsCaFile" : ["management_ca.pem"], "managementTlsCert" : "security/certs/management/server.pem", "managementTlsPk" : "security/keys/management/server.key.pem", "managementTlsPkPwd" : "security/pass/management/key_pwd.txt", "managementTlsCrlPath" : "security/management/certs/", "managementTlsCrlFiles" : ["server_crl.pem"], "kmcKsfMaster" : "tools/pmt/master/ksfa", "kmcKsfStandby" : "tools/pmt/standby/ksfb", "inferMode" : "standard", "interCommTLSEnabled" : false, "interCommPort" : 11121, "interCommTlsCaPath" : "security/grpc/ca/", "interCommTlsCaFiles" : ["ca.pem"], "interCommTlsCert" : "security/grpc/certs/server.pem", "interCommPk" : "security/grpc/keys/server.key.pem", "interCommPkPwd" : "security/grpc/pass/key_pwd.txt", "interCommTlsCrlPath" : "security/grpc/certs/", "interCommTlsCrlFiles" : ["server_crl.pem"], "openAiSupport" : "vllm" }, "BackendConfig" : { "backendName" : "mindieservice_llm_engine", "modelInstanceNumber" : 1, "npuDeviceIds" : [[0,1,2,3,4,5,6,7]], "tokenizerProcessNumber" : 8, "multiNodesInferEnabled" : true, "multiNodesInferPort" : 11120, "interNodeTLSEnabled" : false, "interNodeTlsCaPath" : "security/grpc/ca/", "interNodeTlsCaFiles" : ["ca.pem"], "interNodeTlsCert" : "security/grpc/certs/server.pem", "interNodeTlsPk" : "security/grpc/keys/server.key.pem", "interNodeTlsPkPwd" : "security/grpc/pass/mindie_server_key_pwd.txt", "interNodeTlsCrlPath" : "security/grpc/certs/", "interNodeTlsCrlFiles" : ["server_crl.pem"], "interNodeKmcKsfMaster" : "tools/pmt/master/ksfa", "interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb", "ModelDeployConfig" : { "maxSeqLen" : 10000, "maxInputTokenLen" : 2048, "truncation" : false, "ModelConfig" : [ { "modelInstanceType" : "Standard", "modelName" : "deepseek-r1", "modelWeightPath" : "/workspace/DeepSeek-R1-W8A8/", "worldSize" : 8, "cpuMemSize" : 5, "npuMemSize" : -1, "backendType" : "atb", "trustRemoteCode" : false } ] }, "ScheduleConfig" : { "templateType" : "Standard", "templateName" : "Standard_LLM", "cacheBlockSize" : 128, "maxPrefillBatchSize" : 50, "maxPrefillTokens" : 8192, "prefillTimeMsPerReq" : 150, "prefillPolicyType" : 0, "decodeTimeMsPerReq" : 50, "decodePolicyType" : 0, "maxBatchSize" : 200, "maxIterTimes" : 512, "maxPreemptCount" : 0, "supportSelectBatch" : false, "maxQueueDelayMicroseconds" : 5000 } } } ```
评论 (
1
)
登录
后才可以发表评论
状态
TODO
TODO
WIP
DONE
CLOSED
REJECTED
负责人
未设置
标签
未设置
项目
未立项任务
未立项任务
里程碑
未关联里程碑
未关联里程碑
Pull Requests
未关联
未关联
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
未关联
未关联
master
ci-pipeline
开始日期   -   截止日期
-
置顶选项
不置顶
置顶等级:高
置顶等级:中
置顶等级:低
优先级
不指定
严重
主要
次要
不重要
预计工期
(小时)
参与者(1)
Python
1
https://gitee.com/ascend/ModelZoo-PyTorch.git
git@gitee.com:ascend/ModelZoo-PyTorch.git
ascend
ModelZoo-PyTorch
ModelZoo-PyTorch
点此查找更多帮助
搜索帮助
Git 命令在线学习
如何在 Gitee 导入 GitHub 仓库
Git 仓库基础操作
企业版和社区版功能对比
SSH 公钥设置
如何处理代码冲突
仓库体积过大,如何减小?
如何找回被删除的仓库数据
Gitee 产品配额说明
GitHub仓库快速导入Gitee及同步更新
什么是 Release(发行版)
将 PHP 项目自动发布到 packagist.org
仓库举报
回到顶部
登录提示
该操作需登录 Gitee 帐号,请先登录后再操作。
立即登录
没有帐号,去注册