From 8260330b2b5859d61930a60494e8bf009e0027df Mon Sep 17 00:00:00 2001 From: zhangpengrui Date: Mon, 25 Aug 2025 11:12:46 +0800 Subject: [PATCH] =?UTF-8?q?=E5=88=B7=E6=96=B0=20=E4=B8=AD=E8=8B=B1?= =?UTF-8?q?=E6=96=87=20readme=20=E6=96=87=E6=A1=A3=E3=80=81=E6=B7=BB?= =?UTF-8?q?=E5=8A=A0=20license=20=E6=96=87=E4=BB=B6=E3=80=81=E6=96=B0?= =?UTF-8?q?=E5=A2=9EDeepSeek=20=E9=83=A8=E7=BD=B2=E6=8C=87=E5=8D=97?= =?UTF-8?q?=E8=8B=B1=E6=96=87=E7=89=88=E3=80=81=E4=B8=80=E9=94=AE=E5=BC=8F?= =?UTF-8?q?=E9=83=A8=E7=BD=B2=20openEuler=20intelligence=20=E8=8B=B1?= =?UTF-8?q?=E6=96=87=20readme=20=E6=96=87=E6=A1=A3?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: zhangpengrui --- License/LICENSE | 127 +++++ README.md | 145 +++--- README_en.md | 150 +++--- .../DeepSeek-V3&R1Deployment Guide_en.md | 461 ++++++++++++++++++ script/mindspore-intelligence/README_en.md | 30 ++ 5 files changed, 767 insertions(+), 146 deletions(-) create mode 100755 License/LICENSE create mode 100644 doc/deepseek/DeepSeek-V3&R1Deployment Guide_en.md create mode 100644 script/mindspore-intelligence/README_en.md diff --git a/License/LICENSE b/License/LICENSE new file mode 100755 index 0000000..ee58399 --- /dev/null +++ b/License/LICENSE @@ -0,0 +1,127 @@ + 木兰宽松许可证, 第2版 + + 木兰宽松许可证, 第2版 + 2020年1月 http://license.coscl.org.cn/MulanPSL2 + + + 您对“软件”的复制、使用、修改及分发受木兰宽松许可证,第2版(“本许可证”)的如下条款的约束: + + 0. 定义 + + “软件”是指由“贡献”构成的许可在“本许可证”下的程序和相关文档的集合。 + + “贡献”是指由任一“贡献者”许可在“本许可证”下的受版权法保护的作品。 + + “贡献者”是指将受版权法保护的作品许可在“本许可证”下的自然人或“法人实体”。 + + “法人实体”是指提交贡献的机构及其“关联实体”。 + + “关联实体”是指,对“本许可证”下的行为方而言,控制、受控制或与其共同受控制的机构,此处的控制是指有受控方或共同受控方至少50%直接或间接的投票权、资金或其他有价证券。 + + 1. 授予版权许可 + + 每个“贡献者”根据“本许可证”授予您永久性的、全球性的、免费的、非独占的、不可撤销的版权许可,您可以复制、使用、修改、分发其“贡献”,不论修改与否。 + + 2. 授予专利许可 + + 每个“贡献者”根据“本许可证”授予您永久性的、全球性的、免费的、非独占的、不可撤销的(根据本条规定撤销除外)专利许可,供您制造、委托制造、使用、许诺销售、销售、进口其“贡献”或以其他方式转移其“贡献”。前述专利许可仅限于“贡献者”现在或将来拥有或控制的其“贡献”本身或其“贡献”与许可“贡献”时的“软件”结合而将必然会侵犯的专利权利要求,不包括对“贡献”的修改或包含“贡献”的其他结合。如果您或您的“关联实体”直接或间接地,就“软件”或其中的“贡献”对任何人发起专利侵权诉讼(包括反诉或交叉诉讼)或其他专利维权行动,指控其侵犯专利权,则“本许可证”授予您对“软件”的专利许可自您提起诉讼或发起维权行动之日终止。 + + 3. 无商标许可 + + “本许可证”不提供对“贡献者”的商品名称、商标、服务标志或产品名称的商标许可,但您为满足第4条规定的声明义务而必须使用除外。 + + 4. 分发限制 + + 您可以在任何媒介中将“软件”以源程序形式或可执行形式重新分发,不论修改与否,但您必须向接收者提供“本许可证”的副本,并保留“软件”中的版权、商标、专利及免责声明。 + + 5. 免责声明与责任限制 + + “软件”及其中的“贡献”在提供时不带任何明示或默示的担保。在任何情况下,“贡献者”或版权所有者不对任何人因使用“软件”或其中的“贡献”而引发的任何直接或间接损失承担责任,不论因何种原因导致或者基于何种法律理论,即使其曾被建议有此种损失的可能性。 + + 6. 语言 + “本许可证”以中英文双语表述,中英文版本具有同等法律效力。如果中英文版本存在任何冲突不一致,以中文版为准。 + + 条款结束 + + 如何将木兰宽松许可证,第2版,应用到您的软件 + + 如果您希望将木兰宽松许可证,第2版,应用到您的新软件,为了方便接收者查阅,建议您完成如下三步: + + 1, 请您补充如下声明中的空白,包括软件名、软件的首次发表年份以及您作为版权人的名字; + + 2, 请您在软件包的一级目录下创建以“LICENSE”为名的文件,将整个许可证文本放入该文件中; + + 3, 请将如下声明文本放入每个源文件的头部注释中。 + + Copyright (c) [Year] [name of copyright holder] + [Software Name] is licensed under Mulan PSL v2. + You can use this software according to the terms and conditions of the Mulan PSL v2. + You may obtain a copy of Mulan PSL v2 at: + http://license.coscl.org.cn/MulanPSL2 + THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE. + See the Mulan PSL v2 for more details. + + + Mulan Permissive Software License,Version 2 + + Mulan Permissive Software License,Version 2 (Mulan PSL v2) + January 2020 http://license.coscl.org.cn/MulanPSL2 + + Your reproduction, use, modification and distribution of the Software shall be subject to Mulan PSL v2 (this License) with the following terms and conditions: + + 0. Definition + + Software means the program and related documents which are licensed under this License and comprise all Contribution(s). + + Contribution means the copyrightable work licensed by a particular Contributor under this License. + + Contributor means the Individual or Legal Entity who licenses its copyrightable work under this License. + + Legal Entity means the entity making a Contribution and all its Affiliates. + + Affiliates means entities that control, are controlled by, or are under common control with the acting entity under this License, ‘control’ means direct or indirect ownership of at least fifty percent (50%) of the voting power, capital or other securities of controlled or commonly controlled entity. + + 1. Grant of Copyright License + + Subject to the terms and conditions of this License, each Contributor hereby grants to you a perpetual, worldwide, royalty-free, non-exclusive, irrevocable copyright license to reproduce, use, modify, or distribute its Contribution, with modification or not. + + 2. Grant of Patent License + + Subject to the terms and conditions of this License, each Contributor hereby grants to you a perpetual, worldwide, royalty-free, non-exclusive, irrevocable (except for revocation under this Section) patent license to make, have made, use, offer for sale, sell, import or otherwise transfer its Contribution, where such patent license is only limited to the patent claims owned or controlled by such Contributor now or in future which will be necessarily infringed by its Contribution alone, or by combination of the Contribution with the Software to which the Contribution was contributed. The patent license shall not apply to any modification of the Contribution, and any other combination which includes the Contribution. If you or your Affiliates directly or indirectly institute patent litigation (including a cross claim or counterclaim in a litigation) or other patent enforcement activities against any individual or entity by alleging that the Software or any Contribution in it infringes patents, then any patent license granted to you under this License for the Software shall terminate as of the date such litigation or activity is filed or taken. + + 3. No Trademark License + + No trademark license is granted to use the trade names, trademarks, service marks, or product names of Contributor, except as required to fulfill notice requirements in Section 4. + + 4. Distribution Restriction + + You may distribute the Software in any medium with or without modification, whether in source or executable forms, provided that you provide recipients with a copy of this License and retain copyright, patent, trademark and disclaimer statements in the Software. + + 5. Disclaimer of Warranty and Limitation of Liability + + THE SOFTWARE AND CONTRIBUTION IN IT ARE PROVIDED WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL ANY CONTRIBUTOR OR COPYRIGHT HOLDER BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE SOFTWARE OR THE CONTRIBUTION IN IT, NO MATTER HOW IT’S CAUSED OR BASED ON WHICH LEGAL THEORY, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. + + 6. Language + + THIS LICENSE IS WRITTEN IN BOTH CHINESE AND ENGLISH, AND THE CHINESE VERSION AND ENGLISH VERSION SHALL HAVE THE SAME LEGAL EFFECT. IN THE CASE OF DIVERGENCE BETWEEN THE CHINESE AND ENGLISH VERSIONS, THE CHINESE VERSION SHALL PREVAIL. + + END OF THE TERMS AND CONDITIONS + + How to Apply the Mulan Permissive Software License,Version 2 (Mulan PSL v2) to Your Software + + To apply the Mulan PSL v2 to your work, for easy identification by recipients, you are suggested to complete following three steps: + + i Fill in the blanks in following statement, including insert your software name, the year of the first publication of your software, and your name identified as the copyright owner; + + ii Create a file named “LICENSE” which contains the whole context of this License in the first directory of your software package; + + iii Attach the statement to the appropriate annotated syntax at the beginning of each source file. + + + Copyright (c) [Year] [name of copyright holder] + [Software Name] is licensed under Mulan PSL v2. + You can use this software according to the terms and conditions of the Mulan PSL v2. + You may obtain a copy of Mulan PSL v2 at: + http://license.coscl.org.cn/MulanPSL2 + THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE. + See the Mulan PSL v2 for more details. diff --git a/README.md b/README.md index 7e92cd4..c1c1911 100644 --- a/README.md +++ b/README.md @@ -1,131 +1,131 @@ # openEuler开源全栈AI推理解决方案(Intelligence BooM) -**如果您的使用场景符合以下形态,您也可以直接下载以下 3 种镜像来开启使用之旅!** +**如果您的使用场景符合以下形态,您也可以直接下载以下 3 种镜像来开启使用之旅!** -**①** **CPU+NPU(800I A2)** +**①** **CPU+NPU(800I A2)** -•**硬件规格:**支持单机、双机、四机、大集群 +•**硬件规格:** 支持单机、双机、四机、大集群 -•**镜像地址:**hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-800I-A2-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.1.0-x86_64-800I-A2-openeuler24.03-lts-sp2 +•**镜像地址:** hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-800I-A2-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.1.0-x86_64-800I-A2-openeuler24.03-lts-sp2 -**②CPU+NPU(300I Duo)** +**②CPU+NPU(300I Duo)** -•**硬件规格:**支持单机、双机 +•**硬件规格:** 支持单机、双机 -•**镜像地址:**hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-300I-Duo-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.1.0-x86_64-300I-Duo-openeuler24.03-lts-sp2 +•**镜像地址:** hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-300I-Duo-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.1.0-x86_64-300I-Duo-openeuler24.03-lts-sp2 -**③** **CPU+GPU(NVIDIA A100)** +**③** **CPU+GPU(NVIDIA A100)** -•**硬件规格:**支持单机单卡、单机多卡 +•**硬件规格:** 支持单机单卡、单机多卡 -•**镜像地址:** hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-A100-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-syshax-openeuler24.03-lts-sp2- +•**镜像地址:** hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-A100-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-syshax-openeuler24.03-lts-sp2- -**我们的愿景:**基于 openEuler 构建开源的 AI 基础软件事实标准,推动企业智能应用生态的繁荣。 +**我们的愿景:** 基于 openEuler 构建开源的 AI 基础软件事实标准,推动企业智能应用生态的繁荣。 -**当大模型遇见产业落地,我们为何需要全栈方案?** +**当大模型遇见产业落地,我们为何需要全栈方案?** DeepSeek创新降低大模型落地门槛,AI进入“杰文斯悖论”时刻,需求大幅增加、多模态交互突破硬件限制、低算力需求重构部署逻辑,标志着AI从“技术验证期”迈入“规模落地期”。然而,产业实践中最核心的矛盾逐渐显现: -**产业痛点​** +**产业痛点​** -**适配难​:**不同行业(如金融、制造、医疗)的业务场景对推理延迟、算力成本、多模态支持的要求差异极大,单一模型或工具链难以覆盖多样化需求; +**适配难​:** 不同行业(如金融、制造、医疗)的业务场景对推理延迟、算力成本、多模态支持的要求差异极大,单一模型或工具链难以覆盖多样化需求; -**成本高​:**从模型训练到部署,需跨框架(PyTorch/TensorFlow/MindSpore)、跨硬件(CPU/GPU/NPU)、跨存储(关系型数据库/向量数据库)协同,硬件资源利用率低,运维复杂度指数级上升; +**成本高​:** 从模型训练到部署,需跨框架(PyTorch/TensorFlow/MindSpore)、跨硬件(CPU/GPU/NPU)、跨存储(关系型数据库/向量数据库)协同,硬件资源利用率低,运维复杂度指数级上升; -**​生态割裂​:**硬件厂商(如昇腾、英伟达)、框架厂商(如华为、Meta、Google)、云厂商(如K8s/RAY)的工具链互不兼容,“拼凑式”部署导致开发周期长、迭代效率低。 +**​生态割裂​:** 硬件厂商(如华为、英伟达)、框架厂商(Meta、Google)的工具链互不兼容,“拼凑式”部署导致开发周期长、迭代效率低。 ​技术挑战 -**推理效率瓶颈​:**大模型参数规模突破万亿级,传统推理引擎对动态计算图、稀疏激活、混合精度支持不足,算力浪费严重; +**推理效率瓶颈​:** 大模型参数规模突破万亿级,传统推理引擎对动态计算图、稀疏激活、混合精度支持不足,算力浪费严重; -**资源协同低效​:**CPU/GPU/NPU异构算力调度依赖人工经验,内存/显存碎片化导致资源闲置; +**资源协同低效​:** CPU/GPU/NPU异构算力调度依赖人工经验,内存/显存碎片化导致资源闲置; 为了解决以上问题,我们通过开源社区协同,加速开源推理方案Intelligence BooM成熟。 ## 技术架构 -![](doc/deepseek/asserts/IntelligenceBoom.png) +![](/Users/项目/llm方案/xin/llm_solution/doc/deepseek/asserts/IntelligenceBoom.png) -#### **智能应用平台:让您的业务快速“接轨”AI​** +#### **智能应用平台:让您的业务快速“接轨”AI​** -**组件构成 :** openHermes(智能体引擎,利用平台公共能力,Agent应用货架化,提供行业典型应用案例、多模态交互中间件,轻量框架,业务流编排、提示词工程等能力)、deeplnsight(业务洞察平台,提供多模态识别、Deep Research能力) +**组件构成 :** openHermes(智能体引擎,利用平台公共能力,Agent应用货架化,提供行业典型应用案例、多模态交互中间件,轻量框架,业务流编排、提示词工程等能力)、deeplnsight(业务洞察平台,提供多模态识别、Deep Research能力) 【deepInsight开源地址】https://gitee.com/openeuler/deepInsight -**核心价值** +**核心价值** -**低代码开发:** openHermes提供自然语言驱动的任务编排能力,业务人员可通过对话式交互生成AI应用原型; +**低代码开发:** openHermes提供自然语言驱动的任务编排能力,业务人员可通过对话式交互生成AI应用原型; -**效果追踪​:** deeplnsight实时监控模型推理效果(如准确率、延迟、成本),结合业务指标(如转化率、故障率)给出优化建议,实现“数据-模型-业务”闭环。 +**效果追踪​:** deeplnsight实时监控模型推理效果(如准确率、延迟、成本),结合业务指标(如转化率、故障率)给出优化建议,实现“数据-模型-业务”闭环。 -#### **推理服务:让模型“高效跑起来”** +#### **推理服务:让模型“高效跑起来”** -**组件构成​:** vLLM(高性能大模型推理框架)、SGLang(多模态推理加速库) +**组件构成​:** vLLM(高性能大模型推理框架)、SGLang(多模态推理加速库) 【vLLM开源地址】https://vllm.hyper.ai/docs -**核心价值​** +**核心价值​** -**动态扩缩容:** vLLM支持模型按需加载,结合K8s自动扩缩容策略,降低70%以上空闲算力成本; +**动态扩缩容:** vLLM支持模型按需加载,结合K8s自动扩缩容策略,降低70%以上空闲算力成本; -**大模型优化​: **vLLM通过PagedAttention、连续批处理等技术,将万亿参数模型的推理延迟降低50%,吞吐量提升3倍; +**大模型优化​:** vLLM通过PagedAttention、连续批处理等技术,将万亿参数模型的推理延迟降低50%,吞吐量提升3倍; -#### **加速层:让推理“快人一步”​​** +#### **加速层:让推理“快人一步”​​** -**组件构成​:** sysHAX、expert-kit、ktransformers +**组件构成​:** sysHAX、expert-kit、ktransformers 【sysHAX开源地址】https://gitee.com/openeuler/sysHAX 【expert-kit开源地址】https://gitee.com/openeuler/expert-kit -**核心价值​** +**核心价值​** -**异构算力协同分布式推理加速引擎:** 整合CPU、NPU、GPU等不同架构硬件的计算特性,通过动态任务分配实现"专用硬件处理专用任务"的优化,将分散的异构算力虚拟为统一资源池,实现细粒度分配与弹性伸缩; +**异构算力协同分布式推理加速引擎:** 整合CPU、NPU、GPU等不同架构硬件的计算特性,通过动态任务分配实现"专用硬件处理专用任务"的优化,将分散的异构算力虚拟为统一资源池,实现细粒度分配与弹性伸缩; -#### **框架层:让模型“兼容并蓄”** +#### **框架层:让模型“兼容并蓄”** -**组件构成​:** MindSpore(全场景框架)、PyTorch(Meta通用框架)、TensorFlow(Google工业框架) +**组件构成​:** MindSpore(全场景框架)、PyTorch(Meta通用框架)、TensorFlow(Google工业框架) 【MindSpore开源地址】https://gitee.com/mindspore -**核心价值​** +**核心价值​** -**多框架兼容: **通过统一API接口,支持用户直接调用任意框架训练的模型,无需重写代码; +**多框架兼容:** 通过统一API接口,支持用户直接调用任意框架训练的模型,无需重写代码; -**动态图优化​:** 针对大模型的动态控制流(如条件判断、循环),提供图优化能力,推理稳定性提升30%; -​**社区生态复用​:** 完整继承PyTorch/TensorFlow的生态工具(如Hugging Face模型库),降低模型迁移成本。 +**动态图优化​:** 针对大模型的动态控制流(如条件判断、循环),提供图优化能力,推理稳定性提升30%; +​**社区生态复用​:** 完整继承PyTorch/TensorFlow的生态工具(如Hugging Face模型库),降低模型迁移成本。 -#### **数据工程、向量检索、数据融合分析:从原始数据到推理燃料的转化​** +#### **数据工程、向量检索、数据融合分析:从原始数据到推理燃料的转化​** -**组件构成​:** DataJuicer、Oasis、九天计算引擎、PG Vector、Milvus、GuassVector、Lotus、融合分析引擎 +**组件构成​:** DataJuicer、Oasis、九天计算引擎、PG Vector、Milvus、GuassVector、Lotus、融合分析引擎 -**核心价值​** +**核心价值​** -**多模态数据高效处理与管理:** 多模态数据的统一接入、清洗、存储与索引,解决推理场景中数据类型复杂、规模庞大的管理难题,为上层智能应用提供标准化数据底座。 +**多模态数据高效处理与管理:** 多模态数据的统一接入、清洗、存储与索引,解决推理场景中数据类型复杂、规模庞大的管理难题,为上层智能应用提供标准化数据底座。 -**高效检索与实时响应支撑: **实现海量高维数据的快速匹配与实时查询,满足推理场景中对数据时效性和准确性的严苛要求,缩短数据到推理结果的链路延迟,为智能问答、智能运维等实时性应用提供底层性能保障。 +**高效检索与实时响应支撑:** 实现海量高维数据的快速匹配与实时查询,满足推理场景中对数据时效性和准确性的严苛要求,缩短数据到推理结果的链路延迟,为智能问答、智能运维等实时性应用提供底层性能保障。 -#### **任务管理平台:让资源“聪明调度”​​** +#### **任务管理平台:让资源“聪明调度”​​** -**组件构成​: **openFuyao(任务编排引擎)、K8S(容器编排)、RAY(分布式计算)、oeDeploy(一键部署工具) +**组件构成​:** openFuyao(任务编排引擎)、K8S(容器编排)、RAY(分布式计算)、oeDeploy(一键部署工具) 【openFuyao开源地址】https://gitcode.com/openFuyao @@ -133,62 +133,62 @@ DeepSeek创新降低大模型落地门槛,AI进入“杰文斯悖论”时刻 【oeDeploy开源地址】https://gitee.com/openeuler/oeDeploy -**核心价值​** +**核心价值​** -**端边云协同:** 根据任务类型(如实时推理/离线批处理)和硬件能力(如边缘侧NPU/云端GPU),自动分配执行节点; +**端边云协同:** 根据任务类型(如实时推理/离线批处理)和硬件能力(如边缘侧NPU/云端GPU),自动分配执行节点; -**全生命周期管理​: **从模型上传、版本迭代、依赖安装到服务启停,提供“一站式”运维界面; -​**故障自愈​:** 实时监控任务状态,自动重启异常进程、切换备用节点,保障服务高可用性。 +**全生命周期管理​:** 从模型上传、版本迭代、依赖安装到服务启停,提供“一站式”运维界面; +​**故障自愈​:** 实时监控任务状态,自动重启异常进程、切换备用节点,保障服务高可用性。 -#### **编译器:让代码“更懂硬件”​​** +#### **编译器:让代码“更懂硬件”​​** -**组件构成​:**异构融合编译器(Bisheng) +**组件构成​:** 异构融合编译器(Bisheng) -**核心价值** +**核心价值** -**跨硬件优化: **针对CPU(x86/ARM)、GPU(CUDA)、NPU(昇腾/CANN)的指令集差异,自动转换计算逻辑,算力利用率大幅提升%; +**跨硬件优化:** 针对CPU(x86/ARM)、GPU(CUDA)、NPU(昇腾/CANN)的指令集差异,自动转换计算逻辑,算力利用率大幅提升%; -**混合精度支持​: **动态调整FP32/FP16/INT8精度,在精度损失可控的前提下,推理速度大幅提升; -​**内存优化​:**通过算子融合、内存复用等技术,减少30%显存/内存占用,降低硬件成本。 +**混合精度支持​:** 动态调整FP32/FP16/INT8精度,在精度损失可控的前提下,推理速度大幅提升; +​**内存优化​:** 通过算子融合、内存复用等技术,减少30%显存/内存占用,降低硬件成本。 -#### **操作系统:让全栈“稳如磐石”** +#### **操作系统:让全栈“稳如磐石”** -**组件构成​:** openEuler(开源欧拉操作系统) +**组件构成​:** openEuler(开源欧拉操作系统) 【openEuler开源地址】https://gitee.com/openeuler -**核心价值** +**核心价值** -**异构资源管理:** 原生支持CPU/GPU/NPU的统一调度,提供硬件状态监控、故障隔离等能力; +**异构资源管理:** 原生支持CPU/GPU/NPU的统一调度,提供硬件状态监控、故障隔离等能力; -**安全增强​:** 集成国密算法、权限隔离、漏洞扫描模块,满足金融、政务等行业的合规要求。 +**安全增强​:** 集成国密算法、权限隔离、漏洞扫描模块,满足金融、政务等行业的合规要求。 -#### **硬件使能与硬件层:让算力“物尽其用”** +#### **硬件使能与硬件层:让算力“物尽其用”** -**组件构成​:**CANN(昇腾AI使能套件)、CUDA(英伟达计算平台)、CPU(x86/ARM)、NPU(昇腾)、GPU(英伟达/国产GPU) +**组件构成​:** CANN(昇腾AI使能套件)、CUDA(英伟达计算平台)、CPU(x86/ARM)、NPU(昇腾)、GPU(英伟达/国产GPU) -**核心价值** +**核心价值** -**硬件潜能释放:**CANN针对昇腾NPU的达芬奇架构优化矩阵运算、向量计算,算力利用率大幅提升;CUDA提供成熟的GPU并行计算框架,支撑通用AI任务; +**硬件潜能释放:** CANN针对昇腾NPU的达芬奇架构优化矩阵运算、向量计算,算力利用率大幅提升;CUDA提供成熟的GPU并行计算框架,支撑通用AI任务; -**异构算力融合​:**通过统一编程接口(如OpenCL),实现CPU/NPU/GPU的协同计算,避免单一硬件性能瓶颈; +**异构算力融合​:** 通过统一编程接口(如OpenCL),实现CPU/NPU/GPU的协同计算,避免单一硬件性能瓶颈; ​ -#### **互联技术:让硬件“高速对话”​​** +#### **互联技术:让硬件“高速对话”​​** -**组件构成​:**UB(通用总线)、CXL(计算与内存扩展)、NvLink(英伟达高速互联)、SUE +**组件构成​:** UB(通用总线)、CXL(计算与内存扩展)、NvLink(英伟达高速互联)、SUE -**核心价值** +**核心价值** -**低延迟通信:**CXL/NvLink提供内存级互联带宽(>1TB/s),减少跨设备数据拷贝开销 +**低延迟通信:** CXL/NvLink提供内存级互联带宽(>1TB/s),减少跨设备数据拷贝开销 -**灵活扩展:**支持从单机(多GPU)到集群(跨服务器)的无缝扩展,适配不同规模企业的部署需求。 +**灵活扩展:** 支持从单机(多GPU)到集群(跨服务器)的无缝扩展,适配不同规模企业的部署需求。 @@ -201,6 +201,8 @@ DeepSeek创新降低大模型落地门槛,AI进入“杰文斯悖论”时刻 参考[部署指南](https://gitee.com/openeuler/llm_solution/blob/master/doc/deepseek/DeepSeek-V3&R1%E9%83%A8%E7%BD%B2%E6%8C%87%E5%8D%97.md),使用一键式部署脚本,20min完成推理服务拉起。 + + ### 一键式部署DeepSeek 模型和openEuler Intelligence智能应用 参考[一键式部署openEuler Intelligence ](https://gitee.com/openeuler/llm_solution/tree/master/script/mindspore-intelligence),搭建本地知识库并协同DeepSeek大模型完成智能调优、智能运维等应用; @@ -245,4 +247,5 @@ DeepSeek创新降低大模型落地门槛,AI进入“杰文斯悖论”时刻 ## 参与贡献 欢迎通过issue方式提出您宝贵的建议,共建开箱即优、性能领先的全栈开源国产化推理解决方案 + # llm_solution diff --git a/README_en.md b/README_en.md index c52e9ec..345ac2e 100644 --- a/README_en.md +++ b/README_en.md @@ -1,164 +1,164 @@ -# OpenEuler Open-Source Full-Stack AI inference Solution (Intelligence BooM) # +# openEuler Open-Source Full-Stack AI inference Solution (Intelligence BooM) # **If your application scenario meets the following requirements, you can also download the following three images to start the use journey:** -**1CPU+NPU (800I A2)** +**CPU+NPU (800I A2)** -· \* \* Hardware specifications: \* \* Supports single-node system, two-node cluster, four-node cluster, and large cluster. + **Hardware specifications:** Supports single-node system, two-node cluster, four-node cluster, and large cluster. -· \* \* image path: \*\*hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-800I-A2-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.1.0-x86_64-800I-A2-openeuler24.03-lts-sp2 +**image path:** hub.oepkgs.net/oedeploy/openEuler/aarch64/intelligence_boom:0.1.0-aarch64-800I-A2-openEuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openEuler/x86_64/intelligence_boom:0.1.0-x86_64-800I-A2-openEuler24.03-lts-sp2 -**2CPU+NPU (300I Duo)** +**CPU+NPU (300I Duo)** -· \* \* Hardware specifications: \* \* Single-node system and two-node cluster are supported. +**Hardware specifications:** Single-node system and two-node cluster are supported. -· \* \* image path: \*\*hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-300I-Duo-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.1.0-x86_64-300I-Duo-openeuler24.03-lts-sp2 +**image path:** hub.oepkgs.net/oedeploy/openEuler/aarch64/intelligence_boom:0.1.0-aarch64-300I-Duo-openEuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openEuler/x86_64/intelligence_boom:0.1.0-x86_64-300I-Duo-openEuler24.03-lts-sp2 -**3CPU+GPU (NVIDIA A100)** +**CPU+GPU (NVIDIA A100)** -· \* \* Hardware specifications: \* \* Supports single-node single-card and single-node multi-card. +·**Hardware specifications:** Supports single-node single-card and single-node multi-card. -· image path: hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-A100-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-syshax-openeuler24.03-lts-sp2- +**image path:** hub.oepkgs.net/oedeploy/openEuler/aarch64/intelligence_boom:0.1.0-aarch64-A100-openEuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openEuler/aarch64/intelligence_boom:0.1.0-aarch64-syshax-openEuler24.03-lts-sp2- -\*\*Our vision: \* \* Build open-source AI basic software de facto standards based on openEuler to promote the prosperity of the enterprise intelligent application ecosystem. +**Our vision:** Build open-source AI basic software de facto standards based on openEuler to promote the prosperity of the enterprise intelligent application ecosystem. -**When big models meet industry implementation, why do we need a full-stack solution?** +**When big models meet industry implementation, why do we need a full-stack solution?** DeepSeek innovation lowers the threshold for implementing large models. AI enters the "Jervins Paradox" moment. Requirements increase significantly, multi-modal interaction breaks through hardware restrictions, and low computing power requirements reconstruct deployment logic, marking the transition from the "technical verification period" to the "scale implementation period". However, the core contradictions in industry practice gradually emerged: -**Industry pain points** +**Industry pain points** -\*\*Difficult adaptation: \*\*The requirements for inference delay, computing cost, and multi-modal support vary greatly in different industries (such as finance, manufacturing, and healthcare). A single model or tool chain cannot cover diversified requirements. +**Difficult adaptation:** The requirements for inference delay, computing cost, and multi-modal support vary greatly in different industries (such as finance, manufacturing, and healthcare). A single model or tool chain cannot cover diversified requirements. -\*\* High cost: \*\*From model training to deployment, collaboration between (PyTorch/TensorFlow/MindSpore), hardware (CPU/GPU/NPU), and storage (relational database/vector database) is required. Hardware resource utilization is low and O&M complexity increases exponentially. +**High cost:** From model training to deployment, collaboration between (PyTorch/TensorFlow/MindSpore), hardware (CPU/GPU/NPU), and storage (relational database/vector database) is required. Hardware resource utilization is low and O&M complexity increases exponentially. -\*\* Ecosystem fragmentation: Tool chains of hardware vendors (such as Ascend and NVIDIA), framework vendors (such as Huawei, Meta, and Google), and cloud vendors (such as K8s and RAY) are incompatible with each other. Patchwork deployment leads to long development cycles and inefficient iterations. Technical challenges +**Ecosystem fragmentation:** Tool chains of hardware vendors (such as Huawei and NVIDIA), framework vendors (such as Meta, and Google) are incompatible with each other. Patchwork deployment leads to long development cycles and inefficient iterations. Technical challenges -\*\*Inference efficiency bottleneck: \* \*The parameter scale of large models exceeds trillions. Traditional inference engines do not support dynamic graph calculation, sparse activation, and hybrid precision, causing a serious waste of computing power. +**Inference efficiency bottleneck:** The parameter scale of large models exceeds trillions. Traditional inference engines do not support dynamic graph calculation, sparse activation, and hybrid precision, causing a serious waste of computing power. -\*\*Inefficient resource collaboration: \*\*Heterogeneous computing power scheduling of CPUs, GPUs, and NPUs depends on manual experience. Memory and video memory fragmentation leads to idle resources. +**Inefficient resource collaboration:** Heterogeneous computing power scheduling of CPUs, GPUs, and NPUs depends on manual experience. Memory and video memory fragmentation leads to idle resources. To solve the preceding problems, we collaborate with the open source community to accelerate the maturity of the open source inference solution Intelligence BooM. ## Technical Architecture ## -![Image](doc/deepseek/asserts/IntelligenceBoom.png) +![Image](doc/deepseek/asserts/IntelligenceBoom_en.png} -#### **Intelligent Application Platform: Quickly Connect Your Business to AI** #### +#### **Intelligent Application Platform: Quickly Connect Your Business to AI** #### -**Component: openHermes (Agent-Tone Engine, which uses the public capabilities of the platform and provides the following capabilities: Typical application cases, multi-modal interaction middleware, lightweight framework, service flow orchestration, and prompt word engineering.) and deeplnsight (Service insight platform, providing multi-modal identification and deep research capabilities)** +**Component:** openHermes (Agent-Tone Engine, which uses the public capabilities of the platform and provides the following capabilities: Typical application cases, multi-modal interaction middleware, lightweight framework, service flow orchestration, and prompt word engineering.) and deeplnsight (Service insight platform, providing multi-modal identification and deep research capabilities) -\[DeepInsight open source address\] https://gitee.com/openeuler/deepInsight +\[DeepInsight open source address\] https://gitee.com/openEuler/deepInsight -**Core Value** +**Core Value** -**Low-code development: OpenHermes provides the natural language-driven task orchestration capability, allowing service personnel to generate AI application prototypes through dialogue-based interaction.** +**Low-code development:** OpenHermes provides the natural language-driven task orchestration capability, allowing service personnel to generate AI application prototypes through dialogue-based interaction. -**Effect tracking: Deeplnsight monitors the model inference effect (such as accuracy, delay, and cost) in real time, and provides optimization suggestions based on service indicators (such as conversion rate and failure rate), implementing closed-loop management of data, models, and services.** +**Effect tracking:** Deeplnsight monitors the model inference effect (such as accuracy, delay, and cost) in real time, and provides optimization suggestions based on service indicators (such as conversion rate and failure rate), implementing closed-loop management of data, models, and services. -#### **Inference service: enabling models to run efficiently** #### +#### **Inference service: enabling models to run efficiently** #### -**Components: vLLM (high-performance large model inference framework) and SGLang (multi-modal inference acceleration library)** +**Components:** vLLM (high-performance large model inference framework) and SGLang (multi-modal inference acceleration library) -\[vLLM open source address\] https://vllm.hyper.ai/docs +\[vLLM open source address\] https://docs.vllm.ai/en/latest/ -**Core Value** +**Core Value** -**Dynamic scaling: vLLM supports on-demand model loading and uses the Kubernetes automatic scaling policy to reduce the idle computing power cost by more than 70%.** +**Dynamic scaling:** vLLM supports on-demand model loading and uses the Kubernetes automatic scaling policy to reduce the idle computing power cost by more than 70%. -\*\*Big model optimization: \*\*VLLM uses technologies such as PagedAttention and continuous batch processing to reduce the inference delay of trillion-parameter models by 50% and improve the throughput by three times. +**Big model optimization:** VLLM uses technologies such as PagedAttention and continuous batch processing to reduce the inference delay of trillion-parameter models by 50% and improve the throughput by three times. -#### **Acceleration layer: Make reasoning "one step faster"** #### +#### **Acceleration layer: Make reasoning "one step faster"** #### -**Components: sysHAX, expert-kit, and ktransformers** +**Components:** sysHAX, expert-kit, and ktransformers -\[sysHAX open source address\] https://gitee.com/openeuler/sysHAX +\[sysHAX open source address\] https://gitee.com/openEuler/sysHAX -\[Expert-Kit open source address\] https://gitee.com/openeuler/expert-kit +\[Expert-Kit open source address\] https://gitee.com/openEuler/expert-kit -**Core Value** +**Core Value** -**Heterogeneous computing power collaboration distributed inference acceleration engine: Integrates the computing features of different architecture hardware such as CPU, NPU, and GPU, optimizes dedicated hardware processing dedicated tasks through dynamic task allocation, virtualizes scattered heterogeneous computing power into a unified resource pool, implementing fine-grained allocation and elastic scaling.** +**Heterogeneous computing power collaboration distributed inference acceleration engine:** Integrates the computing features of different architecture hardware such as CPU, NPU, and GPU, optimizes dedicated hardware processing dedicated tasks through dynamic task allocation, virtualizes scattered heterogeneous computing power into a unified resource pool, implementing fine-grained allocation and elastic scaling. -#### **Framework layer: Make the model "inclusive"** #### +#### **Framework layer: Make the model "inclusive"** #### -**Components: MindSpore (all-scenario framework), PyTorch (Meta general framework), and TensorFlow (Google industrial framework)** +**Components:** MindSpore (all-scenario framework), PyTorch (Meta general framework), and TensorFlow (Google industrial framework) \[MindSpore open source address\] https://gitee.com/mindspore -**Core Value** +**Core Value** -\*\*Multi-framework compatibility: \* \* Unified APIs allow users to directly invoke models trained by any framework without rewriting code. +**Multi-framework compatibility:** Unified APIs allow users to directly invoke models trained by any framework without rewriting code. -**Dynamic graph optimization: For dynamic control flows (such as condition judgment and loop) of large models, the graph optimization capability is provided, improving the inference stability by 30%. Community ecosystem reuse: Inherit ecosystem tools (such as Hugging Face model library) of PyTorch/TensorFlow, reducing model migration costs.** +**Dynamic graph optimization:** For dynamic control flows (such as condition judgment and loop) of large models, the graph optimization capability is provided, improving the inference stability by 30%. Community ecosystem reuse: Inherit ecosystem tools (such as Hugging Face model library) of PyTorch/TensorFlow, reducing model migration costs. -#### **Data engineering, vector retrieval, and data fusion analysis: transformation from raw data to inference fuel** #### +#### **Data engineering, vector retrieval, and data fusion analysis:** transformation from raw data to inference fuel #### -**Components: DataJuicer, Oasis, nine-day computing engine, PG Vector, Milvus, GuassVector, Lotus, and converged analysis engine** +**Components:** DataJuicer, Oasis, nine-day computing engine, PG Vector, Milvus, GuassVector, Lotus, and converged analysis engine -**Core Value** +**Core Value** -**Efficient processing and management of multi-modal data: Unified access, cleaning, storage, and indexing of multi-modal data solves complex data types and large-scale management problems in inference scenarios and provides standardized data foundation for upper-layer intelligent applications.** +**Efficient processing and management of multi-modal data:** Unified access, cleaning, storage, and indexing of multi-modal data solves complex data types and large-scale management problems in inference scenarios and provides standardized data foundation for upper-layer intelligent applications. -\*\*Effective search and real-time response support: \*\*Quick matching and real-time query of massive high-dimensional data meet the strict requirements on data timeliness and accuracy in inference scenarios and shorten the link delay from data to inference results. Provides underlying performance assurance for real-time applications such as intelligent Q&A and intelligent O&M. +**Effective search and real-time response support:** Quick matching and real-time query of massive high-dimensional data meet the strict requirements on data timeliness and accuracy in inference scenarios and shorten the link delay from data to inference results. Provides underlying performance assurance for real-time applications such as intelligent Q&A and intelligent O&M. -#### **Task management platform: smart resource scheduling** #### +#### **Task management platform: smart resource scheduling** #### -\*\*Components: \*\*OpenFuyao (task orchestration engine), K8S (container orchestration), RAY (distributed computing), and oeDeploy (one-click deployment tool) +**Components:** OpenFuyao (task orchestration engine), K8S (container orchestration), RAY (distributed computing), and oeDeploy (one-click deployment tool) \[OpenFuyao open source address\] https://gitcode.com/openFuyao -\[Ray open source address\] https://gitee.com/src-openeuler/ray +\[Ray open source address\] https://gitee.com/src-openEuler/ray -\[Open source address of the oeDeploy\] https://gitee.com/openeuler/oeDeploy +\[Open source address of the oeDeploy\] https://gitee.com/openEuler/oeDeploy -**Core Value** +**Core Value** -**Device-edge-cloud synergy: Automatically allocates execution nodes based on task types (such as real-time inference and offline batch processing) and hardware capabilities (such as edge NPUs and cloud GPUs).** +**Device-edge-cloud synergy:** Automatically allocates execution nodes based on task types (such as real-time inference and offline batch processing) and hardware capabilities (such as edge NPUs and cloud GPUs). -**Full-lifecycle management: provides a one-stop O&M interface, including model upload, version iteration, dependency installation, and service startup and shutdown. Fault self-healing: Monitors the task status in real time, automatically restarts abnormal processes, and switches services to the standby node, ensuring high service availability.** +**Full-lifecycle management:** provides a one-stop O&M interface, including model upload, version iteration, dependency installation, and service startup and shutdown. Fault self-healing: Monitors the task status in real time, automatically restarts abnormal processes, and switches services to the standby node, ensuring high service availability. #### **Compiler: Making Code "More Hardware-Savvy"** #### -\*\*Component composition: \*\*Heterogeneous integration compiler (Bisheng) +**Component composition:** Heterogeneous integration compiler (Bisheng) -**Core Value** +**Core Value** -\*\* Cross-hardware optimization: \* \* Automatically converts computing logic based on instruction set differences between CPU (x86/ARM), GPU (CUDA), and NPU (Ascend/CANN), greatly improving computing power utilization by%. +**Cross-hardware optimization:** Automatically converts computing logic based on instruction set differences between CPU (x86/ARM), GPU (CUDA), and NPU (Ascend/CANN), greatly improving computing power utilization by%. -\*\* Mixed precision support: Dynamically adjust the FP32/FP16/INT8 precision, greatly improving the inference speed while the precision loss is controllable. Memory optimization: \* \* Reduces the video memory and memory usage by 30% and reduces hardware costs by using technologies such as operator convergence and memory overcommitment. +**Mixed precision support:** Dynamically adjust the FP32/FP16/INT8 precision, greatly improving the inference speed while the precision loss is controllable. Memory optimization:**Reduces the video memory and memory usage by 30% and reduces hardware costs by using technologies such as operator convergence and memory overcommitment. -#### **Operating System: Make the Full Stack "Stand as a Rock"** #### +#### **Operating System: Make the Full Stack "Stand as a Rock"** #### -**Component: openEuler (open-source EulerOS)** +**Component:** openEuler (open-source EulerOS) -\[OpenEuler open source address\] https://gitee.com/openeuler +\[openEuler open source address\] https://gitee.com/openEuler **Core Value** -**Heterogeneous resource management: Supports unified scheduling of CPUs, GPUs, and NPUs, and provides capabilities such as hardware status monitoring and fault isolation.** +**Heterogeneous resource management:** Supports unified scheduling of CPUs, GPUs, and NPUs, and provides capabilities such as hardware status monitoring and fault isolation. -**Security enhancement: Integrates the Chinese national cryptographic algorithm, permission isolation, and vulnerability scanning modules to meet compliance requirements of industries such as finance and government.** +**Security enhancement:** Integrates the Chinese national cryptographic algorithm, permission isolation, and vulnerability scanning modules to meet compliance requirements of industries such as finance and government. -#### **Hardware Enablement and Hardware Layer: Make the Most of Computing Power** #### +#### **Hardware Enablement and Hardware Layer: Make the Most of Computing Power** #### -\*\* Components: \*\*CANN (Ascend AI enablement suite), CUDA (Nvidia computing platform), CPU (x86/ARM), NPU (Ascend), GPU (Nvidia/GPU in China) +**Components:** CANN (Ascend AI enablement suite), CUDA (Nvidia computing platform), CPU (x86/ARM), NPU (Ascend), GPU (Nvidia/GPU in China) -**Core Value** +**Core Value** -\*\* Hardware potential release: \*\*CANN optimizes matrix computing and vector computing for Ascend NPU Da Vinci architecture, greatly improving computing power utilization. CUDA provides a mature GPU parallel computing framework to support common AI tasks. +**Hardware potential release:** CANN optimizes matrix computing and vector computing for Ascend NPU Da Vinci architecture, greatly improving computing power utilization. CUDA provides a mature GPU parallel computing framework to support common AI tasks. -\*\*Heterogeneous computing power convergence: \*\*Using unified programming interfaces (such as OpenCL) to implement collaborative computing among CPUs, NPUs, and GPUs, avoiding performance bottlenecks of a single hardware. +**Heterogeneous computing power convergence:** Using unified programming interfaces (such as OpenCL) to implement collaborative computing among CPUs, NPUs, and GPUs, avoiding performance bottlenecks of a single hardware. -#### **Connected technology: "high-speed conversation" with hardware** #### +#### **Connected technology: "high-speed conversation" with hardware** #### -\*\*Component composition: \*\*UB (universal bus), CXL (computing and memory expansion), NVLink (Nvidia high-speed interconnection), SUE +**Component composition:** UB (universal bus), CXL (computing and memory expansion), NVLink (Nvidia high-speed interconnection), SUE -**Core Values** +**Core Values** -\*\*Low latency communication: \*\*CXL/NvLink provides memory-class interconnect bandwidth (>1 TB/s) to reduce cross-device data copy overhead +**Low latency communication:** CXL/NvLink provides memory-class interconnect bandwidth (>1 TB/s) to reduce cross-device data copy overhead -\*\*Flexible expansion: \*\*Supports seamless expansion from a single-node system (multi-GPU) to a cluster (cross-server) to meet the deployment requirements of enterprises of different scales. +**Flexible expansion:** Supports seamless expansion from a single-node system (multi-GPU) to a cluster (cross-server) to meet the deployment requirements of enterprises of different scales. ## Full-Stack Solution Deployment Tutorial ## @@ -166,11 +166,11 @@ Currently, the solution supports more than 50 mainstream models, such as DeepSee ### DeepSeek V3 and R1 deployment ### -Reference[Deployment Guide](https://gitee.com/openeuler/llm_solution/blob/master/doc/deepseek/DeepSeek-V3&R1%E9%83%A8%E7%BD%B2%E6%8C%87%E5%8D%97.md) Use the one-click deployment script to start the inference service within 20 minutes. +Reference[Deployment Guide](https://gitee.com/openEuler/llm_solution/blob/master/doc/deepseek/DeepSeek-V3%26R1Deployment%20Guide_en.md) Use the one-click deployment script to start the inference service within 20 minutes. ### One-click deployment of the DeepSeek model and openEuler Intelligence intelligent application ### -Reference[One-click deployment of openEuler Intelligence](https://gitee.com/openeuler/llm_solution/tree/master/script/mindspore-intelligence) Build a local knowledge base and collaborate with the DeepSeek big model to complete applications such as intelligent optimization and intelligent O&M. +Reference[One-click deployment of openEuler Intelligence](https://gitee.com/openEuler/llm_solution/tree/master/script/mindspore-intelligence/README_en.md) Build a local knowledge base and collaborate with the DeepSeek big model to complete applications such as intelligent optimization and intelligent O&M. ## Performance ## diff --git a/doc/deepseek/DeepSeek-V3&R1Deployment Guide_en.md b/doc/deepseek/DeepSeek-V3&R1Deployment Guide_en.md new file mode 100644 index 0000000..c323978 --- /dev/null +++ b/doc/deepseek/DeepSeek-V3&R1Deployment Guide_en.md @@ -0,0 +1,461 @@ +# DeepSeek-V3&R1 Deployment Guide # + +## 1. Hardware requirements and networking ## + +This document uses DeepSeek-R1 as an example. DeepSeek-V3 and R1 have the same model structure, parameter quantity, and deployment mode as R1. + +### 1.1 Single-Node Deployment ### + +One Atlas 800I A2 (8 x 64 GB) server is required for deploying the DeepSeek-R1 quantization model (A16W4). + +### 1.2 Multi-node deployment ### + +At least two Atlas 800I A2 (8 x 64 GB) servers are required for deploying the DeepSeek-R1 quantization model (W8A8). + +It is recommended that the NPU direct connection mode be used. That is, all the NPUs of the two servers are connected through the switch, and the network ports are Up. + +## 2. Obtain the model weight. ## + +| No. | Check Item | Detailed Description | +| --- | --------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| 2.A | Model weight storage space | When downloading A16W4/W8A8 weights, ensure that the storage space in the or mounted disk is greater than 400 GB or 700 GB. | +| 2.B | CPU-side memory | Ensure that the CPU memory can release the corresponding weight. For example, the W8A8 weight requires about 500 GB memory. You can run the free -h command to check the free CPU memory. Calculation method: free_mem ≥ (Weight/Number of machines) x 1.3 (This calculation method needs to be verified, but the memory must be sufficient.) | +| 2.C | Select the number of inference cards based on the weight. | W8A8 requires at least two 800I/T A2 64G. | +| 2.D | Weight Correctness Check | Ensure that the weights are correct and compare the MD5 or SHA256 values of the weights and tokenizer files with those of the source files. | + +### 2.1 Downloading Quantified Model Weights ### + +A16W4:modelers.cn[IPADS/DeepSeek-R1-A16W4 \| Magic Music Community](https://modelers.cn/models/IPADS/DeepSeek-R1-A16W4) + +W8A8:modelers.cn[MindSpore-Lab/DeepSeek-R1-W8A8 \| Magic Community](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-W8A8) + +#### 2.1.1 Weight Placement \[Important\] #### + +Weights must be placed on all machines and placed in the same path. This path will be used as a configuration item and will be added to the configuration file of the one-click deployment script. For details, see section 4.1. + +For example, the model weight of the master node is stored in /home/ds/deepseek-r1 and that of the slave node is stored in /home/ds/deepseek-r1. + +## 3. Driver and firmware preparation ## + +### 3.1 Recommended Version ### + +| Component | Community edition | +| ------------------- | ----------------- | +| Ascend HDK Driver | 24.1.rc3 | +| Ascend HDK Firmware | 7.5.0.1.129 | + +#### 3.1.1 HDK Download Mode #### + +**Link for downloading the community version: https://www.hiascend.com/hardware/firmware-drivers/community?product=4&model=32&cann=8.0.0.alpha002&driver=1.0.RC2** + +The kernel version must be 5.10. Check the kernel version before installation. + +``` +#You can run the following command to obtain the driver and firmware versions in the environment: +npu-smi info -t board -i 1 | egrep -i "software|firmware" +``` + +![image-20250318153035798](./asserts/image1.png) + +Note: Before installing drivers and firmware, install the kernel-devel and kernel-headers packages in advance and ensure that the versions are the same as the server kernel versions. + +``` +#Installing kernel-devel & kernel-headers +yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r) +``` + +#### 3.1.2 Driver and Firmware Installation #### + +If the Ascend driver and firmware are not available in the environment, perform the following steps to install the Ascend driver and firmware for the first time: + +``` +#Method 1: +#Driver installation +./Ascend-hdk--npu-driver__linux-.run --full --install-for-all +#Firmware Installation +./Ascend-hdk--npu-firmware_.run --full + +#Method 2: Download the deployment plug-in package. The script is contained in the plug-in package. For details, see section 4.1. +sh mindspore-deepseek/workspace/roles/prepare/files/lib/ascend_prepare.sh +#Restart the system after installation. +``` + +Skip this step if the Ascend driver and firmware already exist in the environment. + +## 4. Introduction to One-Click Deployment Scripts ## + +| No. | Check Item | Detailed Description | +| --- | --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| 4.A | Weight Check | Before performing the one-click deployment, ensure that the weights obtained in section 2 are obtained. | +| 4.B | Driver Firmware Check | Before performing the one-click deployment, ensure that the Ascend HDK driver & firemare has been installed on all host machines. If not, see section 3.1.2. | +| 4.C | Networking and connectivity check | Before performing the one-click deployment, ensure the network connectivity. You are advised to check the network by referring to section 5.3. | + +Use the one-click deployment script to deploy a single-node or dual-node cluster and start the deepseek service based on the configuration. + +### 4.1 Using Deployment Scripts ### + +You are advised to run the one-click deployment script on an independent controller node. The controller node must be able to access each inference node in SSH mode. Note: In single-node deployment, you need to modify the configuration file in step 2 based on the comments. + +**Step 1: Download the oedeploy tool to the control node.** + +``` +#Download and install the OEDP tool. For example: +wget https://repo.oepkgs.net/openEuler/rpm/openEuler-24.03-LTS/contrib/oedp/aarch64/Packages/oedp-1.0.1-1.oe2503.aarch64.rpm + +yum localinstall oedp-* +#Download plug-in package +git clone https://gitee.com/openeuler/llm_solution.git + +cd llm_solution/script/mindspore-deepseek +``` + +**Step 2: Adjust the oedeploy configuration file.** + +``` +#Adjust the config.yaml file in the mindspore-deepseek directory. +#Note: W8a8 and int4 have different weight deployment modes and use different image tags. You can modify the image tags as follows: +(base) [root@910b-3 mindspore-deepseek]#cat config.yaml +all: + children: + masters: + hosts: + master1: + ansible_host: 1.2.3.4 #IP address of the active node + ansible_port: 22 + ansible_user: root #You must be started by the root user or have the permission. + ansible_password: "密码" #Password of the active node + + workers: #The worker node is deployed in a single-node system. The worker node is not required. Comment out or delete the worker node. + hosts: + worker1: + ansible_host: 1.2.3.5 #IP address of the slave node + ansible_port: 22 + ansible_user: root + ansible_password: "密码" #Password of the slave node + + vars: + #Container image + #If an image has been loaded to the local Docker, change the values to image_name and image_tag of the Docker image. + image_name: hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom + image_tag: 0.1.0-aarch64-800I-A2-mindspore2.7-openeuler24.03-lts-sp2 + #Name of the inference container to be started + container_name: openeuler_ds #Docker name after startup, which cannot be the same as the name of an existing image. + #Model Path + model_path: /workspace/deepseekv3 #Modify the weight path based on the actual model file and weight path (for storing weight paths in section 2.2.1). + #Ports opened by the ray. + ray_port: 6379 #The script reads the port from here. + #Number of nodes. If the system is deployed in a single-node system, change the value to 1. + node_num: 2 + #Indicates whether to stop other containers before starting the service. + is_stop_other_container: 0 #0: Do not stop other containers. 1: stop other containers. + #Inference service port. + llm_port: 8000 #Ensure that the port is idle. + #NIC used by the ray cluster + ray_device: enp67s0f0np0 #run the ifconfig command to query the name of the network adapter corresponding to the IP address. + #Model Weight Type + model_type: safetensors + #Backend type + backend_type: MindFormers + #Skip the SSH check. (To disable this function, comment out the following configuration items:) + ansible_ssh_common_args: '-o StrictHostKeyChecking=no' +``` + +Start the master node as the master node and the worker node as the slave node. Before deploying the master node, change the IP address, user name, and password of the corresponding node. + +If the controller node can communicate with the two nodes using keys, the ansible_password variable can be left empty. + +Ensure that the value of node_num is the same as the number of configured IP addresses. + +Ensure that the NICs used on each node are the same and are configured in the ray_device variable. + +In the configuration file, ray_device indicates the name of the network adapter used by the ray cluster. You can run the ifconfig command to view the name. For example: + +![Image](./asserts/image2.png) + +**Step 3: Run the one-click deployment script.** + +``` +#Run the following command in the mindspore-deepseek directory: +oedp run install +``` + +Note: During the one-click deployment process, the container image will be pulled from the network and the Docker command will be installed. If the network cannot be connected, You can install Docker in advance and load the corresponding Docker image to the server. During the running of the deployment script, if the same image already exists, the step of pulling the image from the network is skipped. + +## 5. Semi-automated on-demand deployment ## + +In addition to using the one-click deployment script, you can also manually execute step-by-step scripts. + +You need to transfer the workspace/roles/prepare/files/lib directory in the mindspore-deepseek directory to all inference nodes. + +Before transmitting the script, modify the config.cfg configuration file. + +### 5.1 Modifying the Configuration File ### + +``` +#Copy the template config. The example is stored in the lib folder. +cp example_config config.cfg +#Modify the config.cfg file. +``` + +Note that the name of the copied file must be config.cfg. + +### 5.2 Loading the Inference Image ### + +``` +#Run the container pulling script on all nodes. The script contains two steps: container image pulling and container starting. +./lib/start_docker.sh +``` + +Before pulling a container, the environment is checked. If an image with the same name exists in the environment, the container will not be pulled. + +Before starting a container, the system checks the environment. If a container with the same name exists in the environment (whether in the running or stopped state), the existing container is used. + +### 5.3 Check the network. (Skip this step in single-node deployment.) ### + +Invoke the network detection script on each node. + +``` +./lib/net_check.sh +``` + +This script checks the link status and network health status of the npu card on each node. + +You can also perform the following steps: + +#### 5.3.1 Checking the Network Status of the Host #### + +**Step 1: Check the physical connection.** + +``` +for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done +``` + +![Image](./asserts/image3.png) + +**Step 2: Check the connection.** + +``` +for i in {0..7}; do hccn_tool -i $i -link -g ; done +``` + +![Image](./asserts/image4.png) + +**Step 3: Obtain the IP address of each card.** + +``` +for i in {0..7};do hccn_tool -i $i -ip -g; done +``` + +![Image](./asserts/image5.png) + +If the NPU has no IP address, run the following command to configure the IP address and netmask: Note that the IP addresses of the NPUs on the primary/secondary node must be different. + +``` +#Primary node +for i in {0..7} +do +hccn_tool -i $i -ip -s address 10.10.0.1$i netmask 255.255.0.0 +done +#Slave node +for i in {0..7} +do +hccn_tool -i $i -ip -s address 10.10.0.2$i netmask 255.255.0.0 +done +``` + +You can also use the auxiliary script npu_net_config_simple.sh in the plug-in package to configure the IP address. + +``` +cd lib +sh npu_set_config_simple.sh 1 #Primary node +sh npu_set_config_simple.sh 2 #Slave node +``` + +If only one host A has an IP address, refer to the existing IP network segment and netmask configuration commands. Do not configure the IP address. + +**Step 4: Check the TLS behavior consistency at the bottom layer of the NPU. The value must be the same on each host. It is recommended that the value be all 0.** + +``` +for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch +``` + +![Image](./asserts/image6.png) + +\*\*Step 5: \*\*Set the TLS verification behavior at the bottom layer of the NPU to 0. + +``` +for i in {0..7};do hccn_tool -i $i -tls -s enable 0;done +``` + +![Image](./asserts/image7.png) + +Run the npu-smi info command to check whether the NIC status is OK. If the NIC status is abnormal, reset the NPU. + +![Image](./asserts/image8.png) + +To reset the NPU, run the following command: + +``` +npu-smi set -t reset -i $id -c $chip_id +``` + +#### 5.3.2 Checking Interconnection Between Hosts \[Host Machine\] #### + +To check the interconnection between hosts, ping the IP addresses of the NP cards of other hosts by using each NP card on the local host. If the IP addresses can be pinged, the connection is normal. + +Scripts can be used to check machine interconnection. + +``` +./net_check.sh --check-connection +``` + +### 5.4 Environment Configuration ### + +This step is performed in the container and must be performed on all nodes. + +**Step 1: Use the environment configuration script to set environment variables and adjust configuration files by one click.** + +``` +#This script needs to be executed on all nodes. The script modification includes the steps of modifying configuration files and setting environment variables. +./lib/set_env.sh +``` + +When the script is executed, environment variables are written to /root/.bashrc. If the openeuler_deepseek_env_config field exists in the file, the environment variables already exist and the environment variable setting process is skipped. + +### 5.5 Ray Cluster Start (Skip this step for single-node deployment.) ### + +Perform this step in the container and on all nodes. + +**Step 1: Use the ray startup script to start the ray by one click and set MASTER_IP to the IP address of the primary node.** + +``` +#Run the following command on the active node: +./lib/ray_start.sh +#Run the following command after the slave node: +./lib/ray_start.sh $MASTER_IP +``` + +Before running the script, ensure that the port corresponding to the RAY_PORT item in the config.cfg configuration file on the active node is idle. + +### 5.6 Service Startup ### + +This step is performed in the container and on the primary/secondary node. + +**Step 1: Before using the config.cfg file, ensure that the configuration file is complete.** + +``` +#Log generated when the active node executes the script to start the service, which is stored in the ds.log file. +./lib/start_ds.sh $MASTER_IP +#Logs about running the script on the slave node to start the service, which is stored in the ds.log file. +./lib/start_ds.sh $MASTER_IP 1 +``` + +## 6. Servitization test ## + +### 6.1 Using the Benchmark Test \[In Container\] ### + +Use the ascend-vllm performance test tool. + +``` +python benchmark_parallel.py --backend openai --host [主节点IP] --port [服务端口] --tokenizer [模型路径] --epochs 1 --parallel-num 192 --prompt-tokens 256 --output-tokens 256 +``` + +Note: The model path specified by --tokenizer must be the same as model_path when the inference service is started. + +You can also use the vllm open-source performance test tool. + +https://github.com/vllm-project/vllm/tree/main/benchmarks + +### 6.2 Use the curl request test. ### + +Deepseek-R1 request example (Reconfigure the request port based on the configured LLM_PORT variable.) + +``` +curl http://localhost:$LLM_PORT/v1/completions -H "Content-Type: application/json" -d '{"model": "'$MODEL_PATH'", "prompt": "You are a helpful assistant.<|User|> Categorizes text as neutral, negative, or positive. \nText: I think this vacation was ok. \nEmotion: <|Assistant|>\n", "max_tokens": 512, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty":1.0}' +``` + +Result: + +## 7. FAQ ## + +### 7.1 Decompression fail Is Displayed When the Ascend Driver Is Installed ### + +The tar command does not exist. You need to install the tar package. + +### 7.2 When the oedp run install command is run, an error message is displayed, indicating that the library cannot be found. ### + +> ![C:\\Users\\l00646674\\AppData\\Roaming\\eSpace_Desktop\\UserData\\l00646674\\imagefiles\\originalImgfiles\\5CFB8744-544E-424F-AEF0-75098B97D8AB.png](./asserts/image9.png) + +Oedp compatibility bug. The cause is that multiple py files are installed in the environment, which interferes with each other. You are advised to install Oedp in the environment with only one python. + +Alternatively, modify the /usr/bin/oedp file and change the python3 version to the specific python version. + +### 7.3 Insufficient Space When Running Deployment ### + +> ![Image](./asserts/image10.png) + +Ensure that the root directory of each node has sufficient space. + +### 7.4 Display CUDA Paged Attrntion kernel only supports block sizes up to 32. during model inference ### + +Backward compatibility restriction of MindSpore: The block size was limited to 32 in earlier versions. This restriction has been removed since March 30, 25. You are advised to use the container image whose tag time is later than 250330. + +## Appendixes ## + +### Description of environment variables: ### + +| **Environment variables** | **Function Description** | +| --------------------------------------------------- | --------------------------------------------------- | +| MS_ENABLE_LCCL=off | Disable the multi-device lccl. | +| HCCL_OP_EXPANSION_MODE=AIV | Communication delivery optimization | +| vLLM_MODEL_BACKEND=MindFormers | Specify the use of mindformers backend | +| vLLM_MODEL_MEMORY_USE_GB=50 | Performance Optimization Related | +| MS_DEV_RUNTIME_CONF="parallel_dispatch_kernel:True" | Performance-related | +| MS_ALLOC_CONF="enable_vmm:False" | Telecom-specific | +| ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | Specify the available ascending cards | +| GLOO_SOCKET_IFNAME=Network adapter name | Required by the ray network | +| TP_SOCKET_IFNAME=Network adapter name | Required by the ray network | +| MindFormers_MODEL_CONFIG=yaml path | Specifies the model YAML to be used by mindformers. | +| HCCL_CONNECT_TIMEOUT=7200 | Setting the hccl timeout period | + +### Directory structure of the Mindspore-deepseek plug-in package ### + +``` +[root@localhost mindspore-deepseek]#tree -l +. +├── config.yaml +├── main.yaml +├── README.md +└── workspace + ├── install.yml + └── roles + ├── prepare + │ ├── files + │ │ ├── lib + │ │ │ ├── ascend_prepare.sh + │ │ │ ├── example_config + │ │ │ ├── fine-grained-bind-cann.py + │ │ │ ├── net_check.sh + │ │ │ ├── npu_net_config_simple.sh + │ │ │ ├── ray_start.sh + │ │ │ ├── set_env.sh + │ │ │ ├── start_docker.sh + │ │ │ └── start_ds.sh + │ │ └── prepare.sh + │ ├── tasks + │ │ └── main.yml + │ └── templates + │ └── config.cfg.j2 + └── start + ├── deepseek + │ └── tasks + │ └── main.yml + ├── ray-master + │ └── tasks + │ └── main.yml + └── ray-worker + └── tasks + └── main.yml +14 directories, 19 files +``` + diff --git a/script/mindspore-intelligence/README_en.md b/script/mindspore-intelligence/README_en.md new file mode 100644 index 0000000..d9fe142 --- /dev/null +++ b/script/mindspore-intelligence/README_en.md @@ -0,0 +1,30 @@ +# Script for deploying MindSpore-DeepSeek. # + +This deployment script is used to automatically deploy VLM+Mindspore DeepSeek R1 and V3. Scripts are written in Python and cover the entire process from image pulling, dependency installation, container deployment, and service starting. The scripts are designed to simplify the deployment process and improve the deployment efficiency and accuracy. + +### 1. Environmental requirements ### + +1. **Operating system: OpenEuler22.03 LTS SP4 or later** +2. **Software dependency:** + + * `docker`Image download and container management; + * `Python3`\: used for script execution. + * `oedp`\: quick application installation and deployment platform; + +### 2. Script execution ### + +Configure the config.yaml file by referring to the DeepSeekV3&R1 Deployment Guide-Chapter 4 and run the following command: + +``` +oedp run install +``` + +### 3. FAQ ### + +1. The downloaded weight is in the CKPT format. By default, the weight in the script is in the safetensor format. How to modify the weight? + +``` +#Modifying the Weight Type of the config.yaml Model +model_type: ckpt +``` + -- Gitee