From 09502b1072668e2a708602394e9899b042793387 Mon Sep 17 00:00:00 2001
From: liu lili <liulili715@huawei.com>
Date: Fri, 29 Aug 2025 14:09:05 +0800
Subject: [PATCH 1/6] lll: solve parallel docs

---
 .../model_infer/ms_infer/ms_infer_parallel_infer.md           | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_parallel_infer.md b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_parallel_infer.md
index 65b08ca795..7d9b513b1d 100644
--- a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_parallel_infer.md
+++ b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_parallel_infer.md
@@ -12,7 +12,7 @@
 
 在对模型进行并行切分前，需要先根据模型的结构特征来进行并行分析，确认网络中哪些层可以并行，以及如何切分能够获得比较好的性能加速。为了要能够获得好的加速效果，并行切分的部分就需要尽可能的独立计算互不影响。以Qwen2模型结构为例，我们对其主要的网络结构进行并行分析：
 
-- **Embedding**：Embedding层实际上是一个gather操作，不管是按hidden_dim还是num_embeddings维度切分，都可以比较好地进行并行计算。由于按照num_embedding可以更好地进行all_reduce（减少数据排布的开销），此处我们按照num_embeddings维度进行切分。
+- **Embedding**：Embedding层实际上是一个gather操作，不管是按hidden_dim还是num_embeddings维度切分，都可以比较好地进行并行计算。由于按照num_embeddings可以更好地进行all_reduce（减少数据排布的开销），此处我们按照num_embeddings维度进行切分。
 
 - **Attention**：Qwen2模型使用了GQA的Attention计算方法，即有多个独立的Attention计算，因此我们可以按照列维度将query、key、value切分开来单独计算，但是需要保证切分能够被Attention的head数整除。
 
@@ -937,7 +937,7 @@ class Qwen2ForCausalLM(nn.Cell):
         for path in glob(weight_path + "/*.safetensors"):
             weight_dict.update(ms.load_checkpoint(path, format="safetensors"))
 
--        ms.load_param_into_net(self, weight_dict, strict_load=False)
+-        load_param_into_net(self, weight_dict, strict_load=False)
 +        param_dict = self.parameters_dict()
 +
 +        for (name, weight) in weight_dict.items():
-- 
Gitee


From f066b626f68cae75d1e66ca403d53e90e6122da1 Mon Sep 17 00:00:00 2001
From: liu lili <liulili715@huawei.com>
Date: Fri, 29 Aug 2025 14:11:28 +0800
Subject: [PATCH 2/6] lll: solve parallel docs

---
 .../model_infer/ms_infer/ms_infer_parallel_infer.md         | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tutorials/source_en/model_infer/ms_infer/ms_infer_parallel_infer.md b/tutorials/source_en/model_infer/ms_infer/ms_infer_parallel_infer.md
index 27832555e0..74bfbce8d1 100644
--- a/tutorials/source_en/model_infer/ms_infer/ms_infer_parallel_infer.md
+++ b/tutorials/source_en/model_infer/ms_infer/ms_infer_parallel_infer.md
@@ -12,7 +12,7 @@ The pressure on graphics memory makes it challenging for a single device to comp
 
 Before performing model sharding and parallelism, you need to analyze the parallelism based on the model structure to determine which layers can be parallelized and how to divide the model to achieve better performance acceleration. To achieve better acceleration, the parallelized part needs to be computed separately, minimizing the impact on other parts. The following uses the Qwen2 model structure as an example to analyze the parallelism of the main network structure:
 
-- **Embedding**: The embedding layer is actually a gather operation and can be parallelized properly regardless of the sharding dimension (**hidden_dim** or **num_embeddings**). Because **all_reduce** (reducing overheads of data arrangement) can be better performed based on **num_embedding**, sharding is performed based on the **num_embeddings** dimension.
+- **Embedding**: The embedding layer is actually a gather operation and can be parallelized properly regardless of the sharding dimension (**hidden_dim** or **num_embeddings**). Because **all_reduce** (reducing overheads of data arrangement) can be better performed based on **num_embeddings**, sharding is performed based on the **num_embeddings** dimension.
 
 - **Attention**: The Qwen2 model uses the attention computation method of GQA, that is, multiple independent attention computations. Therefore, the query, key, and value can be parallelized separately by column. However, the number of shards must be exactly divided by the number of attention heads.
 
@@ -739,7 +739,7 @@ class Qwen2MLP(nn.Cell):
             param_dtype=config.param_dtype,
             bias=False
         )
--        self.qgate_proj = Qwen2Linear(
+-        self.gate_proj = Qwen2Linear(
 +        self.gate_proj = Qwen2ColParallelLinear(
             input_size=config.hidden_size,
             output_size=config.intermediate_size,
@@ -923,7 +923,7 @@ class Qwen2ForCausalLM(nn.Cell):
         for path in glob(weight_path + "/*.safetensors"):
             weight_dict.update(ms.load_checkpoint(path, format="safetensors"))
 
--        ms.load_param_into_net(self, weight_dict, strict_load=False)
+-        load_param_into_net(self, weight_dict, strict_load=False)
 +        param_dict = self.parameters_dict()
 +
 +        for (name, weight) in weight_dict.items():
-- 
Gitee


From e723c9795e5bc7ea2dd9ecd38f19d9be45ecaedc Mon Sep 17 00:00:00 2001
From: liu lili <liulili715@huawei.com>
Date: Fri, 29 Aug 2025 14:27:51 +0800
Subject: [PATCH 3/6] lll: solve model infer docs

---
 .../model_infer/ms_infer/ms_infer_model_infer.rst        | 4 ++--
 .../model_infer/ms_infer/ms_infer_network_develop.md     | 2 +-
 .../model_infer/ms_infer/ms_infer_model_infer.rst        | 9 +++++----
 .../model_infer/ms_infer/ms_infer_network_develop.md     | 2 +-
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst b/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst
index 95d2ab72b7..a3b991a905 100644
--- a/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst
+++ b/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst
@@ -123,7 +123,7 @@ MindSpore LLM inference with the framework mainly depends on the MindSpore open-
 
 You can also install the Python package adapted to your environment by referring to the official installation document. For details, see `MindSpore Installation <https://www.mindspore.cn/install/en>`_.
 
-MindSpore inference mainly runs on the Ascend AI Processor environment. You need to install the corresponding Ascend development environment. For details, see the following:
+MindSpore inference mainly runs on the Ascend AI Processor environment. You need to install the corresponding Ascend development environment. For details, see `CANN Software Installation <https://www.hiascend.com/en/document>`_:
 
 .. code:: shell
 
@@ -418,7 +418,7 @@ The MindSpore LLM supports the following quantization technologies to improve th
 
 To quantize a model using golden-stick, perform the following steps:
 
-1. **Weight quantization**: Use a quantization algorithm to convert the model weight data from float16 to int8.
+1. **Model quantization**: Use a quantization algorithm to convert the model data type from high-bit type (eg. float16) to low-bit type (eg. int8 or int4).
 
 2. **Model inference**: Load the standard model, quantize the model network (by inserting corresponding quantization operators), load the quantized weight, and call the model inference.
 
diff --git a/tutorials/source_en/model_infer/ms_infer/ms_infer_network_develop.md b/tutorials/source_en/model_infer/ms_infer/ms_infer_network_develop.md
index 86ca7cb2ac..a2c2d1d7b7 100644
--- a/tutorials/source_en/model_infer/ms_infer/ms_infer_network_develop.md
+++ b/tutorials/source_en/model_infer/ms_infer/ms_infer_network_develop.md
@@ -606,7 +606,7 @@ class Qwen2Model(nn.Cell):
 
 ### KVCacheManager
 
-Since KVCache is usually used to optimize LLMs, to use KVCache with FlashAttention and lashPagedAttention provided by MindSpore, some parameters need to be specified additionally, including:
+Since KVCache is usually used to optimize LLMs, to use KVCache with FlashAttention and PagedAttention provided by MindSpore, some parameters need to be specified additionally, including:
 
 - **k_cache & v_cache**: The kv_cache object can be considered as a cache table, which is used to store the keys and values in the previous iteration. In the next iteration, these values can be directly read, avoiding repeated computation of the keys and values of the first *n* words, thereby improving performance.
 
diff --git a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst
index c614dcf231..42d08aef6b 100644
--- a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst
+++ b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst
@@ -123,7 +123,7 @@ MindSpore大语言模型带框架推理主要依赖MindSpore开源软件，用
 
 同时，用户也可以参考官方安装文档来安装自己环境适配的Python包，具体见 `MindSpore安装 <https://www.mindspore.cn/install>`_。
 
-由于MindSpore推理主要支持Ascend芯片环境上运行，还需要安装相应的Ascend开发环境，具体可以参考：
+由于MindSpore推理主要支持Ascend芯片环境上运行，还需要安装相应的Ascend开发环境，具体可以参考 `快速安装CANN <https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/83RC1alpha001/softwareinst/instg/instg_quick.html> `：
 
 .. code:: shell
 
@@ -157,6 +157,7 @@ MindSpore大语言模型带框架推理主要依赖MindSpore开源软件，用
 
    ls
    |- config.json
+   |- generation_config.json
    |- LICENSE
    |- merges.txt
    |- model-00001-of-00004.safetensors
@@ -401,7 +402,7 @@ MindSpore大语言模型带框架推理主要依赖MindSpore开源软件，用
    2048]矩阵乘法，可以切分成2个[1024, 4096]和[4096,
    1024]的矩阵乘法。而不同的切分可能带来不同的并行计算性能。对于Qwen、LLAMA这类大语言模型而言，其切分主要包含在Attention中query、key、value这些数据的linear操作上。
 
-2. **权重适配**：除了模型结构的并行化改造外，由于模型计算中的权重也被切分了，因此在模型加载的时候，相关的权重也要进行切分，以尽量减少不必要权重加载占用显存。对于大语言模型而言，主要的权重都集中在embbeding和linear两个网络层中，因此权重加载的适配主要涉及这两个模块改造。
+2. **权重适配**：除了模型结构的并行化改造外，由于模型计算中的权重也被切分了，因此在模型加载的时候，相关的权重也要进行切分，以尽量减少不必要权重加载占用显存。对于大语言模型而言，主要的权重都集中在embedding和linear两个网络层中，因此权重加载的适配主要涉及这两个模块改造。
 
 3. **模型推理**：和单卡推理不同，多卡推理需要同时启动多个进程来并行进行推理，因此在启动模型推理时，相比于直接运行脚本，多卡推理需要一次运行多组相关进程。MindSpore框架为用户提供了msrun的并行运行工具，具体使用方法可以参考 `构建可并行的大语言模型网络 <./ms_infer_parallel_infer.md>`_。
 
@@ -416,9 +417,9 @@ MindSpore大语言模型支持以下量化技术，来提升模型推理性能
 
 - **KVCache量化**：在大语言模型推理场景下，除了模型权重以外，KVCache也占用了大量显存，因此对KVCache进行量化，降低其显存消耗，也能够有效提升整体的吞吐量。MindSpore大语言模型支持对KVCache做float16到int8的量化，通过FA和PA适配，将量化和反量化融合到算子内部，降低量化带来的开销，实现整体吞吐量提升。
 
-使用golen-stick进行模型量化主要分为以下两步：
+使用golden-stick进行模型量化主要分为以下两步：
 
-1. **权重量化**：利用量化算法，将模型的权重数据从float16转化成int8数据。
+1. **模型量化**：利用量化算法，将模型的数据类型从高bit类型（如float16）转化成低bit类型（如int8或int4）。
 
 2. **模型推理**：加载标准模型，将模型网络进行量化改造（插入相应量化算子），加载量化后的权重，调用模型推理。
 
diff --git a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_network_develop.md b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_network_develop.md
index a6803e42f4..9bd6a3f856 100644
--- a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_network_develop.md
+++ b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_network_develop.md
@@ -606,7 +606,7 @@ class Qwen2Model(nn.Cell):
 
 ### KVCacheManager
 
-由于大语言模型通常会使用KVCache优化，MindSpore提供的FlashAttention和lashPagedAttention需要和KVCache配合使用，需要额外传入一些参数，其中主要包括：
+由于大语言模型通常会使用KVCache优化，MindSpore提供的FlashAttention和PagedAttention需要和KVCache配合使用，需要额外传入一些参数，其中主要包括：
 
 - **k_cache & v_cache**：kv_cache对象可以理解为是一个缓存表，用于保存上一次迭代中的key和value值。在下一次迭代时，可以直接读取这些值，从而避免重复计算前n个词的key和value，以提升性能。
 
-- 
Gitee


From 4ef979a20f5d34902683180a1ddd0a7c4c2acff6 Mon Sep 17 00:00:00 2001
From: liu lili <liulili715@huawei.com>
Date: Fri, 29 Aug 2025 15:15:48 +0800
Subject: [PATCH 4/6] lll: solve model infer docs

---
 .../model_infer/ms_infer/ms_infer_parallel_infer.md             | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_parallel_infer.md b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_parallel_infer.md
index 7d9b513b1d..62e4f0ea6c 100644
--- a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_parallel_infer.md
+++ b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_parallel_infer.md
@@ -751,7 +751,7 @@ class Qwen2MLP(nn.Cell):
             param_dtype=config.param_dtype,
             bias=False
         )
--        self.qgate_proj = Qwen2Linear(
+-        self.gate_proj = Qwen2Linear(
 +        self.gate_proj = Qwen2ColParallelLinear(
             input_size=config.hidden_size,
             output_size=config.intermediate_size,
-- 
Gitee


From b318c45f41ca8b70764539b0545db567ad22cc51 Mon Sep 17 00:00:00 2001
From: TingWang <kathy.wangting@huawei.com>
Date: Fri, 29 Aug 2025 08:37:34 +0000
Subject: [PATCH 5/6] Update
 tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst accept

---
 .../source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst
index 42d08aef6b..d1ce35d661 100644
--- a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst
+++ b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst
@@ -123,7 +123,7 @@ MindSpore大语言模型带框架推理主要依赖MindSpore开源软件，用
 
 同时，用户也可以参考官方安装文档来安装自己环境适配的Python包，具体见 `MindSpore安装 <https://www.mindspore.cn/install>`_。
 
-由于MindSpore推理主要支持Ascend芯片环境上运行，还需要安装相应的Ascend开发环境，具体可以参考 `快速安装CANN <https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/83RC1alpha001/softwareinst/instg/instg_quick.html> `：
+由于MindSpore推理主要支持Ascend芯片环境上运行，还需要安装相应的Ascend开发环境，具体可以参考 `快速安装CANN <https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/83RC1alpha001/softwareinst/instg/instg_quick.html>`_ ：
 
 .. code:: shell
 
-- 
Gitee


From fe0c0746f24c382cbb7b115dd2b3aaf58aca5b41 Mon Sep 17 00:00:00 2001
From: liu lili <liulili715@huawei.com>
Date: Fri, 29 Aug 2025 16:40:57 +0800
Subject: [PATCH 6/6] lll: solve model infer docs

---
 .../source_en/model_infer/ms_infer/ms_infer_model_infer.rst      | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst b/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst
index a3b991a905..b3978f6a16 100644
--- a/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst
+++ b/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst
@@ -157,6 +157,7 @@ After the download is complete, the following file tree structure should be disp
 
    ls
    |- config.json
+   |- generation_config.json
    |- LICENSE
    |- merges.txt
    |- model-00001-of-00004.safetensors
-- 
Gitee