From 98eff8d829ccccf85f99d27cb7adc5db2745eb07 Mon Sep 17 00:00:00 2001
From: Yule100 <chenyukun11@h-partners.com>
Date: Wed, 20 Aug 2025 15:23:09 +0800
Subject: [PATCH] inference yaml_config_inference

---
 .../yaml_config_inference.md                  | 66 +++++++++++++++++++
 .../example/yaml/inference_template.yaml      | 36 ++++++++++
 .../docs/source_en/guide/inference.md         |  8 ++-
 docs/mindformers/docs/source_en/index.rst     |  2 +
 .../yaml_config_inference.md                  | 66 +++++++++++++++++++
 .../example/yaml/inference_template.yaml      | 36 ++++++++++
 .../docs/source_zh_cn/guide/inference.md      |  8 ++-
 docs/mindformers/docs/source_zh_cn/index.rst  |  2 +
 8 files changed, 220 insertions(+), 4 deletions(-)
 create mode 100644 docs/mindformers/docs/source_en/advanced_development/yaml_config_inference.md
 create mode 100644 docs/mindformers/docs/source_en/example/yaml/inference_template.yaml
 create mode 100644 docs/mindformers/docs/source_zh_cn/advanced_development/yaml_config_inference.md
 create mode 100644 docs/mindformers/docs/source_zh_cn/example/yaml/inference_template.yaml

diff --git a/docs/mindformers/docs/source_en/advanced_development/yaml_config_inference.md b/docs/mindformers/docs/source_en/advanced_development/yaml_config_inference.md
new file mode 100644
index 0000000000..066114ab17
--- /dev/null
+++ b/docs/mindformers/docs/source_en/advanced_development/yaml_config_inference.md
@@ -0,0 +1,66 @@
+# Guide to Using the Inference Configuration Template
+
+[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/yaml_config_inference.md)
+
+## Overview
+
+Currently, the Mcore architecture model supports reading the Hugging Face model directory for model instantiation during inference. Therefore, MindSpore Transformers has streamlined the model's YAML configuration files. Instead of having a separate YAML file for each model and each specification, they have been unified into a single YAML configuration template. For online inference of models with different specifications, you only need to apply the configuration template, set up the model directory downloaded from Hugging Face or ModelScope, and modify a few necessary configurations to perform inference.
+
+## Usage Method
+
+When using the inference configuration template for inference, some configurations in it need to be modified according to the actual situation.
+
+### Configurations that Must be Modified (Required)
+
+The configuration template does not contain model configurations; it relies on reading model configurations from Hugging Face or ModelScope to instantiate the model. Therefore, the following configurations must be modified:
+
+| Configuration Item | Configuration Description | Modification Method |
+|----|----|--------|
+|pretrained_model_dir|Path to the model directory|Change it to the folder path of the model file downloaded from Hugging Face or ModelScope.|
+
+### Optional Scenario-Based Configuration (Optional)
+
+The following different usage scenarios require modifications to some configurations:
+
+#### Default Scenario (single card, 64GB video memory)
+
+The inference configuration template is by default set for scenarios with a single card and 64GB of video memory, and no additional configuration modifications are needed in this case. It should be noted that if the model scale is too large and the single-card memory cannot support it, multi-card inference is required.
+
+#### Distributed Scenario
+
+In distributed multi-card inference scenarios, it is necessary to enable parallel configurations in the settings and adjust the model parallel strategy. The configurations that need to be modified are as follows:
+
+| Configuration Item | Configuration Description | Modification Method |
+|----|----|--------|
+|use_parallel |Parallel switch |Needs to be set to True during distributed inference|
+|parallel_config |Parallel strategy |Currently, online inference only supports model parallelism. Set model_parallel to the number of cards used|
+
+#### Scenarios with Other Video Memory Specifications
+
+On devices without 64GB of video memory (on-chip memory), it is necessary to adjust the maximum video memory size occupied by MindSpore. The configurations that need to be modified are as follows:
+
+| Configuration Item | Configuration Description | Modification Method |
+|----|----|--------|
+|max_device_memory|The maximum video memory that MindSpore can occupy|It is necessary to reserve part of the video memory for communication. Generally, devices with 64GB video memory are configured to be less than 60GB, and devices with 32GB video memory are configured to be less than 30GB. When the number of cards is relatively large, it may need to be reduced according to the actual situation.|
+
+## Usage Example
+
+Mindspore Transformers provides YAML configuration file templates for the Qwen3 series models [predict_qwen3.yaml](https://gitee.com/mindspore/mindformers/blob/master/configs/qwen3/predict_qwen3.yaml), Qwen3 models of different specifications can perform inference tasks using this template by modifying relevant configurations.
+
+Taking Qwen3-32B as an example, the configuration that needs to be modified for reasoning YAML is as follows:
+
+1. Modify pretrained_model_dir to the folder path of the model file of Qwen3-32B
+
+    ```yaml
+    pretrained_model_dir: "path/to/Qwen3-32B"
+    ```
+
+2. The Qwen3-32B requires at least 4 cards and the parallel configuration needs to be modified
+
+    ```yaml
+    use_parallel: True
+    parallel_config:
+        model_parallel: 4
+    ```
+
+Subsequent operations regarding the execution of reasoning tasks, please refer to [Qwen3's README](https://gitee.com/mindspore/mindformers/blob/master/configs/qwen3/README.md#%E6%8E%A8%E7%90%86%E6%A0%B7%E4%BE%8B).
diff --git a/docs/mindformers/docs/source_en/example/yaml/inference_template.yaml b/docs/mindformers/docs/source_en/example/yaml/inference_template.yaml
new file mode 100644
index 0000000000..23093e4fec
--- /dev/null
+++ b/docs/mindformers/docs/source_en/example/yaml/inference_template.yaml
@@ -0,0 +1,36 @@
+use_legacy: False # Control whether to use the old architecture
+
+# HuggingFace file directory
+pretrained_model_dir: '/path/hf_dir'
+model:
+  model_config:
+    compute_dtype: "bfloat16" # Linear layer compute dtype
+    layernorm_compute_dtype: "bfloat16" # LayerNorm compute dtype
+    softmax_compute_dtype: "float32" # Data type for computing softmax during attention computation
+    rotary_dtype: "bfloat16" # Custom rotary position embedding compute dtype
+    params_dtype: "bfloat16" # Data types for initializing parameters such as weights
+
+use_parallel: False # Enable parallel mode
+parallel_config:
+  data_parallel: 1 # Set the number of data parallel
+  model_parallel: 1 # Set the number of model parallel
+
+# mindspore context init config
+context:
+  mode: 0 #0--Graph Mode; 1--Pynative Mode
+  max_device_memory: "59GB" # Set the maximum memory avavilable to the device in the format "xxGB"
+  device_id: 0 # Set the execution device ID
+  device_target: "Ascend" # Set the backend execution device
+
+run_mode: 'predict' # Set the running mode of the mode: train, finetune, eval or predict
+seed: 0 # Set the global seed
+output_dir: './output' # Set the path where checkpoint, log,strategy, etc. files are saved
+load_checkpoint: '' # File or folder paths for loading weights
+load_ckpt_format: "safetensors" # The format of loading checkpoint, either ckpt or safetensonrs
+
+# parallel context config
+parallel:
+  parallel_mode: "MANUAL_PARALLEL" # Set parallel mode
+
+trainer: # trainer config
+  type: CausalLanguageModelingTrainer
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_en/guide/inference.md b/docs/mindformers/docs/source_en/guide/inference.md
index b93a61f563..f4fad23350 100644
--- a/docs/mindformers/docs/source_en/guide/inference.md
+++ b/docs/mindformers/docs/source_en/guide/inference.md
@@ -12,13 +12,17 @@ The inference process can be categorized into the following steps:
 
 ### 1. Models of Selective Inference
 
-Depending on the required inference task, different models are chosen, e.g. for text generation one can choose Qwen3-8B, etc.
+Depending on the required inference task, different models are chosen, e.g. for text generation one can choose Qwen3, etc.
 
 ### 2. Preparing Model Files
 
 Obtain the Hugging Face model file: weights, configurations, and tokenizers. Store the downloaded files in the same folder directory for convenient subsequent use.
 
-### 3. Executing Inference Tasks
+### 3. YAML Configuration File Modification
+
+The user needs to configure a YAML file to define all the configurations of the task. MindSpore Transformers provides a YAML configuration template. Users can customize the configuration based on the template according to the actual scenario. For detailed information, please refer to the [Guide to Using Inference Configuration Templates](https://www.mindspore.cn/mindformers/docs/en/master/advanced_development/yaml_config_inference.html).
+
+### 4. Executing Inference Tasks
 
 Use the unified script `run_mindformer` to execute inference tasks.
 
diff --git a/docs/mindformers/docs/source_en/index.rst b/docs/mindformers/docs/source_en/index.rst
index c3c883a2dc..5aaeedb0c8 100644
--- a/docs/mindformers/docs/source_en/index.rst
+++ b/docs/mindformers/docs/source_en/index.rst
@@ -117,6 +117,7 @@ Advanced developing with MindSpore Transformers
 - Model Development
 
   - `Development Migration <https://www.mindspore.cn/mindformers/docs/en/master/advanced_development/dev_migration.html>`_
+  - `Guide to Using the Inference Configuration Template <https://www.mindspore.cn/mindformers/docs/en/master/advanced_development/yaml_config_inference.html>`_
 
 - Accuracy Comparison
 
@@ -192,6 +193,7 @@ FAQ
    advanced_development/precision_optimization
    advanced_development/performance_optimization
    advanced_development/dev_migration
+   advanced_development/yaml_config_inference
    advanced_development/accuracy_comparison
    advanced_development/api
 
diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/yaml_config_inference.md b/docs/mindformers/docs/source_zh_cn/advanced_development/yaml_config_inference.md
new file mode 100644
index 0000000000..abd8cf775f
--- /dev/null
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/yaml_config_inference.md
@@ -0,0 +1,66 @@
+# 推理配置模板使用指南
+
+[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/yaml_config_inference.md)
+
+## 概述
+
+当前Mcore架构模型在推理时，支持读取Hugging Face模型目录来实例化模型，因此MindSpore Transformers精简了模型的YAML配置文件，从原先每个模型、每个规格都有一份YAML，统一成一份YAML配置模板。不同规格模型在在线推理时，只需要套用配置模板，配置好从Hugging Face或ModelScope下载的模型目录，再修改少数必要配置，即可进行推理。
+
+## 使用方法
+
+使用推理配置模板进行推理时，需要根据实际情况，修改其中的部分配置。
+
+### 必须修改的配置（Required）
+
+配置模板不包含模型的配置，依赖读取Hugging Face或modelscope的模型配置，来实例化模型。因此必须修改如下配置：
+
+|配置项|配置说明|修改方法|
+|----|----|--------|
+|pretrained_model_dir|模型目录的路径|修改成从Hugging Face或ModelScope的下载的模型文件的文件夹路径|
+
+### 可选的场景化配置（Optional）
+
+以下不同使用场景需要对部分配置进行修改：
+
+#### 默认场景（单卡、64GB显存）
+
+推理配置模板默认为单卡64GB显存的场景的配置，此时无需额外修改配置。需注意如果模型规模过大，单卡显存无法支持时，需要进行多卡推理。
+
+#### 分布式场景
+
+分布式的多卡推理场景需要在配置中打开并行配置，并调整模型并行策略，需要修改的配置如下：
+
+|配置项|配置说明|修改方法|
+|----|----|--------|
+|use_parallel|并行开关|分布式推理时需要设置为True|
+|parallel_config|并行策略|当前在线推理仅支持模型并行，设置model_parallel为使用的卡数|
+
+#### 其他显存规格场景
+
+非64GB显存（片上内存）的设备上，需要调整MindSpore占用的最大显存大小，需要修改的配置如下：
+
+|配置项|配置说明|修改方法|
+|----|----|--------|
+|max_device_memory|MindSpore可占用的最大显存|需要为通信预留部分显存，一般情况下64GB显存的设备配置为<60GB，32GB显存的设备配置为<30GB。卡数比较多时可能还需根据实际减小。|
+
+## 使用样例
+
+Mindspore Transformers提供了Qwen3系列模型的YAML配置文件模板[predict_qwen3.yaml](https://gitee.com/mindspore/mindformers/blob/master/configs/qwen3/predict_qwen3.yaml)，不同规格的Qwen3模型可以通过修改相关配置使用该模板执行推理任务。
+
+以Qwen3-32B为例，按照如下步骤修改YAML配置文件：
+
+1. 修改pretrained_model_dir为Qwen3-32B的模型文件的文件夹路径
+
+    ```yaml
+    pretrained_model_dir: "path/to/Qwen3-32B"
+    ```
+
+2. Qwen3-32B至少需要4卡，需要修改并行配置
+
+    ```yaml
+    use_parallel: True
+    parallel_config:
+        model_parallel: 4
+    ```
+
+关于执行推理任务的后续操作，详细可见[Qwen3的README](https://gitee.com/mindspore/mindformers/blob/master/configs/qwen3/README.md#%E6%8E%A8%E7%90%86%E6%A0%B7%E4%BE%8B)。
diff --git a/docs/mindformers/docs/source_zh_cn/example/yaml/inference_template.yaml b/docs/mindformers/docs/source_zh_cn/example/yaml/inference_template.yaml
new file mode 100644
index 0000000000..23093e4fec
--- /dev/null
+++ b/docs/mindformers/docs/source_zh_cn/example/yaml/inference_template.yaml
@@ -0,0 +1,36 @@
+use_legacy: False # Control whether to use the old architecture
+
+# HuggingFace file directory
+pretrained_model_dir: '/path/hf_dir'
+model:
+  model_config:
+    compute_dtype: "bfloat16" # Linear layer compute dtype
+    layernorm_compute_dtype: "bfloat16" # LayerNorm compute dtype
+    softmax_compute_dtype: "float32" # Data type for computing softmax during attention computation
+    rotary_dtype: "bfloat16" # Custom rotary position embedding compute dtype
+    params_dtype: "bfloat16" # Data types for initializing parameters such as weights
+
+use_parallel: False # Enable parallel mode
+parallel_config:
+  data_parallel: 1 # Set the number of data parallel
+  model_parallel: 1 # Set the number of model parallel
+
+# mindspore context init config
+context:
+  mode: 0 #0--Graph Mode; 1--Pynative Mode
+  max_device_memory: "59GB" # Set the maximum memory avavilable to the device in the format "xxGB"
+  device_id: 0 # Set the execution device ID
+  device_target: "Ascend" # Set the backend execution device
+
+run_mode: 'predict' # Set the running mode of the mode: train, finetune, eval or predict
+seed: 0 # Set the global seed
+output_dir: './output' # Set the path where checkpoint, log,strategy, etc. files are saved
+load_checkpoint: '' # File or folder paths for loading weights
+load_ckpt_format: "safetensors" # The format of loading checkpoint, either ckpt or safetensonrs
+
+# parallel context config
+parallel:
+  parallel_mode: "MANUAL_PARALLEL" # Set parallel mode
+
+trainer: # trainer config
+  type: CausalLanguageModelingTrainer
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_zh_cn/guide/inference.md b/docs/mindformers/docs/source_zh_cn/guide/inference.md
index bfd36ff4fa..79123dd7c9 100644
--- a/docs/mindformers/docs/source_zh_cn/guide/inference.md
+++ b/docs/mindformers/docs/source_zh_cn/guide/inference.md
@@ -12,13 +12,17 @@ MindSpore Transformers 提供了大模型推理能力，用户可以执行 `run_
 
 ### 1. 选择推理的模型
 
-根据需要的推理任务，选择不同的模型，如文本生成可以选择Qwen3-8B等。
+根据需要的推理任务，选择不同的模型，如文本生成可以选择Qwen3等。
 
 ### 2. 准备模型文件
 
 获取Hugging Face模型文件：权重、配置与分词器，将下载的文件存放在同一个文件夹目录，方便后续使用。
 
-### 3. 执行推理任务
+### 3. 准备YAML配置文件
+
+用户需要配置一份YAML文件，来定义任务的所有配置。MindSpore Transformers提供了一份YAML配置模板，用户可以基于模板，根据实际场景自定义配置，详细可见[推理配置模板使用指南](https://www.mindspore.cn/mindformers/docs/zh-CN/master/advanced_development/yaml_config_inference.html)。
+
+### 4. 执行推理任务
 
 使用 `run_mindformer` 统一脚本执行推理任务。
 
diff --git a/docs/mindformers/docs/source_zh_cn/index.rst b/docs/mindformers/docs/source_zh_cn/index.rst
index a615d23055..11f0e1651a 100644
--- a/docs/mindformers/docs/source_zh_cn/index.rst
+++ b/docs/mindformers/docs/source_zh_cn/index.rst
@@ -144,6 +144,7 @@ MindSpore Transformers功能特性说明
 - 模型开发
 
   - `开发迁移 <https://www.mindspore.cn/mindformers/docs/zh-CN/master/advanced_development/dev_migration.html>`_
+  - `推理配置模板使用指南 <https://www.mindspore.cn/mindformers/docs/zh-CN/master/advanced_development/yaml_config_inference.html>`_
 
 - 精度对比
 
@@ -219,6 +220,7 @@ FAQ
    advanced_development/precision_optimization
    advanced_development/performance_optimization
    advanced_development/dev_migration
+   advanced_development/yaml_config_inference
    advanced_development/accuracy_comparison
    advanced_development/api
 
-- 
Gitee