From 1f69c8ebab3a0e6c3c95c110180a4e3e8fefad87 Mon Sep 17 00:00:00 2001
From: Yule100 <2538776509@qq.com>
Date: Tue, 17 Jun 2025 20:59:06 +0800
Subject: [PATCH] docs_code inference

---
 .../docs/source_en/guide/inference.md         | 315 ++++--------------
 .../docs/source_zh_cn/guide/inference.md      | 311 ++++-------------
 2 files changed, 134 insertions(+), 492 deletions(-)

diff --git a/docs/mindformers/docs/source_en/guide/inference.md b/docs/mindformers/docs/source_en/guide/inference.md
index 071cea651f..b252a158d7 100644
--- a/docs/mindformers/docs/source_en/guide/inference.md
+++ b/docs/mindformers/docs/source_en/guide/inference.md
@@ -4,7 +4,7 @@
 
 ## Overview
 
-MindSpore Transformers provides the foundation model inference capability. Users can run the unified script `run_mindformer` or write a script to call the high-level API to start inference. If the unified script `run_mindformer` is used, you can directly start the system through the configuration file without writing code.
+MindSpore Transformers offers large model inference capabilities. Users can execute the `run_mindformer` unified script for inference. By using the `run_mindformer` unified script, users can start the process directly through configuration files without writing any code, making it very convenient to use.
 
 ## Basic Process
 
@@ -12,38 +12,18 @@ The inference process can be categorized into the following steps:
 
 ### 1. Models of Selective Inference
 
-Depending on the required inference task, different models are chosen, e.g. for text generation one can choose Llama2, etc.
+Depending on the required inference task, different models are chosen, e.g. for text generation one can choose `Qwen2.5-7B`, etc.
 
 ### 2. Preparing Model Weights
 
-Model weights can be categorized into two types: complete weights and distributed weights, and the following instructions should be referred to when using them.
+Currently, the inference weights can be loaded online to perform inference with the complete weights. The weights can be obtained through the following two methods:
 
-#### 2.1 Complete Weights
-
-Complete weights can be obtained in two ways:
-
-1. After downloading the open source weights of the corresponding model from the HuggingFace model library, refer to [Weight Format Conversion](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html) to convert them to the ckpt format.
-2. Pre-trained or fine-tuned distributed weights are used to generate a complete weight by [merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html).
-
-#### 2.2 Distributed Weights
-
-Distributed weights are typically obtained by pre-training or after fine-tuning and are stored by default in the `./output/checkpoint_network` directory, which needs to be converted to single-card or multi-card weights before performing single-card or multi-card inference.
-
-If the inference uses a weight slicing that is different from the model slicing provided in the inference task, such as in these cases below, the weights need to be additionally converted to a slice that matches the slicing of the model in the actual inference task.
-
-1. The weights obtained from multi-card training are reasoned on a single card;
-2. The weights of the eight-card training are reasoned over two cards;
-3. Already sliced distributed weights are reasoned on a single card, and so on.
-
-The command samples in the following contents are all used in the way of online autoslicing. It is recommended to use online autoslicing by setting the command parameters `--auto_trans_ckpt` to `-True` and `-src_strategy_path_or_dir` to the weighted slicing strategy file or directory path (which is saved by default after training under `./output/strategy`) are automatically sliced in the inference task. Details can be found in [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html).
-
-> Since both the training and inference tasks use `./output` as the default output path, when using the strategy file output by the training task as the source weight strategy file for the inference task, you need to move the strategy file directory under the default output path to another location to avoid it being emptied by the process of the inference task, for example:
->
-> ```mv ./output/strategy/ ./strategy```
+1. Download the complete open-source weights of the corresponding model from the Hugging Face model library.
+2. Pre-trained or fine-tuned distributed weights through [merger](https://www.mindspore.cn/mindformers/docs/en/dev/feature/safetensors.html#weight-merging) Generate a complete weight.
 
 ### 3. Executing Inference Tasks
 
-Call the high-level API or use the unified script `run_mindformer` to execute inference tasks.
+Use the unified script `run_mindformer` to execute inference tasks.
 
 ## Inference Based on the run_mindformer Script
 
@@ -65,107 +45,82 @@ The arguments to run_mindformer.py are described below:
 
 msrun_launcher.sh includes the run_mindformer.py command and the number of inference cards as two parameters.
 
-The following will describe the usage of single and multi-card inference using Llama2 as an example, with the recommended configuration of the [predict_llama2_7b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/llama2/predict_llama2_7b.yaml) file.
-
-> During inference, the vocabulary file `tokenizer.model` required for the Llama2 model will be automatically downloaded (ensuring smooth network connectivity). If the file exists locally, you can place it in the `./checkpoint_download/Llama2/` directory in advance.
+The following will describe the usage of single and multi-card inference using `Qwen2.5-7B` as an example, with the recommended configuration of the [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) file.
 
-### Single-Device Inference
+### Configuration Modification
 
-When using complete weight inference, the following command is executed to start the inference task:
+The configuration related to weights is modified as follows:
 
-```shell
-python run_mindformer.py \
---config configs/llama2/predict_llama2_7b.yaml \
---run_mode predict \
---use_parallel False \
---load_checkpoint path/to/checkpoint.ckpt \
---predict_data 'I love Beijing, because'
-```
-
-If you use distributed weight files for inference, you need to add the `--auto_trans_ckpt` and `-src_strategy_path_or_dir` entries, with the following startup commands:
-
-```shell
-python run_mindformer.py \
---config configs/llama2/predict_llama2_7b.yaml \
---run_mode predict \
---use_parallel False \
---auto_trans_ckpt True \
---src_strategy_path_or_dir ./strategy \
---load_checkpoint path/to/checkpoint.ckpt \
---predict_data 'I love Beijing, because'
-```
-
-The following result appears, proving that the inference was successful. The inference result is also saved to the `text_generation_result.txt` file in the current directory. The detailed log can be viewed in the `./output/msrun_log` directory.
-
-```text
-'text_generation_text': [I love Beijing, because it is a city that is constantly constantly changing. I have been living here for ......]
+```yaml
+load_checkpoint: "path/to/Qwen2_5_7b_instruct/"
+load_ckpt_format: 'safetensors'
+auto_trans_ckpt: True
 ```
 
-### Multi-Card Inference
-
-The configuration requirements for multi-card inference differ from those of single card, and you need to refer to the following instructions to modify the [predict_llama2_7b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/llama2/predict_llama2_7b.yaml) configuration.
-
-1. The configuration of model_parallel and the number of cards used need to be consistent. The following use case is 2-card inference, and model_parallel needs to be set to 2;
-2. The current version of multi-card inference does not support data parallelism, you need to set data_parallel to 1.
-
-**Configuration before modification:**
+The default configuration is the single-card inference configuration. The parallel related configuration is modified as follows:
 
 ```yaml
+use_parallel: False
 parallel_config:
-  data_parallel: 8
+  data_parallel: 1
   model_parallel: 1
   pipeline_stage: 1
 ```
 
-**Configuration after modifications:**
+The configuration related to `tokenizer` is modified as follows:
 
 ```yaml
-parallel_config:
-  data_parallel: 1
-  model_parallel: 2
-  pipeline_stage: 1
+processor:
+  tokenizer:
+    vocab_file: "path/to/vocab.json"
+    merges_file: "path/to/merges.txt"
 ```
 
-When full weight inference is used, you need to enable the online slicing mode to load weights. For details, see the following command:
+For specific configuration instructions, please refer to [yaml Configuration Instructions](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html).
+
+### Single-Device Inference
+
+When using full weight reasoning, it is recommended to use the default configuration and execute the following command to start the reasoning task:
 
 ```shell
-bash scripts/msrun_launcher.sh "python run_mindformer.py \
---config configs/llama2/predict_llama2_7b.yaml \
+python run_mindformer.py \
+--register_path /path/to/research/qwen2_5/ \
+--config /path/to/research/qwen2_5/predict_qwen2_5_7b_instruct \
 --run_mode predict \
---use_parallel True \
---auto_trans_ckpt True \
---load_checkpoint path/to/checkpoint.ckpt \
---predict_data 'I love Beijing, because'" \
-2
+--use_parallel False \
+--predict_data '帮助我制定一份去上海的旅游攻略'
 ```
 
-Refer to the following commands when distributed weight inference is used and the slicing strategy for the weights is the same as the slicing strategy for the model:
+The following results appear, proving the success of the reasoning. The reasoning results will also be saved to the `text_generation_result.txt` file in the current directory.
 
-```shell
-bash scripts/msrun_launcher.sh "python run_mindformer.py \
---config configs/llama2/predict_llama2_7b.yaml \
---run_mode predict \
---use_parallel True \
---load_checkpoint path/to/checkpoint_dir \
---predict_data 'I love Beijing, because'" \
-2
+```text
+'text_generation_text': [帮助我制定一份去上海的旅游攻略，包括景点、美食、住宿等信息...]
 ```
 
-When distributed weight inference is used and the slicing strategy of the weights is not consistent with the slicing strategy of the model, you need to enable the online slicing function to load the weights. Refer to the following command:
+### Multi-Card Inference
+
+The configuration requirements for multi-card inference are different from those for single-card inference. Please refer to the following for configuration modification:
+
+1. The configuration of model_parallel and the number of cards used need to be consistent. The following use case is 4-card inference, and model_parallel needs to be set to 4;
+2. The current version of multi-card inference does not support data parallelism. data_parallel needs to be set to 1.
+
+When using full weight reasoning, it is necessary to enable the online splitting mode to load the weights. Refer to the following command:
 
 ```shell
-bash scripts/msrun_launcher.sh "python run_mindformer.py \
---config configs/llama2/predict_llama2_7b.yaml \
---run_mode predict \
---use_parallel True \
---auto_trans_ckpt True \
---src_strategy_path_or_dir ./strategy \
---load_checkpoint path/to/checkpoint_dir \
---predict_data 'I love Beijing, because'" \
-2
+bash scripts/msrun_launcher.sh "run_mindformer.py \
+ --register_path /path/to/research/qwen2_5 \
+ --config /path/to/research/qwen2_5/qwen2_5_72b/predict_qwen2_5_72b_instruct.yaml \
+ --run_mode predict \
+ --use_parallel True \
+ --auto_trans_ckpt True \
+ --predict_data '帮助我制定一份去上海的旅游攻略'" 4
 ```
 
-Inference results are viewed in the same way as single-card inference.
+The following results appear, proving the success of the reasoning. The reasoning results will also be saved to the text_generation_result.txt file in the current directory. Detailed logs can be viewed through the directory `./output/msrun_log`.
+
+```text
+'text_generation_text': [帮助我制定一份去上海的旅游攻略，包括景点、美食、住宿等信息...]
+```
 
 ### Multi-Device Multi-Batch Inference
 
@@ -174,27 +129,26 @@ Multi-card multi-batch inference is initiated in the same way as [multi-card inf
 The content and format of the `input_predict_data.txt` file is an input each line, and the number of questions is the same as the `predict_batch_size`, which can be found in the following format:
 
 ```text
-I love Beijing, because
-I love Beijing, because
-I love Beijing, because
-I love Beijing, because
+帮助我制定一份去上海的旅游攻略
+帮助我制定一份去上海的旅游攻略
+帮助我制定一份去上海的旅游攻略
+帮助我制定一份去上海的旅游攻略
 ```
 
-Refer to the following commands to perform inference tasks, taking the full weight inference as an example:
+Take full weight reasoning as an example. The reasoning task can be started by referring to the following command:
 
 ```shell
-bash scripts/msrun_launcher.sh "python run_mindformer.py \
---config configs/llama2/predict_llama2_7b.yaml \
---run_mode predict \
---predict_batch_size 4 \
---use_parallel True \
---auto_trans_ckpt True \
---load_checkpoint path/to/checkpoint.ckpt \
---predict_data path/to/input_predict_data.txt" \
-2
+bash scripts/msrun_launcher.sh "run_mindformer.py \
+ --register_path /path/to/research/qwen2_5 \
+ --config /path/to/research/qwen2_5/qwen2_5_72b/predict_qwen2_5_72b_instruct.yaml \
+ --run_mode predict \
+ --predict_batch_size 4 \
+ --use_parallel True \
+ --auto_trans_ckpt True \
+ --predict_data '帮助我制定一份去上海的旅游攻略'" 4
 ```
 
-Inference results are viewed in the same way as single-card inference.
+Inference results are viewed in the same way as multi-card inference.
 
 ### Multimodal Inference
 
@@ -223,139 +177,6 @@ python run_mindformer.py \
  --load_checkpoint /{path}/cogvlm2-image-llama3-chat.ckpt
 ```
 
-## Inference Based on High-level Interface
-
-> For security reasons, it is not recommended to use high-level interfaces for inference. This chapter will be deprecated in the next version. If you have any questions or suggestions, please submit feedback through [Community Issue](https://gitee.com/mindspore/mindformers/issues/new). Thank you for your understanding and support!
-
-MindSpore Transformers not only provides a unified script for `run_mindformer` inference, but also supports user-defined calls to high-level interfaces such as `pipeline` or `chat` for implementation.
-
-### Pipeline Interface
-
-Customized text generation inference task flow based on `pipeline` interface, supporting single card inference and multi-card inference. About how to use `pipeline` interface to start the task and output the result, you can refer to the following implementation. The specific parameter description can be viewed [pipeline interface API documentation](https://www.mindspore.cn/mindformers/docs/en/dev/mindformers/mindformers.pipeline.html#mindformers.pipeline).
-
-#### Incremental Inference
-
-```python
-from mindformers import build_context
-from mindformers import AutoModel, AutoTokenizer, pipeline, TextStreamer
-
-# Construct the input content.
-inputs = ["I love Beijing, because", "LLaMA is a", "Huawei is a company that"]
-
-# Initialize the environment.
-build_context({'context': {'mode': 0}, 'run_mode': 'predict', 'parallel': {}, 'parallel_config': {}})
-
-# Instantiate a tokenizer.
-tokenizer = AutoTokenizer.from_pretrained('llama2_7b')
-
-# Instantiate a model.
-# Modify the path to the local weight path.
-model = AutoModel.from_pretrained('llama2_7b', checkpoint_name_or_path="path/to/llama2_7b.ckpt", use_past=True)
-# Model instantiation is also supported from modelers.cn.Given repo id which format is MindSpore-Lab/model_name
-# model = AutoModel.from_pretrained('MindSpore-Lab/qwen1_5_7b-chat')
-
-# Start a non-stream inference task in the pipeline.
-text_generation_pipeline = pipeline(task="text_generation", model=model, tokenizer=tokenizer)
-outputs = text_generation_pipeline(inputs, max_length=512, do_sample=False, top_k=3, top_p=1)
-for output in outputs:
-    print(output)
-```
-
-Save the example to `pipeline_inference.py`, modify the path for loading the weight, and run the `pipeline_inference.py` script.
-
-```shell
-python pipeline_inference.py
-```
-
-The inference result is as follows:
-
-```text
-'text_generation_text': [I love Beijing, because it is a city that is constantly constantly changing. I have been living here for ......]
-'text_generation_text': [LLaMA is a large-scale, open-source, multimodal, multilingual, multitask, and multimodal pretrained language model. It is ......]
-'text_generation_text': [Huawei is a company that has been around for a long time. ......]
-```
-
-#### Stream Inference
-
-```python
-from mindformers import build_context
-from mindformers import AutoModel, AutoTokenizer, pipeline, TextStreamer
-
-# Construct the input content.
-inputs = ["I love Beijing, because", "LLaMA is a", "Huawei is a company that"]
-
-# Initialize the environment.
-build_context({'context': {'mode': 0}, 'run_mode': 'predict', 'parallel': {}, 'parallel_config': {}})
-
-# Instantiate a tokenizer.
-tokenizer = AutoTokenizer.from_pretrained('llama2_7b')
-
-# Instantiate a model.
-# Modify the path to the local weight path.
-model = AutoModel.from_pretrained('llama2_7b', checkpoint_name_or_path="path/to/llama2_7b.ckpt", use_past=True)
-# Model instantiation is also supported from modelers.cn.Given repo id which format is MindSpore-Lab/model_name
-# model = AutoModel.from_pretrained('MindSpore-Lab/qwen1_5_7b-chat')
-
-# Start a stream inference task in the pipeline.
-streamer = TextStreamer(tokenizer)
-text_generation_pipeline = pipeline(task="text_generation", model=model, tokenizer=tokenizer, streamer=streamer)
-_ = text_generation_pipeline(inputs, max_length=512, do_sample=False, top_k=3, top_p=1)
-```
-
-Save the example to `pipeline_inference.py`, modify the path for loading the weight, and run the `pipeline_inference.py` script.
-
-```shell
-python pipeline_inference.py
-```
-
-The inference result is as follows:
-
-```text
-'text_generation_text': [I love Beijing, because it is a city that is constantly constantly changing. I have been living here for ......]
-'text_generation_text': [LLaMA is a large-scale, open-source, multimodal, multilingual, multitask, and multimodal pretrained language model. It is ......]
-'text_generation_text': [Huawei is a company that has been around for a long time. ......]
-```
-
-### chat Interface
-
-Based on the `chat` interface, the process of generating dialogue text inference tasks involves adding chat templates through the provided tokenizer to infer user queries. You can refer to the following implementation methods, and specific parameter descriptions can be viewed [chat interface API documentation](https://www.mindspore.cn/mindformers/docs/en/dev/generation/mindformers.generation.GenerationMixin.html#mindformers.generation.GenerationMixin.chat).
-
-```python
-from mindformers import build_context
-from mindformers import AutoModel, AutoTokenizer
-
-# Construct the input content.
-query = "Hello!"
-
-# Initialize the environment.
-build_context({'context': {'mode': 0}, 'run_mode': 'predict', 'parallel': {}, 'parallel_config': {}})
-
-# Instantiate a tokenizer.
-tokenizer = AutoTokenizer.from_pretrained('llama2_7b')
-
-# Instantiate a model.
-# Modify the path to the local weight path.
-model = AutoModel.from_pretrained('llama2_7b', checkpoint_name_or_path="path/to/llama2_7b.ckpt", use_past=True)
-# Model instantiation is also supported from modelers.cn.Given repo id which format is MindSpore-Lab/model_name
-# model = AutoModel.from_pretrained('MindSpore-Lab/qwen1_5_7b-chat')
-
-# Start a stream inference task with chat.
-response, history = model.chat(tokenizer=tokenizer, query=query, max_length=32)
-print(response)
-```
-
-Save the example to `chat_inference.py`, modify the path for loading the weight, and run the `chat_inference.py` script.
-
-```shell
-python chat_inference.py
-```
-
-The inference result is as follows:
-
-```text
-Thanks, sir.
-```
-
 ## More Information
 
 For more inference examples of different models, see [the models supported by MindSpore Transformers](https://www.mindspore.cn/mindformers/docs/en/dev/introduction/models.html).
diff --git a/docs/mindformers/docs/source_zh_cn/guide/inference.md b/docs/mindformers/docs/source_zh_cn/guide/inference.md
index 2423ce69f5..e415405812 100644
--- a/docs/mindformers/docs/source_zh_cn/guide/inference.md
+++ b/docs/mindformers/docs/source_zh_cn/guide/inference.md
@@ -4,7 +4,7 @@
 
 ## 概述
 
-MindSpore Transformers 提供了大模型推理能力，用户可以执行 `run_mindformer` 统一脚本，或者编写代码调用高阶接口进行推理。使用 `run_mindformer` 统一脚本可以不编写代码，直接通过配置文件启动，用法更便捷。
+MindSpore Transformers 提供了大模型推理能力，用户可以执行 `run_mindformer` 统一脚本进行推理。用户使用 `run_mindformer` 统一脚本可以不编写代码，直接通过配置文件启动，用法便捷。
 
 ## 基本流程
 
@@ -12,38 +12,18 @@ MindSpore Transformers 提供了大模型推理能力，用户可以执行 `run_
 
 ### 1. 选择推理的模型
 
-根据需要的推理任务，选择不同的模型，如文本生成可以选择 Llama2 等。
+根据需要的推理任务，选择不同的模型，如文本生成可以选择 `Qwen2.5-7B` 等。
 
 ### 2. 准备模型权重
 
-模型权重可分为完整权重和分布式权重两种，使用时需参考以下说明。
+目前推理权重可以在线加载完整权重进行推理，权重可以通过以下两种方式获得：
 
-#### 2.1 完整权重
-
-完整权重可以通过以下两种方式获得：
-
-1. 从HuggingFace模型库中下载相应模型的开源权重后，参考[权重格式转换](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)将其转换为ckpt格式。
-2. 预训练或者微调后的分布式权重，通过[合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)生成一个完整权重。
-
-#### 2.2 分布式权重
-
-分布式权重一般通过预训练或者微调后获得，默认保存在`./output/checkpoint_network`目录，需要先转换为单卡或多卡权重，再进行单卡或多卡推理。
-
-如果推理使用的权重切分方式，与推理任务中提供的模型切分方式不同，例如以下这几种情况，则需要额外对权重进行切分方式的转换，以匹配实际推理任务中模型的切分方式。
-
-1. 多卡训练得到的权重在单卡上推理；
-2. 8卡训练的权重在2卡上推理；
-3. 已经切分好的分布式权重在单卡上推理等。
-
-下文的命令示例均采用了在线自动切分的方式，通过设置参数 `--auto_trans_ckpt` 为 `True` 和 `--src_strategy_path_or_dir` 为权重的切分策略文件或目录路径（预训练或者微调后，默认保存在`./output/strategy`下）在推理任务中自动完成切分。更多用法可参考[分布式权重的合并和切分](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)。
-
-> 由于训练和推理任务都使用 `./output` 作为默认输出路径，当使用训练任务所输出的策略文件，作为推理任务的源权重策略文件时，需要将默认输出路径下的策略文件目录移动到其他位置，避免被推理任务的进程清空，如：
->
-> ```mv ./output/strategy/ ./strategy```
+1. 从Hugging Face模型库中下载相应模型的开源的完整权重。
+2. 预训练或者微调后的分布式权重，通过[合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/safetensors.html#%E6%9D%83%E9%87%8D%E5%90%88%E5%B9%B6)生成一个完整权重。
 
 ### 3. 执行推理任务
 
-使用 `run_mindformer` 统一脚本或调用高阶接口执行推理任务。
+使用 `run_mindformer` 统一脚本执行推理任务。
 
 ## 使用 run_mindformer 一键启动脚本推理
 
@@ -65,108 +45,83 @@ run_mindformer.py的参数说明如下：
 
 msrun_launcher.sh包括run_mindformer.py命令和推理卡数两个参数。
 
-下面将以 Llama2 为例介绍单卡和多卡推理的用法，推荐配置为[predict_llama2_7b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/llama2/predict_llama2_7b.yaml)文件。
+下面将以 `Qwen2.5-7B` 为例介绍单卡和多卡推理的用法，推荐配置为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml)文件。
 
-> 推理时会自动下载Llama2模型所需的词表文件 `tokenizer.model` （需要保障网络畅通）。如果本地有这个文件，可以提前把它放在 `./checkpoint_download/llama2/` 目录下。
+### 配置修改
 
-### 单卡推理
+权重相关配置修改如下：
 
-当使用完整权重推理时，执行以下命令即可启动推理任务：
+```yaml
+load_checkpoint: "path/to/Qwen2_5_7b_instruct/"
+load_ckpt_format: 'safetensors'
+auto_trans_ckpt: True
+```
 
-```shell
-python run_mindformer.py \
---config configs/llama2/predict_llama2_7b.yaml \
---run_mode predict \
---use_parallel False \
---load_checkpoint path/to/checkpoint.ckpt \
---predict_data 'I love Beijing, because'
+默认配置是单卡推理配置，并行相关配置修改如下：
+
+```yaml
+use_parallel: False
+parallel_config:
+  data_parallel: 1
+  model_parallel: 1
+  pipeline_stage: 1
 ```
 
-当使用分布式权重推理时，需要增加 ``--auto_trans_ckpt`` 和 ``--src_strategy_path_or_dir`` 的入参，启动命令如下：
+`tokenizer`相关配置修改如下：
+
+```yaml
+processor:
+  tokenizer:
+    vocab_file: "path/to/vocab.json"
+    merges_file: "path/to/merges.txt"
+```
+
+具体配置说明均可参考[yaml配置说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html)。
+
+### 单卡推理
+
+当使用完整权重推理时，推荐使用默认配置，执行以下命令即可启动推理任务：
 
 ```shell
 python run_mindformer.py \
---config configs/llama2/predict_llama2_7b.yaml \
+--register_path /path/to/research/qwen2_5/ \
+--config /path/to/research/qwen2_5/predict_qwen2_5_7b_instruct \
 --run_mode predict \
 --use_parallel False \
---auto_trans_ckpt True \
---src_strategy_path_or_dir ./strategy \
---load_checkpoint path/to/checkpoint_dir \
---predict_data 'I love Beijing, because'
+--predict_data '帮助我制定一份去上海的旅游攻略'
 ```
 
-出现如下结果，证明推理成功。推理结果也会保存到当前目录下的 `text_generation_result.txt` 文件中。详细日志可通过`./output/msrun_log` 目录查看。
+出现如下结果，证明推理成功。推理结果也会保存到当前目录下的 `text_generation_result.txt` 文件中。
 
 ```text
-'text_generation_text': [I love Beijing, because it is a city that is constantly constantly changing. I have been living here for ......]
+'text_generation_text': [帮助我制定一份去上海的旅游攻略，包括景点、美食、住宿等信息...]
 ```
 
 ### 多卡推理
 
-多卡推理的配置要求与单卡存在差异，需参考如下说明修改[predict_llama2_7b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/llama2/predict_llama2_7b.yaml)配置。
+多卡推理的配置要求与单卡存在差异，需参考下面修改配置：
 
-1. 模型并行model_parallel的配置和使用的卡数需保持一致，下文用例为2卡推理，需将model_parallel设置成2；
+1. 模型并行model_parallel的配置和使用的卡数需保持一致，下文用例为4卡推理，需将model_parallel设置成4；
 2. 当前版本的多卡推理不支持数据并行，需将data_parallel设置为1。
 
-**修改前的配置：**
-
-```yaml
-parallel_config:
-  data_parallel: 8
-  model_parallel: 1
-  pipeline_stage: 1
-```
-
-**修改后的配置：**
-
-```yaml
-parallel_config:
-  data_parallel: 1
-  model_parallel: 2
-  pipeline_stage: 1
-```
-
 当使用完整权重推理时，需要开启在线切分方式加载权重，参考以下命令：
 
 ```shell
-bash scripts/msrun_launcher.sh "python run_mindformer.py \
---config configs/llama2/predict_llama2_7b.yaml \
---run_mode predict \
---use_parallel True \
---auto_trans_ckpt True \
---load_checkpoint path/to/checkpoint.ckpt \
---predict_data 'I love Beijing, because'" \
-2
-```
-
-当使用分布式权重推理，且权重的切分策略与模型的切分策略一致时，参考以下命令：
-
-```shell
-bash scripts/msrun_launcher.sh "python run_mindformer.py \
---config configs/llama2/predict_llama2_7b.yaml \
---run_mode predict \
---use_parallel True \
---load_checkpoint path/to/checkpoint_dir \
---predict_data 'I love Beijing, because'" \
-2
+bash scripts/msrun_launcher.sh "run_mindformer.py \
+ --register_path /path/to/research/qwen2_5 \
+ --config /path/to/research/qwen2_5/qwen2_5_72b/predict_qwen2_5_72b_instruct.yaml \
+ --run_mode predict \
+ --use_parallel True \
+ --auto_trans_ckpt True \
+ --predict_data '帮助我制定一份去上海的旅游攻略'" 4
 ```
 
-当使用分布式权重推理，且权重的切分策略与模型的切分策略不一致时，需要打开在线切分功能加载权重，参考以下命令：
+出现如下结果，证明推理成功。推理结果也会保存到当前目录下的 text_generation_result.txt 文件中。详细日志可通过`./output/msrun_log`目录查看。
 
-```shell
-bash scripts/msrun_launcher.sh "python run_mindformer.py \
---config configs/llama2/predict_llama2_7b.yaml \
---run_mode predict \
---use_parallel True \
---auto_trans_ckpt True \
---src_strategy_path_or_dir ./strategy \
---load_checkpoint path/to/checkpoint_dir \
---predict_data 'I love Beijing, because'" \
-2
+```text
+'text_generation_text': [帮助我制定一份去上海的旅游攻略，包括景点、美食、住宿等信息...]
 ```
 
-推理结果查看方式，与单卡推理相同。
-
 ### 多卡多batch推理
 
 多卡多batch推理的启动方式可参考上述[多卡推理](#多卡推理)，但是需要增加`predict_batch_size`的入参，并修改`predict_data`的入参。
@@ -174,27 +129,26 @@ bash scripts/msrun_launcher.sh "python run_mindformer.py \
 `input_predict_data.txt`文件的内容和格式是每一行都是一个输入，问题的个数与`predict_batch_size`一致，可以参考以下格式：
 
 ```text
-I love Beijing, because
-I love Beijing, because
-I love Beijing, because
-I love Beijing, because
+帮助我制定一份去上海的旅游攻略
+帮助我制定一份去上海的旅游攻略
+帮助我制定一份去上海的旅游攻略
+帮助我制定一份去上海的旅游攻略
 ```
 
 以完整权重推理为例，可以参考以下命令启动推理任务：
 
 ```shell
-bash scripts/msrun_launcher.sh "python run_mindformer.py \
---config configs/llama2/predict_llama2_7b.yaml \
---run_mode predict \
---predict_batch_size 4 \
---use_parallel True \
---auto_trans_ckpt True \
---load_checkpoint path/to/checkpoint.ckpt \
---predict_data path/to/input_predict_data.txt" \
-2
+bash scripts/msrun_launcher.sh "run_mindformer.py \
+ --register_path /path/to/research/qwen2_5 \
+ --config /path/to/research/qwen2_5/qwen2_5_72b/predict_qwen2_5_72b_instruct.yaml \
+ --run_mode predict \
+ --predict_batch_size 4 \
+ --use_parallel True \
+ --auto_trans_ckpt True \
+ --predict_data '帮助我制定一份去上海的旅游攻略'" 4
 ```
 
-推理结果查看方式，与单卡推理相同。
+推理结果查看方式，与多卡推理相同。
 
 ### 多模态推理
 
@@ -223,139 +177,6 @@ python run_mindformer.py \
  --load_checkpoint /{path}/cogvlm2-image-llama3-chat.ckpt
 ```
 
-## 基于高阶接口推理
-
-> 基于安全性考虑，当前暂不推荐使用高阶接口进行推理，本章节将于下个版本下线。如有任何问题或建议，请通过[社区Issue](https://gitee.com/mindspore/mindformers/issues/new)提交反馈。感谢您的理解与支持！
-
-MindSpore Transformers除了提供 `run_mindformer` 统一脚本进行推理外，也支持用户自定义调用高阶接口`pipeline`或`chat`接口实现。
-
-### Pipeline接口
-
-基于 `pipeline` 接口的自定义文本生成推理任务流程，支持单卡推理和多卡推理。关于如何使用 `pipeline` 接口启动任务并输出结果，可以参考以下实现方式，具体参数说明可以查看 [pipeline 接口的API文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/mindformers/mindformers.pipeline.html#mindformers.pipeline)。
-
-#### 增量推理
-
-```python
-from mindformers import build_context
-from mindformers import AutoModel, AutoTokenizer, pipeline, TextStreamer
-
-# 构造输入
-inputs = ["I love Beijing, because", "LLaMA is a", "Huawei is a company that"]
-
-# 初始化环境
-build_context({'context': {'mode': 0}, 'run_mode': 'predict', 'parallel': {}, 'parallel_config': {}})
-
-# 实例化tokenizer
-tokenizer = AutoTokenizer.from_pretrained('llama2_7b')
-
-# 模型实例化
-# 修改成本地的权重路径
-model = AutoModel.from_pretrained('llama2_7b', checkpoint_name_or_path="path/to/llama2_7b.ckpt", use_past=True)
-# 模型实例化可使用魔乐社区模型在线加载，传入仓库名，格式为MindSpore-Lab/model_name
-# model = AutoModel.from_pretrained('MindSpore-Lab/qwen1_5_7b-chat')
-
-# pipeline启动非流式推理任务
-text_generation_pipeline = pipeline(task="text_generation", model=model, tokenizer=tokenizer)
-outputs = text_generation_pipeline(inputs, max_length=512, do_sample=False, top_k=3, top_p=1)
-for output in outputs:
-    print(output)
-```
-
-通过将示例保存到 `pipeline_inference.py` 中，并且修改加载权重的路径，然后直接执行 `pipeline_inference.py` 脚本。
-
-```shell
-python pipeline_inference.py
-```
-
-执行以上命令的推理结果如下：
-
-```text
-'text_generation_text': [I love Beijing, because it is a city that is constantly constantly changing. I have been living here for ......]
-'text_generation_text': [LLaMA is a large-scale, open-source, multimodal, multilingual, multitask, and multimodal pretrained language model. It is ......]
-'text_generation_text': [Huawei is a company that has been around for a long time. ......]
-```
-
-#### 流式推理
-
-```python
-from mindformers import build_context
-from mindformers import AutoModel, AutoTokenizer, pipeline, TextStreamer
-
-# 构造输入
-inputs = ["I love Beijing, because", "LLaMA is a", "Huawei is a company that"]
-
-# 初始化环境
-build_context({'context': {'mode': 0}, 'run_mode': 'predict', 'parallel': {}, 'parallel_config': {}})
-
-# 实例化tokenizer
-tokenizer = AutoTokenizer.from_pretrained('llama2_7b')
-
-# 模型实例化
-# 修改成本地的权重路径
-model = AutoModel.from_pretrained('llama2_7b', checkpoint_name_or_path="path/to/llama2_7b.ckpt", use_past=True)
-# 模型实例化可使用魔乐社区模型在线加载，传入模型名为Repo_id，格式为MindSpore-Lab/model_name
-# model = AutoModel.from_pretrained('MindSpore-Lab/qwen1_5_7b-chat')
-
-# pipeline启动流式推理任务
-streamer = TextStreamer(tokenizer)
-text_generation_pipeline = pipeline(task="text_generation", model=model, tokenizer=tokenizer, streamer=streamer)
-_ = text_generation_pipeline(inputs, max_length=512, do_sample=False, top_k=3, top_p=1)
-```
-
-通过将示例保存到 `pipeline_inference.py` 中，并且修改加载权重的路径，然后直接执行 `pipeline_inference.py` 脚本。
-
-```shell
-python pipeline_inference.py
-```
-
-执行以上命令的推理结果如下：
-
-```text
-'text_generation_text': [I love Beijing, because it is a city that is constantly constantly changing. I have been living here for ......]
-'text_generation_text': [LLaMA is a large-scale, open-source, multimodal, multilingual, multitask, and multimodal pretrained language model. It is ......]
-'text_generation_text': [Huawei is a company that has been around for a long time. ......]
-```
-
-### chat接口
-
-基于 `chat` 接口的对话文本生成推理任务流程，通过提供的分词器添加聊天模板后，对用户的查询进行推断。可以参考以下实现方式，具体参数说明可以查看 [chat 接口的API文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/generation/mindformers.generation.GenerationMixin.html#mindformers.generation.GenerationMixin.chat)。
-
-```python
-from mindformers import build_context
-from mindformers import AutoModel, AutoTokenizer
-
-# 构造输入
-query = "Hello!"
-
-# 初始化环境
-build_context({'context': {'mode': 0}, 'run_mode': 'predict', 'parallel': {}, 'parallel_config': {}})
-
-# 实例化tokenizer
-tokenizer = AutoTokenizer.from_pretrained('llama2_7b')
-
-# 模型实例化
-# 修改成本地的权重路径
-model = AutoModel.from_pretrained('llama2_7b', checkpoint_name_or_path="path/to/llama2_7b.ckpt", use_past=True)
-# 模型实例化可使用魔乐社区模型在线加载，传入仓库名，格式为MindSpore-Lab/model_name
-# model = AutoModel.from_pretrained('MindSpore-Lab/qwen1_5_7b-chat')
-
-# 调用chat接口启动推理任务
-response, history = model.chat(tokenizer=tokenizer, query=query, max_length=32)
-print(response)
-```
-
-通过将示例保存到 `chat_inference.py` 中，并且修改加载权重的路径，然后直接执行 `chat_inference.py` 脚本。
-
-```shell
-python chat_inference.py
-```
-
-执行以上命令的推理结果如下：
-
-```text
-Thanks, sir.
-```
-
 ## 更多信息
 
 更多关于不同模型的推理示例，请访问[MindSpore Transformers 已支持模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/introduction/models.html)。
\ No newline at end of file
-- 
Gitee