From 3baea95f1502f3f678d7a8c576933d2333e6f155 Mon Sep 17 00:00:00 2001
From: horcam <zhanghongquan15@huawei.com>
Date: Tue, 12 Aug 2025 09:57:39 +0800
Subject: [PATCH 1/7] delete env vars

---
 .../source_en/getting_started/installation/installation.md    | 2 --
 .../docs/source_en/getting_started/quick_start/quick_start.md | 4 ----
 .../tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md    | 4 ----
 .../tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md    | 4 ----
 .../user_guide/supported_features/benchmark/benchmark.md      | 4 ----
 .../supported_features/quantization/quantization.md           | 2 --
 .../source_zh_cn/getting_started/installation/installation.md | 2 --
 .../source_zh_cn/getting_started/quick_start/quick_start.md   | 4 ----
 .../tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md    | 4 ----
 .../tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md    | 4 ----
 .../user_guide/supported_features/benchmark/benchmark.md      | 4 ----
 .../supported_features/quantization/quantization.md           | 2 --
 12 files changed, 40 deletions(-)

diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
index 73851332d6..a6789dccde 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
@@ -153,9 +153,7 @@ docker exec -it $DOCKER_NAME bash
 User can verify the installation with a simple offline inference test. First, user need to configure the environment variables with the following command:
 
 ```bash
-export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory.
 export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md
index c0eaf16c34..6b129509a8 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md
@@ -131,17 +131,13 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
 Before launching the model, user need to set the following environment variables:  
 
 ```bash  
-export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory.  
 export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.  
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation.  
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.  
 ```  
 
 Here is an explanation of these environment variables:  
 
-- `ASCEND_TOTAL_MEMORY_GB`: The memory size of each card. User can check the memory by using `npu-smi info`, where the value corresponds to `HBM-Usage(MB)` in the query results.
 - `vLLM_MODEL_BACKEND`: The backend of the model to run. User could find supported models and backends for vLLM MindSpore in the [Model Support List](../../user_guide/supported_models/models_list/models_list.md).  
-- `vLLM_MODEL_MEMORY_USE_GB`: The memory reserved for model loading. Adjust this value if insufficient memory error occurs during model loading.  
 - `MINDFORMERS_MODEL_CONFIG`: The model configuration file.  
 
 Additionally, users need to ensure that MindSpore Transformers is installed. Users can add it by running the following command:  
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
index 24d4d4a2ca..7f55f75080 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
@@ -127,17 +127,13 @@ For [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), the followi
 
 ```bash
 #set environment variables
-export ASCEND_TOTAL_MEMORY_GB=64 # Use `npu-smi info` to check the memory.
 export vLLM_MODEL_BACKEND=MindFormers # Use MindSpore TransFormers as the model backend.
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Adjust based on the model's maximum usage, with the remaining allocated for KV cache.
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model YAML file.
 ```
 
 Here is an explanation of these environment variables:  
 
-- `ASCEND_TOTAL_MEMORY_GB`: The memory size of each compute card. Query using `npu-smi info`, corresponding to `HBM-Usage(MB)` in the results.  
 - `vLLM_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md).  
-- `vLLM_MODEL_MEMORY_USE_GB`: Memory reserved for model loading. Adjust this if encountering insufficient memory.  
 - `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5). For Qwen2.5-32B, the YAML file is [predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml).
 
 Users can check memory usage with `npu-smi info` and set the NPU cards for inference using the following example (assuming cards 4,5,6,7 are used):
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
index 2c360e28f8..f6daacb52d 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
@@ -127,17 +127,13 @@ For [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), the following
 
 ```bash  
 #set environment variables  
-export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory.  
 export vLLM_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend.  
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation  
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.  
 ```  
 
 Here is an explanation of these variables:  
 
-- `ASCEND_TOTAL_MEMORY_GB`: The memory size of each compute card. Query using `npu-smi info`, corresponding to `HBM-Usage(MB)` in the results.  
 - `vLLM_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md).  
-- `vLLM_MODEL_MEMORY_USE_GB`: Memory reserved for model loading. Adjust this if encountering insufficient memory.  
 - `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-7B, the YAML file is [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml).  
 
 User can check memory usage with `npu-smi info` and set the compute card for inference using:  
diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md
index b704a49f44..9ab03ff8ac 100644
--- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md
+++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md
@@ -9,9 +9,7 @@ The benchmark tool of vLLM MindSpore is inherited from vLLM. You can refer to th
 For single-card inference, we take [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. You can prepare the environment by following the guide [Single-Card Inference (Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#online-inference), set the environment variables:
 
 ```bash
-export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory.
 export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
@@ -104,9 +102,7 @@ P99 ITL (ms):                            ....
 For offline performance benchmark, take [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. Prepare the environment by following the guide [Single-Card Inference (Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#offline-inference). User need to set the environment variables:
 
 ```bash
-export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory.
 export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md
index 401768ca1a..33c39b583d 100644
--- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md
+++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md
@@ -27,9 +27,7 @@ After obtaining the DeepSeek-R1 W8A8 weights, ensure they are stored in the rela
 Refer to the [Installation Guide](../../../getting_started/installation/installation.md) to set up the vLLM MindSpore environment. User need to set the following environment variables:
 
 ```bash
-export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory.
 export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
index 8121916024..2243f2eaea 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
@@ -152,9 +152,7 @@ docker exec -it $DOCKER_NAME bash
 用户可以创建一个简单的离线推理场景，验证安装是否成功。下面以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例。首先用户需要执行以下命令，设置环境变量：
 
 ```bash
-export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory.
 export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
index 2bf629908a..a17c89b0fa 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
@@ -131,17 +131,13 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
 用户在拉起模型前，需设置以下环境变量：
 
 ```bash
-export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory.
 export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
 以下是对上述环境变量的解释：
 
-- `ASCEND_TOTAL_MEMORY_GB`: 每一张计算卡的显存大小。用户可使用`npu-smi info`命令进行查询，该值对应查询结果中的`HBM-Usage(MB)`；
 - `vLLM_MODEL_BACKEND`：所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端，可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询；
-- `vLLM_MODEL_MEMORY_USE_GB`：模型加载时所用空间，根据用户所使用的模型进行设置。若用户在模型加载过程中遇到显存不足时，可适当增大该值并重试；
 - `MINDFORMERS_MODEL_CONFIG`：模型配置文件。
 
 另外，用户需要确保MindSpore Transformers已安装。用户可通过
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
index 811009f647..5db8bd6d63 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
@@ -128,17 +128,13 @@ git clone https://huggingface.co/Qwen/Qwen2.5-32B-Instruct
 
 ```bash
 #set environment variables
-export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory.
 export vLLM_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend.
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
 以下是对上述环境变量的解释：
 
-- `ASCEND_TOTAL_MEMORY_GB`: 每一张计算卡的显存大小。用户可使用`npu-smi info`命令进行查询，该值对应查询结果中的`HBM-Usage(MB)`。
 - `vLLM_MODEL_BACKEND`：所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端，可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。
-- `vLLM_MODEL_MEMORY_USE_GB`：模型加载时所用空间，根据用户所使用的模型进行设置。若用户在模型加载过程中遇到显存不足时，可适当增大该值并重试。
 - `MINDFORMERS_MODEL_CONFIG`：模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中，找到对应模型的yaml文件。以Qwen2.5-32B为例，则其yaml文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml) 。
 
 用户可通过`npu-smi info`查看显存占用情况，并可以使用如下环境变量，设置用于推理的计算卡。以下例子为假设用户使用4,5,6,7卡进行推理：
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
index ffc82071b2..90a49c065e 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
@@ -128,17 +128,13 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
 
 ```bash
 #set environment variables
-export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory.
 export vLLM_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend.
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
 以下是对上述环境变量的解释：
 
-- `ASCEND_TOTAL_MEMORY_GB`: 每一张计算卡的显存大小。用户可使用`npu-smi info`命令进行查询，该值对应查询结果中的`HBM-Usage(MB)`；
 - `vLLM_MODEL_BACKEND`：所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端，可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询；
-- `vLLM_MODEL_MEMORY_USE_GB`：模型加载时所用空间，根据用户所使用的模型进行设置。若用户在模型加载过程中遇到显存不足时，可适当增大该值并重试；
 - `MINDFORMERS_MODEL_CONFIG`：模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中，找到对应模型的yaml文件。以Qwen2.5-7B为例，则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。
 
 用户可通过`npu-smi info`查看显存占用情况，并可以使用如下环境变量，设置用于推理的计算卡：
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md
index 15f1040699..6d28a6ff7c 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md
@@ -9,9 +9,7 @@ vLLM MindSpore的性能测试能力，继承自vLLM所提供的性能测试能
 若用户使用单卡推理，以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)为例，可按照文档[单卡推理（Qwen2.5-7B）](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#在线推理)进行环境准备，设置以下环境变量：
 
 ```bash
-export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory.
 export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
@@ -104,9 +102,7 @@ P99 ITL (ms):                            ....
 用户使用离线性能测试时，以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)为例，可按照文档[单卡推理（Qwen2.5-7B）](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#离线推理)进行环境准备，设置以下环境变量：
 
 ```bash
-export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory.
 export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md
index 54ad35032d..71667d5f1e 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md
@@ -27,9 +27,7 @@
 用户可以参考[安装指南](../../../getting_started/installation/installation.md)，进行vLLM MindSpore的环境搭建。用户需设置以下环境变量：
 
 ```bash
-export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory.
 export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation
 export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
-- 
Gitee


From 5f6d2506fdf8172dd8ce9472d908ef92ae85a522 Mon Sep 17 00:00:00 2001
From: horcam <zhanghongquan15@huawei.com>
Date: Tue, 12 Aug 2025 10:50:59 +0800
Subject: [PATCH 2/7] update ds dir, check model name and model args

---
 .../quick_start/quick_start.md                |  4 ++-
 .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md      | 27 +++++++++++--------
 .../qwen2.5_32b_multiNPU.md                   |  8 +++---
 .../qwen2.5_7b_singleNPU.md                   |  6 +++--
 .../quick_start/quick_start.md                |  4 +--
 .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md      | 21 +++++++++------
 .../qwen2.5_32b_multiNPU.md                   |  6 ++---
 .../supported_features/profiling/profiling.md |  2 +-
 8 files changed, 47 insertions(+), 31 deletions(-)

diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md
index 6b129509a8..506dc04152 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md
@@ -198,7 +198,7 @@ Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the fol
 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct"
 ```  
 
-If the service starts successfully, similar output will be obtained:  
+User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained:  
 
 ```text  
 INFO:   Started server process [6363]  
@@ -220,6 +220,8 @@ Use the following command to send a request, where `prompt` is the model input:
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 15, "temperature": 0}'  
 ```  
 
+User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model.
+
 If the request is processed successfully, the following inference result will be returned:  
 
 ```text  
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
index 7a3a9fb4a8..8386c3d11c 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
@@ -241,18 +241,20 @@ Execution example:
 
 ```bash  
 # Master node:  
-vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray
+vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray
 ```  
 
-In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file.  
+In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can also set the local model path by `--model` argument. 
 
 #### Sending Requests
 
 Use the following command to send requests, where `prompt` is the model input:  
 
 ```bash  
-curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}'  
-```  
+curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}'  
+```
+
+User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model.
 
 ## Hybrid Parallel Inference
 
@@ -301,6 +303,7 @@ parallel_config:
 
 ### Online Inference
 
+#### Starting the Service
 `vllm-mindspore` can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service:  
 
 ```bash  
@@ -321,22 +324,24 @@ vllm-mindspore serve
  --data-parallel-address [Master node communication IP]  
  --data-parallel-rpc-port [Master node communication port]  
  --enable-expert-parallel # Enable expert parallelism  
-```  
+```
 
-Execution example:  
+User can also set the local model path by `--model` argument. The following is an execution example:  
 
 ```bash  
 # Master node:  
-vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel  
+vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel  
 
 # Worker node:  
-vllm-mindspore serve --headless --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel  
-```  
+vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel  
+```
 
-## Sending Requests
+#### Sending Requests
 
 Use the following command to send requests, where `prompt` is the model input:  
 
 ```bash  
-curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0}'  
+curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0}'  
 ```
+
+User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model.
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
index 7f55f75080..0451257e2b 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
@@ -156,7 +156,7 @@ export MAX_MODEL_LEN=1024
 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
 ```
 
-Here, `TENSOR_PARALLEL_SIZE` specifies the number of NPU cards, and `MAX_MODEL_LEN` sets the maximum output token length.
+Here, `TENSOR_PARALLEL_SIZE` specifies the number of NPU cards, and `MAX_MODEL_LEN` sets the maximum output token length. User can also set the local model path by `--model` argument.
 
 If the service starts successfully, similar output will be obtained:
 
@@ -177,16 +177,18 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0
 Use the following command to send a request, where `prompt` is the model input:  
 
 ```bash
-curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}'
+curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}'
 ```
 
+User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model.
+
 If processed successfully, the inference result will be:
 
 ```text
 {
     "id":"cmpl-11fe2898c77d4ff18c879f57ae7aa9ca","object":"text_completion",
     "create":1748568696,
-    "model":"Qwen2.5-32B-Instruct",
+    "model":"Qwen/Qwen2.5-32B-Instruct",
     "choices":[
         {
             "index":0,
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
index f6daacb52d..e1216d12c9 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
@@ -192,7 +192,7 @@ Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the fol
 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct"  
 ```  
 
-If the service starts successfully, similar output will be obtained:
+User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained:
 
 ```text  
 INFO:   Started server process [6363]  
@@ -214,13 +214,15 @@ Use the following command to send a request, where `prompt` is the model input:
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 15, "temperature": 0}'  
 ```  
 
+User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model.
+
 If the request is processed successfully, the following inference result will be returned:
 
 ```text  
 {  
     "id":"cmpl-5e6e314861c24ba79fea151d86c1b9a6","object":"text_completion",  
     "create":1747398389,  
-    "model":"Qwen2.5-7B-Instruct",  
+    "model":"Qwen/Qwen2.5-7B-Instruct",  
     "choices":[  
         {  
             "index":0,  
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
index a17c89b0fa..cb8960fed7 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
@@ -198,7 +198,7 @@ vLLM MindSpore可使用OpenAI的API协议，进行在线推理部署。以下是
 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct"
 ```
 
-若服务成功拉起，则可以获得类似的执行结果：
+用户可以通过`--model`参数，指定模型保存的本地路径。若服务成功拉起，则可以获得类似的执行结果：
 
 ```text
 INFO:   Started server process [6363]
@@ -220,7 +220,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}'
 ```
 
-若请求处理成功，将获得以下的推理结果：
+其中，用户需确认`"model"`字段与启动服务中`--model`一致，请求才能成功匹配到模型。可以通过请求处理成功，将获得以下的推理结果：
 
 ```text
 {
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
index 047e7b4aad..2ad616469f 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
@@ -284,19 +284,21 @@ vllm-mindspore serve
 
 ```bash
 # 主节点：
-vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray
+vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray
 ```
 
-张量并行场景下，`--tensor-parallel-size`参数会覆盖模型yaml文件中`parallel_config`的`model_parallel`配置。
+张量并行场景下，`--tensor-parallel-size`参数会覆盖模型yaml文件中`parallel_config`的`model_parallel`配置。用户可以通过`--model`参数，指定模型保存的本地路径。
 
 #### 发起请求
 
 使用如下命令发送请求。其中`prompt`字段为模型输入：
 
 ```bash
-curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}'
+curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}'
 ```
 
+用户需确认`"model"`字段与启动服务中`--model`一致，请求才能成功匹配到模型。
+
 ## 混合并行推理
 
 vLLM 通过 Ray 对多个节点资源进行管理和运行。该样例对应以下并行策略场景：
@@ -344,6 +346,7 @@ parallel_config:
 
 ### 在线推理
 
+#### 启动服务
 `vllm-mindspore`可使用OpenAI的API协议部署在线推理。以下是在线推理的拉起流程：
 
 ```bash
@@ -366,20 +369,22 @@ vllm-mindspore serve
  --enable-expert-parallel # 使能专家并行
 ```
 
-执行示例：
+用户可以通过`--model`参数，指定模型保存的本地路径。以下为执行示例：
 
 ```bash
 # 主节点：
-vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel
+vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel
 
 # 从节点：
-vllm-mindspore serve --headless --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel
+vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel
 ```
 
-## 发送请求
+#### 发送请求
 
 使用如下命令发送请求。其中`prompt`字段为模型输入：
 
 ```bash
-curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am, "max_tokens": 120, "temperature": 0}'
+curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am, "max_tokens": 120, "temperature": 0}'
 ```
+
+用户需确认`"model"`字段与启动服务中`--model`一致，请求才能成功匹配到模型。
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
index 5db8bd6d63..72e4bad00a 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
@@ -159,7 +159,7 @@ python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model
 
 其中，`TENSOR_PARALLEL_SIZE`为用户指定的卡数，`MAX_MODEL_LEN`为模型最大输出token数。
 
-若服务成功拉起，则可以获得类似的执行结果：
+用户可以通过`--model`参数，指定模型保存的本地路径。若服务成功拉起，则可以获得类似的执行结果：
 
 ```text
 INFO:   Started server process [6363]
@@ -178,10 +178,10 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0
 使用如下命令发送请求。其中`prompt`字段为模型输入：
 
 ```bash
-curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}'
+curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}'
 ```
 
-若请求处理成功，将获得以下的推理结果：
+其中，用户需确认`"model"`字段与启动服务中`--model`一致，请求才能成功匹配到模型。可以通过请求处理成功，将获得以下的推理结果：
 
 ```text
 {
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md
index 4dc4f2ccee..eb90283188 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md
@@ -40,7 +40,7 @@ curl -X POST http://127.0.0.1:8000/start_profile
 curl http://localhost:8000/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
-        "model": "/home/DeepSeekV3",
+        "model": "Qwen/Qwen2.5-32B-Instruct",
         "prompt": "San Francisco is a",
         "max_tokens": 7,
         "temperature": 0
-- 
Gitee


From af28e12112606bf2d766f1fb222e2680bd4671c7 Mon Sep 17 00:00:00 2001
From: horcam <zhanghongquan15@huawei.com>
Date: Tue, 12 Aug 2025 11:14:01 +0800
Subject: [PATCH 3/7] update introduction of manual install

---
 .../installation/installation.md              | 47 +++++++++++++++++++
 .../installation/installation.md              | 46 ++++++++++++++++++
 2 files changed, 93 insertions(+)

diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
index a6789dccde..cadf178a7a 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
@@ -148,6 +148,53 @@ docker exec -it $DOCKER_NAME bash
   export PYTHONPATH=$MF_PATH:$PYTHONPATH  
   ```
 
+- **Manual Component Installation**  
+
+    If user need to modify the components or use other versions, components need to be manually installed in a specific order. vLLM MindSpore requires the following installation sequence:  
+
+    1. Install vLLM  
+
+       ```bash  
+       pip install /path/to/vllm-*.whl  
+       ```  
+
+    2. Uninstall Torch-related components  
+
+       ```bash  
+       pip uninstall torch torch-npu torchvision torchaudio -y  
+       ```  
+
+    3. Install MindSpore  
+
+       ```bash  
+       pip install /path/to/mindspore-*.whl  
+       ```  
+
+    4. Clone the MindSpore Transformers repository and add it to `PYTHONPATH`  
+
+       ```bash  
+       git clone https://gitee.com/mindspore/mindformers.git  
+       export PYTHONPATH=`realpath mindformers`:$PYTHONPATH  
+       ```  
+
+    5. Install Golden Stick  
+
+       ```bash  
+       pip install /path/to/mindspore_gs-*.whl  
+       ```  
+
+    6. Install MSAdapter  
+
+       ```bash  
+       pip install /path/to/msadapter-*.whl  
+       ```  
+
+    7. Install vLLM MindSpore  
+
+       ```bash  
+       pip install .  
+       ```
+
 ### Quick Verification
 
 User can verify the installation with a simple offline inference test. First, user need to configure the environment variables with the following command:
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
index 2243f2eaea..2331d3db9e 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
@@ -147,6 +147,52 @@ docker exec -it $DOCKER_NAME bash
     export PYTHONPATH=$MF_PATH:$PYTHONPATH
     ```
 
+- **组件手动安装**
+
+    若用户对组件有修改，或者需使用其他版本，则用户需要按照特定顺序，手动安装组件。vLLM MindSpore对组件的安装顺序要求如下：
+    1. 安装vLLM
+
+       ```bash
+       pip install /path/to/vllm-*.whl
+       ```
+
+    2. 卸载torch相关组件
+
+       ```bash
+       pip uninstall torch torch-npu torchvision torchaudio -y 
+       ```
+
+    3. 安装MindSpore
+
+       ```bash
+       pip install /path/to/mindspore-*.whl
+       ```
+
+    4. 引入MindSpore Transformers仓，加入到`PYTHONPATH`中
+
+       ```bash
+       git clone https://gitee.com/mindspore/mindformers.git
+       export PYTHONPATH=`realpath mindformers`:$PYTHONPATH
+       ```
+
+    5. 安装Golden Stick
+
+       ```bash
+       pip install /path/to/mindspore_gs-*.whl
+       ```
+
+    6. 安装MSAdapter
+
+       ```bash
+       pip install /path/to/msadapter-*.whl
+       ```
+
+    7. 安装vLLM MindSpore
+
+       ```bash
+       pip install .
+       ```
+
 ### 快速验证
 
 用户可以创建一个简单的离线推理场景，验证安装是否成功。下面以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例。首先用户需要执行以下命令，设置环境变量：
-- 
Gitee


From ab80875d6bc68507d06efdf46d8dfecb707b909c Mon Sep 17 00:00:00 2001
From: horcam <zhanghongquan15@huawei.com>
Date: Tue, 12 Aug 2025 11:25:21 +0800
Subject: [PATCH 4/7] update mindformers config introduction

---
 .../docs/source_en/getting_started/quick_start/quick_start.md   | 2 +-
 .../source_zh_cn/getting_started/quick_start/quick_start.md     | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md
index 506dc04152..91a88e814d 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md
@@ -138,7 +138,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra
 Here is an explanation of these environment variables:  
 
 - `vLLM_MODEL_BACKEND`: The backend of the model to run. User could find supported models and backends for vLLM MindSpore in the [Model Support List](../../user_guide/supported_models/models_list/models_list.md).  
-- `MINDFORMERS_MODEL_CONFIG`: The model configuration file.  
+- `MINDFORMERS_MODEL_CONFIG`: The model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-7B, the YAML file is [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml).
 
 Additionally, users need to ensure that MindSpore Transformers is installed. Users can add it by running the following command:  
 
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
index cb8960fed7..82d5534bda 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
@@ -138,7 +138,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra
 以下是对上述环境变量的解释：
 
 - `vLLM_MODEL_BACKEND`：所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端，可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询；
-- `MINDFORMERS_MODEL_CONFIG`：模型配置文件。
+- `MINDFORMERS_MODEL_CONFIG`：模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中，找到对应模型的yaml文件。以Qwen2.5-7B为例，则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。
 
 另外，用户需要确保MindSpore Transformers已安装。用户可通过
 
-- 
Gitee


From c984755479a9857cc421e46302de66df5902b570 Mon Sep 17 00:00:00 2001
From: horcam <zhanghongquan15@huawei.com>
Date: Tue, 12 Aug 2025 16:07:42 +0800
Subject: [PATCH 5/7] update env variable and pkgs version

---
 .../installation/installation.md              | 18 ++++-----
 .../environment_variables.md                  | 37 +++++++++++++++++--
 .../installation/installation.md              | 18 ++++-----
 .../environment_variables.md                  | 32 +++++++++++++++-
 4 files changed, 83 insertions(+), 22 deletions(-)

diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
index cadf178a7a..9c3f34359f 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
@@ -14,15 +14,15 @@ This document describes the steps to install the vLLM MindSpore environment. Thr
 - Python: 3.9 / 3.10 / 3.11  
 - Software version compatibility  
 
-  | Software | Version | Corresponding Branch |  
-  | -------- | ------- | -------------------- |  
-  | [CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | - |  
-  | [MindSpore](https://www.mindspore.cn/install/) | 2.7 | master |  
-  | [MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter) | 0.2 | master |  
-  | [MindSpore Transformers](https://gitee.com/mindspore/mindformers) | 1.6 | dev |  
-  | [Golden Stick](https://gitee.com/mindspore/golden-stick) | 1.1.0 | r1.1.0 |  
-  | [vLLM](https://github.com/vllm-project/vllm) | 0.9.1 | v0.9.1 |  
-  | [vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3 | master |  
+   | Software | Version |
+   | -----    | -----   |
+   |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann)     |   8.1      |
+   |[MindSpore](https://www.mindspore.cn/install/) |  2.7.0    |
+   |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.0.1 |
+   |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| 1.6.0  |
+   |[Golden Stick](https://gitee.com/mindspore/golden-stick)| 1.2.0  |
+   |[vLLM](https://github.com/vllm-project/vllm)      | 0.8.3 |
+   |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3.0 |
 
 ## Environment Setup
 
diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md
index c1b7161626..35129a6215 100644
--- a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md
+++ b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md
@@ -11,6 +11,37 @@
 | `HCCL_SOCKET_IFNAME` | Specifies the network interface name for inter-machine communication using HCCL. | String | Interface name (e.g., `enp189s0f0`). | Used in multi-machine scenarios. The interface name can be found via `ifconfig` by matching the IP address. |
 | `ASCEND_RT_VISIBLE_DEVICES` | Specifies which devices are visible to the current process, supporting one or multiple Device IDs. | String | Device IDs as a comma-separated string (e.g., `"0,1,2,3,4,5,6,7"`). | Recommended for Ray usage scenarios. |
 | `HCCL_BUFFSIZE` | Controls the buffer size for data sharing between two NPUs. | int | Buffer size in MB (e.g., `2048`). | Usage reference: [HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html). Example: For DeepSeek hybrid parallelism (Data Parallel: 32, Expert Parallel: 32) with `max-num-batched-tokens=256`, set `export HCCL_BUFFSIZE=2048`. |
-| MS_MEMPOOL_BLOCK_SIZE | Set the size of the memory pool block in PyNative mode for devices | String | String of positive number, and the unit is GB. |  |
-| vLLM_USE_NPU_ADV_STEP_FLASH_OP | Whether to use Ascend operation `adv_step_flash`  | String | `on`: Use；`off`：Not use | If the variable is set to `off`, model will use the implement of small operations. |
-| VLLM_TORCH_PROFILER_DIR | Enables profiling data collection and takes effect when a data save path is configured. | String | The path to save profiling data. | |
+| `MS_MEMPOOL_BLOCK_SIZE` | Set the size of the memory pool block in PyNative mode for devices | String | String of positive number, and the unit is GB. |  |
+| `vLLM_USE_NPU_ADV_STEP_FLASH_OP` | Whether to use Ascend operation `adv_step_flash`  | String | `on`: Use；`off`：Not use | If the variable is set to `off`, model will use the implement of small operations. |
+| `VLLM_TORCH_PROFILER_DIR` | Enables profiling data collection and takes effect when a data save path is configured. | String | The path to save profiling data. | |
+
+The following environment variables are automatically registered by vLLM MindSpore:  
+
+| **Environment Variable** | **Function** | **Type** | **Values** | **Description** |  
+|------------------------|-------------|----------|------------|----------------|  
+| `USE_TORCH` | Transformer runtime depends on this variable. | String | Default: `"False"` | vLLM MindSpore does not use Torch as the backend. |  
+| `USE_TF` | Transformer runtime depends on this variable. | String | Default: `"False"` | vLLM MindSpore does not use TensorFlow as the backend. |  
+| `RUN_MODE` | Execution mode. | String | Default: `"predict"` | **This variable will be removed in future versions.** Required by MindFormers. |  
+| `CUSTOM_MATMUL_SHUFFLE` | Enables or disables custom matrix shuffling algorithm . | String | `on`: Enable shuffling. `off`: Disable shuffling. Default: `on`. | |  
+| `HCCL_DETERMINISTIC` | Enables or disables deterministic computation for reduction-type communication operators (e.g., AllReduce, ReduceScatter, Reduce). | String | `true`: Enable deterministic mode. `false`: Disable deterministic mode. Default: `false`. | |  
+| `ASCEND_LAUNCH_BLOCKING` | Controls whether operators run in synchronous mode during training or online inference. | Integer | `1`: Force synchronous execution. `0`: Do not force synchronous execution. Default: `0`. | |  
+| `TE_PARALLEL_COMPILER` | Maximum number of parallel compilation processes for operators. Parallel compilation is enabled if greater than 1. | Integer | Positive integer; Max = CPU cores * 80% / # of Ascend AI processors. Range: 1~32. Default: `0`. | |  
+| `LCCL_DETERMINISTIC` | Controls whether LCCL deterministic AllReduce (ordered addition) is enabled. | Integer | `1`: Enable deterministic mode. `0`: Disable deterministic mode. Default: `0`. | |  
+| `MS_ENABLE_GRACEFUL_EXIT` | Enables graceful process termination. | Integer | `1`: Enable graceful exit. `Other values`: Disable graceful exit. Default: `0`. | |  
+| `CPU_AFFINITY` | Optimizes CPU core binding for MindSpore inference. | String | `True`: Enable core binding. `False`: Disable core binding. Default: `True`. | **This variable will be removed in future versions.** Replaced by `set_cpu_affinity` API. |  
+| `MS_ENABLE_INTERNAL_BOOST` | Enables or disables MindSpore framework's internal acceleration. | String | `on`: Enable acceleration. `off`: Disable acceleration. Default: `on`. | |  
+| `MS_ENABLE_LCCL` | Controls whether the LCCL communication library is used. | Integer | `1`: Enable. `0`: Disable. Default: `0`. | |  
+| `HCCL_EXEC_TIMEOUT` | Controls the synchronization timeout for inter-device execution. | Integer | Range: (0, 17340] (seconds). Default: `7200`. | |  
+| `DEVICE_NUM_PER_NODE` | Number of devices per node. | Integer | Default: `16`. | |  
+| `HCCL_OP_EXPANSION_MODE` | Configures the expansion location for communication algorithms. | String | `AI_CPU`: Expands on AI CPU compute units. `AIV`: Expands on AI Vector Core compute units. Default: `AIV`. | |  
+| `MS_JIT_MODULES` | Specifies modules to be JIT-compiled in static graph mode. | String | Module names (top-level imports). Multiple modules should be comma-separated. Default: `"vllm_mindspore,research"`. | |  
+| `GLOG_v` | Controls log level. | Integer | `0`: DEBUG. `1`: INFO. `2`: WARNING. `3`: ERROR (logs errors, may not terminate). `4`: CRITICAL (logs critical errors, terminates execution). Default: `3`. | |  
+| `RAY_CGRAPH_get_timeout` | Timeout for `ray.get()` method (seconds). | Integer | Default: `360`. | |  
+| `MS_NODE_TIMEOUT` | Node heartbeat timeout (seconds). | Integer | Default: `180`. | |  
+
+More environment variable information can be referred in the following link:
+
+ - [CANN Environment Variable List](https://www.hiascend.com/document/detail/en/CANNCommunityEdition/81RC1beta1/index/index.html)
+ - [MindSpore Environment Variable List](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/env_var_list.html)
+ - [MindSpore Transformers Environment Variable List](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/index.html)
+ - [vLLM Environment Variable List](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html)
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
index 2331d3db9e..8d4eb6a8d6 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
@@ -13,15 +13,15 @@
 - Python：3.9 / 3.10 / 3.11
 - 软件版本配套
 
-   | 软件 | 版本 | 对应分支 |
-   | -----    | -----   |  ----- |
-   |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann)     |   8.1      |  -    |
-   |[MindSpore](https://www.mindspore.cn/install/) |  2.7    | master     |
-   |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.2 | master  |
-   |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)|1.6      | dev |
-   |[Golden Stick](https://gitee.com/mindspore/golden-stick)|1.1.0    | r1.1.0 |
-   |[vLLM](https://github.com/vllm-project/vllm)      | 0.9.1 | v0.9.1   |
-   |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3 | master  |
+   | 软件 | 版本 |
+   | -----    | -----   |
+   |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann)     |   8.1      |
+   |[MindSpore](https://www.mindspore.cn/install/) |  2.7.0    |
+   |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.0.1 |
+   |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| 1.6.0  |
+   |[Golden Stick](https://gitee.com/mindspore/golden-stick)| 1.2.0  |
+   |[vLLM](https://github.com/vllm-project/vllm)      | 0.8.3 |
+   |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3.0 |
 
 ## 配置环境
 
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md
index 7fd53b3ff3..835946b67d 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md
@@ -10,7 +10,37 @@
 |   TP_SOCKET_IFNAME   |   用于多机之间使用TP通信时的网口名称。   |   String   | 网口名称，例如enp189s0f0。      |   多机场景使用，可通过`ifconfig`查找ip对应网卡的网卡名。   |
 | HCCL_SOCKET_IFNAME | 用于多机之间使用HCCL通信时的网口名称。 | String | 网口名称，例如enp189s0f0。  | 多机场景使用，可通过`ifconfig`查找ip对应网卡的网卡名。 |
 | ASCEND_RT_VISIBLE_DEVICES | 指定哪些Device对当前进程可见，支持一次指定一个或多个Device ID。 | String | 为Device ID，逗号分割的字符串，例如"0,1,2,3,4,5,6,7"。 | ray使用场景建议使用。 |
-| HCCL_BUFFSIZE | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | int | 缓存区大小，大小为MB。例如：`2048`。 | 使用方法参考：[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行（数据并行数为32，专家并行数为32），且`max-num-batched-tokens`为256时，则`export HCCL_BUFFSIZE=2048`。 |
+| HCCL_BUFFSIZE | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | Integer | 缓存区大小，大小为MB。例如：`2048`。 | 使用方法参考：[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行（数据并行数为32，专家并行数为32），且`max-num-batched-tokens`为256时，则`export HCCL_BUFFSIZE=2048`。 |
 | MS_MEMPOOL_BLOCK_SIZE | 设置PyNative模式下设备内存池的块大小。 | String | 正整数string，单位为GB。 |  |
 | vLLM_USE_NPU_ADV_STEP_FLASH_OP | 是否使用昇腾`adv_step_flash`算子。 | String | `on`: 使用；`off`：不使用 | 取值为`off`时，将使用小算子实现替代`adv_step_flash`算子。 |
 | VLLM_TORCH_PROFILER_DIR | 开启profiling采集数据，当配置了采集数据保存路径后生效 | String | Profiling数据保存路径。|   |
+
+以下环境变量由vLLM MindSpore自动注册：
+
+|   环境变量   |   功能   |   类型   |   取值   |   说明   |
+|   ------   |   -------  |   ------   |   ------   |   ------   |
+|   USE_TORCH   |   Transformer运行时依赖该环境变量  |   String   |   默认值为"False"   | vLLM MindSpore 不使用torch 作为后端   |
+|   USE_TF   |   Transformer运行时依赖该环境变量  |   String   |   默认值为"False"   | vLLM MindSpore 不使用TensorFlow 作为后端   |
+|   RUN_MODE   |   执行模式为推理  |   String   |   默认值为"predict"   |  **该环境变量在后续版本会被移除。** 为MindFormers依赖的环境变量   |
+|   CUSTOM_MATMUL_SHUFFLE   |   开启或关闭自定义矩阵算法的洗牌操作  |   String   |   `on`：开启矩阵洗牌。`off`：关闭矩阵洗牌。默认值为`on`。   | |
+|   HCCL_DETERMINISTIC   |   开启或关闭归约类通信算子的确定性计算，其中归约类通信算子包括 AllReduce、ReduceScatter、Reduce。  |   String   |   `true`：打开 HCCL 确定性开关；`false`：关闭 HCCL 确定性开关。默认值为`false`。   |    |
+|   ASCEND_LAUNCH_BLOCKING   |   训练或在线推理场景，可通过此环境变量控制算子执行时是否启动同步模式。  |   Integer   |   `1`：强制算子采用同步模式运行；`0`：不强制算子采用同步模式运行。默认值为`0`。   |     |
+|   TE_PARALLEL_COMPILER   |   算子最大并行编译进程数，当大于 1 时开启并行编译。  |   Integer   |   取值为正整数；最大不超过 cpu 核数*80%/昇腾 AI 处理器个数，取值范围 1~32。默认值是 `0`。   |     |
+|   LCCL_DETERMINISTIC   |   设置 LCCL 确定性算子 AllReduce(保序加)是否开启。  |   Integer   |   `1`：打开 LCCL 确定性开关；`0`：关闭 LCCL 确定性开关。默认值是 `0`。   |    |
+|   MS_ENABLE_GRACEFUL_EXIT   |   设置使能进程优雅退出  |   Integer   |   `1`：使用进程优雅退出功能。`不设置或者其他值`: 不使用进程优雅退出功能。默认值为`0`   |      |
+|   CPU_AFFINIITY   |   MindSpore推理绑核优化  |   String   |   `True`：开启绑核；`True`：不开启绑核。默认值为`True`   |   **该环境变量在后续版本会被移除。** 将使用`set_cpu_affinity`接口。   |
+|   MS_ENABLE_INTERNAL_BOOST   |   是否打开 MindSpore 框架的内部加速功能。  |   String   |   `on`：开启 MindSpore 内部加速；`off`：关闭 MindSpore 内部加速。默认值为`on`   |    |
+|   MS_ENABLE_LCCL   |   是否使用LCCL通信库。  |   Integer   |   `1`:开启，`0`:关闭。默认值为`0`。   |     |
+|   HCCL_EXEC_TIMEOUT   |   通过该环境变量可控制设备间执行时同步等待的时间，在该配置时间内各设备进程等待其他设备执行通信同步。  |   Integer   |   取值范围为：(0, 17340]，单位为 s。 默认值为 7200。  |    |
+|   DEVICE_NUM_PER_NODE   | 节点上的设备数    |   Integer   |   默认值为16。   |   |
+|   HCCL_OP_EXPANSION_MODE   |   用于配置通信算法的编排展开位置。  |   String   |  `AI_CPU`：通信算法的编排展开位置为Device侧的AI CPU计算单元；`AIV`：通信算法的编排展开位置为Device侧的AI Vector Core计算单元。默认值为`AIV`。   |     |
+|   MS_JIT_MODULES   |   指定静态图模式下哪些模块需要JIT静态编译，其函数方法会被编译成静态计算图  |   String   |   模块名，对应import导入的顶层模块的名称。如果有多个，使用英文逗号分隔。默认值为`"vllm_mindspore,research"`。   |     |
+|   GLOG_v   |   控制日志的级别  |   Integer   |   `0`：DEBUG；`1`：INFO；`2`：WARNING；`3`：ERROR，表示程序执行出现报错，输出错误日志，程序可能不会终止；`4`：CRITICAL，表示程序执行出现异常，将会终止执行程序。默认值为`3`。   |      |
+|   RAY_CGRAPH_get_timeout   |   `ray.get()`方法的超时时间。  |   Integer   |   默认值为`360`。   |      |
+|   MS_NODE_TIMEOUT |   节点心跳超时时间，单位：秒。     |   Integer  |   默认值为`180`。   |    |
+
+更多的环境变量信息，请查看：
+ - [CANN 环境变量列表](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/index/index.html)
+ - [MindSpore 环境变量列表](https://www.mindspore.cn/docs/zh-CN/r2.7.0rc1/api_python/env_var_list.html)
+ - [MindSpore Transformers 环境变量列表](https://www.mindspore.cn/mindformers/docs/zh-CN/r1.6.0/index.html)
+ - [vLLM 环境变量列表](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html)
-- 
Gitee


From 10cecf67266ac2a31684ed427bd13febb662c7f6 Mon Sep 17 00:00:00 2001
From: horcam <zhanghongquan15@huawei.com>
Date: Tue, 12 Aug 2025 16:14:39 +0800
Subject: [PATCH 6/7] update others

---
 .../user_guide/supported_features/profiling/profiling.md        | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md
index b24b541e59..897f1ec0c8 100644
--- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md
+++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md
@@ -40,7 +40,7 @@ curl -X POST http://127.0.0.1:8000/start_profile
 curl http://localhost:8000/v1/completions \  
     -H "Content-Type: application/json" \  
     -d '{  
-        "model": "/home/DeepSeekV3",  
+        "model": "Qwen/Qwen2.5-32B-Instruct",  
         "prompt": "San Francisco is a",  
         "max_tokens": 7,  
         "temperature": 0  
-- 
Gitee


From 242892400a1ac78a4a22a45f1b0cb59b80c9146d Mon Sep 17 00:00:00 2001
From: horcam <zhanghongquan15@huawei.com>
Date: Wed, 13 Aug 2025 16:19:23 +0800
Subject: [PATCH 7/7] fix for comment

---
 .../installation/installation.md              |  93 +++++++++-------
 .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md      |  19 ++--
 .../qwen2.5_32b_multiNPU.md                   |   4 +-
 .../qwen2.5_7b_singleNPU.md                   |   2 +-
 .../environment_variables.md                  |  34 +-----
 .../quantization/quantization.md              |   2 +-
 .../models_list/models_list.md                |   2 +-
 .../installation/installation.md              | 105 ++++++++++--------
 .../quick_start/quick_start.md                |   4 +-
 .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md      |  17 +--
 .../qwen2.5_32b_multiNPU.md                   |   4 +-
 .../qwen2.5_7b_singleNPU.md                   |   6 +-
 .../environment_variables.md                  |  53 +++------
 .../quantization/quantization.md              |   2 +-
 .../models_list/models_list.md                |   2 +-
 15 files changed, 160 insertions(+), 189 deletions(-)

diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
index 9c3f34359f..7299d810a1 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
@@ -4,8 +4,7 @@
 
 This document describes the steps to install the vLLM MindSpore environment. Three installation methods are provided:  
 
-- [Docker Installation](#docker-installation): Suitable for quick deployment scenarios.  
-- [Pip Installation](#pip-installation): Suitable for scenarios requiring specific versions.  
+- [Docker Installation](#docker-installation): Suitable for quick deployment scenarios.
 - [Source Code Installation](#source-code-installation): Suitable for incremental development of vLLM MindSpore.  
 
 ## Version Compatibility
@@ -14,19 +13,19 @@ This document describes the steps to install the vLLM MindSpore environment. Thr
 - Python: 3.9 / 3.10 / 3.11  
 - Software version compatibility  
 
-   | Software | Version |
+   | Software | Version And Links |
    | -----    | -----   |
-   |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann)     |   8.1      |
-   |[MindSpore](https://www.mindspore.cn/install/) |  2.7.0    |
-   |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.0.1 |
-   |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| 1.6.0  |
-   |[Golden Stick](https://gitee.com/mindspore/golden-stick)| 1.2.0  |
-   |[vLLM](https://github.com/vllm-project/vllm)      | 0.8.3 |
-   |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3.0 |
+   |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann)     |   [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit)      |
+   |[MindSpore](https://www.mindspore.cn/install/) |  [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/)    |
+   |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) |
+   |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers)  |
+   |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/)  |
+   |[vLLM](https://github.com/vllm-project/vllm)      | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202505/20250514/v0.8.4.dev0_newest/any/) |
+   |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) |
 
 ## Environment Setup
 
-This section introduces three installation methods: [Docker Installation](#docker-installation), [Pip Installation](#pip-installation), [Source Code Installation](#source-code-installation), and [Quick Verification](#quick-verification) example to check the installation.  
+This section introduces two installation methods: [Docker Installation](#docker-installation), [Source Code Installation](#source-code-installation), and [Quick Verification](#quick-verification) example to check the installation.  
 
 ### Docker Installation
 
@@ -106,51 +105,55 @@ docker exec -it $DOCKER_NAME bash
 
 ### Source Code Installation
 
-- **CANN Installation**
+#### CANN Installation
 
-  For CANN installation methods and environment configuration, please refer to [CANN Community Edition Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit). If you encounter any issues during CANN installation, please consult the [Ascend FAQ](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html) for troubleshooting.
+For CANN installation methods and environment configuration, please refer to [CANN Community Edition Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit). If you encounter any issues during CANN installation, please consult the [Ascend FAQ](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html) for troubleshooting.
 
-  The default installation path for CANN is `/usr/local/Ascend`. After completing CANN installation, configure the environment variables with the following commands:
+The default installation path for CANN is `/usr/local/Ascend`. After completing CANN installation, configure the environment variables with the following commands:
 
-  ```bash
-  LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package
-  source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh
-  export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit
-  ```
+```bash
+LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package
+source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh
+export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit
+```
 
-- **vLLM Prerequisites Installation**
+#### vLLM Prerequisites Installation
 
   For vLLM environment configuration and installation methods, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html). In vllM installation, `gcc/g++ >= 12.3.0` is required, and it could be  installed by the following command:
 
-  ```bash
-  yum install -y gcc gcc-c++
-  ```
+```bash
+yum install -y gcc gcc-c++
+```
 
-- **vLLM MindSpore Installation**
+#### vLLM MindSpore Installation
 
-  To install vLLM MindSpore, user needs to pull the vLLM MindSpore source code and then runs the following command to install the dependencies:
+vLLM MindSpore can be installed in the following two ways. **vLLM MindSpore One-click Installation** is suitable for scenarios where users need quick deployment and usage. **vLLM MindSpore Manual Installation** is suitable for scenarios where users require custom modifications to the components.
 
-  ```bash  
-  git clone https://gitee.com/mindspore/vllm-mindspore.git  
-  cd vllm-mindspore  
-  bash install_depend_pkgs.sh  
-  ```  
+- **vLLM MindSpore One-click Installation**
 
-  Compile and install vLLM MindSpore:  
+    To install vLLM MindSpore, user needs to pull the vLLM MindSpore source code and then runs the following command to install the dependencies:
 
-  ```bash  
-  pip install .  
-  ```
+    ```bash  
+    git clone https://gitee.com/mindspore/vllm-mindspore.git  
+    cd vllm-mindspore  
+    bash install_depend_pkgs.sh  
+    ```  
 
-  After executing the above commands, `mindformers` folder will be generated in the `vllm-mindspore/install_depend_pkgs` directory. Add this folder to the environment variables:  
+    Compile and install vLLM MindSpore:  
 
-  ```bash
-  export PYTHONPATH=$MF_PATH:$PYTHONPATH  
-  ```
+    ```bash  
+    pip install .  
+    ```
 
-- **Manual Component Installation**  
+    After executing the above commands, `mindformers` folder will be generated in the `vllm-mindspore/install_depend_pkgs` directory. Add this folder to the environment variables:  
 
-    If user need to modify the components or use other versions, components need to be manually installed in a specific order. vLLM MindSpore requires the following installation sequence:  
+    ```bash
+    export PYTHONPATH=$MF_PATH:$PYTHONPATH  
+    ```
+
+- **vLLM MindSpore Manual Installation**
+
+    If user need to modify the components or use other versions, components need to be manually installed in a specific order. Version compatibility of vLLM MindSpore can be found [Version Compatibility](#version-compatibility), abd vLLM MindSpore requires the following installation sequence:  
 
     1. Install vLLM  
 
@@ -174,7 +177,7 @@ docker exec -it $DOCKER_NAME bash
 
        ```bash  
        git clone https://gitee.com/mindspore/mindformers.git  
-       export PYTHONPATH=`realpath mindformers`:$PYTHONPATH  
+       export PYTHONPATH=$MF_PATH:$PYTHONPATH  
        ```  
 
     5. Install Golden Stick  
@@ -189,13 +192,17 @@ docker exec -it $DOCKER_NAME bash
        pip install /path/to/msadapter-*.whl  
        ```  
 
-    7. Install vLLM MindSpore  
+    7. Install vLLM MindSpore
+
+       User needs to pull source of vLLM MindSpore, and run installation.
 
        ```bash  
+       git clone https://gitee.com/mindspore/vllm-mindspore.git
+       cd vllm-mindspore
        pip install .  
        ```
 
-### Quick Verification
+## Quick Verification
 
 User can verify the installation with a simple offline inference test. First, user need to configure the environment variables with the following command:
 
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
index 8386c3d11c..9c369446a2 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
@@ -52,8 +52,8 @@ Execute the following Python script to download the MindSpore-compatible DeepSee
 
 ```python  
 from openmind_hub import snapshot_download
-snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-W8A8",
-                  local_dir="/path/to/save/deepseek_r1_w8a8",
+snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-0528-A8W8",
+                  local_dir="/path/to/save/deepseek_r1_0528_a8w8",
                   local_dir_use_symlinks=False)
 ```
 
@@ -78,7 +78,7 @@ If the tool is unavailable, install [git-lfs](https://git-lfs.com) first. Refer
 Once confirmed, download the weights by executing the following command:
 
 ```shell  
-git clone https://modelers.cn/MindSpore-Lab/DeepSeek-R1-W8A8.git  
+git clone https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8.git  
 ```  
 
 ## TP16 Tensor Parallel Inference
@@ -241,17 +241,17 @@ Execution example:
 
 ```bash  
 # Master node:  
-vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray
+vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray
 ```  
 
-In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can also set the local model path by `--model` argument. 
+In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can also set the local model path by `--model` argument.
 
 #### Sending Requests
 
 Use the following command to send requests, where `prompt` is the model input:  
 
 ```bash  
-curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}'  
+curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}'  
 ```
 
 User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model.
@@ -304,6 +304,7 @@ parallel_config:
 ### Online Inference
 
 #### Starting the Service
+
 `vllm-mindspore` can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service:  
 
 ```bash  
@@ -330,10 +331,10 @@ User can also set the local model path by `--model` argument. The following is a
 
 ```bash  
 # Master node:  
-vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel  
+vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel  
 
 # Worker node:  
-vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel  
+vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel  
 ```
 
 #### Sending Requests
@@ -341,7 +342,7 @@ vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust
 Use the following command to send requests, where `prompt` is the model input:  
 
 ```bash  
-curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0}'  
+curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0}'  
 ```
 
 User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model.
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
index 0451257e2b..40ad4a1597 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
@@ -134,7 +134,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra
 Here is an explanation of these environment variables:  
 
 - `vLLM_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md).  
-- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5). For Qwen2.5-32B, the YAML file is [predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml).
+- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-32B, the YAML file is [predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml).
 
 Users can check memory usage with `npu-smi info` and set the NPU cards for inference using the following example (assuming cards 4,5,6,7 are used):
 
@@ -188,7 +188,7 @@ If processed successfully, the inference result will be:
 {
     "id":"cmpl-11fe2898c77d4ff18c879f57ae7aa9ca","object":"text_completion",
     "create":1748568696,
-    "model":"Qwen/Qwen2.5-32B-Instruct",
+    "model":"Qwen2.5-32B-Instruct",
     "choices":[
         {
             "index":0,
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
index e1216d12c9..79ed73f812 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
@@ -222,7 +222,7 @@ If the request is processed successfully, the following inference result will be
 {  
     "id":"cmpl-5e6e314861c24ba79fea151d86c1b9a6","object":"text_completion",  
     "create":1747398389,  
-    "model":"Qwen/Qwen2.5-7B-Instruct",  
+    "model":"Qwen2.5-7B-Instruct",  
     "choices":[  
         {  
             "index":0,  
diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md
index 35129a6215..036834798e 100644
--- a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md
+++ b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md
@@ -15,33 +15,9 @@
 | `vLLM_USE_NPU_ADV_STEP_FLASH_OP` | Whether to use Ascend operation `adv_step_flash`  | String | `on`: Use；`off`：Not use | If the variable is set to `off`, model will use the implement of small operations. |
 | `VLLM_TORCH_PROFILER_DIR` | Enables profiling data collection and takes effect when a data save path is configured. | String | The path to save profiling data. | |
 
-The following environment variables are automatically registered by vLLM MindSpore:  
+More environment variable information can be referred in the following links:
 
-| **Environment Variable** | **Function** | **Type** | **Values** | **Description** |  
-|------------------------|-------------|----------|------------|----------------|  
-| `USE_TORCH` | Transformer runtime depends on this variable. | String | Default: `"False"` | vLLM MindSpore does not use Torch as the backend. |  
-| `USE_TF` | Transformer runtime depends on this variable. | String | Default: `"False"` | vLLM MindSpore does not use TensorFlow as the backend. |  
-| `RUN_MODE` | Execution mode. | String | Default: `"predict"` | **This variable will be removed in future versions.** Required by MindFormers. |  
-| `CUSTOM_MATMUL_SHUFFLE` | Enables or disables custom matrix shuffling algorithm . | String | `on`: Enable shuffling. `off`: Disable shuffling. Default: `on`. | |  
-| `HCCL_DETERMINISTIC` | Enables or disables deterministic computation for reduction-type communication operators (e.g., AllReduce, ReduceScatter, Reduce). | String | `true`: Enable deterministic mode. `false`: Disable deterministic mode. Default: `false`. | |  
-| `ASCEND_LAUNCH_BLOCKING` | Controls whether operators run in synchronous mode during training or online inference. | Integer | `1`: Force synchronous execution. `0`: Do not force synchronous execution. Default: `0`. | |  
-| `TE_PARALLEL_COMPILER` | Maximum number of parallel compilation processes for operators. Parallel compilation is enabled if greater than 1. | Integer | Positive integer; Max = CPU cores * 80% / # of Ascend AI processors. Range: 1~32. Default: `0`. | |  
-| `LCCL_DETERMINISTIC` | Controls whether LCCL deterministic AllReduce (ordered addition) is enabled. | Integer | `1`: Enable deterministic mode. `0`: Disable deterministic mode. Default: `0`. | |  
-| `MS_ENABLE_GRACEFUL_EXIT` | Enables graceful process termination. | Integer | `1`: Enable graceful exit. `Other values`: Disable graceful exit. Default: `0`. | |  
-| `CPU_AFFINITY` | Optimizes CPU core binding for MindSpore inference. | String | `True`: Enable core binding. `False`: Disable core binding. Default: `True`. | **This variable will be removed in future versions.** Replaced by `set_cpu_affinity` API. |  
-| `MS_ENABLE_INTERNAL_BOOST` | Enables or disables MindSpore framework's internal acceleration. | String | `on`: Enable acceleration. `off`: Disable acceleration. Default: `on`. | |  
-| `MS_ENABLE_LCCL` | Controls whether the LCCL communication library is used. | Integer | `1`: Enable. `0`: Disable. Default: `0`. | |  
-| `HCCL_EXEC_TIMEOUT` | Controls the synchronization timeout for inter-device execution. | Integer | Range: (0, 17340] (seconds). Default: `7200`. | |  
-| `DEVICE_NUM_PER_NODE` | Number of devices per node. | Integer | Default: `16`. | |  
-| `HCCL_OP_EXPANSION_MODE` | Configures the expansion location for communication algorithms. | String | `AI_CPU`: Expands on AI CPU compute units. `AIV`: Expands on AI Vector Core compute units. Default: `AIV`. | |  
-| `MS_JIT_MODULES` | Specifies modules to be JIT-compiled in static graph mode. | String | Module names (top-level imports). Multiple modules should be comma-separated. Default: `"vllm_mindspore,research"`. | |  
-| `GLOG_v` | Controls log level. | Integer | `0`: DEBUG. `1`: INFO. `2`: WARNING. `3`: ERROR (logs errors, may not terminate). `4`: CRITICAL (logs critical errors, terminates execution). Default: `3`. | |  
-| `RAY_CGRAPH_get_timeout` | Timeout for `ray.get()` method (seconds). | Integer | Default: `360`. | |  
-| `MS_NODE_TIMEOUT` | Node heartbeat timeout (seconds). | Integer | Default: `180`. | |  
-
-More environment variable information can be referred in the following link:
-
- - [CANN Environment Variable List](https://www.hiascend.com/document/detail/en/CANNCommunityEdition/81RC1beta1/index/index.html)
- - [MindSpore Environment Variable List](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/env_var_list.html)
- - [MindSpore Transformers Environment Variable List](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/index.html)
- - [vLLM Environment Variable List](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html)
+- [CANN Environment Variable List](https://www.hiascend.com/document/detail/en/CANNCommunityEdition/81RC1beta1/index/index.html)
+- [MindSpore Environment Variable List](https://www.mindspore.cn/docs/en/master/api_python/env_var_list.html)
+- [MindSpore Transformers Environment Variable List](https://www.mindspore.cn/mindformers/docs/en/master/index.html)
+- [vLLM Environment Variable List](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html)
diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md
index 33c39b583d..fa1b8f89c3 100644
--- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md
+++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md
@@ -16,7 +16,7 @@ We employ [MindSpore Golden Stick's PTQ algorithm](https://gitee.com/mindspore/g
 
 ### Downloading Quantized Weights
 
-We have uploaded the quantized DeepSeek-R1 to [ModelArts Community](https://modelers.cn): [MindSpore-Lab/DeepSeek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-W8A8). Refer to the [ModelArts Community documentation](https://modelers.cn/docs/en/openmind-hub-client/0.9/basic_tutorial/download.html) to download the weights locally.
+We have uploaded the quantized DeepSeek-R1 to [ModelArts Community](https://modelers.cn): [MindSpore-Lab/DeepSeek-R1-0528-A8W8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8). Refer to the [ModelArts Community documentation](https://modelers.cn/docs/en/openmind-hub-client/0.9/basic_tutorial/download.html) to download the weights locally.
 
 ## Quantized Model Inference
 
diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md
index ba825bcae5..3d9b49bd6e 100644
--- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md
+++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md
@@ -6,7 +6,7 @@
 |-------| --------- | ---- |
 | DeepSeek-V3 |   Supported | [DeepSeek-V3](https://modelers.cn/models/MindSpore-Lab/DeepSeek-V3) |
 | DeepSeek-R1 |   Supported | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-V3) |
-| DeepSeek-R1 W8A8 |   Supported | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-r1-w8a8) |
+| DeepSeek-R1 W8A8 |   Supported | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8) |
 | Qwen2.5 | Supported | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct), [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct), [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct), [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct), [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) |
 | Qwen3-32B | Supported | [Qwen3-32B](https://modelers.cn/models/MindSpore-Lab/Qwen3-32B) |
 | Qwen3-235B-A22B | Supported | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) |
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
index 8d4eb6a8d6..67269d9419 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
@@ -13,19 +13,19 @@
 - Python：3.9 / 3.10 / 3.11
 - 软件版本配套
 
-   | 软件 | 版本 |
+   | 软件 | 配套版本与下载链接 |
    | -----    | -----   |
-   |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann)     |   8.1      |
-   |[MindSpore](https://www.mindspore.cn/install/) |  2.7.0    |
-   |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.0.1 |
-   |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| 1.6.0  |
-   |[Golden Stick](https://gitee.com/mindspore/golden-stick)| 1.2.0  |
-   |[vLLM](https://github.com/vllm-project/vllm)      | 0.8.3 |
-   |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3.0 |
+   |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann)     |   [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit)      |
+   |[MindSpore](https://www.mindspore.cn/install/) |  [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/)    |
+   |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) |
+   |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers)  |
+   |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/)  |
+   |[vLLM](https://github.com/vllm-project/vllm)      | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202505/20250514/v0.8.4.dev0_newest/any/) |
+   |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) |
 
 ## 配置环境
 
-在本章节中，我们将介绍[docker安装](#docker安装)、[pip安装](#pip安装)、[源码安装](#源码安装)三种安装方式，以及[快速验证](#快速验证)用例，用于验证安装是否成功。
+在本章节中，我们将介绍[docker安装](#docker安装)、[源码安装](#源码安装)两种安装方式，以及[快速验证](#快速验证)用例，用于验证安装是否成功。
 
 ### docker安装
 
@@ -105,29 +105,33 @@ docker exec -it $DOCKER_NAME bash
 
 ### 源码安装
 
-- **CANN安装**
+#### CANN安装
 
-    CANN安装方法与环境配套，请参考[CANN社区版软件安装](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit)，若用户在安装CANN过程中遇到问题，可参考[昇腾常见问题](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html)进行解决。
+CANN安装方法与环境配套，请参考[CANN社区版软件安装](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit)，若用户在安装CANN过程中遇到问题，可参考[昇腾常见问题](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html)进行解决。
 
-    CANN默认安装路径为`/usr/local/Ascend`。用户在安装CANN完毕后，使用如下命令，为CANN配置环境变量：
+CANN默认安装路径为`/usr/local/Ascend`。用户在安装CANN完毕后，使用如下命令，为CANN配置环境变量：
 
-    ```bash
-    LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package
-    source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh
-    export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit
-    ```
+```bash
+LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package
+source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh
+export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit
+```
 
-- **vLLM前置依赖安装**
+#### vLLM前置依赖安装
 
     vLLM的环境配置与安装方法，请参考[vLLM安装教程](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html)。其依赖`gcc/g++ >= 12.3.0`版本，可通过以下命令完成安装：
 
-    ```bash
-    yum install -y gcc gcc-c++
-    ```
+```bash
+yum install -y gcc gcc-c++
+```
+
+#### vLLM MindSpore安装
 
-- **vLLM MindSpore安装**
+vLLM MindSpore有以下两种安装方式。**vLLM MindSpore一键式安装**适用于用户快速使用与部署的场景。**vLLM MindSpore手动安装**适用于用户对组件有自定义修改的场景。
 
-    安装vLLM MindSpore，需要在拉取vLLM MindSpore源码后，执行以下命令，安装依赖包：
+- **vLLM MindSpore一键式安装**
+
+    采用一键式安装脚本来安装vLLM MindSpore，需要在拉取vLLM MindSpore源码后，执行以下命令，安装依赖包：
 
     ```bash
     git clone https://gitee.com/mindspore/vllm-mindspore.git
@@ -147,53 +151,58 @@ docker exec -it $DOCKER_NAME bash
     export PYTHONPATH=$MF_PATH:$PYTHONPATH
     ```
 
-- **组件手动安装**
+- **vLLM MindSpore手动安装**
+
+    若用户对组件有修改，或者需使用其他版本，则用户需要按照特定顺序，手动安装组件。vLLM MindSpore软件配套下载地址可以参考[版本配套](#版本配套)，且对组件的安装顺序要求如下：
 
-    若用户对组件有修改，或者需使用其他版本，则用户需要按照特定顺序，手动安装组件。vLLM MindSpore对组件的安装顺序要求如下：
     1. 安装vLLM
 
-       ```bash
-       pip install /path/to/vllm-*.whl
-       ```
+        ```bash
+        pip install /path/to/vllm-*.whl
+        ```
 
     2. 卸载torch相关组件
 
-       ```bash
-       pip uninstall torch torch-npu torchvision torchaudio -y 
-       ```
+        ```bash
+        pip uninstall torch torch-npu torchvision torchaudio -y
+        ```
 
     3. 安装MindSpore
 
-       ```bash
-       pip install /path/to/mindspore-*.whl
-       ```
+        ```bash
+        pip install /path/to/mindspore-*.whl
+        ```
 
     4. 引入MindSpore Transformers仓，加入到`PYTHONPATH`中
 
-       ```bash
-       git clone https://gitee.com/mindspore/mindformers.git
-       export PYTHONPATH=`realpath mindformers`:$PYTHONPATH
-       ```
+        ```bash
+        git clone https://gitee.com/mindspore/mindformers.git
+        export PYTHONPATH=$MF_PATH:$PYTHONPATH
+        ```
 
     5. 安装Golden Stick
 
-       ```bash
-       pip install /path/to/mindspore_gs-*.whl
-       ```
+        ```bash
+        pip install /path/to/mindspore_gs-*.whl
+        ```
 
     6. 安装MSAdapter
 
-       ```bash
-       pip install /path/to/msadapter-*.whl
-       ```
+        ```bash
+        pip install /path/to/msadapter-*.whl
+        ```
 
     7. 安装vLLM MindSpore
 
-       ```bash
-       pip install .
-       ```
+        需要先拉取vLLM MindSpore源码，再执行安装
+
+        ```bash
+        git clone https://gitee.com/mindspore/vllm-mindspore.git
+        cd vllm-mindspore
+        pip install .
+        ```
 
-### 快速验证
+## 快速验证
 
 用户可以创建一个简单的离线推理场景，验证安装是否成功。下面以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例。首先用户需要执行以下命令，设置环境变量：
 
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
index 82d5534bda..addd3951d0 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
@@ -138,7 +138,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra
 以下是对上述环境变量的解释：
 
 - `vLLM_MODEL_BACKEND`：所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端，可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询；
-- `MINDFORMERS_MODEL_CONFIG`：模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中，找到对应模型的yaml文件。以Qwen2.5-7B为例，则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。
+- `MINDFORMERS_MODEL_CONFIG`：模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中，找到对应模型的yaml文件。以Qwen2.5-7B为例，则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml)。
 
 另外，用户需要确保MindSpore Transformers已安装。用户可通过
 
@@ -220,7 +220,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}'
 ```
 
-其中，用户需确认`"model"`字段与启动服务中`--model`一致，请求才能成功匹配到模型。可以通过请求处理成功，将获得以下的推理结果：
+其中，用户需确认`"model"`字段与启动服务中`--model`一致，请求才能成功匹配到模型。若请求处理成功，将获得以下推理结果：
 
 ```text
 {
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
index 2ad616469f..813a0dd588 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
@@ -94,8 +94,8 @@ docker exec -it $DOCKER_NAME bash
 
 ```python
 from openmind_hub import snapshot_download
-snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-W8A8",
-                  local_dir="/path/to/save/deepseek_r1_w8a8",
+snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-0528-A8W8",
+                  local_dir="/path/to/save/deepseek_r1_0528_a8w8",
                   local_dir_use_symlinks=False)
 ```
 
@@ -120,7 +120,7 @@ Git LFS initialized.
 工具确认可用后，执行以下命令，下载权重：
 
 ```shell
-git clone https://modelers.cn/MindSpore-Lab/DeepSeek-R1-W8A8.git
+git clone https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8.git
 ```
 
 ## TP16 张量并行推理
@@ -284,7 +284,7 @@ vllm-mindspore serve
 
 ```bash
 # 主节点：
-vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray
+vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray
 ```
 
 张量并行场景下，`--tensor-parallel-size`参数会覆盖模型yaml文件中`parallel_config`的`model_parallel`配置。用户可以通过`--model`参数，指定模型保存的本地路径。
@@ -294,7 +294,7 @@ vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-cod
 使用如下命令发送请求。其中`prompt`字段为模型输入：
 
 ```bash
-curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}'
+curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}'
 ```
 
 用户需确认`"model"`字段与启动服务中`--model`一致，请求才能成功匹配到模型。
@@ -347,6 +347,7 @@ parallel_config:
 ### 在线推理
 
 #### 启动服务
+
 `vllm-mindspore`可使用OpenAI的API协议部署在线推理。以下是在线推理的拉起流程：
 
 ```bash
@@ -373,10 +374,10 @@ vllm-mindspore serve
 
 ```bash
 # 主节点：
-vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel
+vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel
 
 # 从节点：
-vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel
+vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel
 ```
 
 #### 发送请求
@@ -384,7 +385,7 @@ vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust
 使用如下命令发送请求。其中`prompt`字段为模型输入：
 
 ```bash
-curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am, "max_tokens": 120, "temperature": 0}'
+curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am, "max_tokens": 120, "temperature": 0}'
 ```
 
 用户需确认`"model"`字段与启动服务中`--model`一致，请求才能成功匹配到模型。
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
index 72e4bad00a..f0901e2e0f 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
@@ -135,7 +135,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra
 以下是对上述环境变量的解释：
 
 - `vLLM_MODEL_BACKEND`：所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端，可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。
-- `MINDFORMERS_MODEL_CONFIG`：模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中，找到对应模型的yaml文件。以Qwen2.5-32B为例，则其yaml文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml) 。
+- `MINDFORMERS_MODEL_CONFIG`：模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中，找到对应模型的yaml文件。以Qwen2.5-32B为例，则其yaml文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml) 。
 
 用户可通过`npu-smi info`查看显存占用情况，并可以使用如下环境变量，设置用于推理的计算卡。以下例子为假设用户使用4,5,6,7卡进行推理：
 
@@ -181,7 +181,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}'
 ```
 
-其中，用户需确认`"model"`字段与启动服务中`--model`一致，请求才能成功匹配到模型。可以通过请求处理成功，将获得以下的推理结果：
+其中，用户需确认`"model"`字段与启动服务中`--model`一致，请求才能成功匹配到模型。若请求处理成功，将获得以下推理结果：
 
 ```text
 {
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
index 90a49c065e..c7d8426f6d 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
@@ -135,7 +135,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra
 以下是对上述环境变量的解释：
 
 - `vLLM_MODEL_BACKEND`：所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端，可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询；
-- `MINDFORMERS_MODEL_CONFIG`：模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中，找到对应模型的yaml文件。以Qwen2.5-7B为例，则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。
+- `MINDFORMERS_MODEL_CONFIG`：模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中，找到对应模型的yaml文件。以Qwen2.5-7B为例，则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。
 
 用户可通过`npu-smi info`查看显存占用情况，并可以使用如下环境变量，设置用于推理的计算卡：
 
@@ -194,7 +194,7 @@ vLLM MindSpore可使用OpenAI的API协议，部署为在线推理。以下是以
 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct"
 ```
 
-若服务成功拉起，则可以获得类似的执行结果：
+用户可以通过`--model`参数，指定模型保存的本地路径。若服务成功拉起，则可以获得类似的执行结果：
 
 ```text
 INFO:   Started server process [6363]
@@ -216,7 +216,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}'
 ```
 
-若请求处理成功，将获得以下的推理结果：
+其中，用户需确认`"model"`字段与启动服务中`--model`一致，请求才能成功匹配到模型。若请求处理成功，将获得以下推理结果：
 
 ```text
 {
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md
index 835946b67d..b5e2aefd2d 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md
@@ -4,43 +4,20 @@
 
 |   环境变量   |   功能   |   类型   |   取值   |   说明   |
 |   ------   |   -------  |   ------   |   ------   |   ------   |
-|   vLLM_MODEL_BACKEND   |   用于指定模型后端。使用vLLM MindSpore原生模型后端时无需指定；使用模型为vLLM MindSpore外部后端时则需要指定。   |   String   | `MindFormers`: 模型后端为MindSpore Transformers。   |   原生模型后端当前支持Qwen2.5系列；MindSpore Transformers模型后端支持Qwen系列、DeepSeek、Llama系列模型，使用时需配置环境变量：`export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`。   |
-|   MINDFORMERS_MODEL_CONFIG   |   MindSpore Transformers模型的配置文件。使用Qwen2.5系列、DeepSeek系列模型时，需要配置文件路径。   |   String   |   模型配置文件路径。   |   **该环境变量在后续版本会被移除。** 样例：`export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`。   |
-|   GLOO_SOCKET_IFNAME   |   用于多机之间使用gloo通信时的网口名称。   |   String   |  网口名称，例如enp189s0f0。    |   多机场景使用，可通过`ifconfig`查找ip对应网卡的网卡名。   |
-|   TP_SOCKET_IFNAME   |   用于多机之间使用TP通信时的网口名称。   |   String   | 网口名称，例如enp189s0f0。      |   多机场景使用，可通过`ifconfig`查找ip对应网卡的网卡名。   |
-| HCCL_SOCKET_IFNAME | 用于多机之间使用HCCL通信时的网口名称。 | String | 网口名称，例如enp189s0f0。  | 多机场景使用，可通过`ifconfig`查找ip对应网卡的网卡名。 |
-| ASCEND_RT_VISIBLE_DEVICES | 指定哪些Device对当前进程可见，支持一次指定一个或多个Device ID。 | String | 为Device ID，逗号分割的字符串，例如"0,1,2,3,4,5,6,7"。 | ray使用场景建议使用。 |
-| HCCL_BUFFSIZE | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | Integer | 缓存区大小，大小为MB。例如：`2048`。 | 使用方法参考：[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行（数据并行数为32，专家并行数为32），且`max-num-batched-tokens`为256时，则`export HCCL_BUFFSIZE=2048`。 |
-| MS_MEMPOOL_BLOCK_SIZE | 设置PyNative模式下设备内存池的块大小。 | String | 正整数string，单位为GB。 |  |
-| vLLM_USE_NPU_ADV_STEP_FLASH_OP | 是否使用昇腾`adv_step_flash`算子。 | String | `on`: 使用；`off`：不使用 | 取值为`off`时，将使用小算子实现替代`adv_step_flash`算子。 |
-| VLLM_TORCH_PROFILER_DIR | 开启profiling采集数据，当配置了采集数据保存路径后生效 | String | Profiling数据保存路径。|   |
-
-以下环境变量由vLLM MindSpore自动注册：
-
-|   环境变量   |   功能   |   类型   |   取值   |   说明   |
-|   ------   |   -------  |   ------   |   ------   |   ------   |
-|   USE_TORCH   |   Transformer运行时依赖该环境变量  |   String   |   默认值为"False"   | vLLM MindSpore 不使用torch 作为后端   |
-|   USE_TF   |   Transformer运行时依赖该环境变量  |   String   |   默认值为"False"   | vLLM MindSpore 不使用TensorFlow 作为后端   |
-|   RUN_MODE   |   执行模式为推理  |   String   |   默认值为"predict"   |  **该环境变量在后续版本会被移除。** 为MindFormers依赖的环境变量   |
-|   CUSTOM_MATMUL_SHUFFLE   |   开启或关闭自定义矩阵算法的洗牌操作  |   String   |   `on`：开启矩阵洗牌。`off`：关闭矩阵洗牌。默认值为`on`。   | |
-|   HCCL_DETERMINISTIC   |   开启或关闭归约类通信算子的确定性计算，其中归约类通信算子包括 AllReduce、ReduceScatter、Reduce。  |   String   |   `true`：打开 HCCL 确定性开关；`false`：关闭 HCCL 确定性开关。默认值为`false`。   |    |
-|   ASCEND_LAUNCH_BLOCKING   |   训练或在线推理场景，可通过此环境变量控制算子执行时是否启动同步模式。  |   Integer   |   `1`：强制算子采用同步模式运行；`0`：不强制算子采用同步模式运行。默认值为`0`。   |     |
-|   TE_PARALLEL_COMPILER   |   算子最大并行编译进程数，当大于 1 时开启并行编译。  |   Integer   |   取值为正整数；最大不超过 cpu 核数*80%/昇腾 AI 处理器个数，取值范围 1~32。默认值是 `0`。   |     |
-|   LCCL_DETERMINISTIC   |   设置 LCCL 确定性算子 AllReduce(保序加)是否开启。  |   Integer   |   `1`：打开 LCCL 确定性开关；`0`：关闭 LCCL 确定性开关。默认值是 `0`。   |    |
-|   MS_ENABLE_GRACEFUL_EXIT   |   设置使能进程优雅退出  |   Integer   |   `1`：使用进程优雅退出功能。`不设置或者其他值`: 不使用进程优雅退出功能。默认值为`0`   |      |
-|   CPU_AFFINIITY   |   MindSpore推理绑核优化  |   String   |   `True`：开启绑核；`True`：不开启绑核。默认值为`True`   |   **该环境变量在后续版本会被移除。** 将使用`set_cpu_affinity`接口。   |
-|   MS_ENABLE_INTERNAL_BOOST   |   是否打开 MindSpore 框架的内部加速功能。  |   String   |   `on`：开启 MindSpore 内部加速；`off`：关闭 MindSpore 内部加速。默认值为`on`   |    |
-|   MS_ENABLE_LCCL   |   是否使用LCCL通信库。  |   Integer   |   `1`:开启，`0`:关闭。默认值为`0`。   |     |
-|   HCCL_EXEC_TIMEOUT   |   通过该环境变量可控制设备间执行时同步等待的时间，在该配置时间内各设备进程等待其他设备执行通信同步。  |   Integer   |   取值范围为：(0, 17340]，单位为 s。 默认值为 7200。  |    |
-|   DEVICE_NUM_PER_NODE   | 节点上的设备数    |   Integer   |   默认值为16。   |   |
-|   HCCL_OP_EXPANSION_MODE   |   用于配置通信算法的编排展开位置。  |   String   |  `AI_CPU`：通信算法的编排展开位置为Device侧的AI CPU计算单元；`AIV`：通信算法的编排展开位置为Device侧的AI Vector Core计算单元。默认值为`AIV`。   |     |
-|   MS_JIT_MODULES   |   指定静态图模式下哪些模块需要JIT静态编译，其函数方法会被编译成静态计算图  |   String   |   模块名，对应import导入的顶层模块的名称。如果有多个，使用英文逗号分隔。默认值为`"vllm_mindspore,research"`。   |     |
-|   GLOG_v   |   控制日志的级别  |   Integer   |   `0`：DEBUG；`1`：INFO；`2`：WARNING；`3`：ERROR，表示程序执行出现报错，输出错误日志，程序可能不会终止；`4`：CRITICAL，表示程序执行出现异常，将会终止执行程序。默认值为`3`。   |      |
-|   RAY_CGRAPH_get_timeout   |   `ray.get()`方法的超时时间。  |   Integer   |   默认值为`360`。   |      |
-|   MS_NODE_TIMEOUT |   节点心跳超时时间，单位：秒。     |   Integer  |   默认值为`180`。   |    |
+|   `vLLM_MODEL_BACKEND`   |   用于指定模型后端。使用vLLM MindSpore原生模型后端时无需指定；使用模型为vLLM MindSpore外部后端时则需要指定。   |   String   | `MindFormers`: 模型后端为MindSpore Transformers。   |   原生模型后端当前支持Qwen2.5系列；MindSpore Transformers模型后端支持Qwen系列、DeepSeek、Llama系列模型，使用时需配置环境变量：`export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`。   |
+|   `MINDFORMERS_MODEL_CONFIG`   |   MindSpore Transformers模型的配置文件。使用Qwen2.5系列、DeepSeek系列模型时，需要配置文件路径。   |   String   |   模型配置文件路径。   |   **该环境变量在后续版本会被移除。** 样例：`export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`。   |
+|   `GLOO_SOCKET_IFNAME`   |   用于多机之间使用gloo通信时的网口名称。   |   String   |  网口名称，例如enp189s0f0。    |   多机场景使用，可通过`ifconfig`查找ip对应网卡的网卡名。   |
+|   `TP_SOCKET_IFNAME`   |   用于多机之间使用TP通信时的网口名称。   |   String   | 网口名称，例如enp189s0f0。      |   多机场景使用，可通过`ifconfig`查找ip对应网卡的网卡名。   |
+| `HCCL_SOCKET_IFNAME` | 用于多机之间使用HCCL通信时的网口名称。 | String | 网口名称，例如enp189s0f0。  | 多机场景使用，可通过`ifconfig`查找ip对应网卡的网卡名。 |
+| `ASCEND_RT_VISIBLE_DEVICES` | 指定哪些Device对当前进程可见，支持一次指定一个或多个Device ID。 | String | 为Device ID，逗号分割的字符串，例如"0,1,2,3,4,5,6,7"。 | ray使用场景建议使用。 |
+| `HCCL_BUFFSIZE` | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | Integer | 缓存区大小，大小为MB。例如：`2048`。 | 使用方法参考：[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行（数据并行数为32，专家并行数为32），且`max-num-batched-tokens`为256时，则`export HCCL_BUFFSIZE=2048`。 |
+| `MS_MEMPOOL_BLOCK_SIZE` | 设置PyNative模式下设备内存池的块大小。 | String | 正整数string，单位为GB。 |  |
+| `vLLM_USE_NPU_ADV_STEP_FLASH_OP` | 是否使用昇腾`adv_step_flash`算子。 | String | `on`: 使用；`off`：不使用 | 取值为`off`时，将使用小算子实现替代`adv_step_flash`算子。 |
+| `VLLM_TORCH_PROFILER_DIR` | 开启profiling采集数据，当配置了采集数据保存路径后生效 | String | Profiling数据保存路径。|   |
 
 更多的环境变量信息，请查看：
- - [CANN 环境变量列表](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/index/index.html)
- - [MindSpore 环境变量列表](https://www.mindspore.cn/docs/zh-CN/r2.7.0rc1/api_python/env_var_list.html)
- - [MindSpore Transformers 环境变量列表](https://www.mindspore.cn/mindformers/docs/zh-CN/r1.6.0/index.html)
- - [vLLM 环境变量列表](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html)
+
+- [CANN 环境变量列表](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/index/index.html)
+- [MindSpore 环境变量列表](https://www.mindspore.cn/docs/zh-CN/master/api_python/env_var_list.html)
+- [MindSpore Transformers 环境变量列表](https://www.mindspore.cn/mindformers/docs/zh-CN/master/index.html)
+- [vLLM 环境变量列表](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html)
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md
index 71667d5f1e..22a83475ef 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md
@@ -16,7 +16,7 @@
 
 ### 直接下载量化权重
 
-我们已经将量化好的DeepSeek-R1上传到[魔乐社区](https://modelers.cn)：[MindSpore-Lab/DeepSeek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-W8A8)，可以参考[魔乐社区文档](https://modelers.cn/docs/zh/openmind-hub-client/0.9/basic_tutorial/download.html)将权重下载到本地。
+我们已经将量化好的DeepSeek-R1上传到[魔乐社区](https://modelers.cn)：[MindSpore-Lab/DeepSeek-R1-0528-A8W8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8)，可以参考[魔乐社区文档](https://modelers.cn/docs/zh/openmind-hub-client/0.9/basic_tutorial/download.html)将权重下载到本地。
 
 ## 量化模型推理
 
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md
index 2e504c0fec..c64725c9e1 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md
@@ -6,7 +6,7 @@
 |-------| --------- | ---- |
 | DeepSeek-V3 |   已支持 | [DeepSeek-V3](https://modelers.cn/models/MindSpore-Lab/DeepSeek-V3) |
 | DeepSeek-R1 |   已支持 | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-V3) |
-| DeepSeek-R1 W8A8 |   已支持 | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-r1-w8a8) |
+| DeepSeek-R1 W8A8 |   已支持 | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8) |
 | Qwen2.5 | 已支持 | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)、[Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)、[Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)、 [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)、[Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)、[Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)、[Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) |
 | Qwen3-32B | 已支持 | [Qwen3-32B](https://modelers.cn/models/MindSpore-Lab/Qwen3-32B) |
 | Qwen3-235B-A22B | 已支持 | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) |
-- 
Gitee