From 3baea95f1502f3f678d7a8c576933d2333e6f155 Mon Sep 17 00:00:00 2001 From: horcam Date: Tue, 12 Aug 2025 09:57:39 +0800 Subject: [PATCH 1/7] delete env vars --- .../source_en/getting_started/installation/installation.md | 2 -- .../docs/source_en/getting_started/quick_start/quick_start.md | 4 ---- .../tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md | 4 ---- .../tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md | 4 ---- .../user_guide/supported_features/benchmark/benchmark.md | 4 ---- .../supported_features/quantization/quantization.md | 2 -- .../source_zh_cn/getting_started/installation/installation.md | 2 -- .../source_zh_cn/getting_started/quick_start/quick_start.md | 4 ---- .../tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md | 4 ---- .../tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md | 4 ---- .../user_guide/supported_features/benchmark/benchmark.md | 4 ---- .../supported_features/quantization/quantization.md | 2 -- 12 files changed, 40 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index 73851332d6..a6789dccde 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -153,9 +153,7 @@ docker exec -it $DOCKER_NAME bash User can verify the installation with a simple offline inference test. First, user need to configure the environment variables with the following command: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md index c0eaf16c34..6b129509a8 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md @@ -131,17 +131,13 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct Before launching the model, user need to set the following environment variables: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation. export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` Here is an explanation of these environment variables: -- `ASCEND_TOTAL_MEMORY_GB`: The memory size of each card. User can check the memory by using `npu-smi info`, where the value corresponds to `HBM-Usage(MB)` in the query results. - `vLLM_MODEL_BACKEND`: The backend of the model to run. User could find supported models and backends for vLLM MindSpore in the [Model Support List](../../user_guide/supported_models/models_list/models_list.md). -- `vLLM_MODEL_MEMORY_USE_GB`: The memory reserved for model loading. Adjust this value if insufficient memory error occurs during model loading. - `MINDFORMERS_MODEL_CONFIG`: The model configuration file. Additionally, users need to ensure that MindSpore Transformers is installed. Users can add it by running the following command: diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 24d4d4a2ca..7f55f75080 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -127,17 +127,13 @@ For [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), the followi ```bash #set environment variables -export ASCEND_TOTAL_MEMORY_GB=64 # Use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # Use MindSpore TransFormers as the model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Adjust based on the model's maximum usage, with the remaining allocated for KV cache. export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model YAML file. ``` Here is an explanation of these environment variables: -- `ASCEND_TOTAL_MEMORY_GB`: The memory size of each compute card. Query using `npu-smi info`, corresponding to `HBM-Usage(MB)` in the results. - `vLLM_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). -- `vLLM_MODEL_MEMORY_USE_GB`: Memory reserved for model loading. Adjust this if encountering insufficient memory. - `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5). For Qwen2.5-32B, the YAML file is [predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml). Users can check memory usage with `npu-smi info` and set the NPU cards for inference using the following example (assuming cards 4,5,6,7 are used): diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index 2c360e28f8..f6daacb52d 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -127,17 +127,13 @@ For [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), the following ```bash #set environment variables -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` Here is an explanation of these variables: -- `ASCEND_TOTAL_MEMORY_GB`: The memory size of each compute card. Query using `npu-smi info`, corresponding to `HBM-Usage(MB)` in the results. - `vLLM_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). -- `vLLM_MODEL_MEMORY_USE_GB`: Memory reserved for model loading. Adjust this if encountering insufficient memory. - `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-7B, the YAML file is [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml). User can check memory usage with `npu-smi info` and set the compute card for inference using: diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md index b704a49f44..9ab03ff8ac 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md @@ -9,9 +9,7 @@ The benchmark tool of vLLM MindSpore is inherited from vLLM. You can refer to th For single-card inference, we take [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. You can prepare the environment by following the guide [Single-Card Inference (Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#online-inference), set the environment variables: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` @@ -104,9 +102,7 @@ P99 ITL (ms): .... For offline performance benchmark, take [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. Prepare the environment by following the guide [Single-Card Inference (Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#offline-inference). User need to set the environment variables: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md index 401768ca1a..33c39b583d 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md @@ -27,9 +27,7 @@ After obtaining the DeepSeek-R1 W8A8 weights, ensure they are stored in the rela Refer to the [Installation Guide](../../../getting_started/installation/installation.md) to set up the vLLM MindSpore environment. User need to set the following environment variables: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index 8121916024..2243f2eaea 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -152,9 +152,7 @@ docker exec -it $DOCKER_NAME bash 用户可以创建一个简单的离线推理场景,验证安装是否成功。下面以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例。首先用户需要执行以下命令,设置环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md index 2bf629908a..a17c89b0fa 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md @@ -131,17 +131,13 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct 用户在拉起模型前,需设置以下环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 以下是对上述环境变量的解释: -- `ASCEND_TOTAL_MEMORY_GB`: 每一张计算卡的显存大小。用户可使用`npu-smi info`命令进行查询,该值对应查询结果中的`HBM-Usage(MB)`; - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `vLLM_MODEL_MEMORY_USE_GB`:模型加载时所用空间,根据用户所使用的模型进行设置。若用户在模型加载过程中遇到显存不足时,可适当增大该值并重试; - `MINDFORMERS_MODEL_CONFIG`:模型配置文件。 另外,用户需要确保MindSpore Transformers已安装。用户可通过 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 811009f647..5db8bd6d63 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -128,17 +128,13 @@ git clone https://huggingface.co/Qwen/Qwen2.5-32B-Instruct ```bash #set environment variables -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 以下是对上述环境变量的解释: -- `ASCEND_TOTAL_MEMORY_GB`: 每一张计算卡的显存大小。用户可使用`npu-smi info`命令进行查询,该值对应查询结果中的`HBM-Usage(MB)`。 - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 -- `vLLM_MODEL_MEMORY_USE_GB`:模型加载时所用空间,根据用户所使用的模型进行设置。若用户在模型加载过程中遇到显存不足时,可适当增大该值并重试。 - `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-32B为例,则其yaml文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml) 。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡。以下例子为假设用户使用4,5,6,7卡进行推理: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index ffc82071b2..90a49c065e 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -128,17 +128,13 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct ```bash #set environment variables -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 以下是对上述环境变量的解释: -- `ASCEND_TOTAL_MEMORY_GB`: 每一张计算卡的显存大小。用户可使用`npu-smi info`命令进行查询,该值对应查询结果中的`HBM-Usage(MB)`; - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `vLLM_MODEL_MEMORY_USE_GB`:模型加载时所用空间,根据用户所使用的模型进行设置。若用户在模型加载过程中遇到显存不足时,可适当增大该值并重试; - `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md index 15f1040699..6d28a6ff7c 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md @@ -9,9 +9,7 @@ vLLM MindSpore的性能测试能力,继承自vLLM所提供的性能测试能 若用户使用单卡推理,以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)为例,可按照文档[单卡推理(Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#在线推理)进行环境准备,设置以下环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` @@ -104,9 +102,7 @@ P99 ITL (ms): .... 用户使用离线性能测试时,以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)为例,可按照文档[单卡推理(Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#离线推理)进行环境准备,设置以下环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md index 54ad35032d..71667d5f1e 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md @@ -27,9 +27,7 @@ 用户可以参考[安装指南](../../../getting_started/installation/installation.md),进行vLLM MindSpore的环境搭建。用户需设置以下环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` -- Gitee From 5f6d2506fdf8172dd8ce9472d908ef92ae85a522 Mon Sep 17 00:00:00 2001 From: horcam Date: Tue, 12 Aug 2025 10:50:59 +0800 Subject: [PATCH 2/7] update ds dir, check model name and model args --- .../quick_start/quick_start.md | 4 ++- .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md | 27 +++++++++++-------- .../qwen2.5_32b_multiNPU.md | 8 +++--- .../qwen2.5_7b_singleNPU.md | 6 +++-- .../quick_start/quick_start.md | 4 +-- .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md | 21 +++++++++------ .../qwen2.5_32b_multiNPU.md | 6 ++--- .../supported_features/profiling/profiling.md | 2 +- 8 files changed, 47 insertions(+), 31 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md index 6b129509a8..506dc04152 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md @@ -198,7 +198,7 @@ Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the fol python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" ``` -If the service starts successfully, similar output will be obtained: +User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained: ```text INFO: Started server process [6363] @@ -220,6 +220,8 @@ Use the following command to send a request, where `prompt` is the model input: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 15, "temperature": 0}' ``` +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. + If the request is processed successfully, the following inference result will be returned: ```text diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 7a3a9fb4a8..8386c3d11c 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -241,18 +241,20 @@ Execution example: ```bash # Master node: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray ``` -In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. +In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can also set the local model path by `--model` argument. #### Sending Requests Use the following command to send requests, where `prompt` is the model input: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' -``` +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' +``` + +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. ## Hybrid Parallel Inference @@ -301,6 +303,7 @@ parallel_config: ### Online Inference +#### Starting the Service `vllm-mindspore` can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service: ```bash @@ -321,22 +324,24 @@ vllm-mindspore serve --data-parallel-address [Master node communication IP] --data-parallel-rpc-port [Master node communication port] --enable-expert-parallel # Enable expert parallelism -``` +``` -Execution example: +User can also set the local model path by `--model` argument. The following is an execution example: ```bash # Master node: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel # Worker node: -vllm-mindspore serve --headless --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel -``` +vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +``` -## Sending Requests +#### Sending Requests Use the following command to send requests, where `prompt` is the model input: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` + +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 7f55f75080..0451257e2b 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -156,7 +156,7 @@ export MAX_MODEL_LEN=1024 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN ``` -Here, `TENSOR_PARALLEL_SIZE` specifies the number of NPU cards, and `MAX_MODEL_LEN` sets the maximum output token length. +Here, `TENSOR_PARALLEL_SIZE` specifies the number of NPU cards, and `MAX_MODEL_LEN` sets the maximum output token length. User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained: @@ -177,16 +177,18 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 Use the following command to send a request, where `prompt` is the model input: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. + If processed successfully, the inference result will be: ```text { "id":"cmpl-11fe2898c77d4ff18c879f57ae7aa9ca","object":"text_completion", "create":1748568696, - "model":"Qwen2.5-32B-Instruct", + "model":"Qwen/Qwen2.5-32B-Instruct", "choices":[ { "index":0, diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index f6daacb52d..e1216d12c9 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -192,7 +192,7 @@ Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the fol python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" ``` -If the service starts successfully, similar output will be obtained: +User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained: ```text INFO: Started server process [6363] @@ -214,13 +214,15 @@ Use the following command to send a request, where `prompt` is the model input: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 15, "temperature": 0}' ``` +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. + If the request is processed successfully, the following inference result will be returned: ```text { "id":"cmpl-5e6e314861c24ba79fea151d86c1b9a6","object":"text_completion", "create":1747398389, - "model":"Qwen2.5-7B-Instruct", + "model":"Qwen/Qwen2.5-7B-Instruct", "choices":[ { "index":0, diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md index a17c89b0fa..cb8960fed7 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md @@ -198,7 +198,7 @@ vLLM MindSpore可使用OpenAI的API协议,进行在线推理部署。以下是 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" ``` -若服务成功拉起,则可以获得类似的执行结果: +用户可以通过`--model`参数,指定模型保存的本地路径。若服务成功拉起,则可以获得类似的执行结果: ```text INFO: Started server process [6363] @@ -220,7 +220,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -若请求处理成功,将获得以下的推理结果: +其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。可以通过请求处理成功,将获得以下的推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 047e7b4aad..2ad616469f 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -284,19 +284,21 @@ vllm-mindspore serve ```bash # 主节点: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray ``` -张量并行场景下,`--tensor-parallel-size`参数会覆盖模型yaml文件中`parallel_config`的`model_parallel`配置。 +张量并行场景下,`--tensor-parallel-size`参数会覆盖模型yaml文件中`parallel_config`的`model_parallel`配置。用户可以通过`--model`参数,指定模型保存的本地路径。 #### 发起请求 使用如下命令发送请求。其中`prompt`字段为模型输入: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' ``` +用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。 + ## 混合并行推理 vLLM 通过 Ray 对多个节点资源进行管理和运行。该样例对应以下并行策略场景: @@ -344,6 +346,7 @@ parallel_config: ### 在线推理 +#### 启动服务 `vllm-mindspore`可使用OpenAI的API协议部署在线推理。以下是在线推理的拉起流程: ```bash @@ -366,20 +369,22 @@ vllm-mindspore serve --enable-expert-parallel # 使能专家并行 ``` -执行示例: +用户可以通过`--model`参数,指定模型保存的本地路径。以下为执行示例: ```bash # 主节点: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel # 从节点: -vllm-mindspore serve --headless --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel ``` -## 发送请求 +#### 发送请求 使用如下命令发送请求。其中`prompt`字段为模型输入: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am, "max_tokens": 120, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am, "max_tokens": 120, "temperature": 0}' ``` + +用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 5db8bd6d63..72e4bad00a 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -159,7 +159,7 @@ python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model 其中,`TENSOR_PARALLEL_SIZE`为用户指定的卡数,`MAX_MODEL_LEN`为模型最大输出token数。 -若服务成功拉起,则可以获得类似的执行结果: +用户可以通过`--model`参数,指定模型保存的本地路径。若服务成功拉起,则可以获得类似的执行结果: ```text INFO: Started server process [6363] @@ -178,10 +178,10 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 使用如下命令发送请求。其中`prompt`字段为模型输入: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -若请求处理成功,将获得以下的推理结果: +其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。可以通过请求处理成功,将获得以下的推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md index 4dc4f2ccee..eb90283188 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md @@ -40,7 +40,7 @@ curl -X POST http://127.0.0.1:8000/start_profile curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "/home/DeepSeekV3", + "model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 -- Gitee From af28e12112606bf2d766f1fb222e2680bd4671c7 Mon Sep 17 00:00:00 2001 From: horcam Date: Tue, 12 Aug 2025 11:14:01 +0800 Subject: [PATCH 3/7] update introduction of manual install --- .../installation/installation.md | 47 +++++++++++++++++++ .../installation/installation.md | 46 ++++++++++++++++++ 2 files changed, 93 insertions(+) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index a6789dccde..cadf178a7a 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -148,6 +148,53 @@ docker exec -it $DOCKER_NAME bash export PYTHONPATH=$MF_PATH:$PYTHONPATH ``` +- **Manual Component Installation** + + If user need to modify the components or use other versions, components need to be manually installed in a specific order. vLLM MindSpore requires the following installation sequence: + + 1. Install vLLM + + ```bash + pip install /path/to/vllm-*.whl + ``` + + 2. Uninstall Torch-related components + + ```bash + pip uninstall torch torch-npu torchvision torchaudio -y + ``` + + 3. Install MindSpore + + ```bash + pip install /path/to/mindspore-*.whl + ``` + + 4. Clone the MindSpore Transformers repository and add it to `PYTHONPATH` + + ```bash + git clone https://gitee.com/mindspore/mindformers.git + export PYTHONPATH=`realpath mindformers`:$PYTHONPATH + ``` + + 5. Install Golden Stick + + ```bash + pip install /path/to/mindspore_gs-*.whl + ``` + + 6. Install MSAdapter + + ```bash + pip install /path/to/msadapter-*.whl + ``` + + 7. Install vLLM MindSpore + + ```bash + pip install . + ``` + ### Quick Verification User can verify the installation with a simple offline inference test. First, user need to configure the environment variables with the following command: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index 2243f2eaea..2331d3db9e 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -147,6 +147,52 @@ docker exec -it $DOCKER_NAME bash export PYTHONPATH=$MF_PATH:$PYTHONPATH ``` +- **组件手动安装** + + 若用户对组件有修改,或者需使用其他版本,则用户需要按照特定顺序,手动安装组件。vLLM MindSpore对组件的安装顺序要求如下: + 1. 安装vLLM + + ```bash + pip install /path/to/vllm-*.whl + ``` + + 2. 卸载torch相关组件 + + ```bash + pip uninstall torch torch-npu torchvision torchaudio -y + ``` + + 3. 安装MindSpore + + ```bash + pip install /path/to/mindspore-*.whl + ``` + + 4. 引入MindSpore Transformers仓,加入到`PYTHONPATH`中 + + ```bash + git clone https://gitee.com/mindspore/mindformers.git + export PYTHONPATH=`realpath mindformers`:$PYTHONPATH + ``` + + 5. 安装Golden Stick + + ```bash + pip install /path/to/mindspore_gs-*.whl + ``` + + 6. 安装MSAdapter + + ```bash + pip install /path/to/msadapter-*.whl + ``` + + 7. 安装vLLM MindSpore + + ```bash + pip install . + ``` + ### 快速验证 用户可以创建一个简单的离线推理场景,验证安装是否成功。下面以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例。首先用户需要执行以下命令,设置环境变量: -- Gitee From ab80875d6bc68507d06efdf46d8dfecb707b909c Mon Sep 17 00:00:00 2001 From: horcam Date: Tue, 12 Aug 2025 11:25:21 +0800 Subject: [PATCH 4/7] update mindformers config introduction --- .../docs/source_en/getting_started/quick_start/quick_start.md | 2 +- .../source_zh_cn/getting_started/quick_start/quick_start.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md index 506dc04152..91a88e814d 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md @@ -138,7 +138,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra Here is an explanation of these environment variables: - `vLLM_MODEL_BACKEND`: The backend of the model to run. User could find supported models and backends for vLLM MindSpore in the [Model Support List](../../user_guide/supported_models/models_list/models_list.md). -- `MINDFORMERS_MODEL_CONFIG`: The model configuration file. +- `MINDFORMERS_MODEL_CONFIG`: The model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-7B, the YAML file is [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml). Additionally, users need to ensure that MindSpore Transformers is installed. Users can add it by running the following command: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md index cb8960fed7..82d5534bda 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md @@ -138,7 +138,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra 以下是对上述环境变量的解释: - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。 +- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。 另外,用户需要确保MindSpore Transformers已安装。用户可通过 -- Gitee From c984755479a9857cc421e46302de66df5902b570 Mon Sep 17 00:00:00 2001 From: horcam Date: Tue, 12 Aug 2025 16:07:42 +0800 Subject: [PATCH 5/7] update env variable and pkgs version --- .../installation/installation.md | 18 ++++----- .../environment_variables.md | 37 +++++++++++++++++-- .../installation/installation.md | 18 ++++----- .../environment_variables.md | 32 +++++++++++++++- 4 files changed, 83 insertions(+), 22 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index cadf178a7a..9c3f34359f 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -14,15 +14,15 @@ This document describes the steps to install the vLLM MindSpore environment. Thr - Python: 3.9 / 3.10 / 3.11 - Software version compatibility - | Software | Version | Corresponding Branch | - | -------- | ------- | -------------------- | - | [CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | - | - | [MindSpore](https://www.mindspore.cn/install/) | 2.7 | master | - | [MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter) | 0.2 | master | - | [MindSpore Transformers](https://gitee.com/mindspore/mindformers) | 1.6 | dev | - | [Golden Stick](https://gitee.com/mindspore/golden-stick) | 1.1.0 | r1.1.0 | - | [vLLM](https://github.com/vllm-project/vllm) | 0.9.1 | v0.9.1 | - | [vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3 | master | + | Software | Version | + | ----- | ----- | + |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | + |[MindSpore](https://www.mindspore.cn/install/) | 2.7.0 | + |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.0.1 | + |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| 1.6.0 | + |[Golden Stick](https://gitee.com/mindspore/golden-stick)| 1.2.0 | + |[vLLM](https://github.com/vllm-project/vllm) | 0.8.3 | + |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3.0 | ## Environment Setup diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md index c1b7161626..35129a6215 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md @@ -11,6 +11,37 @@ | `HCCL_SOCKET_IFNAME` | Specifies the network interface name for inter-machine communication using HCCL. | String | Interface name (e.g., `enp189s0f0`). | Used in multi-machine scenarios. The interface name can be found via `ifconfig` by matching the IP address. | | `ASCEND_RT_VISIBLE_DEVICES` | Specifies which devices are visible to the current process, supporting one or multiple Device IDs. | String | Device IDs as a comma-separated string (e.g., `"0,1,2,3,4,5,6,7"`). | Recommended for Ray usage scenarios. | | `HCCL_BUFFSIZE` | Controls the buffer size for data sharing between two NPUs. | int | Buffer size in MB (e.g., `2048`). | Usage reference: [HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html). Example: For DeepSeek hybrid parallelism (Data Parallel: 32, Expert Parallel: 32) with `max-num-batched-tokens=256`, set `export HCCL_BUFFSIZE=2048`. | -| MS_MEMPOOL_BLOCK_SIZE | Set the size of the memory pool block in PyNative mode for devices | String | String of positive number, and the unit is GB. | | -| vLLM_USE_NPU_ADV_STEP_FLASH_OP | Whether to use Ascend operation `adv_step_flash` | String | `on`: Use;`off`:Not use | If the variable is set to `off`, model will use the implement of small operations. | -| VLLM_TORCH_PROFILER_DIR | Enables profiling data collection and takes effect when a data save path is configured. | String | The path to save profiling data. | | +| `MS_MEMPOOL_BLOCK_SIZE` | Set the size of the memory pool block in PyNative mode for devices | String | String of positive number, and the unit is GB. | | +| `vLLM_USE_NPU_ADV_STEP_FLASH_OP` | Whether to use Ascend operation `adv_step_flash` | String | `on`: Use;`off`:Not use | If the variable is set to `off`, model will use the implement of small operations. | +| `VLLM_TORCH_PROFILER_DIR` | Enables profiling data collection and takes effect when a data save path is configured. | String | The path to save profiling data. | | + +The following environment variables are automatically registered by vLLM MindSpore: + +| **Environment Variable** | **Function** | **Type** | **Values** | **Description** | +|------------------------|-------------|----------|------------|----------------| +| `USE_TORCH` | Transformer runtime depends on this variable. | String | Default: `"False"` | vLLM MindSpore does not use Torch as the backend. | +| `USE_TF` | Transformer runtime depends on this variable. | String | Default: `"False"` | vLLM MindSpore does not use TensorFlow as the backend. | +| `RUN_MODE` | Execution mode. | String | Default: `"predict"` | **This variable will be removed in future versions.** Required by MindFormers. | +| `CUSTOM_MATMUL_SHUFFLE` | Enables or disables custom matrix shuffling algorithm . | String | `on`: Enable shuffling. `off`: Disable shuffling. Default: `on`. | | +| `HCCL_DETERMINISTIC` | Enables or disables deterministic computation for reduction-type communication operators (e.g., AllReduce, ReduceScatter, Reduce). | String | `true`: Enable deterministic mode. `false`: Disable deterministic mode. Default: `false`. | | +| `ASCEND_LAUNCH_BLOCKING` | Controls whether operators run in synchronous mode during training or online inference. | Integer | `1`: Force synchronous execution. `0`: Do not force synchronous execution. Default: `0`. | | +| `TE_PARALLEL_COMPILER` | Maximum number of parallel compilation processes for operators. Parallel compilation is enabled if greater than 1. | Integer | Positive integer; Max = CPU cores * 80% / # of Ascend AI processors. Range: 1~32. Default: `0`. | | +| `LCCL_DETERMINISTIC` | Controls whether LCCL deterministic AllReduce (ordered addition) is enabled. | Integer | `1`: Enable deterministic mode. `0`: Disable deterministic mode. Default: `0`. | | +| `MS_ENABLE_GRACEFUL_EXIT` | Enables graceful process termination. | Integer | `1`: Enable graceful exit. `Other values`: Disable graceful exit. Default: `0`. | | +| `CPU_AFFINITY` | Optimizes CPU core binding for MindSpore inference. | String | `True`: Enable core binding. `False`: Disable core binding. Default: `True`. | **This variable will be removed in future versions.** Replaced by `set_cpu_affinity` API. | +| `MS_ENABLE_INTERNAL_BOOST` | Enables or disables MindSpore framework's internal acceleration. | String | `on`: Enable acceleration. `off`: Disable acceleration. Default: `on`. | | +| `MS_ENABLE_LCCL` | Controls whether the LCCL communication library is used. | Integer | `1`: Enable. `0`: Disable. Default: `0`. | | +| `HCCL_EXEC_TIMEOUT` | Controls the synchronization timeout for inter-device execution. | Integer | Range: (0, 17340] (seconds). Default: `7200`. | | +| `DEVICE_NUM_PER_NODE` | Number of devices per node. | Integer | Default: `16`. | | +| `HCCL_OP_EXPANSION_MODE` | Configures the expansion location for communication algorithms. | String | `AI_CPU`: Expands on AI CPU compute units. `AIV`: Expands on AI Vector Core compute units. Default: `AIV`. | | +| `MS_JIT_MODULES` | Specifies modules to be JIT-compiled in static graph mode. | String | Module names (top-level imports). Multiple modules should be comma-separated. Default: `"vllm_mindspore,research"`. | | +| `GLOG_v` | Controls log level. | Integer | `0`: DEBUG. `1`: INFO. `2`: WARNING. `3`: ERROR (logs errors, may not terminate). `4`: CRITICAL (logs critical errors, terminates execution). Default: `3`. | | +| `RAY_CGRAPH_get_timeout` | Timeout for `ray.get()` method (seconds). | Integer | Default: `360`. | | +| `MS_NODE_TIMEOUT` | Node heartbeat timeout (seconds). | Integer | Default: `180`. | | + +More environment variable information can be referred in the following link: + + - [CANN Environment Variable List](https://www.hiascend.com/document/detail/en/CANNCommunityEdition/81RC1beta1/index/index.html) + - [MindSpore Environment Variable List](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/env_var_list.html) + - [MindSpore Transformers Environment Variable List](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/index.html) + - [vLLM Environment Variable List](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index 2331d3db9e..8d4eb6a8d6 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -13,15 +13,15 @@ - Python:3.9 / 3.10 / 3.11 - 软件版本配套 - | 软件 | 版本 | 对应分支 | - | ----- | ----- | ----- | - |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | - | - |[MindSpore](https://www.mindspore.cn/install/) | 2.7 | master | - |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.2 | master | - |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)|1.6 | dev | - |[Golden Stick](https://gitee.com/mindspore/golden-stick)|1.1.0 | r1.1.0 | - |[vLLM](https://github.com/vllm-project/vllm) | 0.9.1 | v0.9.1 | - |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3 | master | + | 软件 | 版本 | + | ----- | ----- | + |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | + |[MindSpore](https://www.mindspore.cn/install/) | 2.7.0 | + |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.0.1 | + |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| 1.6.0 | + |[Golden Stick](https://gitee.com/mindspore/golden-stick)| 1.2.0 | + |[vLLM](https://github.com/vllm-project/vllm) | 0.8.3 | + |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3.0 | ## 配置环境 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md index 7fd53b3ff3..835946b67d 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md @@ -10,7 +10,37 @@ | TP_SOCKET_IFNAME | 用于多机之间使用TP通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | | HCCL_SOCKET_IFNAME | 用于多机之间使用HCCL通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | | ASCEND_RT_VISIBLE_DEVICES | 指定哪些Device对当前进程可见,支持一次指定一个或多个Device ID。 | String | 为Device ID,逗号分割的字符串,例如"0,1,2,3,4,5,6,7"。 | ray使用场景建议使用。 | -| HCCL_BUFFSIZE | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | int | 缓存区大小,大小为MB。例如:`2048`。 | 使用方法参考:[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行(数据并行数为32,专家并行数为32),且`max-num-batched-tokens`为256时,则`export HCCL_BUFFSIZE=2048`。 | +| HCCL_BUFFSIZE | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | Integer | 缓存区大小,大小为MB。例如:`2048`。 | 使用方法参考:[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行(数据并行数为32,专家并行数为32),且`max-num-batched-tokens`为256时,则`export HCCL_BUFFSIZE=2048`。 | | MS_MEMPOOL_BLOCK_SIZE | 设置PyNative模式下设备内存池的块大小。 | String | 正整数string,单位为GB。 | | | vLLM_USE_NPU_ADV_STEP_FLASH_OP | 是否使用昇腾`adv_step_flash`算子。 | String | `on`: 使用;`off`:不使用 | 取值为`off`时,将使用小算子实现替代`adv_step_flash`算子。 | | VLLM_TORCH_PROFILER_DIR | 开启profiling采集数据,当配置了采集数据保存路径后生效 | String | Profiling数据保存路径。| | + +以下环境变量由vLLM MindSpore自动注册: + +| 环境变量 | 功能 | 类型 | 取值 | 说明 | +| ------ | ------- | ------ | ------ | ------ | +| USE_TORCH | Transformer运行时依赖该环境变量 | String | 默认值为"False" | vLLM MindSpore 不使用torch 作为后端 | +| USE_TF | Transformer运行时依赖该环境变量 | String | 默认值为"False" | vLLM MindSpore 不使用TensorFlow 作为后端 | +| RUN_MODE | 执行模式为推理 | String | 默认值为"predict" | **该环境变量在后续版本会被移除。** 为MindFormers依赖的环境变量 | +| CUSTOM_MATMUL_SHUFFLE | 开启或关闭自定义矩阵算法的洗牌操作 | String | `on`:开启矩阵洗牌。`off`:关闭矩阵洗牌。默认值为`on`。 | | +| HCCL_DETERMINISTIC | 开启或关闭归约类通信算子的确定性计算,其中归约类通信算子包括 AllReduce、ReduceScatter、Reduce。 | String | `true`:打开 HCCL 确定性开关;`false`:关闭 HCCL 确定性开关。默认值为`false`。 | | +| ASCEND_LAUNCH_BLOCKING | 训练或在线推理场景,可通过此环境变量控制算子执行时是否启动同步模式。 | Integer | `1`:强制算子采用同步模式运行;`0`:不强制算子采用同步模式运行。默认值为`0`。 | | +| TE_PARALLEL_COMPILER | 算子最大并行编译进程数,当大于 1 时开启并行编译。 | Integer | 取值为正整数;最大不超过 cpu 核数*80%/昇腾 AI 处理器个数,取值范围 1~32。默认值是 `0`。 | | +| LCCL_DETERMINISTIC | 设置 LCCL 确定性算子 AllReduce(保序加)是否开启。 | Integer | `1`:打开 LCCL 确定性开关;`0`:关闭 LCCL 确定性开关。默认值是 `0`。 | | +| MS_ENABLE_GRACEFUL_EXIT | 设置使能进程优雅退出 | Integer | `1`:使用进程优雅退出功能。`不设置或者其他值`: 不使用进程优雅退出功能。默认值为`0` | | +| CPU_AFFINIITY | MindSpore推理绑核优化 | String | `True`:开启绑核;`True`:不开启绑核。默认值为`True` | **该环境变量在后续版本会被移除。** 将使用`set_cpu_affinity`接口。 | +| MS_ENABLE_INTERNAL_BOOST | 是否打开 MindSpore 框架的内部加速功能。 | String | `on`:开启 MindSpore 内部加速;`off`:关闭 MindSpore 内部加速。默认值为`on` | | +| MS_ENABLE_LCCL | 是否使用LCCL通信库。 | Integer | `1`:开启,`0`:关闭。默认值为`0`。 | | +| HCCL_EXEC_TIMEOUT | 通过该环境变量可控制设备间执行时同步等待的时间,在该配置时间内各设备进程等待其他设备执行通信同步。 | Integer | 取值范围为:(0, 17340],单位为 s。 默认值为 7200。 | | +| DEVICE_NUM_PER_NODE | 节点上的设备数 | Integer | 默认值为16。 | | +| HCCL_OP_EXPANSION_MODE | 用于配置通信算法的编排展开位置。 | String | `AI_CPU`:通信算法的编排展开位置为Device侧的AI CPU计算单元;`AIV`:通信算法的编排展开位置为Device侧的AI Vector Core计算单元。默认值为`AIV`。 | | +| MS_JIT_MODULES | 指定静态图模式下哪些模块需要JIT静态编译,其函数方法会被编译成静态计算图 | String | 模块名,对应import导入的顶层模块的名称。如果有多个,使用英文逗号分隔。默认值为`"vllm_mindspore,research"`。 | | +| GLOG_v | 控制日志的级别 | Integer | `0`:DEBUG;`1`:INFO;`2`:WARNING;`3`:ERROR,表示程序执行出现报错,输出错误日志,程序可能不会终止;`4`:CRITICAL,表示程序执行出现异常,将会终止执行程序。默认值为`3`。 | | +| RAY_CGRAPH_get_timeout | `ray.get()`方法的超时时间。 | Integer | 默认值为`360`。 | | +| MS_NODE_TIMEOUT | 节点心跳超时时间,单位:秒。 | Integer | 默认值为`180`。 | | + +更多的环境变量信息,请查看: + - [CANN 环境变量列表](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/index/index.html) + - [MindSpore 环境变量列表](https://www.mindspore.cn/docs/zh-CN/r2.7.0rc1/api_python/env_var_list.html) + - [MindSpore Transformers 环境变量列表](https://www.mindspore.cn/mindformers/docs/zh-CN/r1.6.0/index.html) + - [vLLM 环境变量列表](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) -- Gitee From 10cecf67266ac2a31684ed427bd13febb662c7f6 Mon Sep 17 00:00:00 2001 From: horcam Date: Tue, 12 Aug 2025 16:14:39 +0800 Subject: [PATCH 6/7] update others --- .../user_guide/supported_features/profiling/profiling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md index b24b541e59..897f1ec0c8 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md @@ -40,7 +40,7 @@ curl -X POST http://127.0.0.1:8000/start_profile curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "/home/DeepSeekV3", + "model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 -- Gitee From 242892400a1ac78a4a22a45f1b0cb59b80c9146d Mon Sep 17 00:00:00 2001 From: horcam Date: Wed, 13 Aug 2025 16:19:23 +0800 Subject: [PATCH 7/7] fix for comment --- .../installation/installation.md | 93 +++++++++------- .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md | 19 ++-- .../qwen2.5_32b_multiNPU.md | 4 +- .../qwen2.5_7b_singleNPU.md | 2 +- .../environment_variables.md | 34 +----- .../quantization/quantization.md | 2 +- .../models_list/models_list.md | 2 +- .../installation/installation.md | 105 ++++++++++-------- .../quick_start/quick_start.md | 4 +- .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md | 17 +-- .../qwen2.5_32b_multiNPU.md | 4 +- .../qwen2.5_7b_singleNPU.md | 6 +- .../environment_variables.md | 53 +++------ .../quantization/quantization.md | 2 +- .../models_list/models_list.md | 2 +- 15 files changed, 160 insertions(+), 189 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index 9c3f34359f..7299d810a1 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -4,8 +4,7 @@ This document describes the steps to install the vLLM MindSpore environment. Three installation methods are provided: -- [Docker Installation](#docker-installation): Suitable for quick deployment scenarios. -- [Pip Installation](#pip-installation): Suitable for scenarios requiring specific versions. +- [Docker Installation](#docker-installation): Suitable for quick deployment scenarios. - [Source Code Installation](#source-code-installation): Suitable for incremental development of vLLM MindSpore. ## Version Compatibility @@ -14,19 +13,19 @@ This document describes the steps to install the vLLM MindSpore environment. Thr - Python: 3.9 / 3.10 / 3.11 - Software version compatibility - | Software | Version | + | Software | Version And Links | | ----- | ----- | - |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | - |[MindSpore](https://www.mindspore.cn/install/) | 2.7.0 | - |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.0.1 | - |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| 1.6.0 | - |[Golden Stick](https://gitee.com/mindspore/golden-stick)| 1.2.0 | - |[vLLM](https://github.com/vllm-project/vllm) | 0.8.3 | - |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3.0 | + |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit) | + |[MindSpore](https://www.mindspore.cn/install/) | [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/) | + |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | + |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers) | + |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | + |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202505/20250514/v0.8.4.dev0_newest/any/) | + |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | ## Environment Setup -This section introduces three installation methods: [Docker Installation](#docker-installation), [Pip Installation](#pip-installation), [Source Code Installation](#source-code-installation), and [Quick Verification](#quick-verification) example to check the installation. +This section introduces two installation methods: [Docker Installation](#docker-installation), [Source Code Installation](#source-code-installation), and [Quick Verification](#quick-verification) example to check the installation. ### Docker Installation @@ -106,51 +105,55 @@ docker exec -it $DOCKER_NAME bash ### Source Code Installation -- **CANN Installation** +#### CANN Installation - For CANN installation methods and environment configuration, please refer to [CANN Community Edition Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit). If you encounter any issues during CANN installation, please consult the [Ascend FAQ](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html) for troubleshooting. +For CANN installation methods and environment configuration, please refer to [CANN Community Edition Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit). If you encounter any issues during CANN installation, please consult the [Ascend FAQ](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html) for troubleshooting. - The default installation path for CANN is `/usr/local/Ascend`. After completing CANN installation, configure the environment variables with the following commands: +The default installation path for CANN is `/usr/local/Ascend`. After completing CANN installation, configure the environment variables with the following commands: - ```bash - LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package - source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh - export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit - ``` +```bash +LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package +source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh +export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit +``` -- **vLLM Prerequisites Installation** +#### vLLM Prerequisites Installation For vLLM environment configuration and installation methods, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html). In vllM installation, `gcc/g++ >= 12.3.0` is required, and it could be installed by the following command: - ```bash - yum install -y gcc gcc-c++ - ``` +```bash +yum install -y gcc gcc-c++ +``` -- **vLLM MindSpore Installation** +#### vLLM MindSpore Installation - To install vLLM MindSpore, user needs to pull the vLLM MindSpore source code and then runs the following command to install the dependencies: +vLLM MindSpore can be installed in the following two ways. **vLLM MindSpore One-click Installation** is suitable for scenarios where users need quick deployment and usage. **vLLM MindSpore Manual Installation** is suitable for scenarios where users require custom modifications to the components. - ```bash - git clone https://gitee.com/mindspore/vllm-mindspore.git - cd vllm-mindspore - bash install_depend_pkgs.sh - ``` +- **vLLM MindSpore One-click Installation** - Compile and install vLLM MindSpore: + To install vLLM MindSpore, user needs to pull the vLLM MindSpore source code and then runs the following command to install the dependencies: - ```bash - pip install . - ``` + ```bash + git clone https://gitee.com/mindspore/vllm-mindspore.git + cd vllm-mindspore + bash install_depend_pkgs.sh + ``` - After executing the above commands, `mindformers` folder will be generated in the `vllm-mindspore/install_depend_pkgs` directory. Add this folder to the environment variables: + Compile and install vLLM MindSpore: - ```bash - export PYTHONPATH=$MF_PATH:$PYTHONPATH - ``` + ```bash + pip install . + ``` -- **Manual Component Installation** + After executing the above commands, `mindformers` folder will be generated in the `vllm-mindspore/install_depend_pkgs` directory. Add this folder to the environment variables: - If user need to modify the components or use other versions, components need to be manually installed in a specific order. vLLM MindSpore requires the following installation sequence: + ```bash + export PYTHONPATH=$MF_PATH:$PYTHONPATH + ``` + +- **vLLM MindSpore Manual Installation** + + If user need to modify the components or use other versions, components need to be manually installed in a specific order. Version compatibility of vLLM MindSpore can be found [Version Compatibility](#version-compatibility), abd vLLM MindSpore requires the following installation sequence: 1. Install vLLM @@ -174,7 +177,7 @@ docker exec -it $DOCKER_NAME bash ```bash git clone https://gitee.com/mindspore/mindformers.git - export PYTHONPATH=`realpath mindformers`:$PYTHONPATH + export PYTHONPATH=$MF_PATH:$PYTHONPATH ``` 5. Install Golden Stick @@ -189,13 +192,17 @@ docker exec -it $DOCKER_NAME bash pip install /path/to/msadapter-*.whl ``` - 7. Install vLLM MindSpore + 7. Install vLLM MindSpore + + User needs to pull source of vLLM MindSpore, and run installation. ```bash + git clone https://gitee.com/mindspore/vllm-mindspore.git + cd vllm-mindspore pip install . ``` -### Quick Verification +## Quick Verification User can verify the installation with a simple offline inference test. First, user need to configure the environment variables with the following command: diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 8386c3d11c..9c369446a2 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -52,8 +52,8 @@ Execute the following Python script to download the MindSpore-compatible DeepSee ```python from openmind_hub import snapshot_download -snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-W8A8", - local_dir="/path/to/save/deepseek_r1_w8a8", +snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-0528-A8W8", + local_dir="/path/to/save/deepseek_r1_0528_a8w8", local_dir_use_symlinks=False) ``` @@ -78,7 +78,7 @@ If the tool is unavailable, install [git-lfs](https://git-lfs.com) first. Refer Once confirmed, download the weights by executing the following command: ```shell -git clone https://modelers.cn/MindSpore-Lab/DeepSeek-R1-W8A8.git +git clone https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8.git ``` ## TP16 Tensor Parallel Inference @@ -241,17 +241,17 @@ Execution example: ```bash # Master node: -vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray ``` -In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can also set the local model path by `--model` argument. +In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can also set the local model path by `--model` argument. #### Sending Requests Use the following command to send requests, where `prompt` is the model input: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' ``` User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. @@ -304,6 +304,7 @@ parallel_config: ### Online Inference #### Starting the Service + `vllm-mindspore` can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service: ```bash @@ -330,10 +331,10 @@ User can also set the local model path by `--model` argument. The following is a ```bash # Master node: -vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel # Worker node: -vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel ``` #### Sending Requests @@ -341,7 +342,7 @@ vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust Use the following command to send requests, where `prompt` is the model input: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 0451257e2b..40ad4a1597 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -134,7 +134,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra Here is an explanation of these environment variables: - `vLLM_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). -- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5). For Qwen2.5-32B, the YAML file is [predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml). +- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-32B, the YAML file is [predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml). Users can check memory usage with `npu-smi info` and set the NPU cards for inference using the following example (assuming cards 4,5,6,7 are used): @@ -188,7 +188,7 @@ If processed successfully, the inference result will be: { "id":"cmpl-11fe2898c77d4ff18c879f57ae7aa9ca","object":"text_completion", "create":1748568696, - "model":"Qwen/Qwen2.5-32B-Instruct", + "model":"Qwen2.5-32B-Instruct", "choices":[ { "index":0, diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index e1216d12c9..79ed73f812 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -222,7 +222,7 @@ If the request is processed successfully, the following inference result will be { "id":"cmpl-5e6e314861c24ba79fea151d86c1b9a6","object":"text_completion", "create":1747398389, - "model":"Qwen/Qwen2.5-7B-Instruct", + "model":"Qwen2.5-7B-Instruct", "choices":[ { "index":0, diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md index 35129a6215..036834798e 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md @@ -15,33 +15,9 @@ | `vLLM_USE_NPU_ADV_STEP_FLASH_OP` | Whether to use Ascend operation `adv_step_flash` | String | `on`: Use;`off`:Not use | If the variable is set to `off`, model will use the implement of small operations. | | `VLLM_TORCH_PROFILER_DIR` | Enables profiling data collection and takes effect when a data save path is configured. | String | The path to save profiling data. | | -The following environment variables are automatically registered by vLLM MindSpore: +More environment variable information can be referred in the following links: -| **Environment Variable** | **Function** | **Type** | **Values** | **Description** | -|------------------------|-------------|----------|------------|----------------| -| `USE_TORCH` | Transformer runtime depends on this variable. | String | Default: `"False"` | vLLM MindSpore does not use Torch as the backend. | -| `USE_TF` | Transformer runtime depends on this variable. | String | Default: `"False"` | vLLM MindSpore does not use TensorFlow as the backend. | -| `RUN_MODE` | Execution mode. | String | Default: `"predict"` | **This variable will be removed in future versions.** Required by MindFormers. | -| `CUSTOM_MATMUL_SHUFFLE` | Enables or disables custom matrix shuffling algorithm . | String | `on`: Enable shuffling. `off`: Disable shuffling. Default: `on`. | | -| `HCCL_DETERMINISTIC` | Enables or disables deterministic computation for reduction-type communication operators (e.g., AllReduce, ReduceScatter, Reduce). | String | `true`: Enable deterministic mode. `false`: Disable deterministic mode. Default: `false`. | | -| `ASCEND_LAUNCH_BLOCKING` | Controls whether operators run in synchronous mode during training or online inference. | Integer | `1`: Force synchronous execution. `0`: Do not force synchronous execution. Default: `0`. | | -| `TE_PARALLEL_COMPILER` | Maximum number of parallel compilation processes for operators. Parallel compilation is enabled if greater than 1. | Integer | Positive integer; Max = CPU cores * 80% / # of Ascend AI processors. Range: 1~32. Default: `0`. | | -| `LCCL_DETERMINISTIC` | Controls whether LCCL deterministic AllReduce (ordered addition) is enabled. | Integer | `1`: Enable deterministic mode. `0`: Disable deterministic mode. Default: `0`. | | -| `MS_ENABLE_GRACEFUL_EXIT` | Enables graceful process termination. | Integer | `1`: Enable graceful exit. `Other values`: Disable graceful exit. Default: `0`. | | -| `CPU_AFFINITY` | Optimizes CPU core binding for MindSpore inference. | String | `True`: Enable core binding. `False`: Disable core binding. Default: `True`. | **This variable will be removed in future versions.** Replaced by `set_cpu_affinity` API. | -| `MS_ENABLE_INTERNAL_BOOST` | Enables or disables MindSpore framework's internal acceleration. | String | `on`: Enable acceleration. `off`: Disable acceleration. Default: `on`. | | -| `MS_ENABLE_LCCL` | Controls whether the LCCL communication library is used. | Integer | `1`: Enable. `0`: Disable. Default: `0`. | | -| `HCCL_EXEC_TIMEOUT` | Controls the synchronization timeout for inter-device execution. | Integer | Range: (0, 17340] (seconds). Default: `7200`. | | -| `DEVICE_NUM_PER_NODE` | Number of devices per node. | Integer | Default: `16`. | | -| `HCCL_OP_EXPANSION_MODE` | Configures the expansion location for communication algorithms. | String | `AI_CPU`: Expands on AI CPU compute units. `AIV`: Expands on AI Vector Core compute units. Default: `AIV`. | | -| `MS_JIT_MODULES` | Specifies modules to be JIT-compiled in static graph mode. | String | Module names (top-level imports). Multiple modules should be comma-separated. Default: `"vllm_mindspore,research"`. | | -| `GLOG_v` | Controls log level. | Integer | `0`: DEBUG. `1`: INFO. `2`: WARNING. `3`: ERROR (logs errors, may not terminate). `4`: CRITICAL (logs critical errors, terminates execution). Default: `3`. | | -| `RAY_CGRAPH_get_timeout` | Timeout for `ray.get()` method (seconds). | Integer | Default: `360`. | | -| `MS_NODE_TIMEOUT` | Node heartbeat timeout (seconds). | Integer | Default: `180`. | | - -More environment variable information can be referred in the following link: - - - [CANN Environment Variable List](https://www.hiascend.com/document/detail/en/CANNCommunityEdition/81RC1beta1/index/index.html) - - [MindSpore Environment Variable List](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/env_var_list.html) - - [MindSpore Transformers Environment Variable List](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/index.html) - - [vLLM Environment Variable List](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) +- [CANN Environment Variable List](https://www.hiascend.com/document/detail/en/CANNCommunityEdition/81RC1beta1/index/index.html) +- [MindSpore Environment Variable List](https://www.mindspore.cn/docs/en/master/api_python/env_var_list.html) +- [MindSpore Transformers Environment Variable List](https://www.mindspore.cn/mindformers/docs/en/master/index.html) +- [vLLM Environment Variable List](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md index 33c39b583d..fa1b8f89c3 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md @@ -16,7 +16,7 @@ We employ [MindSpore Golden Stick's PTQ algorithm](https://gitee.com/mindspore/g ### Downloading Quantized Weights -We have uploaded the quantized DeepSeek-R1 to [ModelArts Community](https://modelers.cn): [MindSpore-Lab/DeepSeek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-W8A8). Refer to the [ModelArts Community documentation](https://modelers.cn/docs/en/openmind-hub-client/0.9/basic_tutorial/download.html) to download the weights locally. +We have uploaded the quantized DeepSeek-R1 to [ModelArts Community](https://modelers.cn): [MindSpore-Lab/DeepSeek-R1-0528-A8W8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8). Refer to the [ModelArts Community documentation](https://modelers.cn/docs/en/openmind-hub-client/0.9/basic_tutorial/download.html) to download the weights locally. ## Quantized Model Inference diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md index ba825bcae5..3d9b49bd6e 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md @@ -6,7 +6,7 @@ |-------| --------- | ---- | | DeepSeek-V3 | Supported | [DeepSeek-V3](https://modelers.cn/models/MindSpore-Lab/DeepSeek-V3) | | DeepSeek-R1 | Supported | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-V3) | -| DeepSeek-R1 W8A8 | Supported | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-r1-w8a8) | +| DeepSeek-R1 W8A8 | Supported | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8) | | Qwen2.5 | Supported | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct), [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct), [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct), [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct), [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | | Qwen3-32B | Supported | [Qwen3-32B](https://modelers.cn/models/MindSpore-Lab/Qwen3-32B) | | Qwen3-235B-A22B | Supported | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index 8d4eb6a8d6..67269d9419 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -13,19 +13,19 @@ - Python:3.9 / 3.10 / 3.11 - 软件版本配套 - | 软件 | 版本 | + | 软件 | 配套版本与下载链接 | | ----- | ----- | - |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | - |[MindSpore](https://www.mindspore.cn/install/) | 2.7.0 | - |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.0.1 | - |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| 1.6.0 | - |[Golden Stick](https://gitee.com/mindspore/golden-stick)| 1.2.0 | - |[vLLM](https://github.com/vllm-project/vllm) | 0.8.3 | - |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3.0 | + |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit) | + |[MindSpore](https://www.mindspore.cn/install/) | [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/) | + |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | + |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers) | + |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | + |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202505/20250514/v0.8.4.dev0_newest/any/) | + |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | ## 配置环境 -在本章节中,我们将介绍[docker安装](#docker安装)、[pip安装](#pip安装)、[源码安装](#源码安装)三种安装方式,以及[快速验证](#快速验证)用例,用于验证安装是否成功。 +在本章节中,我们将介绍[docker安装](#docker安装)、[源码安装](#源码安装)两种安装方式,以及[快速验证](#快速验证)用例,用于验证安装是否成功。 ### docker安装 @@ -105,29 +105,33 @@ docker exec -it $DOCKER_NAME bash ### 源码安装 -- **CANN安装** +#### CANN安装 - CANN安装方法与环境配套,请参考[CANN社区版软件安装](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit),若用户在安装CANN过程中遇到问题,可参考[昇腾常见问题](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html)进行解决。 +CANN安装方法与环境配套,请参考[CANN社区版软件安装](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit),若用户在安装CANN过程中遇到问题,可参考[昇腾常见问题](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html)进行解决。 - CANN默认安装路径为`/usr/local/Ascend`。用户在安装CANN完毕后,使用如下命令,为CANN配置环境变量: +CANN默认安装路径为`/usr/local/Ascend`。用户在安装CANN完毕后,使用如下命令,为CANN配置环境变量: - ```bash - LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package - source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh - export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit - ``` +```bash +LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package +source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh +export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit +``` -- **vLLM前置依赖安装** +#### vLLM前置依赖安装 vLLM的环境配置与安装方法,请参考[vLLM安装教程](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html)。其依赖`gcc/g++ >= 12.3.0`版本,可通过以下命令完成安装: - ```bash - yum install -y gcc gcc-c++ - ``` +```bash +yum install -y gcc gcc-c++ +``` + +#### vLLM MindSpore安装 -- **vLLM MindSpore安装** +vLLM MindSpore有以下两种安装方式。**vLLM MindSpore一键式安装**适用于用户快速使用与部署的场景。**vLLM MindSpore手动安装**适用于用户对组件有自定义修改的场景。 - 安装vLLM MindSpore,需要在拉取vLLM MindSpore源码后,执行以下命令,安装依赖包: +- **vLLM MindSpore一键式安装** + + 采用一键式安装脚本来安装vLLM MindSpore,需要在拉取vLLM MindSpore源码后,执行以下命令,安装依赖包: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -147,53 +151,58 @@ docker exec -it $DOCKER_NAME bash export PYTHONPATH=$MF_PATH:$PYTHONPATH ``` -- **组件手动安装** +- **vLLM MindSpore手动安装** + + 若用户对组件有修改,或者需使用其他版本,则用户需要按照特定顺序,手动安装组件。vLLM MindSpore软件配套下载地址可以参考[版本配套](#版本配套),且对组件的安装顺序要求如下: - 若用户对组件有修改,或者需使用其他版本,则用户需要按照特定顺序,手动安装组件。vLLM MindSpore对组件的安装顺序要求如下: 1. 安装vLLM - ```bash - pip install /path/to/vllm-*.whl - ``` + ```bash + pip install /path/to/vllm-*.whl + ``` 2. 卸载torch相关组件 - ```bash - pip uninstall torch torch-npu torchvision torchaudio -y - ``` + ```bash + pip uninstall torch torch-npu torchvision torchaudio -y + ``` 3. 安装MindSpore - ```bash - pip install /path/to/mindspore-*.whl - ``` + ```bash + pip install /path/to/mindspore-*.whl + ``` 4. 引入MindSpore Transformers仓,加入到`PYTHONPATH`中 - ```bash - git clone https://gitee.com/mindspore/mindformers.git - export PYTHONPATH=`realpath mindformers`:$PYTHONPATH - ``` + ```bash + git clone https://gitee.com/mindspore/mindformers.git + export PYTHONPATH=$MF_PATH:$PYTHONPATH + ``` 5. 安装Golden Stick - ```bash - pip install /path/to/mindspore_gs-*.whl - ``` + ```bash + pip install /path/to/mindspore_gs-*.whl + ``` 6. 安装MSAdapter - ```bash - pip install /path/to/msadapter-*.whl - ``` + ```bash + pip install /path/to/msadapter-*.whl + ``` 7. 安装vLLM MindSpore - ```bash - pip install . - ``` + 需要先拉取vLLM MindSpore源码,再执行安装 + + ```bash + git clone https://gitee.com/mindspore/vllm-mindspore.git + cd vllm-mindspore + pip install . + ``` -### 快速验证 +## 快速验证 用户可以创建一个简单的离线推理场景,验证安装是否成功。下面以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例。首先用户需要执行以下命令,设置环境变量: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md index 82d5534bda..addd3951d0 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md @@ -138,7 +138,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra 以下是对上述环境变量的解释: - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。 +- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml)。 另外,用户需要确保MindSpore Transformers已安装。用户可通过 @@ -220,7 +220,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。可以通过请求处理成功,将获得以下的推理结果: +其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 2ad616469f..813a0dd588 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -94,8 +94,8 @@ docker exec -it $DOCKER_NAME bash ```python from openmind_hub import snapshot_download -snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-W8A8", - local_dir="/path/to/save/deepseek_r1_w8a8", +snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-0528-A8W8", + local_dir="/path/to/save/deepseek_r1_0528_a8w8", local_dir_use_symlinks=False) ``` @@ -120,7 +120,7 @@ Git LFS initialized. 工具确认可用后,执行以下命令,下载权重: ```shell -git clone https://modelers.cn/MindSpore-Lab/DeepSeek-R1-W8A8.git +git clone https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8.git ``` ## TP16 张量并行推理 @@ -284,7 +284,7 @@ vllm-mindspore serve ```bash # 主节点: -vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray ``` 张量并行场景下,`--tensor-parallel-size`参数会覆盖模型yaml文件中`parallel_config`的`model_parallel`配置。用户可以通过`--model`参数,指定模型保存的本地路径。 @@ -294,7 +294,7 @@ vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-cod 使用如下命令发送请求。其中`prompt`字段为模型输入: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' ``` 用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。 @@ -347,6 +347,7 @@ parallel_config: ### 在线推理 #### 启动服务 + `vllm-mindspore`可使用OpenAI的API协议部署在线推理。以下是在线推理的拉起流程: ```bash @@ -373,10 +374,10 @@ vllm-mindspore serve ```bash # 主节点: -vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel # 从节点: -vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel ``` #### 发送请求 @@ -384,7 +385,7 @@ vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust 使用如下命令发送请求。其中`prompt`字段为模型输入: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am, "max_tokens": 120, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am, "max_tokens": 120, "temperature": 0}' ``` 用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 72e4bad00a..f0901e2e0f 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -135,7 +135,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra 以下是对上述环境变量的解释: - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-32B为例,则其yaml文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml) 。 +- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-32B为例,则其yaml文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml) 。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡。以下例子为假设用户使用4,5,6,7卡进行推理: @@ -181,7 +181,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。可以通过请求处理成功,将获得以下的推理结果: +其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index 90a49c065e..c7d8426f6d 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -135,7 +135,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra 以下是对上述环境变量的解释: - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。 +- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡: @@ -194,7 +194,7 @@ vLLM MindSpore可使用OpenAI的API协议,部署为在线推理。以下是以 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" ``` -若服务成功拉起,则可以获得类似的执行结果: +用户可以通过`--model`参数,指定模型保存的本地路径。若服务成功拉起,则可以获得类似的执行结果: ```text INFO: Started server process [6363] @@ -216,7 +216,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -若请求处理成功,将获得以下的推理结果: +其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md index 835946b67d..b5e2aefd2d 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md @@ -4,43 +4,20 @@ | 环境变量 | 功能 | 类型 | 取值 | 说明 | | ------ | ------- | ------ | ------ | ------ | -| vLLM_MODEL_BACKEND | 用于指定模型后端。使用vLLM MindSpore原生模型后端时无需指定;使用模型为vLLM MindSpore外部后端时则需要指定。 | String | `MindFormers`: 模型后端为MindSpore Transformers。 | 原生模型后端当前支持Qwen2.5系列;MindSpore Transformers模型后端支持Qwen系列、DeepSeek、Llama系列模型,使用时需配置环境变量:`export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`。 | -| MINDFORMERS_MODEL_CONFIG | MindSpore Transformers模型的配置文件。使用Qwen2.5系列、DeepSeek系列模型时,需要配置文件路径。 | String | 模型配置文件路径。 | **该环境变量在后续版本会被移除。** 样例:`export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`。 | -| GLOO_SOCKET_IFNAME | 用于多机之间使用gloo通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | -| TP_SOCKET_IFNAME | 用于多机之间使用TP通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | -| HCCL_SOCKET_IFNAME | 用于多机之间使用HCCL通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | -| ASCEND_RT_VISIBLE_DEVICES | 指定哪些Device对当前进程可见,支持一次指定一个或多个Device ID。 | String | 为Device ID,逗号分割的字符串,例如"0,1,2,3,4,5,6,7"。 | ray使用场景建议使用。 | -| HCCL_BUFFSIZE | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | Integer | 缓存区大小,大小为MB。例如:`2048`。 | 使用方法参考:[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行(数据并行数为32,专家并行数为32),且`max-num-batched-tokens`为256时,则`export HCCL_BUFFSIZE=2048`。 | -| MS_MEMPOOL_BLOCK_SIZE | 设置PyNative模式下设备内存池的块大小。 | String | 正整数string,单位为GB。 | | -| vLLM_USE_NPU_ADV_STEP_FLASH_OP | 是否使用昇腾`adv_step_flash`算子。 | String | `on`: 使用;`off`:不使用 | 取值为`off`时,将使用小算子实现替代`adv_step_flash`算子。 | -| VLLM_TORCH_PROFILER_DIR | 开启profiling采集数据,当配置了采集数据保存路径后生效 | String | Profiling数据保存路径。| | - -以下环境变量由vLLM MindSpore自动注册: - -| 环境变量 | 功能 | 类型 | 取值 | 说明 | -| ------ | ------- | ------ | ------ | ------ | -| USE_TORCH | Transformer运行时依赖该环境变量 | String | 默认值为"False" | vLLM MindSpore 不使用torch 作为后端 | -| USE_TF | Transformer运行时依赖该环境变量 | String | 默认值为"False" | vLLM MindSpore 不使用TensorFlow 作为后端 | -| RUN_MODE | 执行模式为推理 | String | 默认值为"predict" | **该环境变量在后续版本会被移除。** 为MindFormers依赖的环境变量 | -| CUSTOM_MATMUL_SHUFFLE | 开启或关闭自定义矩阵算法的洗牌操作 | String | `on`:开启矩阵洗牌。`off`:关闭矩阵洗牌。默认值为`on`。 | | -| HCCL_DETERMINISTIC | 开启或关闭归约类通信算子的确定性计算,其中归约类通信算子包括 AllReduce、ReduceScatter、Reduce。 | String | `true`:打开 HCCL 确定性开关;`false`:关闭 HCCL 确定性开关。默认值为`false`。 | | -| ASCEND_LAUNCH_BLOCKING | 训练或在线推理场景,可通过此环境变量控制算子执行时是否启动同步模式。 | Integer | `1`:强制算子采用同步模式运行;`0`:不强制算子采用同步模式运行。默认值为`0`。 | | -| TE_PARALLEL_COMPILER | 算子最大并行编译进程数,当大于 1 时开启并行编译。 | Integer | 取值为正整数;最大不超过 cpu 核数*80%/昇腾 AI 处理器个数,取值范围 1~32。默认值是 `0`。 | | -| LCCL_DETERMINISTIC | 设置 LCCL 确定性算子 AllReduce(保序加)是否开启。 | Integer | `1`:打开 LCCL 确定性开关;`0`:关闭 LCCL 确定性开关。默认值是 `0`。 | | -| MS_ENABLE_GRACEFUL_EXIT | 设置使能进程优雅退出 | Integer | `1`:使用进程优雅退出功能。`不设置或者其他值`: 不使用进程优雅退出功能。默认值为`0` | | -| CPU_AFFINIITY | MindSpore推理绑核优化 | String | `True`:开启绑核;`True`:不开启绑核。默认值为`True` | **该环境变量在后续版本会被移除。** 将使用`set_cpu_affinity`接口。 | -| MS_ENABLE_INTERNAL_BOOST | 是否打开 MindSpore 框架的内部加速功能。 | String | `on`:开启 MindSpore 内部加速;`off`:关闭 MindSpore 内部加速。默认值为`on` | | -| MS_ENABLE_LCCL | 是否使用LCCL通信库。 | Integer | `1`:开启,`0`:关闭。默认值为`0`。 | | -| HCCL_EXEC_TIMEOUT | 通过该环境变量可控制设备间执行时同步等待的时间,在该配置时间内各设备进程等待其他设备执行通信同步。 | Integer | 取值范围为:(0, 17340],单位为 s。 默认值为 7200。 | | -| DEVICE_NUM_PER_NODE | 节点上的设备数 | Integer | 默认值为16。 | | -| HCCL_OP_EXPANSION_MODE | 用于配置通信算法的编排展开位置。 | String | `AI_CPU`:通信算法的编排展开位置为Device侧的AI CPU计算单元;`AIV`:通信算法的编排展开位置为Device侧的AI Vector Core计算单元。默认值为`AIV`。 | | -| MS_JIT_MODULES | 指定静态图模式下哪些模块需要JIT静态编译,其函数方法会被编译成静态计算图 | String | 模块名,对应import导入的顶层模块的名称。如果有多个,使用英文逗号分隔。默认值为`"vllm_mindspore,research"`。 | | -| GLOG_v | 控制日志的级别 | Integer | `0`:DEBUG;`1`:INFO;`2`:WARNING;`3`:ERROR,表示程序执行出现报错,输出错误日志,程序可能不会终止;`4`:CRITICAL,表示程序执行出现异常,将会终止执行程序。默认值为`3`。 | | -| RAY_CGRAPH_get_timeout | `ray.get()`方法的超时时间。 | Integer | 默认值为`360`。 | | -| MS_NODE_TIMEOUT | 节点心跳超时时间,单位:秒。 | Integer | 默认值为`180`。 | | +| `vLLM_MODEL_BACKEND` | 用于指定模型后端。使用vLLM MindSpore原生模型后端时无需指定;使用模型为vLLM MindSpore外部后端时则需要指定。 | String | `MindFormers`: 模型后端为MindSpore Transformers。 | 原生模型后端当前支持Qwen2.5系列;MindSpore Transformers模型后端支持Qwen系列、DeepSeek、Llama系列模型,使用时需配置环境变量:`export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`。 | +| `MINDFORMERS_MODEL_CONFIG` | MindSpore Transformers模型的配置文件。使用Qwen2.5系列、DeepSeek系列模型时,需要配置文件路径。 | String | 模型配置文件路径。 | **该环境变量在后续版本会被移除。** 样例:`export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`。 | +| `GLOO_SOCKET_IFNAME` | 用于多机之间使用gloo通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | +| `TP_SOCKET_IFNAME` | 用于多机之间使用TP通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | +| `HCCL_SOCKET_IFNAME` | 用于多机之间使用HCCL通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | +| `ASCEND_RT_VISIBLE_DEVICES` | 指定哪些Device对当前进程可见,支持一次指定一个或多个Device ID。 | String | 为Device ID,逗号分割的字符串,例如"0,1,2,3,4,5,6,7"。 | ray使用场景建议使用。 | +| `HCCL_BUFFSIZE` | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | Integer | 缓存区大小,大小为MB。例如:`2048`。 | 使用方法参考:[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行(数据并行数为32,专家并行数为32),且`max-num-batched-tokens`为256时,则`export HCCL_BUFFSIZE=2048`。 | +| `MS_MEMPOOL_BLOCK_SIZE` | 设置PyNative模式下设备内存池的块大小。 | String | 正整数string,单位为GB。 | | +| `vLLM_USE_NPU_ADV_STEP_FLASH_OP` | 是否使用昇腾`adv_step_flash`算子。 | String | `on`: 使用;`off`:不使用 | 取值为`off`时,将使用小算子实现替代`adv_step_flash`算子。 | +| `VLLM_TORCH_PROFILER_DIR` | 开启profiling采集数据,当配置了采集数据保存路径后生效 | String | Profiling数据保存路径。| | 更多的环境变量信息,请查看: - - [CANN 环境变量列表](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/index/index.html) - - [MindSpore 环境变量列表](https://www.mindspore.cn/docs/zh-CN/r2.7.0rc1/api_python/env_var_list.html) - - [MindSpore Transformers 环境变量列表](https://www.mindspore.cn/mindformers/docs/zh-CN/r1.6.0/index.html) - - [vLLM 环境变量列表](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) + +- [CANN 环境变量列表](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/index/index.html) +- [MindSpore 环境变量列表](https://www.mindspore.cn/docs/zh-CN/master/api_python/env_var_list.html) +- [MindSpore Transformers 环境变量列表](https://www.mindspore.cn/mindformers/docs/zh-CN/master/index.html) +- [vLLM 环境变量列表](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md index 71667d5f1e..22a83475ef 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md @@ -16,7 +16,7 @@ ### 直接下载量化权重 -我们已经将量化好的DeepSeek-R1上传到[魔乐社区](https://modelers.cn):[MindSpore-Lab/DeepSeek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-W8A8),可以参考[魔乐社区文档](https://modelers.cn/docs/zh/openmind-hub-client/0.9/basic_tutorial/download.html)将权重下载到本地。 +我们已经将量化好的DeepSeek-R1上传到[魔乐社区](https://modelers.cn):[MindSpore-Lab/DeepSeek-R1-0528-A8W8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8),可以参考[魔乐社区文档](https://modelers.cn/docs/zh/openmind-hub-client/0.9/basic_tutorial/download.html)将权重下载到本地。 ## 量化模型推理 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md index 2e504c0fec..c64725c9e1 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md @@ -6,7 +6,7 @@ |-------| --------- | ---- | | DeepSeek-V3 | 已支持 | [DeepSeek-V3](https://modelers.cn/models/MindSpore-Lab/DeepSeek-V3) | | DeepSeek-R1 | 已支持 | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-V3) | -| DeepSeek-R1 W8A8 | 已支持 | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-r1-w8a8) | +| DeepSeek-R1 W8A8 | 已支持 | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8) | | Qwen2.5 | 已支持 | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)、[Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)、[Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)、 [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)、[Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)、[Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)、[Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | | Qwen3-32B | 已支持 | [Qwen3-32B](https://modelers.cn/models/MindSpore-Lab/Qwen3-32B) | | Qwen3-235B-A22B | 已支持 | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | -- Gitee