From 538f8fc773554af930a9ab2bff3b96b70581a65f Mon Sep 17 00:00:00 2001 From: horcam Date: Tue, 12 Aug 2025 09:57:39 +0800 Subject: [PATCH 01/12] delete env vars --- .../source_en/getting_started/installation/installation.md | 2 -- .../docs/source_en/getting_started/quick_start/quick_start.md | 4 ---- .../tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md | 4 ---- .../tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md | 4 ---- .../user_guide/supported_features/benchmark/benchmark.md | 4 ---- .../supported_features/quantization/quantization.md | 2 -- .../source_zh_cn/getting_started/installation/installation.md | 2 -- .../source_zh_cn/getting_started/quick_start/quick_start.md | 4 ---- .../tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md | 4 ---- .../tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md | 4 ---- .../user_guide/supported_features/benchmark/benchmark.md | 4 ---- .../supported_features/quantization/quantization.md | 2 -- 12 files changed, 40 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index 73851332d6..a6789dccde 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -153,9 +153,7 @@ docker exec -it $DOCKER_NAME bash User can verify the installation with a simple offline inference test. First, user need to configure the environment variables with the following command: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md index c0eaf16c34..6b129509a8 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md @@ -131,17 +131,13 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct Before launching the model, user need to set the following environment variables: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation. export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` Here is an explanation of these environment variables: -- `ASCEND_TOTAL_MEMORY_GB`: The memory size of each card. User can check the memory by using `npu-smi info`, where the value corresponds to `HBM-Usage(MB)` in the query results. - `vLLM_MODEL_BACKEND`: The backend of the model to run. User could find supported models and backends for vLLM MindSpore in the [Model Support List](../../user_guide/supported_models/models_list/models_list.md). -- `vLLM_MODEL_MEMORY_USE_GB`: The memory reserved for model loading. Adjust this value if insufficient memory error occurs during model loading. - `MINDFORMERS_MODEL_CONFIG`: The model configuration file. Additionally, users need to ensure that MindSpore Transformers is installed. Users can add it by running the following command: diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 24d4d4a2ca..7f55f75080 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -127,17 +127,13 @@ For [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), the followi ```bash #set environment variables -export ASCEND_TOTAL_MEMORY_GB=64 # Use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # Use MindSpore TransFormers as the model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Adjust based on the model's maximum usage, with the remaining allocated for KV cache. export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model YAML file. ``` Here is an explanation of these environment variables: -- `ASCEND_TOTAL_MEMORY_GB`: The memory size of each compute card. Query using `npu-smi info`, corresponding to `HBM-Usage(MB)` in the results. - `vLLM_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). -- `vLLM_MODEL_MEMORY_USE_GB`: Memory reserved for model loading. Adjust this if encountering insufficient memory. - `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5). For Qwen2.5-32B, the YAML file is [predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml). Users can check memory usage with `npu-smi info` and set the NPU cards for inference using the following example (assuming cards 4,5,6,7 are used): diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index 2c360e28f8..f6daacb52d 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -127,17 +127,13 @@ For [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), the following ```bash #set environment variables -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` Here is an explanation of these variables: -- `ASCEND_TOTAL_MEMORY_GB`: The memory size of each compute card. Query using `npu-smi info`, corresponding to `HBM-Usage(MB)` in the results. - `vLLM_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). -- `vLLM_MODEL_MEMORY_USE_GB`: Memory reserved for model loading. Adjust this if encountering insufficient memory. - `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-7B, the YAML file is [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml). User can check memory usage with `npu-smi info` and set the compute card for inference using: diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md index b704a49f44..9ab03ff8ac 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md @@ -9,9 +9,7 @@ The benchmark tool of vLLM MindSpore is inherited from vLLM. You can refer to th For single-card inference, we take [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. You can prepare the environment by following the guide [Single-Card Inference (Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#online-inference), set the environment variables: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` @@ -104,9 +102,7 @@ P99 ITL (ms): .... For offline performance benchmark, take [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. Prepare the environment by following the guide [Single-Card Inference (Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#offline-inference). User need to set the environment variables: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md index 401768ca1a..33c39b583d 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md @@ -27,9 +27,7 @@ After obtaining the DeepSeek-R1 W8A8 weights, ensure they are stored in the rela Refer to the [Installation Guide](../../../getting_started/installation/installation.md) to set up the vLLM MindSpore environment. User need to set the following environment variables: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index 8121916024..2243f2eaea 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -152,9 +152,7 @@ docker exec -it $DOCKER_NAME bash 用户可以创建一个简单的离线推理场景,验证安装是否成功。下面以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例。首先用户需要执行以下命令,设置环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md index 2bf629908a..a17c89b0fa 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md @@ -131,17 +131,13 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct 用户在拉起模型前,需设置以下环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 以下是对上述环境变量的解释: -- `ASCEND_TOTAL_MEMORY_GB`: 每一张计算卡的显存大小。用户可使用`npu-smi info`命令进行查询,该值对应查询结果中的`HBM-Usage(MB)`; - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `vLLM_MODEL_MEMORY_USE_GB`:模型加载时所用空间,根据用户所使用的模型进行设置。若用户在模型加载过程中遇到显存不足时,可适当增大该值并重试; - `MINDFORMERS_MODEL_CONFIG`:模型配置文件。 另外,用户需要确保MindSpore Transformers已安装。用户可通过 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 811009f647..5db8bd6d63 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -128,17 +128,13 @@ git clone https://huggingface.co/Qwen/Qwen2.5-32B-Instruct ```bash #set environment variables -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 以下是对上述环境变量的解释: -- `ASCEND_TOTAL_MEMORY_GB`: 每一张计算卡的显存大小。用户可使用`npu-smi info`命令进行查询,该值对应查询结果中的`HBM-Usage(MB)`。 - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 -- `vLLM_MODEL_MEMORY_USE_GB`:模型加载时所用空间,根据用户所使用的模型进行设置。若用户在模型加载过程中遇到显存不足时,可适当增大该值并重试。 - `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-32B为例,则其yaml文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml) 。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡。以下例子为假设用户使用4,5,6,7卡进行推理: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index ffc82071b2..90a49c065e 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -128,17 +128,13 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct ```bash #set environment variables -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 以下是对上述环境变量的解释: -- `ASCEND_TOTAL_MEMORY_GB`: 每一张计算卡的显存大小。用户可使用`npu-smi info`命令进行查询,该值对应查询结果中的`HBM-Usage(MB)`; - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `vLLM_MODEL_MEMORY_USE_GB`:模型加载时所用空间,根据用户所使用的模型进行设置。若用户在模型加载过程中遇到显存不足时,可适当增大该值并重试; - `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md index 15f1040699..6d28a6ff7c 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md @@ -9,9 +9,7 @@ vLLM MindSpore的性能测试能力,继承自vLLM所提供的性能测试能 若用户使用单卡推理,以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)为例,可按照文档[单卡推理(Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#在线推理)进行环境准备,设置以下环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` @@ -104,9 +102,7 @@ P99 ITL (ms): .... 用户使用离线性能测试时,以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)为例,可按照文档[单卡推理(Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#离线推理)进行环境准备,设置以下环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md index 54ad35032d..71667d5f1e 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md @@ -27,9 +27,7 @@ 用户可以参考[安装指南](../../../getting_started/installation/installation.md),进行vLLM MindSpore的环境搭建。用户需设置以下环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` -- Gitee From b1fc4254748cf6aadc8040d9f0afcab6d34627a7 Mon Sep 17 00:00:00 2001 From: horcam Date: Tue, 12 Aug 2025 10:50:59 +0800 Subject: [PATCH 02/12] update ds dir, check model name and model args --- .../quick_start/quick_start.md | 4 ++- .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md | 27 +++++++++++-------- .../qwen2.5_32b_multiNPU.md | 8 +++--- .../qwen2.5_7b_singleNPU.md | 6 +++-- .../quick_start/quick_start.md | 4 +-- .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md | 21 +++++++++------ .../qwen2.5_32b_multiNPU.md | 6 ++--- .../supported_features/profiling/profiling.md | 2 +- 8 files changed, 47 insertions(+), 31 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md index 6b129509a8..506dc04152 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md @@ -198,7 +198,7 @@ Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the fol python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" ``` -If the service starts successfully, similar output will be obtained: +User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained: ```text INFO: Started server process [6363] @@ -220,6 +220,8 @@ Use the following command to send a request, where `prompt` is the model input: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 15, "temperature": 0}' ``` +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. + If the request is processed successfully, the following inference result will be returned: ```text diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 7a3a9fb4a8..8386c3d11c 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -241,18 +241,20 @@ Execution example: ```bash # Master node: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray ``` -In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. +In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can also set the local model path by `--model` argument. #### Sending Requests Use the following command to send requests, where `prompt` is the model input: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' -``` +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' +``` + +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. ## Hybrid Parallel Inference @@ -301,6 +303,7 @@ parallel_config: ### Online Inference +#### Starting the Service `vllm-mindspore` can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service: ```bash @@ -321,22 +324,24 @@ vllm-mindspore serve --data-parallel-address [Master node communication IP] --data-parallel-rpc-port [Master node communication port] --enable-expert-parallel # Enable expert parallelism -``` +``` -Execution example: +User can also set the local model path by `--model` argument. The following is an execution example: ```bash # Master node: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel # Worker node: -vllm-mindspore serve --headless --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel -``` +vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +``` -## Sending Requests +#### Sending Requests Use the following command to send requests, where `prompt` is the model input: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` + +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 7f55f75080..0451257e2b 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -156,7 +156,7 @@ export MAX_MODEL_LEN=1024 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN ``` -Here, `TENSOR_PARALLEL_SIZE` specifies the number of NPU cards, and `MAX_MODEL_LEN` sets the maximum output token length. +Here, `TENSOR_PARALLEL_SIZE` specifies the number of NPU cards, and `MAX_MODEL_LEN` sets the maximum output token length. User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained: @@ -177,16 +177,18 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 Use the following command to send a request, where `prompt` is the model input: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. + If processed successfully, the inference result will be: ```text { "id":"cmpl-11fe2898c77d4ff18c879f57ae7aa9ca","object":"text_completion", "create":1748568696, - "model":"Qwen2.5-32B-Instruct", + "model":"Qwen/Qwen2.5-32B-Instruct", "choices":[ { "index":0, diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index f6daacb52d..e1216d12c9 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -192,7 +192,7 @@ Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the fol python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" ``` -If the service starts successfully, similar output will be obtained: +User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained: ```text INFO: Started server process [6363] @@ -214,13 +214,15 @@ Use the following command to send a request, where `prompt` is the model input: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 15, "temperature": 0}' ``` +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. + If the request is processed successfully, the following inference result will be returned: ```text { "id":"cmpl-5e6e314861c24ba79fea151d86c1b9a6","object":"text_completion", "create":1747398389, - "model":"Qwen2.5-7B-Instruct", + "model":"Qwen/Qwen2.5-7B-Instruct", "choices":[ { "index":0, diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md index a17c89b0fa..cb8960fed7 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md @@ -198,7 +198,7 @@ vLLM MindSpore可使用OpenAI的API协议,进行在线推理部署。以下是 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" ``` -若服务成功拉起,则可以获得类似的执行结果: +用户可以通过`--model`参数,指定模型保存的本地路径。若服务成功拉起,则可以获得类似的执行结果: ```text INFO: Started server process [6363] @@ -220,7 +220,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -若请求处理成功,将获得以下的推理结果: +其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。可以通过请求处理成功,将获得以下的推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 047e7b4aad..2ad616469f 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -284,19 +284,21 @@ vllm-mindspore serve ```bash # 主节点: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray ``` -张量并行场景下,`--tensor-parallel-size`参数会覆盖模型yaml文件中`parallel_config`的`model_parallel`配置。 +张量并行场景下,`--tensor-parallel-size`参数会覆盖模型yaml文件中`parallel_config`的`model_parallel`配置。用户可以通过`--model`参数,指定模型保存的本地路径。 #### 发起请求 使用如下命令发送请求。其中`prompt`字段为模型输入: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' ``` +用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。 + ## 混合并行推理 vLLM 通过 Ray 对多个节点资源进行管理和运行。该样例对应以下并行策略场景: @@ -344,6 +346,7 @@ parallel_config: ### 在线推理 +#### 启动服务 `vllm-mindspore`可使用OpenAI的API协议部署在线推理。以下是在线推理的拉起流程: ```bash @@ -366,20 +369,22 @@ vllm-mindspore serve --enable-expert-parallel # 使能专家并行 ``` -执行示例: +用户可以通过`--model`参数,指定模型保存的本地路径。以下为执行示例: ```bash # 主节点: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel # 从节点: -vllm-mindspore serve --headless --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel ``` -## 发送请求 +#### 发送请求 使用如下命令发送请求。其中`prompt`字段为模型输入: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am, "max_tokens": 120, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am, "max_tokens": 120, "temperature": 0}' ``` + +用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 5db8bd6d63..72e4bad00a 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -159,7 +159,7 @@ python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model 其中,`TENSOR_PARALLEL_SIZE`为用户指定的卡数,`MAX_MODEL_LEN`为模型最大输出token数。 -若服务成功拉起,则可以获得类似的执行结果: +用户可以通过`--model`参数,指定模型保存的本地路径。若服务成功拉起,则可以获得类似的执行结果: ```text INFO: Started server process [6363] @@ -178,10 +178,10 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 使用如下命令发送请求。其中`prompt`字段为模型输入: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -若请求处理成功,将获得以下的推理结果: +其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。可以通过请求处理成功,将获得以下的推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md index 4dc4f2ccee..eb90283188 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md @@ -40,7 +40,7 @@ curl -X POST http://127.0.0.1:8000/start_profile curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "/home/DeepSeekV3", + "model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 -- Gitee From d6e1eca1924f9e7e91e719cd33bd0768bfe2e802 Mon Sep 17 00:00:00 2001 From: horcam Date: Tue, 12 Aug 2025 11:14:01 +0800 Subject: [PATCH 03/12] update introduction of manual install --- .../installation/installation.md | 47 +++++++++++++++++++ .../installation/installation.md | 46 ++++++++++++++++++ 2 files changed, 93 insertions(+) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index a6789dccde..cadf178a7a 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -148,6 +148,53 @@ docker exec -it $DOCKER_NAME bash export PYTHONPATH=$MF_PATH:$PYTHONPATH ``` +- **Manual Component Installation** + + If user need to modify the components or use other versions, components need to be manually installed in a specific order. vLLM MindSpore requires the following installation sequence: + + 1. Install vLLM + + ```bash + pip install /path/to/vllm-*.whl + ``` + + 2. Uninstall Torch-related components + + ```bash + pip uninstall torch torch-npu torchvision torchaudio -y + ``` + + 3. Install MindSpore + + ```bash + pip install /path/to/mindspore-*.whl + ``` + + 4. Clone the MindSpore Transformers repository and add it to `PYTHONPATH` + + ```bash + git clone https://gitee.com/mindspore/mindformers.git + export PYTHONPATH=`realpath mindformers`:$PYTHONPATH + ``` + + 5. Install Golden Stick + + ```bash + pip install /path/to/mindspore_gs-*.whl + ``` + + 6. Install MSAdapter + + ```bash + pip install /path/to/msadapter-*.whl + ``` + + 7. Install vLLM MindSpore + + ```bash + pip install . + ``` + ### Quick Verification User can verify the installation with a simple offline inference test. First, user need to configure the environment variables with the following command: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index 2243f2eaea..2331d3db9e 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -147,6 +147,52 @@ docker exec -it $DOCKER_NAME bash export PYTHONPATH=$MF_PATH:$PYTHONPATH ``` +- **组件手动安装** + + 若用户对组件有修改,或者需使用其他版本,则用户需要按照特定顺序,手动安装组件。vLLM MindSpore对组件的安装顺序要求如下: + 1. 安装vLLM + + ```bash + pip install /path/to/vllm-*.whl + ``` + + 2. 卸载torch相关组件 + + ```bash + pip uninstall torch torch-npu torchvision torchaudio -y + ``` + + 3. 安装MindSpore + + ```bash + pip install /path/to/mindspore-*.whl + ``` + + 4. 引入MindSpore Transformers仓,加入到`PYTHONPATH`中 + + ```bash + git clone https://gitee.com/mindspore/mindformers.git + export PYTHONPATH=`realpath mindformers`:$PYTHONPATH + ``` + + 5. 安装Golden Stick + + ```bash + pip install /path/to/mindspore_gs-*.whl + ``` + + 6. 安装MSAdapter + + ```bash + pip install /path/to/msadapter-*.whl + ``` + + 7. 安装vLLM MindSpore + + ```bash + pip install . + ``` + ### 快速验证 用户可以创建一个简单的离线推理场景,验证安装是否成功。下面以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例。首先用户需要执行以下命令,设置环境变量: -- Gitee From 8733fb088f5b12723fa781c7d225ce019de613a1 Mon Sep 17 00:00:00 2001 From: horcam Date: Tue, 12 Aug 2025 11:25:21 +0800 Subject: [PATCH 04/12] update mindformers config introduction --- .../docs/source_en/getting_started/quick_start/quick_start.md | 2 +- .../source_zh_cn/getting_started/quick_start/quick_start.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md index 506dc04152..91a88e814d 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md @@ -138,7 +138,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra Here is an explanation of these environment variables: - `vLLM_MODEL_BACKEND`: The backend of the model to run. User could find supported models and backends for vLLM MindSpore in the [Model Support List](../../user_guide/supported_models/models_list/models_list.md). -- `MINDFORMERS_MODEL_CONFIG`: The model configuration file. +- `MINDFORMERS_MODEL_CONFIG`: The model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-7B, the YAML file is [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml). Additionally, users need to ensure that MindSpore Transformers is installed. Users can add it by running the following command: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md index cb8960fed7..82d5534bda 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md @@ -138,7 +138,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra 以下是对上述环境变量的解释: - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。 +- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。 另外,用户需要确保MindSpore Transformers已安装。用户可通过 -- Gitee From 4975ebb7f19789bf666d82e4d54ff39a22de5979 Mon Sep 17 00:00:00 2001 From: horcam Date: Tue, 12 Aug 2025 16:07:42 +0800 Subject: [PATCH 05/12] update env variable and pkgs version --- .../installation/installation.md | 18 ++++----- .../environment_variables.md | 37 +++++++++++++++++-- .../installation/installation.md | 18 ++++----- .../environment_variables.md | 32 +++++++++++++++- 4 files changed, 83 insertions(+), 22 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index cadf178a7a..9c3f34359f 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -14,15 +14,15 @@ This document describes the steps to install the vLLM MindSpore environment. Thr - Python: 3.9 / 3.10 / 3.11 - Software version compatibility - | Software | Version | Corresponding Branch | - | -------- | ------- | -------------------- | - | [CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | - | - | [MindSpore](https://www.mindspore.cn/install/) | 2.7 | master | - | [MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter) | 0.2 | master | - | [MindSpore Transformers](https://gitee.com/mindspore/mindformers) | 1.6 | dev | - | [Golden Stick](https://gitee.com/mindspore/golden-stick) | 1.1.0 | r1.1.0 | - | [vLLM](https://github.com/vllm-project/vllm) | 0.9.1 | v0.9.1 | - | [vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3 | master | + | Software | Version | + | ----- | ----- | + |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | + |[MindSpore](https://www.mindspore.cn/install/) | 2.7.0 | + |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.0.1 | + |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| 1.6.0 | + |[Golden Stick](https://gitee.com/mindspore/golden-stick)| 1.2.0 | + |[vLLM](https://github.com/vllm-project/vllm) | 0.8.3 | + |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3.0 | ## Environment Setup diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md index c1b7161626..35129a6215 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md @@ -11,6 +11,37 @@ | `HCCL_SOCKET_IFNAME` | Specifies the network interface name for inter-machine communication using HCCL. | String | Interface name (e.g., `enp189s0f0`). | Used in multi-machine scenarios. The interface name can be found via `ifconfig` by matching the IP address. | | `ASCEND_RT_VISIBLE_DEVICES` | Specifies which devices are visible to the current process, supporting one or multiple Device IDs. | String | Device IDs as a comma-separated string (e.g., `"0,1,2,3,4,5,6,7"`). | Recommended for Ray usage scenarios. | | `HCCL_BUFFSIZE` | Controls the buffer size for data sharing between two NPUs. | int | Buffer size in MB (e.g., `2048`). | Usage reference: [HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html). Example: For DeepSeek hybrid parallelism (Data Parallel: 32, Expert Parallel: 32) with `max-num-batched-tokens=256`, set `export HCCL_BUFFSIZE=2048`. | -| MS_MEMPOOL_BLOCK_SIZE | Set the size of the memory pool block in PyNative mode for devices | String | String of positive number, and the unit is GB. | | -| vLLM_USE_NPU_ADV_STEP_FLASH_OP | Whether to use Ascend operation `adv_step_flash` | String | `on`: Use;`off`:Not use | If the variable is set to `off`, model will use the implement of small operations. | -| VLLM_TORCH_PROFILER_DIR | Enables profiling data collection and takes effect when a data save path is configured. | String | The path to save profiling data. | | +| `MS_MEMPOOL_BLOCK_SIZE` | Set the size of the memory pool block in PyNative mode for devices | String | String of positive number, and the unit is GB. | | +| `vLLM_USE_NPU_ADV_STEP_FLASH_OP` | Whether to use Ascend operation `adv_step_flash` | String | `on`: Use;`off`:Not use | If the variable is set to `off`, model will use the implement of small operations. | +| `VLLM_TORCH_PROFILER_DIR` | Enables profiling data collection and takes effect when a data save path is configured. | String | The path to save profiling data. | | + +The following environment variables are automatically registered by vLLM MindSpore: + +| **Environment Variable** | **Function** | **Type** | **Values** | **Description** | +|------------------------|-------------|----------|------------|----------------| +| `USE_TORCH` | Transformer runtime depends on this variable. | String | Default: `"False"` | vLLM MindSpore does not use Torch as the backend. | +| `USE_TF` | Transformer runtime depends on this variable. | String | Default: `"False"` | vLLM MindSpore does not use TensorFlow as the backend. | +| `RUN_MODE` | Execution mode. | String | Default: `"predict"` | **This variable will be removed in future versions.** Required by MindFormers. | +| `CUSTOM_MATMUL_SHUFFLE` | Enables or disables custom matrix shuffling algorithm . | String | `on`: Enable shuffling. `off`: Disable shuffling. Default: `on`. | | +| `HCCL_DETERMINISTIC` | Enables or disables deterministic computation for reduction-type communication operators (e.g., AllReduce, ReduceScatter, Reduce). | String | `true`: Enable deterministic mode. `false`: Disable deterministic mode. Default: `false`. | | +| `ASCEND_LAUNCH_BLOCKING` | Controls whether operators run in synchronous mode during training or online inference. | Integer | `1`: Force synchronous execution. `0`: Do not force synchronous execution. Default: `0`. | | +| `TE_PARALLEL_COMPILER` | Maximum number of parallel compilation processes for operators. Parallel compilation is enabled if greater than 1. | Integer | Positive integer; Max = CPU cores * 80% / # of Ascend AI processors. Range: 1~32. Default: `0`. | | +| `LCCL_DETERMINISTIC` | Controls whether LCCL deterministic AllReduce (ordered addition) is enabled. | Integer | `1`: Enable deterministic mode. `0`: Disable deterministic mode. Default: `0`. | | +| `MS_ENABLE_GRACEFUL_EXIT` | Enables graceful process termination. | Integer | `1`: Enable graceful exit. `Other values`: Disable graceful exit. Default: `0`. | | +| `CPU_AFFINITY` | Optimizes CPU core binding for MindSpore inference. | String | `True`: Enable core binding. `False`: Disable core binding. Default: `True`. | **This variable will be removed in future versions.** Replaced by `set_cpu_affinity` API. | +| `MS_ENABLE_INTERNAL_BOOST` | Enables or disables MindSpore framework's internal acceleration. | String | `on`: Enable acceleration. `off`: Disable acceleration. Default: `on`. | | +| `MS_ENABLE_LCCL` | Controls whether the LCCL communication library is used. | Integer | `1`: Enable. `0`: Disable. Default: `0`. | | +| `HCCL_EXEC_TIMEOUT` | Controls the synchronization timeout for inter-device execution. | Integer | Range: (0, 17340] (seconds). Default: `7200`. | | +| `DEVICE_NUM_PER_NODE` | Number of devices per node. | Integer | Default: `16`. | | +| `HCCL_OP_EXPANSION_MODE` | Configures the expansion location for communication algorithms. | String | `AI_CPU`: Expands on AI CPU compute units. `AIV`: Expands on AI Vector Core compute units. Default: `AIV`. | | +| `MS_JIT_MODULES` | Specifies modules to be JIT-compiled in static graph mode. | String | Module names (top-level imports). Multiple modules should be comma-separated. Default: `"vllm_mindspore,research"`. | | +| `GLOG_v` | Controls log level. | Integer | `0`: DEBUG. `1`: INFO. `2`: WARNING. `3`: ERROR (logs errors, may not terminate). `4`: CRITICAL (logs critical errors, terminates execution). Default: `3`. | | +| `RAY_CGRAPH_get_timeout` | Timeout for `ray.get()` method (seconds). | Integer | Default: `360`. | | +| `MS_NODE_TIMEOUT` | Node heartbeat timeout (seconds). | Integer | Default: `180`. | | + +More environment variable information can be referred in the following link: + + - [CANN Environment Variable List](https://www.hiascend.com/document/detail/en/CANNCommunityEdition/81RC1beta1/index/index.html) + - [MindSpore Environment Variable List](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/env_var_list.html) + - [MindSpore Transformers Environment Variable List](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/index.html) + - [vLLM Environment Variable List](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index 2331d3db9e..8d4eb6a8d6 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -13,15 +13,15 @@ - Python:3.9 / 3.10 / 3.11 - 软件版本配套 - | 软件 | 版本 | 对应分支 | - | ----- | ----- | ----- | - |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | - | - |[MindSpore](https://www.mindspore.cn/install/) | 2.7 | master | - |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.2 | master | - |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)|1.6 | dev | - |[Golden Stick](https://gitee.com/mindspore/golden-stick)|1.1.0 | r1.1.0 | - |[vLLM](https://github.com/vllm-project/vllm) | 0.9.1 | v0.9.1 | - |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3 | master | + | 软件 | 版本 | + | ----- | ----- | + |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | + |[MindSpore](https://www.mindspore.cn/install/) | 2.7.0 | + |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.0.1 | + |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| 1.6.0 | + |[Golden Stick](https://gitee.com/mindspore/golden-stick)| 1.2.0 | + |[vLLM](https://github.com/vllm-project/vllm) | 0.8.3 | + |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3.0 | ## 配置环境 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md index 7fd53b3ff3..835946b67d 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md @@ -10,7 +10,37 @@ | TP_SOCKET_IFNAME | 用于多机之间使用TP通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | | HCCL_SOCKET_IFNAME | 用于多机之间使用HCCL通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | | ASCEND_RT_VISIBLE_DEVICES | 指定哪些Device对当前进程可见,支持一次指定一个或多个Device ID。 | String | 为Device ID,逗号分割的字符串,例如"0,1,2,3,4,5,6,7"。 | ray使用场景建议使用。 | -| HCCL_BUFFSIZE | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | int | 缓存区大小,大小为MB。例如:`2048`。 | 使用方法参考:[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行(数据并行数为32,专家并行数为32),且`max-num-batched-tokens`为256时,则`export HCCL_BUFFSIZE=2048`。 | +| HCCL_BUFFSIZE | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | Integer | 缓存区大小,大小为MB。例如:`2048`。 | 使用方法参考:[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行(数据并行数为32,专家并行数为32),且`max-num-batched-tokens`为256时,则`export HCCL_BUFFSIZE=2048`。 | | MS_MEMPOOL_BLOCK_SIZE | 设置PyNative模式下设备内存池的块大小。 | String | 正整数string,单位为GB。 | | | vLLM_USE_NPU_ADV_STEP_FLASH_OP | 是否使用昇腾`adv_step_flash`算子。 | String | `on`: 使用;`off`:不使用 | 取值为`off`时,将使用小算子实现替代`adv_step_flash`算子。 | | VLLM_TORCH_PROFILER_DIR | 开启profiling采集数据,当配置了采集数据保存路径后生效 | String | Profiling数据保存路径。| | + +以下环境变量由vLLM MindSpore自动注册: + +| 环境变量 | 功能 | 类型 | 取值 | 说明 | +| ------ | ------- | ------ | ------ | ------ | +| USE_TORCH | Transformer运行时依赖该环境变量 | String | 默认值为"False" | vLLM MindSpore 不使用torch 作为后端 | +| USE_TF | Transformer运行时依赖该环境变量 | String | 默认值为"False" | vLLM MindSpore 不使用TensorFlow 作为后端 | +| RUN_MODE | 执行模式为推理 | String | 默认值为"predict" | **该环境变量在后续版本会被移除。** 为MindFormers依赖的环境变量 | +| CUSTOM_MATMUL_SHUFFLE | 开启或关闭自定义矩阵算法的洗牌操作 | String | `on`:开启矩阵洗牌。`off`:关闭矩阵洗牌。默认值为`on`。 | | +| HCCL_DETERMINISTIC | 开启或关闭归约类通信算子的确定性计算,其中归约类通信算子包括 AllReduce、ReduceScatter、Reduce。 | String | `true`:打开 HCCL 确定性开关;`false`:关闭 HCCL 确定性开关。默认值为`false`。 | | +| ASCEND_LAUNCH_BLOCKING | 训练或在线推理场景,可通过此环境变量控制算子执行时是否启动同步模式。 | Integer | `1`:强制算子采用同步模式运行;`0`:不强制算子采用同步模式运行。默认值为`0`。 | | +| TE_PARALLEL_COMPILER | 算子最大并行编译进程数,当大于 1 时开启并行编译。 | Integer | 取值为正整数;最大不超过 cpu 核数*80%/昇腾 AI 处理器个数,取值范围 1~32。默认值是 `0`。 | | +| LCCL_DETERMINISTIC | 设置 LCCL 确定性算子 AllReduce(保序加)是否开启。 | Integer | `1`:打开 LCCL 确定性开关;`0`:关闭 LCCL 确定性开关。默认值是 `0`。 | | +| MS_ENABLE_GRACEFUL_EXIT | 设置使能进程优雅退出 | Integer | `1`:使用进程优雅退出功能。`不设置或者其他值`: 不使用进程优雅退出功能。默认值为`0` | | +| CPU_AFFINIITY | MindSpore推理绑核优化 | String | `True`:开启绑核;`True`:不开启绑核。默认值为`True` | **该环境变量在后续版本会被移除。** 将使用`set_cpu_affinity`接口。 | +| MS_ENABLE_INTERNAL_BOOST | 是否打开 MindSpore 框架的内部加速功能。 | String | `on`:开启 MindSpore 内部加速;`off`:关闭 MindSpore 内部加速。默认值为`on` | | +| MS_ENABLE_LCCL | 是否使用LCCL通信库。 | Integer | `1`:开启,`0`:关闭。默认值为`0`。 | | +| HCCL_EXEC_TIMEOUT | 通过该环境变量可控制设备间执行时同步等待的时间,在该配置时间内各设备进程等待其他设备执行通信同步。 | Integer | 取值范围为:(0, 17340],单位为 s。 默认值为 7200。 | | +| DEVICE_NUM_PER_NODE | 节点上的设备数 | Integer | 默认值为16。 | | +| HCCL_OP_EXPANSION_MODE | 用于配置通信算法的编排展开位置。 | String | `AI_CPU`:通信算法的编排展开位置为Device侧的AI CPU计算单元;`AIV`:通信算法的编排展开位置为Device侧的AI Vector Core计算单元。默认值为`AIV`。 | | +| MS_JIT_MODULES | 指定静态图模式下哪些模块需要JIT静态编译,其函数方法会被编译成静态计算图 | String | 模块名,对应import导入的顶层模块的名称。如果有多个,使用英文逗号分隔。默认值为`"vllm_mindspore,research"`。 | | +| GLOG_v | 控制日志的级别 | Integer | `0`:DEBUG;`1`:INFO;`2`:WARNING;`3`:ERROR,表示程序执行出现报错,输出错误日志,程序可能不会终止;`4`:CRITICAL,表示程序执行出现异常,将会终止执行程序。默认值为`3`。 | | +| RAY_CGRAPH_get_timeout | `ray.get()`方法的超时时间。 | Integer | 默认值为`360`。 | | +| MS_NODE_TIMEOUT | 节点心跳超时时间,单位:秒。 | Integer | 默认值为`180`。 | | + +更多的环境变量信息,请查看: + - [CANN 环境变量列表](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/index/index.html) + - [MindSpore 环境变量列表](https://www.mindspore.cn/docs/zh-CN/r2.7.0rc1/api_python/env_var_list.html) + - [MindSpore Transformers 环境变量列表](https://www.mindspore.cn/mindformers/docs/zh-CN/r1.6.0/index.html) + - [vLLM 环境变量列表](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) -- Gitee From b441fd837a6205c8ba087f9e5a11de48307a5906 Mon Sep 17 00:00:00 2001 From: horcam Date: Tue, 12 Aug 2025 16:14:39 +0800 Subject: [PATCH 06/12] update others --- .../user_guide/supported_features/profiling/profiling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md index b24b541e59..897f1ec0c8 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md @@ -40,7 +40,7 @@ curl -X POST http://127.0.0.1:8000/start_profile curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "/home/DeepSeekV3", + "model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 -- Gitee From ac3084d7412c75845b495d71fe625496cb36586a Mon Sep 17 00:00:00 2001 From: horcam Date: Wed, 13 Aug 2025 16:19:23 +0800 Subject: [PATCH 07/12] fix for comment --- .../installation/installation.md | 95 +++++++++------- .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md | 19 ++-- .../qwen2.5_32b_multiNPU.md | 4 +- .../qwen2.5_7b_singleNPU.md | 2 +- .../environment_variables.md | 34 +----- .../quantization/quantization.md | 2 +- .../models_list/models_list.md | 2 +- .../installation/installation.md | 107 ++++++++++-------- .../quick_start/quick_start.md | 4 +- .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md | 17 +-- .../qwen2.5_32b_multiNPU.md | 4 +- .../qwen2.5_7b_singleNPU.md | 6 +- .../environment_variables.md | 53 +++------ .../quantization/quantization.md | 2 +- .../models_list/models_list.md | 2 +- 15 files changed, 162 insertions(+), 191 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index 9c3f34359f..dd8fab4ee0 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -4,8 +4,7 @@ This document describes the steps to install the vLLM MindSpore environment. Three installation methods are provided: -- [Docker Installation](#docker-installation): Suitable for quick deployment scenarios. -- [Pip Installation](#pip-installation): Suitable for scenarios requiring specific versions. +- [Docker Installation](#docker-installation): Suitable for quick deployment scenarios. - [Source Code Installation](#source-code-installation): Suitable for incremental development of vLLM MindSpore. ## Version Compatibility @@ -14,19 +13,19 @@ This document describes the steps to install the vLLM MindSpore environment. Thr - Python: 3.9 / 3.10 / 3.11 - Software version compatibility - | Software | Version | + | Software | Version And Links | | ----- | ----- | - |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | - |[MindSpore](https://www.mindspore.cn/install/) | 2.7.0 | - |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.0.1 | - |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| 1.6.0 | - |[Golden Stick](https://gitee.com/mindspore/golden-stick)| 1.2.0 | - |[vLLM](https://github.com/vllm-project/vllm) | 0.8.3 | - |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3.0 | + |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit) | + |[MindSpore](https://www.mindspore.cn/install/) | [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/) | + |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | + |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers) | + |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | + |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202505/20250514/v0.8.4.dev0_newest/any/) | + |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | ## Environment Setup -This section introduces three installation methods: [Docker Installation](#docker-installation), [Pip Installation](#pip-installation), [Source Code Installation](#source-code-installation), and [Quick Verification](#quick-verification) example to check the installation. +This section introduces two installation methods: [Docker Installation](#docker-installation), [Source Code Installation](#source-code-installation), and [Quick Verification](#quick-verification) example to check the installation. ### Docker Installation @@ -106,51 +105,55 @@ docker exec -it $DOCKER_NAME bash ### Source Code Installation -- **CANN Installation** +#### CANN Installation - For CANN installation methods and environment configuration, please refer to [CANN Community Edition Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit). If you encounter any issues during CANN installation, please consult the [Ascend FAQ](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html) for troubleshooting. +For CANN installation methods and environment configuration, please refer to [CANN Community Edition Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit). If you encounter any issues during CANN installation, please consult the [Ascend FAQ](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html) for troubleshooting. - The default installation path for CANN is `/usr/local/Ascend`. After completing CANN installation, configure the environment variables with the following commands: +The default installation path for CANN is `/usr/local/Ascend`. After completing CANN installation, configure the environment variables with the following commands: - ```bash - LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package - source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh - export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit - ``` +```bash +LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package +source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh +export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit +``` -- **vLLM Prerequisites Installation** +#### vLLM Prerequisites Installation - For vLLM environment configuration and installation methods, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html). In vllM installation, `gcc/g++ >= 12.3.0` is required, and it could be installed by the following command: +For vLLM environment configuration and installation methods, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html). In vllM installation, `gcc/g++ >= 12.3.0` is required, and it could be installed by the following command: - ```bash - yum install -y gcc gcc-c++ - ``` +```bash +yum install -y gcc gcc-c++ +``` -- **vLLM MindSpore Installation** +#### vLLM MindSpore Installation - To install vLLM MindSpore, user needs to pull the vLLM MindSpore source code and then runs the following command to install the dependencies: +vLLM MindSpore can be installed in the following two ways. **vLLM MindSpore One-click Installation** is suitable for scenarios where users need quick deployment and usage. **vLLM MindSpore Manual Installation** is suitable for scenarios where users require custom modifications to the components. - ```bash - git clone https://gitee.com/mindspore/vllm-mindspore.git - cd vllm-mindspore - bash install_depend_pkgs.sh - ``` +- **vLLM MindSpore One-click Installation** - Compile and install vLLM MindSpore: + To install vLLM MindSpore, user needs to pull the vLLM MindSpore source code and then runs the following command to install the dependencies: - ```bash - pip install . - ``` + ```bash + git clone https://gitee.com/mindspore/vllm-mindspore.git + cd vllm-mindspore + bash install_depend_pkgs.sh + ``` - After executing the above commands, `mindformers` folder will be generated in the `vllm-mindspore/install_depend_pkgs` directory. Add this folder to the environment variables: + Compile and install vLLM MindSpore: - ```bash - export PYTHONPATH=$MF_PATH:$PYTHONPATH - ``` + ```bash + pip install . + ``` -- **Manual Component Installation** + After executing the above commands, `mindformers` folder will be generated in the `vllm-mindspore/install_depend_pkgs` directory. Add this folder to the environment variables: - If user need to modify the components or use other versions, components need to be manually installed in a specific order. vLLM MindSpore requires the following installation sequence: + ```bash + export PYTHONPATH=$MF_PATH:$PYTHONPATH + ``` + +- **vLLM MindSpore Manual Installation** + + If user need to modify the components or use other versions, components need to be manually installed in a specific order. Version compatibility of vLLM MindSpore can be found [Version Compatibility](#version-compatibility), abd vLLM MindSpore requires the following installation sequence: 1. Install vLLM @@ -174,7 +177,7 @@ docker exec -it $DOCKER_NAME bash ```bash git clone https://gitee.com/mindspore/mindformers.git - export PYTHONPATH=`realpath mindformers`:$PYTHONPATH + export PYTHONPATH=$MF_PATH:$PYTHONPATH ``` 5. Install Golden Stick @@ -189,13 +192,17 @@ docker exec -it $DOCKER_NAME bash pip install /path/to/msadapter-*.whl ``` - 7. Install vLLM MindSpore + 7. Install vLLM MindSpore + + User needs to pull source of vLLM MindSpore, and run installation. ```bash + git clone https://gitee.com/mindspore/vllm-mindspore.git + cd vllm-mindspore pip install . ``` -### Quick Verification +## Quick Verification User can verify the installation with a simple offline inference test. First, user need to configure the environment variables with the following command: diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 8386c3d11c..9c369446a2 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -52,8 +52,8 @@ Execute the following Python script to download the MindSpore-compatible DeepSee ```python from openmind_hub import snapshot_download -snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-W8A8", - local_dir="/path/to/save/deepseek_r1_w8a8", +snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-0528-A8W8", + local_dir="/path/to/save/deepseek_r1_0528_a8w8", local_dir_use_symlinks=False) ``` @@ -78,7 +78,7 @@ If the tool is unavailable, install [git-lfs](https://git-lfs.com) first. Refer Once confirmed, download the weights by executing the following command: ```shell -git clone https://modelers.cn/MindSpore-Lab/DeepSeek-R1-W8A8.git +git clone https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8.git ``` ## TP16 Tensor Parallel Inference @@ -241,17 +241,17 @@ Execution example: ```bash # Master node: -vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray ``` -In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can also set the local model path by `--model` argument. +In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can also set the local model path by `--model` argument. #### Sending Requests Use the following command to send requests, where `prompt` is the model input: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' ``` User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. @@ -304,6 +304,7 @@ parallel_config: ### Online Inference #### Starting the Service + `vllm-mindspore` can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service: ```bash @@ -330,10 +331,10 @@ User can also set the local model path by `--model` argument. The following is a ```bash # Master node: -vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel # Worker node: -vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel ``` #### Sending Requests @@ -341,7 +342,7 @@ vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust Use the following command to send requests, where `prompt` is the model input: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 0451257e2b..40ad4a1597 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -134,7 +134,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra Here is an explanation of these environment variables: - `vLLM_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). -- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5). For Qwen2.5-32B, the YAML file is [predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml). +- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-32B, the YAML file is [predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml). Users can check memory usage with `npu-smi info` and set the NPU cards for inference using the following example (assuming cards 4,5,6,7 are used): @@ -188,7 +188,7 @@ If processed successfully, the inference result will be: { "id":"cmpl-11fe2898c77d4ff18c879f57ae7aa9ca","object":"text_completion", "create":1748568696, - "model":"Qwen/Qwen2.5-32B-Instruct", + "model":"Qwen2.5-32B-Instruct", "choices":[ { "index":0, diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index e1216d12c9..79ed73f812 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -222,7 +222,7 @@ If the request is processed successfully, the following inference result will be { "id":"cmpl-5e6e314861c24ba79fea151d86c1b9a6","object":"text_completion", "create":1747398389, - "model":"Qwen/Qwen2.5-7B-Instruct", + "model":"Qwen2.5-7B-Instruct", "choices":[ { "index":0, diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md index 35129a6215..036834798e 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md @@ -15,33 +15,9 @@ | `vLLM_USE_NPU_ADV_STEP_FLASH_OP` | Whether to use Ascend operation `adv_step_flash` | String | `on`: Use;`off`:Not use | If the variable is set to `off`, model will use the implement of small operations. | | `VLLM_TORCH_PROFILER_DIR` | Enables profiling data collection and takes effect when a data save path is configured. | String | The path to save profiling data. | | -The following environment variables are automatically registered by vLLM MindSpore: +More environment variable information can be referred in the following links: -| **Environment Variable** | **Function** | **Type** | **Values** | **Description** | -|------------------------|-------------|----------|------------|----------------| -| `USE_TORCH` | Transformer runtime depends on this variable. | String | Default: `"False"` | vLLM MindSpore does not use Torch as the backend. | -| `USE_TF` | Transformer runtime depends on this variable. | String | Default: `"False"` | vLLM MindSpore does not use TensorFlow as the backend. | -| `RUN_MODE` | Execution mode. | String | Default: `"predict"` | **This variable will be removed in future versions.** Required by MindFormers. | -| `CUSTOM_MATMUL_SHUFFLE` | Enables or disables custom matrix shuffling algorithm . | String | `on`: Enable shuffling. `off`: Disable shuffling. Default: `on`. | | -| `HCCL_DETERMINISTIC` | Enables or disables deterministic computation for reduction-type communication operators (e.g., AllReduce, ReduceScatter, Reduce). | String | `true`: Enable deterministic mode. `false`: Disable deterministic mode. Default: `false`. | | -| `ASCEND_LAUNCH_BLOCKING` | Controls whether operators run in synchronous mode during training or online inference. | Integer | `1`: Force synchronous execution. `0`: Do not force synchronous execution. Default: `0`. | | -| `TE_PARALLEL_COMPILER` | Maximum number of parallel compilation processes for operators. Parallel compilation is enabled if greater than 1. | Integer | Positive integer; Max = CPU cores * 80% / # of Ascend AI processors. Range: 1~32. Default: `0`. | | -| `LCCL_DETERMINISTIC` | Controls whether LCCL deterministic AllReduce (ordered addition) is enabled. | Integer | `1`: Enable deterministic mode. `0`: Disable deterministic mode. Default: `0`. | | -| `MS_ENABLE_GRACEFUL_EXIT` | Enables graceful process termination. | Integer | `1`: Enable graceful exit. `Other values`: Disable graceful exit. Default: `0`. | | -| `CPU_AFFINITY` | Optimizes CPU core binding for MindSpore inference. | String | `True`: Enable core binding. `False`: Disable core binding. Default: `True`. | **This variable will be removed in future versions.** Replaced by `set_cpu_affinity` API. | -| `MS_ENABLE_INTERNAL_BOOST` | Enables or disables MindSpore framework's internal acceleration. | String | `on`: Enable acceleration. `off`: Disable acceleration. Default: `on`. | | -| `MS_ENABLE_LCCL` | Controls whether the LCCL communication library is used. | Integer | `1`: Enable. `0`: Disable. Default: `0`. | | -| `HCCL_EXEC_TIMEOUT` | Controls the synchronization timeout for inter-device execution. | Integer | Range: (0, 17340] (seconds). Default: `7200`. | | -| `DEVICE_NUM_PER_NODE` | Number of devices per node. | Integer | Default: `16`. | | -| `HCCL_OP_EXPANSION_MODE` | Configures the expansion location for communication algorithms. | String | `AI_CPU`: Expands on AI CPU compute units. `AIV`: Expands on AI Vector Core compute units. Default: `AIV`. | | -| `MS_JIT_MODULES` | Specifies modules to be JIT-compiled in static graph mode. | String | Module names (top-level imports). Multiple modules should be comma-separated. Default: `"vllm_mindspore,research"`. | | -| `GLOG_v` | Controls log level. | Integer | `0`: DEBUG. `1`: INFO. `2`: WARNING. `3`: ERROR (logs errors, may not terminate). `4`: CRITICAL (logs critical errors, terminates execution). Default: `3`. | | -| `RAY_CGRAPH_get_timeout` | Timeout for `ray.get()` method (seconds). | Integer | Default: `360`. | | -| `MS_NODE_TIMEOUT` | Node heartbeat timeout (seconds). | Integer | Default: `180`. | | - -More environment variable information can be referred in the following link: - - - [CANN Environment Variable List](https://www.hiascend.com/document/detail/en/CANNCommunityEdition/81RC1beta1/index/index.html) - - [MindSpore Environment Variable List](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/env_var_list.html) - - [MindSpore Transformers Environment Variable List](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/index.html) - - [vLLM Environment Variable List](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) +- [CANN Environment Variable List](https://www.hiascend.com/document/detail/en/CANNCommunityEdition/81RC1beta1/index/index.html) +- [MindSpore Environment Variable List](https://www.mindspore.cn/docs/en/master/api_python/env_var_list.html) +- [MindSpore Transformers Environment Variable List](https://www.mindspore.cn/mindformers/docs/en/master/index.html) +- [vLLM Environment Variable List](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md index 33c39b583d..fa1b8f89c3 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md @@ -16,7 +16,7 @@ We employ [MindSpore Golden Stick's PTQ algorithm](https://gitee.com/mindspore/g ### Downloading Quantized Weights -We have uploaded the quantized DeepSeek-R1 to [ModelArts Community](https://modelers.cn): [MindSpore-Lab/DeepSeek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-W8A8). Refer to the [ModelArts Community documentation](https://modelers.cn/docs/en/openmind-hub-client/0.9/basic_tutorial/download.html) to download the weights locally. +We have uploaded the quantized DeepSeek-R1 to [ModelArts Community](https://modelers.cn): [MindSpore-Lab/DeepSeek-R1-0528-A8W8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8). Refer to the [ModelArts Community documentation](https://modelers.cn/docs/en/openmind-hub-client/0.9/basic_tutorial/download.html) to download the weights locally. ## Quantized Model Inference diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md index ba825bcae5..3d9b49bd6e 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md @@ -6,7 +6,7 @@ |-------| --------- | ---- | | DeepSeek-V3 | Supported | [DeepSeek-V3](https://modelers.cn/models/MindSpore-Lab/DeepSeek-V3) | | DeepSeek-R1 | Supported | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-V3) | -| DeepSeek-R1 W8A8 | Supported | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-r1-w8a8) | +| DeepSeek-R1 W8A8 | Supported | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8) | | Qwen2.5 | Supported | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct), [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct), [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct), [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct), [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | | Qwen3-32B | Supported | [Qwen3-32B](https://modelers.cn/models/MindSpore-Lab/Qwen3-32B) | | Qwen3-235B-A22B | Supported | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index 8d4eb6a8d6..1a442ef00c 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -13,19 +13,19 @@ - Python:3.9 / 3.10 / 3.11 - 软件版本配套 - | 软件 | 版本 | + | 软件 | 配套版本与下载链接 | | ----- | ----- | - |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | - |[MindSpore](https://www.mindspore.cn/install/) | 2.7.0 | - |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.0.1 | - |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| 1.6.0 | - |[Golden Stick](https://gitee.com/mindspore/golden-stick)| 1.2.0 | - |[vLLM](https://github.com/vllm-project/vllm) | 0.8.3 | - |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3.0 | + |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit) | + |[MindSpore](https://www.mindspore.cn/install/) | [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/) | + |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | + |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers) | + |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | + |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202505/20250514/v0.8.4.dev0_newest/any/) | + |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | ## 配置环境 -在本章节中,我们将介绍[docker安装](#docker安装)、[pip安装](#pip安装)、[源码安装](#源码安装)三种安装方式,以及[快速验证](#快速验证)用例,用于验证安装是否成功。 +在本章节中,我们将介绍[docker安装](#docker安装)、[源码安装](#源码安装)两种安装方式,以及[快速验证](#快速验证)用例,用于验证安装是否成功。 ### docker安装 @@ -105,29 +105,33 @@ docker exec -it $DOCKER_NAME bash ### 源码安装 -- **CANN安装** +#### CANN安装 - CANN安装方法与环境配套,请参考[CANN社区版软件安装](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit),若用户在安装CANN过程中遇到问题,可参考[昇腾常见问题](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html)进行解决。 +CANN安装方法与环境配套,请参考[CANN社区版软件安装](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit),若用户在安装CANN过程中遇到问题,可参考[昇腾常见问题](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html)进行解决。 - CANN默认安装路径为`/usr/local/Ascend`。用户在安装CANN完毕后,使用如下命令,为CANN配置环境变量: +CANN默认安装路径为`/usr/local/Ascend`。用户在安装CANN完毕后,使用如下命令,为CANN配置环境变量: - ```bash - LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package - source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh - export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit - ``` +```bash +LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package +source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh +export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit +``` -- **vLLM前置依赖安装** +#### vLLM前置依赖安装 - vLLM的环境配置与安装方法,请参考[vLLM安装教程](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html)。其依赖`gcc/g++ >= 12.3.0`版本,可通过以下命令完成安装: +vLLM的环境配置与安装方法,请参考[vLLM安装教程](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html)。其依赖`gcc/g++ >= 12.3.0`版本,可通过以下命令完成安装: - ```bash - yum install -y gcc gcc-c++ - ``` +```bash +yum install -y gcc gcc-c++ +``` + +#### vLLM MindSpore安装 -- **vLLM MindSpore安装** +vLLM MindSpore有以下两种安装方式。**vLLM MindSpore一键式安装**适用于用户快速使用与部署的场景。**vLLM MindSpore手动安装**适用于用户对组件有自定义修改的场景。 - 安装vLLM MindSpore,需要在拉取vLLM MindSpore源码后,执行以下命令,安装依赖包: +- **vLLM MindSpore一键式安装** + + 采用一键式安装脚本来安装vLLM MindSpore,需要在拉取vLLM MindSpore源码后,执行以下命令,安装依赖包: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -147,53 +151,58 @@ docker exec -it $DOCKER_NAME bash export PYTHONPATH=$MF_PATH:$PYTHONPATH ``` -- **组件手动安装** +- **vLLM MindSpore手动安装** + + 若用户对组件有修改,或者需使用其他版本,则用户需要按照特定顺序,手动安装组件。vLLM MindSpore软件配套下载地址可以参考[版本配套](#版本配套),且对组件的安装顺序要求如下: - 若用户对组件有修改,或者需使用其他版本,则用户需要按照特定顺序,手动安装组件。vLLM MindSpore对组件的安装顺序要求如下: 1. 安装vLLM - ```bash - pip install /path/to/vllm-*.whl - ``` + ```bash + pip install /path/to/vllm-*.whl + ``` 2. 卸载torch相关组件 - ```bash - pip uninstall torch torch-npu torchvision torchaudio -y - ``` + ```bash + pip uninstall torch torch-npu torchvision torchaudio -y + ``` 3. 安装MindSpore - ```bash - pip install /path/to/mindspore-*.whl - ``` + ```bash + pip install /path/to/mindspore-*.whl + ``` 4. 引入MindSpore Transformers仓,加入到`PYTHONPATH`中 - ```bash - git clone https://gitee.com/mindspore/mindformers.git - export PYTHONPATH=`realpath mindformers`:$PYTHONPATH - ``` + ```bash + git clone https://gitee.com/mindspore/mindformers.git + export PYTHONPATH=$MF_PATH:$PYTHONPATH + ``` 5. 安装Golden Stick - ```bash - pip install /path/to/mindspore_gs-*.whl - ``` + ```bash + pip install /path/to/mindspore_gs-*.whl + ``` 6. 安装MSAdapter - ```bash - pip install /path/to/msadapter-*.whl - ``` + ```bash + pip install /path/to/msadapter-*.whl + ``` 7. 安装vLLM MindSpore - ```bash - pip install . - ``` + 需要先拉取vLLM MindSpore源码,再执行安装 + + ```bash + git clone https://gitee.com/mindspore/vllm-mindspore.git + cd vllm-mindspore + pip install . + ``` -### 快速验证 +## 快速验证 用户可以创建一个简单的离线推理场景,验证安装是否成功。下面以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例。首先用户需要执行以下命令,设置环境变量: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md index 82d5534bda..addd3951d0 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md @@ -138,7 +138,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra 以下是对上述环境变量的解释: - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。 +- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml)。 另外,用户需要确保MindSpore Transformers已安装。用户可通过 @@ -220,7 +220,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。可以通过请求处理成功,将获得以下的推理结果: +其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 2ad616469f..813a0dd588 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -94,8 +94,8 @@ docker exec -it $DOCKER_NAME bash ```python from openmind_hub import snapshot_download -snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-W8A8", - local_dir="/path/to/save/deepseek_r1_w8a8", +snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-0528-A8W8", + local_dir="/path/to/save/deepseek_r1_0528_a8w8", local_dir_use_symlinks=False) ``` @@ -120,7 +120,7 @@ Git LFS initialized. 工具确认可用后,执行以下命令,下载权重: ```shell -git clone https://modelers.cn/MindSpore-Lab/DeepSeek-R1-W8A8.git +git clone https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8.git ``` ## TP16 张量并行推理 @@ -284,7 +284,7 @@ vllm-mindspore serve ```bash # 主节点: -vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray ``` 张量并行场景下,`--tensor-parallel-size`参数会覆盖模型yaml文件中`parallel_config`的`model_parallel`配置。用户可以通过`--model`参数,指定模型保存的本地路径。 @@ -294,7 +294,7 @@ vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-cod 使用如下命令发送请求。其中`prompt`字段为模型输入: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' ``` 用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。 @@ -347,6 +347,7 @@ parallel_config: ### 在线推理 #### 启动服务 + `vllm-mindspore`可使用OpenAI的API协议部署在线推理。以下是在线推理的拉起流程: ```bash @@ -373,10 +374,10 @@ vllm-mindspore serve ```bash # 主节点: -vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel # 从节点: -vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel ``` #### 发送请求 @@ -384,7 +385,7 @@ vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-W8A8" --trust 使用如下命令发送请求。其中`prompt`字段为模型输入: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-W8A8", "prompt": "I am, "max_tokens": 120, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am, "max_tokens": 120, "temperature": 0}' ``` 用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 72e4bad00a..f0901e2e0f 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -135,7 +135,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra 以下是对上述环境变量的解释: - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-32B为例,则其yaml文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml) 。 +- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-32B为例,则其yaml文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml) 。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡。以下例子为假设用户使用4,5,6,7卡进行推理: @@ -181,7 +181,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。可以通过请求处理成功,将获得以下的推理结果: +其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index 90a49c065e..c7d8426f6d 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -135,7 +135,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra 以下是对上述环境变量的解释: - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。 +- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡: @@ -194,7 +194,7 @@ vLLM MindSpore可使用OpenAI的API协议,部署为在线推理。以下是以 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" ``` -若服务成功拉起,则可以获得类似的执行结果: +用户可以通过`--model`参数,指定模型保存的本地路径。若服务成功拉起,则可以获得类似的执行结果: ```text INFO: Started server process [6363] @@ -216,7 +216,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -若请求处理成功,将获得以下的推理结果: +其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md index 835946b67d..b5e2aefd2d 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md @@ -4,43 +4,20 @@ | 环境变量 | 功能 | 类型 | 取值 | 说明 | | ------ | ------- | ------ | ------ | ------ | -| vLLM_MODEL_BACKEND | 用于指定模型后端。使用vLLM MindSpore原生模型后端时无需指定;使用模型为vLLM MindSpore外部后端时则需要指定。 | String | `MindFormers`: 模型后端为MindSpore Transformers。 | 原生模型后端当前支持Qwen2.5系列;MindSpore Transformers模型后端支持Qwen系列、DeepSeek、Llama系列模型,使用时需配置环境变量:`export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`。 | -| MINDFORMERS_MODEL_CONFIG | MindSpore Transformers模型的配置文件。使用Qwen2.5系列、DeepSeek系列模型时,需要配置文件路径。 | String | 模型配置文件路径。 | **该环境变量在后续版本会被移除。** 样例:`export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`。 | -| GLOO_SOCKET_IFNAME | 用于多机之间使用gloo通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | -| TP_SOCKET_IFNAME | 用于多机之间使用TP通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | -| HCCL_SOCKET_IFNAME | 用于多机之间使用HCCL通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | -| ASCEND_RT_VISIBLE_DEVICES | 指定哪些Device对当前进程可见,支持一次指定一个或多个Device ID。 | String | 为Device ID,逗号分割的字符串,例如"0,1,2,3,4,5,6,7"。 | ray使用场景建议使用。 | -| HCCL_BUFFSIZE | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | Integer | 缓存区大小,大小为MB。例如:`2048`。 | 使用方法参考:[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行(数据并行数为32,专家并行数为32),且`max-num-batched-tokens`为256时,则`export HCCL_BUFFSIZE=2048`。 | -| MS_MEMPOOL_BLOCK_SIZE | 设置PyNative模式下设备内存池的块大小。 | String | 正整数string,单位为GB。 | | -| vLLM_USE_NPU_ADV_STEP_FLASH_OP | 是否使用昇腾`adv_step_flash`算子。 | String | `on`: 使用;`off`:不使用 | 取值为`off`时,将使用小算子实现替代`adv_step_flash`算子。 | -| VLLM_TORCH_PROFILER_DIR | 开启profiling采集数据,当配置了采集数据保存路径后生效 | String | Profiling数据保存路径。| | - -以下环境变量由vLLM MindSpore自动注册: - -| 环境变量 | 功能 | 类型 | 取值 | 说明 | -| ------ | ------- | ------ | ------ | ------ | -| USE_TORCH | Transformer运行时依赖该环境变量 | String | 默认值为"False" | vLLM MindSpore 不使用torch 作为后端 | -| USE_TF | Transformer运行时依赖该环境变量 | String | 默认值为"False" | vLLM MindSpore 不使用TensorFlow 作为后端 | -| RUN_MODE | 执行模式为推理 | String | 默认值为"predict" | **该环境变量在后续版本会被移除。** 为MindFormers依赖的环境变量 | -| CUSTOM_MATMUL_SHUFFLE | 开启或关闭自定义矩阵算法的洗牌操作 | String | `on`:开启矩阵洗牌。`off`:关闭矩阵洗牌。默认值为`on`。 | | -| HCCL_DETERMINISTIC | 开启或关闭归约类通信算子的确定性计算,其中归约类通信算子包括 AllReduce、ReduceScatter、Reduce。 | String | `true`:打开 HCCL 确定性开关;`false`:关闭 HCCL 确定性开关。默认值为`false`。 | | -| ASCEND_LAUNCH_BLOCKING | 训练或在线推理场景,可通过此环境变量控制算子执行时是否启动同步模式。 | Integer | `1`:强制算子采用同步模式运行;`0`:不强制算子采用同步模式运行。默认值为`0`。 | | -| TE_PARALLEL_COMPILER | 算子最大并行编译进程数,当大于 1 时开启并行编译。 | Integer | 取值为正整数;最大不超过 cpu 核数*80%/昇腾 AI 处理器个数,取值范围 1~32。默认值是 `0`。 | | -| LCCL_DETERMINISTIC | 设置 LCCL 确定性算子 AllReduce(保序加)是否开启。 | Integer | `1`:打开 LCCL 确定性开关;`0`:关闭 LCCL 确定性开关。默认值是 `0`。 | | -| MS_ENABLE_GRACEFUL_EXIT | 设置使能进程优雅退出 | Integer | `1`:使用进程优雅退出功能。`不设置或者其他值`: 不使用进程优雅退出功能。默认值为`0` | | -| CPU_AFFINIITY | MindSpore推理绑核优化 | String | `True`:开启绑核;`True`:不开启绑核。默认值为`True` | **该环境变量在后续版本会被移除。** 将使用`set_cpu_affinity`接口。 | -| MS_ENABLE_INTERNAL_BOOST | 是否打开 MindSpore 框架的内部加速功能。 | String | `on`:开启 MindSpore 内部加速;`off`:关闭 MindSpore 内部加速。默认值为`on` | | -| MS_ENABLE_LCCL | 是否使用LCCL通信库。 | Integer | `1`:开启,`0`:关闭。默认值为`0`。 | | -| HCCL_EXEC_TIMEOUT | 通过该环境变量可控制设备间执行时同步等待的时间,在该配置时间内各设备进程等待其他设备执行通信同步。 | Integer | 取值范围为:(0, 17340],单位为 s。 默认值为 7200。 | | -| DEVICE_NUM_PER_NODE | 节点上的设备数 | Integer | 默认值为16。 | | -| HCCL_OP_EXPANSION_MODE | 用于配置通信算法的编排展开位置。 | String | `AI_CPU`:通信算法的编排展开位置为Device侧的AI CPU计算单元;`AIV`:通信算法的编排展开位置为Device侧的AI Vector Core计算单元。默认值为`AIV`。 | | -| MS_JIT_MODULES | 指定静态图模式下哪些模块需要JIT静态编译,其函数方法会被编译成静态计算图 | String | 模块名,对应import导入的顶层模块的名称。如果有多个,使用英文逗号分隔。默认值为`"vllm_mindspore,research"`。 | | -| GLOG_v | 控制日志的级别 | Integer | `0`:DEBUG;`1`:INFO;`2`:WARNING;`3`:ERROR,表示程序执行出现报错,输出错误日志,程序可能不会终止;`4`:CRITICAL,表示程序执行出现异常,将会终止执行程序。默认值为`3`。 | | -| RAY_CGRAPH_get_timeout | `ray.get()`方法的超时时间。 | Integer | 默认值为`360`。 | | -| MS_NODE_TIMEOUT | 节点心跳超时时间,单位:秒。 | Integer | 默认值为`180`。 | | +| `vLLM_MODEL_BACKEND` | 用于指定模型后端。使用vLLM MindSpore原生模型后端时无需指定;使用模型为vLLM MindSpore外部后端时则需要指定。 | String | `MindFormers`: 模型后端为MindSpore Transformers。 | 原生模型后端当前支持Qwen2.5系列;MindSpore Transformers模型后端支持Qwen系列、DeepSeek、Llama系列模型,使用时需配置环境变量:`export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`。 | +| `MINDFORMERS_MODEL_CONFIG` | MindSpore Transformers模型的配置文件。使用Qwen2.5系列、DeepSeek系列模型时,需要配置文件路径。 | String | 模型配置文件路径。 | **该环境变量在后续版本会被移除。** 样例:`export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`。 | +| `GLOO_SOCKET_IFNAME` | 用于多机之间使用gloo通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | +| `TP_SOCKET_IFNAME` | 用于多机之间使用TP通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | +| `HCCL_SOCKET_IFNAME` | 用于多机之间使用HCCL通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | +| `ASCEND_RT_VISIBLE_DEVICES` | 指定哪些Device对当前进程可见,支持一次指定一个或多个Device ID。 | String | 为Device ID,逗号分割的字符串,例如"0,1,2,3,4,5,6,7"。 | ray使用场景建议使用。 | +| `HCCL_BUFFSIZE` | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | Integer | 缓存区大小,大小为MB。例如:`2048`。 | 使用方法参考:[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行(数据并行数为32,专家并行数为32),且`max-num-batched-tokens`为256时,则`export HCCL_BUFFSIZE=2048`。 | +| `MS_MEMPOOL_BLOCK_SIZE` | 设置PyNative模式下设备内存池的块大小。 | String | 正整数string,单位为GB。 | | +| `vLLM_USE_NPU_ADV_STEP_FLASH_OP` | 是否使用昇腾`adv_step_flash`算子。 | String | `on`: 使用;`off`:不使用 | 取值为`off`时,将使用小算子实现替代`adv_step_flash`算子。 | +| `VLLM_TORCH_PROFILER_DIR` | 开启profiling采集数据,当配置了采集数据保存路径后生效 | String | Profiling数据保存路径。| | 更多的环境变量信息,请查看: - - [CANN 环境变量列表](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/index/index.html) - - [MindSpore 环境变量列表](https://www.mindspore.cn/docs/zh-CN/r2.7.0rc1/api_python/env_var_list.html) - - [MindSpore Transformers 环境变量列表](https://www.mindspore.cn/mindformers/docs/zh-CN/r1.6.0/index.html) - - [vLLM 环境变量列表](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) + +- [CANN 环境变量列表](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/index/index.html) +- [MindSpore 环境变量列表](https://www.mindspore.cn/docs/zh-CN/master/api_python/env_var_list.html) +- [MindSpore Transformers 环境变量列表](https://www.mindspore.cn/mindformers/docs/zh-CN/master/index.html) +- [vLLM 环境变量列表](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md index 71667d5f1e..22a83475ef 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md @@ -16,7 +16,7 @@ ### 直接下载量化权重 -我们已经将量化好的DeepSeek-R1上传到[魔乐社区](https://modelers.cn):[MindSpore-Lab/DeepSeek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-W8A8),可以参考[魔乐社区文档](https://modelers.cn/docs/zh/openmind-hub-client/0.9/basic_tutorial/download.html)将权重下载到本地。 +我们已经将量化好的DeepSeek-R1上传到[魔乐社区](https://modelers.cn):[MindSpore-Lab/DeepSeek-R1-0528-A8W8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8),可以参考[魔乐社区文档](https://modelers.cn/docs/zh/openmind-hub-client/0.9/basic_tutorial/download.html)将权重下载到本地。 ## 量化模型推理 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md index 2e504c0fec..c64725c9e1 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md @@ -6,7 +6,7 @@ |-------| --------- | ---- | | DeepSeek-V3 | 已支持 | [DeepSeek-V3](https://modelers.cn/models/MindSpore-Lab/DeepSeek-V3) | | DeepSeek-R1 | 已支持 | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-V3) | -| DeepSeek-R1 W8A8 | 已支持 | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-r1-w8a8) | +| DeepSeek-R1 W8A8 | 已支持 | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8) | | Qwen2.5 | 已支持 | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)、[Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)、[Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)、 [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)、[Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)、[Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)、[Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | | Qwen3-32B | 已支持 | [Qwen3-32B](https://modelers.cn/models/MindSpore-Lab/Qwen3-32B) | | Qwen3-235B-A22B | 已支持 | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | -- Gitee From d2c1e2d3ab130e6c000609b8fb0b9f5057b7122a Mon Sep 17 00:00:00 2001 From: horcam Date: Fri, 15 Aug 2025 17:40:28 +0800 Subject: [PATCH 08/12] update for 0.9.1 --- .../getting_started/installation/installation.md | 6 +++--- docs/vllm_mindspore/docs/source_en/index.rst | 7 +++++-- .../getting_started/installation/installation.md | 8 ++++---- docs/vllm_mindspore/docs/source_zh_cn/index.rst | 7 +++++-- 4 files changed, 17 insertions(+), 11 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index dd8fab4ee0..bb3d36b3ca 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -20,7 +20,7 @@ This document describes the steps to install the vLLM MindSpore environment. Thr |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers) | |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | - |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202505/20250514/v0.8.4.dev0_newest/any/) | + |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/any/) | |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | ## Environment Setup @@ -127,9 +127,9 @@ yum install -y gcc gcc-c++ #### vLLM MindSpore Installation -vLLM MindSpore can be installed in the following two ways. **vLLM MindSpore One-click Installation** is suitable for scenarios where users need quick deployment and usage. **vLLM MindSpore Manual Installation** is suitable for scenarios where users require custom modifications to the components. +vLLM MindSpore can be installed in the following two ways. **vLLM MindSpore Quick Installation** is suitable for scenarios where users need quick deployment and usage. **vLLM MindSpore Manual Installation** is suitable for scenarios where users require custom modifications to the components. -- **vLLM MindSpore One-click Installation** +- **vLLM MindSpore Quick Installation** To install vLLM MindSpore, user needs to pull the vLLM MindSpore source code and then runs the following command to install the dependencies: diff --git a/docs/vllm_mindspore/docs/source_en/index.rst b/docs/vllm_mindspore/docs/source_en/index.rst index 3163af72f8..351967f14a 100644 --- a/docs/vllm_mindspore/docs/source_en/index.rst +++ b/docs/vllm_mindspore/docs/source_en/index.rst @@ -58,7 +58,7 @@ Branch ----------------------------------------------------- The vllm-mindspore repository contains the main branch, development branch, and version branches: -- **main**: the main branch, compatible with Mindspore master branch and vLLM v0.8.3 version, is continuously monitored for quality through Ascend-MindSpore CI. +- **main**: the main branch, compatible with Mindspore master branch and vLLM v0.9.1 version, is continuously monitored for quality through Ascend-MindSpore CI. - **develop**: the development branch for adapting vLLM features, which is forked from the main branch when a new vLLM version is released. Once the adapted features is stable, it will be merged into the main branch. The current development branch is adapting vLLM v0.9.1 version. - **rX.Y.Z**: version branches used for archiving version release, which is forked from the main branch after the adaptation of a certain vLLM version is completed. @@ -72,7 +72,7 @@ The following are the version branches: - Notes * - master - Maintained - - Compatible with vLLM v0.8.3, and CI commitment for MindSpore master branch + - Compatible with vLLM v0.9.1, and CI commitment for MindSpore master branch * - develop - Maintained - Compatible with vLLM v0.9.1 @@ -82,6 +82,9 @@ The following are the version branches: * - r0.2 - Maintained - Compatible with vLLM v0.7.3, and CI commitment for MindSpore 2.6.0 + * - r0.3.0 + - Maintained + - Compatible with vLLM v0.8.3, and CI commitment for MindSpore 2.7.0 SIG ----------------------------------------------------- diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index 1a442ef00c..1f251e3fca 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -20,7 +20,7 @@ |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers) | |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | - |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202505/20250514/v0.8.4.dev0_newest/any/) | + |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/any/) | |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | ## 配置环境 @@ -127,11 +127,11 @@ yum install -y gcc gcc-c++ #### vLLM MindSpore安装 -vLLM MindSpore有以下两种安装方式。**vLLM MindSpore一键式安装**适用于用户快速使用与部署的场景。**vLLM MindSpore手动安装**适用于用户对组件有自定义修改的场景。 +vLLM MindSpore有以下两种安装方式。**vLLM MindSpore快速安装**适用于用户快速使用与部署的场景。**vLLM MindSpore手动安装**适用于用户对组件有自定义修改的场景。 -- **vLLM MindSpore一键式安装** +- **vLLM MindSpore快速安装** - 采用一键式安装脚本来安装vLLM MindSpore,需要在拉取vLLM MindSpore源码后,执行以下命令,安装依赖包: + 采用快速安装脚本来安装vLLM MindSpore,需要在拉取vLLM MindSpore源码后,执行以下命令,安装依赖包: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git diff --git a/docs/vllm_mindspore/docs/source_zh_cn/index.rst b/docs/vllm_mindspore/docs/source_zh_cn/index.rst index f465f8c121..0f16e9f4ce 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/index.rst +++ b/docs/vllm_mindspore/docs/source_zh_cn/index.rst @@ -58,7 +58,7 @@ vLLM MindSpore采用vLLM社区推荐的插件机制,实现能力注册。未 ----------------------------------------------------- vLLM MindSpore代码仓包含主干分支、开发分支、版本分支: -- **main**: 主干分支,与MindSpore master分支和vLLM v0.8.3版本配套,并通过昇腾+昇思CI持续进行质量看护; +- **main**: 主干分支,与MindSpore master分支和vLLM v0.9.1版本配套,并通过昇腾+昇思CI持续进行质量看护; - **develop**: 开发分支,在vLLM部分新版本发布时从主干分支拉出,用于开发适配vLLM的新功能特性。待特性适配稳定后合入主干分支。当前开发分支正在适配vLLM v0.9.1版本; - **rX.Y.Z**: 版本分支,在完成vLLM某版本适配后,从主干分支拉出,用于正式版本发布归档。 @@ -72,7 +72,7 @@ vLLM MindSpore代码仓包含主干分支、开发分支、版本分支: - 备注 * - master - Maintained - - 基于vLLM v0.8.3版本和MindSpore master分支CI看护 + - 基于vLLM v0.9.1版本和MindSpore master分支CI看护 * - develop - Maintained - 基于vLLM v0.9.1版本 @@ -82,6 +82,9 @@ vLLM MindSpore代码仓包含主干分支、开发分支、版本分支: * - r0.2 - Maintained - 基于vLLM v0.7.3版本和MindSpore 2.6.0版本CI看护 + * - r0.3.0 + - Maintained + - 基于vLLM v0.7.3版本和MindSpore 2.7.0版本CI看护 SIG组织 ----------------------------------------------------- -- Gitee From 1280c57183895ba103c3e126ee3ba8c6d5a015b0 Mon Sep 17 00:00:00 2001 From: horcam Date: Mon, 18 Aug 2025 10:31:19 +0800 Subject: [PATCH 09/12] update install --- .../installation/installation.md | 22 ++++++++----------- .../installation/installation.md | 22 ++++++++----------- 2 files changed, 18 insertions(+), 26 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index bb3d36b3ca..7d7d0f9925 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -2,7 +2,7 @@ [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md) -This document describes the steps to install the vLLM MindSpore environment. Three installation methods are provided: +This document will introduce the [Version Matching](#version-compatibility) of vLLM MindSpore, the installation steps for vLLM MindSpore, and the [Quick Verification](#quick-verification) to verify whether the installation is successful. The installation steps provide two installation methods: - [Docker Installation](#docker-installation): Suitable for quick deployment scenarios. - [Source Code Installation](#source-code-installation): Suitable for incremental development of vLLM MindSpore. @@ -23,15 +23,11 @@ This document describes the steps to install the vLLM MindSpore environment. Thr |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/any/) | |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | -## Environment Setup - -This section introduces two installation methods: [Docker Installation](#docker-installation), [Source Code Installation](#source-code-installation), and [Quick Verification](#quick-verification) example to check the installation. - -### Docker Installation +## Docker Installation We recommend using Docker for quick deployment of the vLLM MindSpore environment. Below are the steps: -#### Building the Image +### Building the Image User can execute the following commands to clone the vLLM MindSpore code repository and build the image: @@ -53,7 +49,7 @@ Here, `e40bcbeae9fc` is the image ID, and `vllm_ms_20250726:latest` is the image docker images ``` -#### Creating a Container +### Creating a Container After [building the image](#building-the-image), set `DOCKER_NAME` and `IMAGE_NAME` as the container and image names, then execute the following command to create the container: @@ -95,7 +91,7 @@ The container ID will be returned if docker is created successfully. User can al docker ps ``` -#### Entering the Container +### Entering the Container After [creating the container](#creating-a-container), user can start and enter the container, using the environment variable `DOCKER_NAME`: @@ -103,9 +99,9 @@ After [creating the container](#creating-a-container), user can start and enter docker exec -it $DOCKER_NAME bash ``` -### Source Code Installation +## Source Code Installation -#### CANN Installation +### CANN Installation For CANN installation methods and environment configuration, please refer to [CANN Community Edition Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit). If you encounter any issues during CANN installation, please consult the [Ascend FAQ](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html) for troubleshooting. @@ -117,7 +113,7 @@ source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit ``` -#### vLLM Prerequisites Installation +### vLLM Prerequisites Installation For vLLM environment configuration and installation methods, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html). In vllM installation, `gcc/g++ >= 12.3.0` is required, and it could be installed by the following command: @@ -125,7 +121,7 @@ For vLLM environment configuration and installation methods, please refer to the yum install -y gcc gcc-c++ ``` -#### vLLM MindSpore Installation +### vLLM MindSpore Installation vLLM MindSpore can be installed in the following two ways. **vLLM MindSpore Quick Installation** is suitable for scenarios where users need quick deployment and usage. **vLLM MindSpore Manual Installation** is suitable for scenarios where users require custom modifications to the components. diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index 1f251e3fca..b47e8a9e61 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -2,7 +2,7 @@ [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md) -本文档将介绍安装vLLM MindSpore环境的操作步骤。分为三种安装方式: +本文档将介绍vLLM MindSpore的[版本配套](#版本配套),vLLM MindSpore的安装步骤,与[快速验证](#快速验证)用例,用于验证安装是否成功。其中安装步骤分为两种安装方式: - [docker安装](#docker安装):适合用户快速使用的场景; - [源码安装](#源码安装):适合用户有增量开发vLLM MindSpore的场景。 @@ -23,15 +23,11 @@ |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/any/) | |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | -## 配置环境 - -在本章节中,我们将介绍[docker安装](#docker安装)、[源码安装](#源码安装)两种安装方式,以及[快速验证](#快速验证)用例,用于验证安装是否成功。 - -### docker安装 +## docker安装 在本章节中,我们推荐用docker创建的方式,以快速部署vLLM MindSpore环境,以下是部署docker的步骤介绍: -#### 构建镜像 +### 构建镜像 用户可执行以下命令,拉取vLLM MindSpore代码仓库,并构建镜像: @@ -53,7 +49,7 @@ Successfully tagged vllm_ms_20250726:latest docker images ``` -#### 新建容器 +### 新建容器 用户在完成[构建镜像](#构建镜像)后,设置`DOCKER_NAME`与`IMAGE_NAME`以设置容器名与镜像名,并执行以下命令,以新建容器: @@ -95,7 +91,7 @@ docker run -itd --name=${DOCKER_NAME} --ipc=host --network=host --privileged=tru docker ps ``` -#### 进入容器 +### 进入容器 用户在完成[新建容器](#新建容器)后,使用已定义的环境变量`DOCKER_NAME`,启动并进入容器: @@ -103,9 +99,9 @@ docker ps docker exec -it $DOCKER_NAME bash ``` -### 源码安装 +## 源码安装 -#### CANN安装 +### CANN安装 CANN安装方法与环境配套,请参考[CANN社区版软件安装](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit),若用户在安装CANN过程中遇到问题,可参考[昇腾常见问题](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html)进行解决。 @@ -117,7 +113,7 @@ source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit ``` -#### vLLM前置依赖安装 +### vLLM前置依赖安装 vLLM的环境配置与安装方法,请参考[vLLM安装教程](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html)。其依赖`gcc/g++ >= 12.3.0`版本,可通过以下命令完成安装: @@ -125,7 +121,7 @@ vLLM的环境配置与安装方法,请参考[vLLM安装教程](https://docs.vl yum install -y gcc gcc-c++ ``` -#### vLLM MindSpore安装 +### vLLM MindSpore安装 vLLM MindSpore有以下两种安装方式。**vLLM MindSpore快速安装**适用于用户快速使用与部署的场景。**vLLM MindSpore手动安装**适用于用户对组件有自定义修改的场景。 -- Gitee From ea7fe2948e4f1b59a063ea13e3f71362969e5ebc Mon Sep 17 00:00:00 2001 From: horcam Date: Mon, 18 Aug 2025 15:08:28 +0800 Subject: [PATCH 10/12] rename vLLM MindSpore Plugin --- docs/vllm_mindspore/docs/source_en/conf.py | 6 ++-- .../source_en/developer_guide/contributing.md | 16 +++++------ .../developer_guide/operations/custom_ops.md | 18 ++++++------ .../docs/source_en/general/security.md | 6 ++-- .../installation/installation.md | 28 +++++++++---------- .../quick_start/quick_start.md | 10 +++---- .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md | 12 ++++---- .../qwen2.5_32b_multiNPU.md | 8 +++--- .../qwen2.5_7b_singleNPU.md | 8 +++--- docs/vllm_mindspore/docs/source_en/index.rst | 16 +++++------ .../source_en/release_notes/release_notes.md | 4 +-- .../environment_variables.md | 2 +- .../supported_features/benchmark/benchmark.md | 8 +++--- .../features_list/features_list.md | 6 ++-- .../supported_features/profiling/profiling.md | 4 +-- .../quantization/quantization.md | 2 +- .../installation/installation.md | 28 +++++++++---------- .../qwen2.5_7b_singleNPU.md | 10 +++---- .../docs/source_zh_cn/index.rst | 18 ++++++------ 19 files changed, 105 insertions(+), 105 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/conf.py b/docs/vllm_mindspore/docs/source_en/conf.py index 33bfa05fe2..e90650890c 100644 --- a/docs/vllm_mindspore/docs/source_en/conf.py +++ b/docs/vllm_mindspore/docs/source_en/conf.py @@ -32,9 +32,9 @@ with open(_html_base.__file__, "r", encoding="utf-8") as f: # -- Project information ----------------------------------------------------- -project = 'vLLM MindSpore' +project = 'vLLM-MindSpore Plugin' copyright = 'MindSpore' -author = 'vLLM MindSpore' +author = 'vLLM-MindSpore Plugin' # The full version, including alpha/beta/rc tags release = 'master' @@ -182,7 +182,7 @@ with open(autodoc_source_path, "r+", encoding="utf8") as f: exec(get_param_func_str, sphinx_autodoc.__dict__) exec(code_str, sphinx_autodoc.__dict__) -# Copy source files of chinese python api from vLLM MindSpore repository. +# Copy source files of chinese python api from vLLM-MindSpore Plugin repository. from sphinx.util import logging logger = logging.getLogger(__name__) diff --git a/docs/vllm_mindspore/docs/source_en/developer_guide/contributing.md b/docs/vllm_mindspore/docs/source_en/developer_guide/contributing.md index e912018c23..f6e6140ef2 100644 --- a/docs/vllm_mindspore/docs/source_en/developer_guide/contributing.md +++ b/docs/vllm_mindspore/docs/source_en/developer_guide/contributing.md @@ -13,11 +13,11 @@ Before submitting code to the MindSpore community, you need to sign the Contribu ## Supporting New Models -To support a new model for vLLM MindSpore code repository, please note the following: +To support a new model for vLLM-MindSpore Plugin code repository, please note the following: - **Follow file format and location specifications.** Model code files should be placed under the `vllm_mindspore/model_executor` directory, organized in corresponding subfolders by model type. -- **Implement models using MindSpore interfaces with jit static graph support.** Model definitions in vLLM MindSpore must be implemented using MindSpore interfaces. Since MindSpore's static graph mode offers performance advantages, models should support execution via @jit static graphs. For reference, see the [Qwen2.5](https://gitee.com/mindspore/vllm-mindspore/blob/master/vllm_mindspore/model_executor/models/qwen2.py) implementation. -- **Register new models in vLLM MindSpore.** After implementing the model structure, register it in vLLM MindSpore by adding it to `_NATIVE_MODELS` in `vllm_mindspore/model_executor/models/registry.py`. +- **Implement models using MindSpore interfaces with jit static graph support.** Model definitions in vLLM-MindSpore Plugin must be implemented using MindSpore interfaces. Since MindSpore's static graph mode offers performance advantages, models should support execution via @jit static graphs. For reference, see the [Qwen2.5](https://gitee.com/mindspore/vllm-mindspore/blob/master/vllm_mindspore/model_executor/models/qwen2.py) implementation. +- **Register new models in vLLM-MindSpore Plugin.** After implementing the model structure, register it in vLLM-MindSpore Plugin by adding it to `_NATIVE_MODELS` in `vllm_mindspore/model_executor/models/registry.py`. - **Write unit tests.** New models must include corresponding unit tests. Refer to the [Qwen2.5 testcases](https://gitee.com/mindspore/vllm-mindspore/blob/master/tests/st/python/cases_parallel/vllm_qwen_7b.py) for examples. ## Contribution Process @@ -27,12 +27,12 @@ To support a new model for vLLM MindSpore code repository, please note the follo Follow these guidelines for community code review, maintenance, and development. - **Coding Standards:** Use vLLM community code checking tools: yapf, codespell, ruff, isort, and mypy. For more details, see the [Toolchain Usage Guide](https://gitee.com/mindspore/vllm-mindspore/blob/master/codecheck_toolkits/README.md). -- **Unit Testing Guidelines:** vLLM MindSpore uses the [pytest](http://www.pytest.org/en/latest/) framework. Test names should clearly reflect their purpose. +- **Unit Testing Guidelines:** vLLM-MindSpore Plugin uses the [pytest](http://www.pytest.org/en/latest/) framework. Test names should clearly reflect their purpose. - **Refactoring Guidelines:** Developers are encouraged to refactor code to eliminate [code smells](https://en.wikipedia.org/wiki/Code_smell). All code, including refactored code, must adhere to coding and testing standards. ### Fork-Pull Development Model -- **Fork the vLLM MindSpore Repository:** Before submitting code, fork the project to your own repository. Ensure consistency between the vLLM MindSpore repository and your fork during parallel development. +- **Fork the vLLM-MindSpore Plugin Repository:** Before submitting code, fork the project to your own repository. Ensure consistency between the vLLM-MindSpore Plugin repository and your fork during parallel development. - **Clone the Remote Repository:** users can use git to pull the source code: @@ -59,13 +59,13 @@ Follow these guidelines for community code review, maintenance, and development. git push origin {new_branch_name} ``` -- **Create a Pull Request to vLLM MindSpore:** Compare and create a PR between your branch and the vLLM MindSpore master branch. After submission, manually trigger CI checks with `/retest` in the comments. PRs should be merged into upstream master promptly to minimize merge risks. +- **Create a Pull Request to vLLM-MindSpore Plugin:** Compare and create a PR between your branch and the vLLM-MindSpore Plugin master branch. After submission, manually trigger CI checks with `/retest` in the comments. PRs should be merged into upstream master promptly to minimize merge risks. ### Reporting Issues To contribute by reporting issues, follow these guidelines: -- Specify your environment versions (vLLM MindSpore, MindSpore TransFormers, MindSpore, OS, Python, etc.). +- Specify your environment versions (vLLM-MindSpore Plugin, MindSpore TransFormers, MindSpore, OS, Python, etc.). - Indicate whether it's a bug report or feature request. - Label the issue type for visibility on the issue board. - Describe the problem and expected resolution. @@ -92,4 +92,4 @@ To contribute by reporting issues, follow these guidelines: - Keep your branch synchronized with master. - For bug-fix PRs, ensure all related issues are referenced. -Thank you for your interest in contributing to vLLM MindSpore. We welcome and value all forms of collaboration. +Thank you for your interest in contributing to vLLM-MindSpore Plugin. We welcome and value all forms of collaboration. diff --git a/docs/vllm_mindspore/docs/source_en/developer_guide/operations/custom_ops.md b/docs/vllm_mindspore/docs/source_en/developer_guide/operations/custom_ops.md index 9f29cb0df6..e16285a9fe 100644 --- a/docs/vllm_mindspore/docs/source_en/developer_guide/operations/custom_ops.md +++ b/docs/vllm_mindspore/docs/source_en/developer_guide/operations/custom_ops.md @@ -4,9 +4,9 @@ When the built-in operators do not meet your requirements, you can use MindSpore's custom operator functionality to integrate your operators. -This document would introduce how to integrate a new custom operator into the vLLM MindSpore project, with the **`advance_step_flashattn`** operator as an example. The focus here is on the integration process into vLLM MindSpore. For the details of custom operator development, please refer to the official MindSpore tutorial: [CustomOpBuilder-Based Custom Operators](https://www.mindspore.cn/tutorials/en/master/custom_program/operation/op_customopbuilder.html), and for AscendC operator development, see the official Ascend documentation: [Ascend C Operator Development](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/developmentguide/opdevg/Ascendcopdevg/atlas_ascendc_10_0001.html). +This document would introduce how to integrate a new custom operator into the vLLM-MindSpore Plugin project, with the **`advance_step_flashattn`** operator as an example. The focus here is on the integration process into vLLM-MindSpore Plugin. For the details of custom operator development, please refer to the official MindSpore tutorial: [CustomOpBuilder-Based Custom Operators](https://www.mindspore.cn/tutorials/en/master/custom_program/operation/op_customopbuilder.html), and for AscendC operator development, see the official Ascend documentation: [Ascend C Operator Development](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/developmentguide/opdevg/Ascendcopdevg/atlas_ascendc_10_0001.html). -**Note: Currently, custom operators in vLLM MindSpore are only supported in PyNative Mode.** +**Note: Currently, custom operators in vLLM-MindSpore Plugin are only supported in PyNative Mode.** ## File Structure @@ -109,11 +109,11 @@ VLLM_MS_EXTENSION_MODULE(m) { In the above, the first parameter `"advance_step_flashattn"` in `m.def()` is the Python interface name for the operator. -The `module.h` and `module.cpp` files create the Python module for the operator based on pybind11. Since only one `PYBIND11_MODULE` is allowed per dynamic library, and to allow users to complete operator integration in a single file, vLLM MindSpore provides a new registration macro `VLLM_MS_EXTENSION_MODULE`. When the custom operator dynamic library is loaded, all operator interfaces will be automatically registered into the same Python module. +The `module.h` and `module.cpp` files create the Python module for the operator based on pybind11. Since only one `PYBIND11_MODULE` is allowed per dynamic library, and to allow users to complete operator integration in a single file, vLLM-MindSpore Plugin provides a new registration macro `VLLM_MS_EXTENSION_MODULE`. When the custom operator dynamic library is loaded, all operator interfaces will be automatically registered into the same Python module. ### Operator Interface -The custom operator in vLLM MindSpore is compiled into `_C_ops.so`. For convenient calls, user can add a call interface in `vllm_mindspore/_custom_ops.py`. If extra adaptation is needed before or after the operator call, user can implement it in this interface. +The custom operator in vLLM-MindSpore Plugin is compiled into `_C_ops.so`. For convenient calls, user can add a call interface in `vllm_mindspore/_custom_ops.py`. If extra adaptation is needed before or after the operator call, user can implement it in this interface. ```python def advance_step_flashattn(num_seqs: int, num_queries: int, block_size: int, @@ -140,8 +140,8 @@ Here, importing `_C_ops` allows user to use the Python module for the custom ope ### Operator Compilation and Testing -1. **Code Integration**: Merge the code into the vLLM MindSpore project. -2. **Project Compilation**: Run `pip install .` in vllm-mindspore to build and install vLLM MindSpore. +1. **Code Integration**: Merge the code into the vLLM-MindSpore Plugin project. +2. **Project Compilation**: Run `pip install .` in vllm-mindspore to build and install vLLM-MindSpore Plugin. 3. **Operator Testing**: Call the operator interface via `_custom_ops`. Refer to testcase [test_custom_advstepflash.py](https://gitee.com/mindspore/vllm-mindspore/blob/master/tests/st/python/test_custom_advstepflash.py): ```python @@ -152,18 +152,18 @@ custom_ops.advance_step_flashattn(...) ## Custom Operator Compilation Project -Currently, MindSpore provides only a [CustomOpBuilder](https://www.mindspore.cn/docs/en/master/api_python/ops/mindspore.ops.CustomOpBuilder.html) interface for online compilation of custom operators, with default compilation and linking options built in. vLLM MindSpore integrates operators based on MindSpore’s custom operator feature and compiles them into a dynamic library for package release. The following introduces the build process: +Currently, MindSpore provides only a [CustomOpBuilder](https://www.mindspore.cn/docs/en/master/api_python/ops/mindspore.ops.CustomOpBuilder.html) interface for online compilation of custom operators, with default compilation and linking options built in. vLLM-MindSpore Plugin integrates operators based on MindSpore’s custom operator feature and compiles them into a dynamic library for package release. The following introduces the build process: ### Extension Module -In `setup.py`, vLLM MindSpore adds a `vllm_mindspore._C_ops` extension and the corresponding build module: +In `setup.py`, vLLM-MindSpore Plugin adds a `vllm_mindspore._C_ops` extension and the corresponding build module: ```python ext_modules = [Extension("vllm_mindspore._C_ops", sources=[])], cmdclass = {"build_ext": CustomBuildExt}, ``` -There is no need to specify `sources` here because vLLM MindSpore triggers the operator build via CMake, which automatically collects the source files. +There is no need to specify `sources` here because vLLM-MindSpore Plugin triggers the operator build via CMake, which automatically collects the source files. ### Building Process diff --git a/docs/vllm_mindspore/docs/source_en/general/security.md b/docs/vllm_mindspore/docs/source_en/general/security.md index 886bd1b8ad..d156fffcb7 100644 --- a/docs/vllm_mindspore/docs/source_en/general/security.md +++ b/docs/vllm_mindspore/docs/source_en/general/security.md @@ -2,11 +2,11 @@ [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/general/security.md) -When enabling inference services using vLLM MindSpore on Ascend, there may be some security-related issues due to the need for certain network ports for necessary functions such as serviceification, node communication, and model execution. +When enabling inference services using vLLM-MindSpore Plugin on Ascend, there may be some security-related issues due to the need for certain network ports for necessary functions such as serviceification, node communication, and model execution. ## Service Port Configuration -When starting the inference service using vLLM MindSpore, relevant IP and port information is required, including: +When starting the inference service using vLLM-MindSpore Plugin, relevant IP and port information is required, including: 1. `host`: Sets the IP address associated with the vLLM serve (default: `0.0.0.0`). 2. `port`: Sets the port for vLLM serve (default: `8000`). @@ -36,7 +36,7 @@ For security, it should be deployed in a sufficiently secure isolated network en ### Executing Framework Distributed Communication -It should be noted that vLLM MindSpore use MindSpore's distributed communication. For detailed security information about MindSpore, please refer to the [MindSpore](https://www.mindspore.cn/en). +It should be noted that vLLM-MindSpore Plugin use MindSpore's distributed communication. For detailed security information about MindSpore, please refer to the [MindSpore](https://www.mindspore.cn/en). ## Security Recommendations diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index 7d7d0f9925..de7bdda373 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -2,10 +2,10 @@ [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md) -This document will introduce the [Version Matching](#version-compatibility) of vLLM MindSpore, the installation steps for vLLM MindSpore, and the [Quick Verification](#quick-verification) to verify whether the installation is successful. The installation steps provide two installation methods: +This document will introduce the [Version Matching](#version-compatibility) of vLLM-MindSpore Plugin, the installation steps for vLLM-MindSpore Plugin, and the [Quick Verification](#quick-verification) to verify whether the installation is successful. The installation steps provide two installation methods: - [Docker Installation](#docker-installation): Suitable for quick deployment scenarios. -- [Source Code Installation](#source-code-installation): Suitable for incremental development of vLLM MindSpore. +- [Source Code Installation](#source-code-installation): Suitable for incremental development of vLLM-MindSpore Plugin. ## Version Compatibility @@ -21,15 +21,15 @@ This document will introduce the [Version Matching](#version-compatibility) of v |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers) | |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/any/) | - |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | + |[vLLM-MindSpore Plugin](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | ## Docker Installation -We recommend using Docker for quick deployment of the vLLM MindSpore environment. Below are the steps: +We recommend using Docker for quick deployment of the vLLM-MindSpore Plugin environment. Below are the steps: ### Building the Image -User can execute the following commands to clone the vLLM MindSpore code repository and build the image: +User can execute the following commands to clone the vLLM-MindSpore Plugin code repository and build the image: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -121,13 +121,13 @@ For vLLM environment configuration and installation methods, please refer to the yum install -y gcc gcc-c++ ``` -### vLLM MindSpore Installation +### vLLM-MindSpore Plugin Installation -vLLM MindSpore can be installed in the following two ways. **vLLM MindSpore Quick Installation** is suitable for scenarios where users need quick deployment and usage. **vLLM MindSpore Manual Installation** is suitable for scenarios where users require custom modifications to the components. +vLLM-MindSpore Plugin can be installed in the following two ways. **vLLM-MindSpore Plugin Quick Installation** is suitable for scenarios where users need quick deployment and usage. **vLLM-MindSpore Plugin Manual Installation** is suitable for scenarios where users require custom modifications to the components. -- **vLLM MindSpore Quick Installation** +- **vLLM-MindSpore Plugin Quick Installation** - To install vLLM MindSpore, user needs to pull the vLLM MindSpore source code and then runs the following command to install the dependencies: + To install vLLM-MindSpore Plugin, user needs to pull the vLLM-MindSpore Plugin source code and then runs the following command to install the dependencies: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -135,7 +135,7 @@ vLLM MindSpore can be installed in the following two ways. **vLLM MindSpore Quic bash install_depend_pkgs.sh ``` - Compile and install vLLM MindSpore: + Compile and install vLLM-MindSpore Plugin: ```bash pip install . @@ -147,9 +147,9 @@ vLLM MindSpore can be installed in the following two ways. **vLLM MindSpore Quic export PYTHONPATH=$MF_PATH:$PYTHONPATH ``` -- **vLLM MindSpore Manual Installation** +- **vLLM-MindSpore Plugin Manual Installation** - If user need to modify the components or use other versions, components need to be manually installed in a specific order. Version compatibility of vLLM MindSpore can be found [Version Compatibility](#version-compatibility), abd vLLM MindSpore requires the following installation sequence: + If user need to modify the components or use other versions, components need to be manually installed in a specific order. Version compatibility of vLLM-MindSpore Plugin can be found [Version Compatibility](#version-compatibility), abd vLLM-MindSpore Plugin requires the following installation sequence: 1. Install vLLM @@ -188,9 +188,9 @@ vLLM MindSpore can be installed in the following two ways. **vLLM MindSpore Quic pip install /path/to/msadapter-*.whl ``` - 7. Install vLLM MindSpore + 7. Install vLLM-MindSpore Plugin - User needs to pull source of vLLM MindSpore, and run installation. + User needs to pull source of vLLM-MindSpore Plugin, and run installation. ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md index 91a88e814d..e425c661a4 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md @@ -2,15 +2,15 @@ [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md) -This document provides a quick guide to deploy vLLM MindSpore by [docker](https://www.docker.com/), with the [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) model as an example. User can quickly experience the serving and inference abilities of vLLM MindSpore by [offline inference](#offline-inference) and [online inference](#online-inference). For more information about installation, please refer to the [Installation Guide](../installation/installation.md). +This document provides a quick guide to deploy vLLM-MindSpore Plugin by [docker](https://www.docker.com/), with the [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) model as an example. User can quickly experience the serving and inference abilities of vLLM-MindSpore Plugin by [offline inference](#offline-inference) and [online inference](#online-inference). For more information about installation, please refer to the [Installation Guide](../installation/installation.md). ## Docker Installation -In this section, we recommend to use docker to deploy the vLLM MindSpore environment. The following sections are the steps for deployment: +In this section, we recommend to use docker to deploy the vLLM-MindSpore Plugin environment. The following sections are the steps for deployment: ### Building the Image -User can execute the following commands to clone the vLLM MindSpore code repository and build the image: +User can execute the following commands to clone the vLLM-MindSpore Plugin code repository and build the image: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -137,7 +137,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra Here is an explanation of these environment variables: -- `vLLM_MODEL_BACKEND`: The backend of the model to run. User could find supported models and backends for vLLM MindSpore in the [Model Support List](../../user_guide/supported_models/models_list/models_list.md). +- `vLLM_MODEL_BACKEND`: The backend of the model to run. User could find supported models and backends for vLLM-MindSpore Plugin in the [Model Support List](../../user_guide/supported_models/models_list/models_list.md). - `MINDFORMERS_MODEL_CONFIG`: The model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-7B, the YAML file is [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml). Additionally, users need to ensure that MindSpore Transformers is installed. Users can add it by running the following command: @@ -188,7 +188,7 @@ Prompt: 'Llama is'. Generated text: ' a 100% natural, biodegradable, and compost ### Online Inference -vLLM MindSpore supports online inference deployment with the OpenAI API protocol. The following section would introduce how to [starting the service](#starting-the-service) and [send requests](#sending-requests) to obtain inference results, using [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. +vLLM-MindSpore Plugin supports online inference deployment with the OpenAI API protocol. The following section would introduce how to [starting the service](#starting-the-service) and [send requests](#sending-requests) to obtain inference results, using [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. #### Starting the Service diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 9c369446a2..9080b67487 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -2,7 +2,7 @@ [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md) -vLLM MindSpore supports hybrid parallel inference with configurations of tensor parallelism (TP), data parallelism (DP), expert parallelism (EP), and their combinations. For the applicable scenarios of different parallel strategies, refer to the [vLLM official documentation](https://docs.vllm.ai/en/latest/configuration/optimization.html#parallelism-strategies). +vLLM-MindSpore Plugin supports hybrid parallel inference with configurations of tensor parallelism (TP), data parallelism (DP), expert parallelism (EP), and their combinations. For the applicable scenarios of different parallel strategies, refer to the [vLLM official documentation](https://docs.vllm.ai/en/latest/configuration/optimization.html#parallelism-strategies). This document uses the DeepSeek R1 671B W8A8 model as an example to introduce the inference workflows for [tensor parallelism (TP16)](#tp16-tensor-parallel-inference) and [hybrid parallelism](#hybrid-parallel-inference). The DeepSeek R1 671B W8A8 model requires multiple nodes to run inference. To ensure consistent execution configurations (including model configuration file paths, Python environments, etc.) across all nodes, it is recommended to use Docker containers to eliminate execution differences. @@ -10,11 +10,11 @@ Users can configure the environment by following the [Docker Installation](#dock ## Docker Installation -In this section, we recommend to use docker to deploy the vLLM MindSpore environment. The following sections are the steps for deployment: +In this section, we recommend to use docker to deploy the vLLM-MindSpore Plugin environment. The following sections are the steps for deployment: ### Building the Image -User can execute the following commands to clone the vLLM MindSpore code repository and build the image: +User can execute the following commands to clone the vLLM-MindSpore Plugin code repository and build the image: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -114,7 +114,7 @@ Environment variable descriptions: - `HCCL_OP_EXPANSION_MODE`: Configure the communication algorithm expansion location to the AI Vector Core (AIV) computing unit on the device side. - `MS_ALLOC_CONF`: Set the memory policy. Refer to the [MindSpore documentation](https://www.mindspore.cn/docs/en/master/api_python/env_var_list.html). - `ASCEND_RT_VISIBLE_DEVICES`: Configure the available device IDs for each node. Use the `npu-smi info` command to check. -- `vLLM_MODEL_BACKEND`: The backend of the model to run. Currently supported models and backends for vLLM MindSpore can be found in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). +- `vLLM_MODEL_BACKEND`: The backend of the model to run. Currently supported models and backends for vLLM-MindSpore Plugin can be found in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). - `MINDFORMERS_MODEL_CONFIG`: Model configuration file. Users can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/deepseek3/deepseek_r1_671b), such as [predict_deepseek_r1_671b_w8a8.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml). The model parallel strategy is specified in the `parallel_config` of the configuration file. For example, the TP16 tensor parallel configuration is as follows: @@ -222,7 +222,7 @@ Before managing a multi-node cluster, ensure that the hostnames of all nodes are #### Starting the Service -vLLM MindSpore can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service. +vLLM-MindSpore Plugin can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service. ```bash # Service launch parameter explanation @@ -285,7 +285,7 @@ Environment variable descriptions: - `HCCL_OP_EXPANSION_MODE`: Configure the communication algorithm expansion location to the AI Vector Core (AIV) computing unit on the device side. - `MS_ALLOC_CONF`: Set the memory policy. Refer to the [MindSpore documentation](https://www.mindspore.cn/docs/en/master/api_python/env_var_list.html). - `ASCEND_RT_VISIBLE_DEVICES`: Configure the available device IDs for each node. Use the `npu-smi info` command to check. -- `vLLM_MODEL_BACKEND`: The backend of the model to run. Currently supported models and backends for vLLM MindSpore can be found in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). +- `vLLM_MODEL_BACKEND`: The backend of the model to run. Currently supported models and backends for vLLM-MindSpore Plugin can be found in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). - `MINDFORMERS_MODEL_CONFIG`: Model configuration file. Users can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/deepseek3/deepseek_r1_671b), such as [predict_deepseek_r1_671b_w8a8_ep4t4.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8_ep4tp4.yaml). The model parallel strategy is specified in the `parallel_config` of the configuration file. For example, the hybrid parallel configuration is as follows: diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 40ad4a1597..5976859caa 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -2,15 +2,15 @@ [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md) -This document introduces single-node multi-card inference process by vLLM MindSpore. Taking the [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) model as an example, users can configure the environment through the [Docker Installation](#docker-installation) section or the [Installation Guide](../../installation/installation.md#installation-guide), and then [download the model weights](#downloading-model-weights). After [setting environment variables](#setting-environment-variables), users can perform [online inference](#online-inference) to experience single-node multi-card inference capabilities. +This document introduces single-node multi-card inference process by vLLM-MindSpore Plugin. Taking the [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) model as an example, users can configure the environment through the [Docker Installation](#docker-installation) section or the [Installation Guide](../../installation/installation.md#installation-guide), and then [download the model weights](#downloading-model-weights). After [setting environment variables](#setting-environment-variables), users can perform [online inference](#online-inference) to experience single-node multi-card inference capabilities. ## Docker Installation -In this section, we recommend using Docker for quick deployment of the vLLM MindSpore environment. Below are the steps for Docker deployment: +In this section, we recommend using Docker for quick deployment of the vLLM-MindSpore Plugin environment. Below are the steps for Docker deployment: ### Building the Image -User can execute the following commands to clone the vLLM MindSpore code repository and build the image: +User can execute the following commands to clone the vLLM-MindSpore Plugin code repository and build the image: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -144,7 +144,7 @@ export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7 ## Online Inference -vLLM MindSpore supports online inference deployment with the OpenAI API protocol. The following section would introduce how to [starting the service](#starting-the-service) and [send requests](#sending-requests) to obtain inference results, using [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) as an example. +vLLM-MindSpore Plugin supports online inference deployment with the OpenAI API protocol. The following section would introduce how to [starting the service](#starting-the-service) and [send requests](#sending-requests) to obtain inference results, using [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) as an example. ### Starting the Service diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index 79ed73f812..e4b2167fbc 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -2,15 +2,15 @@ [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md) -This document introduces single NPU inference process by vLLM MindSpore. Taking the [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) model as an example, user can configure the environment through the [Docker Installation](#docker-installation) or the [Installation Guide](../../installation/installation.md#installation-guide), and [downloading model weights](#downloading-model-weights). After [setting environment variables](#setting-environment-variables), user can perform [offline inference](#offline-inference) and [online inference](#online-inference) to experience single NPU inference abilities. +This document introduces single NPU inference process by vLLM-MindSpore Plugin. Taking the [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) model as an example, user can configure the environment through the [Docker Installation](#docker-installation) or the [Installation Guide](../../installation/installation.md#installation-guide), and [downloading model weights](#downloading-model-weights). After [setting environment variables](#setting-environment-variables), user can perform [offline inference](#offline-inference) and [online inference](#online-inference) to experience single NPU inference abilities. ## Docker Installation -In this section, we recommend using Docker for quick deployment of the vLLM MindSpore environment. Below are the steps for Docker deployment: +In this section, we recommend using Docker for quick deployment of the vLLM-MindSpore Plugin environment. Below are the steps for Docker deployment: ### Building the Image -User can execute the following commands to clone the vLLM MindSpore code repository and build the image: +User can execute the following commands to clone the vLLM-MindSpore Plugin code repository and build the image: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -182,7 +182,7 @@ Prompt: 'Llama is'. Generated text: ' a 100% natural, biodegradable, and compost ## Online Inference -vLLM MindSpore supports online inference deployment with the OpenAI API protocol. The following section would introduce how to [starting the service](#starting-the-service) and [send requests](#sending-requests) to obtain inference results, using [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. +vLLM-MindSpore Plugin supports online inference deployment with the OpenAI API protocol. The following section would introduce how to [starting the service](#starting-the-service) and [send requests](#sending-requests) to obtain inference results, using [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. ### Starting the Service diff --git a/docs/vllm_mindspore/docs/source_en/index.rst b/docs/vllm_mindspore/docs/source_en/index.rst index 351967f14a..b0d3504957 100644 --- a/docs/vllm_mindspore/docs/source_en/index.rst +++ b/docs/vllm_mindspore/docs/source_en/index.rst @@ -1,22 +1,22 @@ -vLLM MindSpore +vLLM-MindSpore Plugin ========================================= Overview ----------------------------------------------------- -vLLM MindSpore (`vllm-mindspore`) is a plugin brewed by the `MindSpore community `_ , which aims to integrate MindSpore LLM inference capabilities into `vLLM `_ . With vLLM MindSpore, technical strengths of Mindspore and vLLM will be organically combined to provide a full-stack open-source, high-performance, easy-to-use LLM inference solution. +vLLM-MindSpore Plugin (`vllm-mindspore`) is a plugin brewed by the `MindSpore community `_ , which aims to integrate MindSpore LLM inference capabilities into `vLLM `_ . With vLLM-MindSpore Plugin, technical strengths of Mindspore and vLLM will be organically combined to provide a full-stack open-source, high-performance, easy-to-use LLM inference solution. vLLM, an opensource and community-driven project initiated by Sky Computing Lab, UC Berkeley, has been widely used in academic research and industry applications. On the basis of Continuous Batching scheduling mechanism and PagedAttention Key-Value cache management, vLLM provides a rich set of inference service features, including speculative inference, Prefix Caching, Multi-LoRA, etc. vLLM also supports a wide range of open-source large models, including Transformer-based models (e.g., LLaMa), Mixture-of-Expert models (e.g., DeepSeek), Embedding models (e.g., E5-Mistral), and multi-modal models (e.g., LLaVA). Because vLLM chooses to use PyTorch to build large models and manage storage resources, it cannot deploy large models built upon MindSpore. -vLLM MindSpore plugin aims to integrate Mindspore large models into vLLM and to enable deploying MindSpore-based LLM inference services. It follows the following design principles: +vLLM-MindSpore Plugin plugin aims to integrate Mindspore large models into vLLM and to enable deploying MindSpore-based LLM inference services. It follows the following design principles: - Interface compatibility: support the native APIs and service deployment interfaces of vLLM to avoid adding new configuration files or interfaces, reducing user learning costs and ensuring ease of use. - Minimal invasive modifications: minimize invasive modifications to the vLLM code to ensure system maintainability and evolvability. - Component decoupling: minimize and standardize the coupling between MindSpore large model components and vLLM service components to facilitate the integration of various MindSpore large model suites. -On the basis of the above design principles, vLLM MindSpore adopts the system architecture shown in the figure below, and implements the docking between vLLM and Mindspore in categories of components: +On the basis of the above design principles, vLLM-MindSpore Plugin adopts the system architecture shown in the figure below, and implements the docking between vLLM and Mindspore in categories of components: -- Service components: vLLM MindSpore maps PyTorch API calls in service components including LLMEngine and Scheduler to MindSpore capabilities, inheriting support for service functions like Continuous Batching and PagedAttention. -- Model components: vLLM MindSpore registers or replaces model components including models, network layers, and custom operators, and integrates MindSpore Transformers, MindSpore One, and other MindSpore large model suites, as well as custom large models, into vLLM. +- Service components: vLLM-MindSpore Plugin maps PyTorch API calls in service components including LLMEngine and Scheduler to MindSpore capabilities, inheriting support for service functions like Continuous Batching and PagedAttention. +- Model components: vLLM-MindSpore Plugin registers or replaces model components including models, network layers, and custom operators, and integrates MindSpore Transformers, MindSpore One, and other MindSpore large model suites, as well as custom large models, into vLLM. .. raw:: html @@ -28,7 +28,7 @@ On the basis of the above design principles, vLLM MindSpore adopts the system ar -vLLM MindSpore uses the plugin mechanism recommended by the vLLM community to realize capability registration. In the future, we expect to promote vLLM community to support integration of inference capabilities of third-party AI frameworks, including PaddlePaddle and JAX by following principles described in `[RPC] Multi-framework support for vllm `_ . +vLLM-MindSpore Plugin uses the plugin mechanism recommended by the vLLM community to realize capability registration. In the future, we expect to promote vLLM community to support integration of inference capabilities of third-party AI frameworks, including PaddlePaddle and JAX by following principles described in `[RPC] Multi-framework support for vllm `_ . Code: @@ -88,7 +88,7 @@ The following are the version branches: SIG ----------------------------------------------------- -- Welcome to join vLLM MindSpore SIG to participate in the co-construction of open-source projects and industrial cooperation: https://www.mindspore.cn/community/SIG +- Welcome to join vLLM-MindSpore Plugin SIG to participate in the co-construction of open-source projects and industrial cooperation: https://www.mindspore.cn/community/SIG - SIG meetings, every other Friday or Saturday evening, 20:00 - 21:00 (UTC+8, `Convert to your timezone `_ ) License diff --git a/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md b/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md index 4b22bc8bf9..2b68ef86d8 100644 --- a/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md +++ b/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md @@ -2,9 +2,9 @@ [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md) -## vLLM MindSpore 0.3.0 Release Notes +## vLLM-MindSpore Plugin 0.3.0 Release Notes -The following are the key new features and models supported in the vLLM MindSpore plugin version 0.3.0. +The following are the key new features and models supported in the vLLM-MindSpore Plugin plugin version 0.3.0. ### New Features diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md index 036834798e..054328a0ff 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md @@ -4,7 +4,7 @@ | Environment Variable | Function | Type | Values | Description | |----------------------|----------|------|--------|-------------| -| `vLLM_MODEL_BACKEND` | Specifies the model backend. Not Required when using vLLM MindSpore native models, and required when using an external vLLM MindSpore models. | String | `MindFormers`: Model source is MindSpore Transformers. | vLLM MindSpore native model backend supports Qwen2.5 series. MindSpore Transformers model backend supports Qwen/DeepSeek/Llama series models, and the environment variable: `export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH` needs to be set. | +| `vLLM_MODEL_BACKEND` | Specifies the model backend. Not Required when using vLLM-MindSpore Plugin native models, and required when using an external vLLM-MindSpore Plugin models. | String | `MindFormers`: Model source is MindSpore Transformers. | vLLM-MindSpore Plugin native model backend supports Qwen2.5 series. MindSpore Transformers model backend supports Qwen/DeepSeek/Llama series models, and the environment variable: `export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH` needs to be set. | | `MINDFORMERS_MODEL_CONFIG` | Configuration file for MindSpore Transformers models. Required for Qwen2.5 series or DeepSeek series models. | String | Path to the model configuration file | **This environment variable will be removed in future versions.** Example: `export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`. | | `GLOO_SOCKET_IFNAME` | Specifies the network interface name for inter-machine communication using gloo. | String | Interface name (e.g., `enp189s0f0`). | Used in multi-machine scenarios. The interface name can be found via `ifconfig` by matching the IP address. | | `TP_SOCKET_IFNAME` | Specifies the network interface name for inter-machine communication using TP. | String | Interface name (e.g., `enp189s0f0`). | Used in multi-machine scenarios. The interface name can be found via `ifconfig` by matching the IP address. | diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md index 9ab03ff8ac..e6055ce438 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md @@ -2,7 +2,7 @@ [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md) -The benchmark tool of vLLM MindSpore is inherited from vLLM. You can refer to the [vLLM BenchMark](https://github.com/vllm-project/vllm/blob/main/benchmarks/README.md) documentation for more details. This document introduces [Online Benchmark](#online-benchmark) and [Offline Benchmark](#offline-benchmark). Users can follow the steps to conduct performance tests. +The benchmark tool of vLLM-MindSpore Plugin is inherited from vLLM. You can refer to the [vLLM BenchMark](https://github.com/vllm-project/vllm/blob/main/benchmarks/README.md) documentation for more details. This document introduces [Online Benchmark](#online-benchmark) and [Offline Benchmark](#offline-benchmark). Users can follow the steps to conduct performance tests. ## Online Benchmark @@ -35,7 +35,7 @@ INFO: Waiting for application startup. INFO: Application startup complete. ``` -Clone the vLLM repository and import the vLLM MindSpore plugin to reuse the benchmark tools: +Clone the vLLM repository and import the vLLM-MindSpore Plugin plugin to reuse the benchmark tools: ```bash export VLLM_BRANCH=v0.9.1 @@ -44,7 +44,7 @@ cd vllm sed -i '1i import vllm_mindspore' benchmarks/benchmark_serving.py ``` -Here, `VLLM_BRANCH` refers to the branch name of vLLM, which needs to be compatible with vLLM MindSpore. For compatibility details, please refer to [here](../../../getting_started/installation/installation.md#version-compatibility). +Here, `VLLM_BRANCH` refers to the branch name of vLLM, which needs to be compatible with vLLM-MindSpore Plugin. For compatibility details, please refer to [here](../../../getting_started/installation/installation.md#version-compatibility). Execute the test script: @@ -115,7 +115,7 @@ cd vllm sed -i '1i import vllm_mindspore' benchmarks/benchmark_throughput.py ``` -Here, `VLLM_BRANCH` refers to the branch name of vLLM, which needs to be compatible with vLLM MindSpore. For compatibility details, please refer to [here](../../../getting_started/installation/installation.md#version-compatibility). +Here, `VLLM_BRANCH` refers to the branch name of vLLM, which needs to be compatible with vLLM-MindSpore Plugin. For compatibility details, please refer to [here](../../../getting_started/installation/installation.md#version-compatibility). Run the test script with the following command. The script below will start the model automatically, and user does not need to start the model manually: diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/features_list/features_list.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/features_list/features_list.md index 6753054670..59680c244f 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/features_list/features_list.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/features_list/features_list.md @@ -2,9 +2,9 @@ [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/features_list/features_list.md) -The features supported by vLLM MindSpore are consistent with the community version of vLLM. For feature descriptions and usage, please refer to the [vLLM Official Documentation](https://docs.vllm.ai/en/latest/). +The features supported by vLLM-MindSpore Plugin are consistent with the community version of vLLM. For feature descriptions and usage, please refer to the [vLLM Official Documentation](https://docs.vllm.ai/en/latest/). -The following is the features supported in vLLM MindSpore. +The following is the features supported in vLLM-MindSpore Plugin. | **Features** | **vLLM V0** | **vLLM V1** | |-----------------------------------|--------------------|--------------------| @@ -39,5 +39,5 @@ The following is the features supported in vLLM MindSpore. ## Feature Description -- LoRA currently only supports the Qwen2.5 vLLM MindSpore native model, other models are in the process of adaptation; +- LoRA currently only supports the Qwen2.5 vLLM-MindSpore Plugin native model, other models are in the process of adaptation; - Tool Calling only supports DeepSeek V3 0324 W8A8 model. diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md index 897f1ec0c8..3810ec6a8c 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md @@ -2,7 +2,7 @@ [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md) -vLLM MindSpore supports the `mindspore.Profiler` module to track the performance of workers in vLLM MindSpore. User can follow the [Collecting Profiling Data](#collecting-profiling-data) section to gather data and then analyze it according to [Analyzing Profiling Data](#analyzing-profiling-data). Additionally, user can inspect the model's IR graph through [Graph Data Dump](#graph-data-dump) to analyze and debug the model structure. +vLLM-MindSpore Plugin supports the `mindspore.Profiler` module to track the performance of workers in vLLM-MindSpore Plugin. User can follow the [Collecting Profiling Data](#collecting-profiling-data) section to gather data and then analyze it according to [Analyzing Profiling Data](#analyzing-profiling-data). Additionally, user can inspect the model's IR graph through [Graph Data Dump](#graph-data-dump) to analyze and debug the model structure. ## Collecting Profiling Data @@ -12,7 +12,7 @@ To enable profiling data collection, user need to set the `VLLM_TORCH_PROFILER_D export VLLM_TORCH_PROFILER_DIR=/path/to/save/vllm_profile ``` -After setting the variable, Run the following command to launch the vLLM MindSpore service. We take [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) as an example: +After setting the variable, Run the following command to launch the vLLM-MindSpore Plugin service. We take [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) as an example: ```bash export TENSOR_PARALLEL_SIZE=4 diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md index fa1b8f89c3..fb135bfe73 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md @@ -24,7 +24,7 @@ After obtaining the DeepSeek-R1 W8A8 weights, ensure they are stored in the rela ### Offline Inference -Refer to the [Installation Guide](../../../getting_started/installation/installation.md) to set up the vLLM MindSpore environment. User need to set the following environment variables: +Refer to the [Installation Guide](../../../getting_started/installation/installation.md) to set up the vLLM-MindSpore Plugin environment. User need to set the following environment variables: ```bash export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index b47e8a9e61..01ece4828f 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -2,10 +2,10 @@ [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md) -本文档将介绍vLLM MindSpore的[版本配套](#版本配套),vLLM MindSpore的安装步骤,与[快速验证](#快速验证)用例,用于验证安装是否成功。其中安装步骤分为两种安装方式: +本文档将介绍vLLM-MindSpore插件的[版本配套](#版本配套),vLLM-MindSpore插件的安装步骤,与[快速验证](#快速验证)用例,用于验证安装是否成功。其中安装步骤分为两种安装方式: - [docker安装](#docker安装):适合用户快速使用的场景; -- [源码安装](#源码安装):适合用户有增量开发vLLM MindSpore的场景。 +- [源码安装](#源码安装):适合用户有增量开发vLLM-MindSpore插件的场景。 ## 版本配套 @@ -21,15 +21,15 @@ |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers) | |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/any/) | - |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | + |[vLLM-MindSpore插件](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | ## docker安装 -在本章节中,我们推荐用docker创建的方式,以快速部署vLLM MindSpore环境,以下是部署docker的步骤介绍: +在本章节中,我们推荐用docker创建的方式,以快速部署vLLM-MindSpore插件环境,以下是部署docker的步骤介绍: ### 构建镜像 -用户可执行以下命令,拉取vLLM MindSpore代码仓库,并构建镜像: +用户可执行以下命令,拉取vLLM-MindSpore插件代码仓库,并构建镜像: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -121,13 +121,13 @@ vLLM的环境配置与安装方法,请参考[vLLM安装教程](https://docs.vl yum install -y gcc gcc-c++ ``` -### vLLM MindSpore安装 +### vLLM-MindSpore插件安装 -vLLM MindSpore有以下两种安装方式。**vLLM MindSpore快速安装**适用于用户快速使用与部署的场景。**vLLM MindSpore手动安装**适用于用户对组件有自定义修改的场景。 +vLLM-MindSpore插件有以下两种安装方式。**vLLM-MindSpore插件快速安装**适用于用户快速使用与部署的场景。**vLLM-MindSpore插件手动安装**适用于用户对组件有自定义修改的场景。 -- **vLLM MindSpore快速安装** +- **vLLM-MindSpore插件快速安装** - 采用快速安装脚本来安装vLLM MindSpore,需要在拉取vLLM MindSpore源码后,执行以下命令,安装依赖包: + 采用快速安装脚本来安装vLLM-MindSpore插件,需要在拉取vLLM-MindSpore插件源码后,执行以下命令,安装依赖包: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -135,7 +135,7 @@ vLLM MindSpore有以下两种安装方式。**vLLM MindSpore快速安装**适用 bash install_depend_pkgs.sh ``` - 编译安装vLLM MindSpore: + 编译安装vLLM-MindSpore插件: ```bash pip install . @@ -147,9 +147,9 @@ vLLM MindSpore有以下两种安装方式。**vLLM MindSpore快速安装**适用 export PYTHONPATH=$MF_PATH:$PYTHONPATH ``` -- **vLLM MindSpore手动安装** +- **vLLM-MindSpore插件手动安装** - 若用户对组件有修改,或者需使用其他版本,则用户需要按照特定顺序,手动安装组件。vLLM MindSpore软件配套下载地址可以参考[版本配套](#版本配套),且对组件的安装顺序要求如下: + 若用户对组件有修改,或者需使用其他版本,则用户需要按照特定顺序,手动安装组件。vLLM-MindSpore插件软件配套下载地址可以参考[版本配套](#版本配套),且对组件的安装顺序要求如下: 1. 安装vLLM @@ -188,9 +188,9 @@ vLLM MindSpore有以下两种安装方式。**vLLM MindSpore快速安装**适用 pip install /path/to/msadapter-*.whl ``` - 7. 安装vLLM MindSpore + 7. 安装vLLM-MindSpore插件 - 需要先拉取vLLM MindSpore源码,再执行安装 + 需要先拉取vLLM-MindSpore插件源码,再执行安装 ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index c7d8426f6d..c7ec37a452 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -2,15 +2,15 @@ [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md) -本文档将为用户介绍使用vLLM MindSpore进行单卡推理流程。以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)模型为例,用户通过以下[docker安装](#docker安装)章节,或[安装指南](../../installation/installation.md#安装指南)进行环境配置,并[下载模型权重](#下载模型权重)。在[设置环境变量](#设置环境变量)之后,可进行[离线推理](#离线推理)与[在线推理](#在线推理),以体验单卡推理功能。 +本文档将为用户介绍使用vLLM-MindSpore插件进行单卡推理流程。以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)模型为例,用户通过以下[docker安装](#docker安装)章节,或[安装指南](../../installation/installation.md#安装指南)进行环境配置,并[下载模型权重](#下载模型权重)。在[设置环境变量](#设置环境变量)之后,可进行[离线推理](#离线推理)与[在线推理](#在线推理),以体验单卡推理功能。 ## docker安装 -在本章节中,我们推荐用docker创建的方式,以快速部署vLLM MindSpore环境。以下是部署docker的步骤介绍: +在本章节中,我们推荐用docker创建的方式,以快速部署vLLM-MindSpore插件环境。以下是部署docker的步骤介绍: ### 构建镜像 -用户可执行以下命令,拉取vLLM MindSpore代码仓库,并构建镜像: +用户可执行以下命令,拉取vLLM-MindSpore插件代码仓库,并构建镜像: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -134,7 +134,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra 以下是对上述环境变量的解释: -- `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询; +- `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询; - `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡: @@ -184,7 +184,7 @@ Prompt: 'Llama is'. Generated text: ' a 100% natural, biodegradable, and compost ## 在线推理 -vLLM MindSpore可使用OpenAI的API协议,部署为在线推理。以下是以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例,介绍模型的[启动服务](#启动服务),并[发送请求](#发送请求),得到在线推理的推理结果。 +vLLM-MindSpore插件可使用OpenAI的API协议,部署为在线推理。以下是以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例,介绍模型的[启动服务](#启动服务),并[发送请求](#发送请求),得到在线推理的推理结果。 ### 启动服务 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/index.rst b/docs/vllm_mindspore/docs/source_zh_cn/index.rst index 0f16e9f4ce..921322682e 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/index.rst +++ b/docs/vllm_mindspore/docs/source_zh_cn/index.rst @@ -1,19 +1,19 @@ -vLLM MindSpore 文档 +vLLM-MindSpore插件文档 ========================================= -vLLM MindSpore 简介 +vLLM-MindSpore插件简介 ----------------------------------------------------- -vLLM MindSpore插件(`vllm-mindspore`)是一个由 `MindSpore社区 `_ 孵化的vLLM后端插件。其将基于MindSpore构建的大模型推理能力接入 `vLLM `_ ,从而有机整合MindSpore和vLLM的技术优势,提供全栈开源、高性能、易用的大模型推理解决方案。 +vLLM-MindSpore插件(`vllm-mindspore`)是一个由 `MindSpore社区 `_ 孵化的vLLM后端插件。其将基于MindSpore构建的大模型推理能力接入 `vLLM `_ ,从而有机整合MindSpore和vLLM的技术优势,提供全栈开源、高性能、易用的大模型推理解决方案。 vLLM是由加州大学伯克利分校Sky Computing Lab创建的社区开源项目,已广泛用于学术研究和工业应用。vLLM以Continuous Batching调度机制和PagedAttention Key-Value缓存管理为基础,提供了丰富的推理服务功能,包括投机推理、Prefix Caching、Multi-LoRA等。同时,vLLM已支持种类丰富的开源大模型,包括Transformer类(如LLaMa)、混合专家类(如DeepSeek)、Embedding类(如E5-Mistral)、多模态类(如LLaVA)等。由于vLLM选用PyTorch构建大模型和管理计算存储资源,此前无法使用其部署基于MindSpore大模型的推理服务。 -vLLM MindSpore插件以将MindSpore大模型接入vLLM,并实现服务化部署为功能目标。其遵循以下设计原则: +vLLM-MindSpore插件以将MindSpore大模型接入vLLM,并实现服务化部署为功能目标。其遵循以下设计原则: - 接口兼容:支持vLLM原生的API和服务部署接口,避免新增配置文件或接口,降低用户学习成本和确保易用性。 - 最小化侵入式修改:尽可能避免侵入式修改vLLM代码,以保障系统的可维护性和可演进性。 - 组件解耦:最小化和规范化MindSpore大模型组件和vLLM服务组件的耦合面,以利于多种MindSpore大模型套件接入。 -基于上述设计原则,vLLM MindSpore采用如下图所示的系统架构,分组件类别实现vLLM与MindSpore的对接: +基于上述设计原则,vLLM-MindSpore插件采用如下图所示的系统架构,分组件类别实现vLLM与MindSpore的对接: - 服务化组件:通过将LLM Engine、Scheduler等服务化组件中的PyTorch API调用映射至MindSpore能力调用,继承支持包括Continuous Batching、PagedAttention在内的服务化功能。 - 大模型组件:通过注册或替换模型、网络层、自定义算子等组件,将MindSpore Transformers、MindSpore One等MindSpore大模型套件和自定义大模型接入vLLM。 @@ -28,7 +28,7 @@ vLLM MindSpore插件以将MindSpore大模型接入vLLM,并实现服务化部 -vLLM MindSpore采用vLLM社区推荐的插件机制,实现能力注册。未来期望遵循 `RPC Multi-framework support for vllm `_ 所述原则。 +vLLM-MindSpore插件采用vLLM社区推荐的插件机制,实现能力注册。未来期望遵循 `RPC Multi-framework support for vllm `_ 所述原则。 代码仓地址: @@ -41,8 +41,8 @@ vLLM MindSpore采用vLLM社区推荐的插件机制,实现能力注册。未 * Python >= 3.9, < 3.12 * CANN >= 8.0.0.beta1 - * MindSpore (与vLLM MindSpore版本配套) - * vLLM (与vLLM MindSpore版本配套) + * MindSpore (与vLLM-MindSpore插件版本配套) + * vLLM (与vLLM-MindSpore插件版本配套) 快速体验 ----------------------------------------------------- @@ -56,7 +56,7 @@ vLLM MindSpore采用vLLM社区推荐的插件机制,实现能力注册。未 分支策略 ----------------------------------------------------- -vLLM MindSpore代码仓包含主干分支、开发分支、版本分支: +vLLM-MindSpore插件代码仓包含主干分支、开发分支、版本分支: - **main**: 主干分支,与MindSpore master分支和vLLM v0.9.1版本配套,并通过昇腾+昇思CI持续进行质量看护; - **develop**: 开发分支,在vLLM部分新版本发布时从主干分支拉出,用于开发适配vLLM的新功能特性。待特性适配稳定后合入主干分支。当前开发分支正在适配vLLM v0.9.1版本; -- Gitee From 83d3e032f1cbd3fff4efdae8666c253ba9725ed7 Mon Sep 17 00:00:00 2001 From: horcam Date: Mon, 18 Aug 2025 15:53:34 +0800 Subject: [PATCH 11/12] update version compatibility --- .../getting_started/installation/installation.md | 14 +++++++------- .../getting_started/installation/installation.md | 16 +++++++++------- 2 files changed, 16 insertions(+), 14 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index de7bdda373..20315c3d3b 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -15,13 +15,13 @@ This document will introduce the [Version Matching](#version-compatibility) of v | Software | Version And Links | | ----- | ----- | - |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit) | - |[MindSpore](https://www.mindspore.cn/install/) | [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/) | - |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | - |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers) | - |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | - |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/any/) | - |[vLLM-MindSpore Plugin](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | + | CANN | [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit) | + | MindSpore | [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/) | + | MSAdapter | [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | + | MindSpore Transformers | [1.6.0](https://gitee.com/mindspore/mindformers) | + | Golden Stick | [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | + | vLLM | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/any/) | + | vLLM-MindSpore Plugin | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | ## Docker Installation diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index 01ece4828f..0c77217add 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -15,13 +15,15 @@ | 软件 | 配套版本与下载链接 | | ----- | ----- | - |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit) | - |[MindSpore](https://www.mindspore.cn/install/) | [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/) | - |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | - |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers) | - |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | - |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/any/) | - |[vLLM-MindSpore插件](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | + | CANN | [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit) | + | MindSpore | [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/) | + | MSAdapter| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | + | MindSpore Transformers | [1.6.0](https://gitee.com/mindspore/mindformers) | + | Golden Stick | [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | + | vLLM | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/any/) | + | vLLM-MindSpore插件 | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | + + ## docker安装 -- Gitee From d1db5ed321730c4ffe86ea2a532ddcad691da279 Mon Sep 17 00:00:00 2001 From: horcam Date: Mon, 18 Aug 2025 16:20:58 +0800 Subject: [PATCH 12/12] rename vLLM MindSpore Plugin2 --- docs/vllm_mindspore/docs/source_en/index.rst | 2 +- .../source_en/release_notes/release_notes.md | 2 +- .../supported_features/benchmark/benchmark.md | 2 +- docs/vllm_mindspore/docs/source_zh_cn/conf.py | 4 +-- .../developer_guide/contributing.md | 16 +++++----- .../developer_guide/operations/custom_ops.md | 22 +++++++------- .../docs/source_zh_cn/general/security.md | 30 +++++++++---------- .../installation/installation.md | 2 -- .../quick_start/quick_start.md | 10 +++---- .../deepseek_r1_671b_w8a8_dp4_tp4_ep4.md | 12 ++++---- .../qwen2.5_32b_multiNPU.md | 10 +++---- .../release_notes/release_notes.md | 4 +-- .../environment_variables.md | 2 +- .../supported_features/benchmark/benchmark.md | 10 +++---- .../features_list/features_list.md | 6 ++-- .../supported_features/profiling/profiling.md | 4 +-- .../quantization/quantization.md | 2 +- 17 files changed, 69 insertions(+), 71 deletions(-) diff --git a/docs/vllm_mindspore/docs/source_en/index.rst b/docs/vllm_mindspore/docs/source_en/index.rst index b0d3504957..a373198a6d 100644 --- a/docs/vllm_mindspore/docs/source_en/index.rst +++ b/docs/vllm_mindspore/docs/source_en/index.rst @@ -7,7 +7,7 @@ vLLM-MindSpore Plugin (`vllm-mindspore`) is a plugin brewed by the `MindSpore co vLLM, an opensource and community-driven project initiated by Sky Computing Lab, UC Berkeley, has been widely used in academic research and industry applications. On the basis of Continuous Batching scheduling mechanism and PagedAttention Key-Value cache management, vLLM provides a rich set of inference service features, including speculative inference, Prefix Caching, Multi-LoRA, etc. vLLM also supports a wide range of open-source large models, including Transformer-based models (e.g., LLaMa), Mixture-of-Expert models (e.g., DeepSeek), Embedding models (e.g., E5-Mistral), and multi-modal models (e.g., LLaVA). Because vLLM chooses to use PyTorch to build large models and manage storage resources, it cannot deploy large models built upon MindSpore. -vLLM-MindSpore Plugin plugin aims to integrate Mindspore large models into vLLM and to enable deploying MindSpore-based LLM inference services. It follows the following design principles: +vLLM-MindSpore Plugin aims to integrate Mindspore large models into vLLM and to enable deploying MindSpore-based LLM inference services. It follows the following design principles: - Interface compatibility: support the native APIs and service deployment interfaces of vLLM to avoid adding new configuration files or interfaces, reducing user learning costs and ensuring ease of use. - Minimal invasive modifications: minimize invasive modifications to the vLLM code to ensure system maintainability and evolvability. diff --git a/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md b/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md index 2b68ef86d8..ea96f2fb13 100644 --- a/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md +++ b/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md @@ -4,7 +4,7 @@ ## vLLM-MindSpore Plugin 0.3.0 Release Notes -The following are the key new features and models supported in the vLLM-MindSpore Plugin plugin version 0.3.0. +The following are the key new features and models supported in the vLLM-MindSpore Plugin version 0.3.0. ### New Features diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md index e6055ce438..e72a1fee54 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md @@ -35,7 +35,7 @@ INFO: Waiting for application startup. INFO: Application startup complete. ``` -Clone the vLLM repository and import the vLLM-MindSpore Plugin plugin to reuse the benchmark tools: +Clone the vLLM repository and import the vLLM-MindSpore Plugin to reuse the benchmark tools: ```bash export VLLM_BRANCH=v0.9.1 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/conf.py b/docs/vllm_mindspore/docs/source_zh_cn/conf.py index 509c2877de..2ccb114608 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/conf.py +++ b/docs/vllm_mindspore/docs/source_zh_cn/conf.py @@ -23,9 +23,9 @@ from sphinx.ext import autodoc as sphinx_autodoc # -- Project information ----------------------------------------------------- -project = 'vLLM MindSpore' +project = 'vLLM-MindSpore插件' copyright = 'MindSpore' -author = 'vLLM MindSpore' +author = 'vLLM-MindSpore插件' # The full version, including alpha/beta/rc tags release = 'master' diff --git a/docs/vllm_mindspore/docs/source_zh_cn/developer_guide/contributing.md b/docs/vllm_mindspore/docs/source_zh_cn/developer_guide/contributing.md index aef574b259..2b49068104 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/developer_guide/contributing.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/developer_guide/contributing.md @@ -14,11 +14,11 @@ ## 增加新模型 -若希望将一个新模型合入vLLM MindSpore代码仓库,需要注意几点: +若希望将一个新模型合入vLLM-MindSpore插件代码仓库,需要注意几点: - **文件格式及位置要遵循规范。** 模型代码文件统一放置于`vllm_mindspore/model_executor`文件夹下,请根据不同模型将代码文件放置于对应的文件夹下。 -- **模型基于MindSpore接口实现,支持jit静态图方式执行。** vLLM MindSpore中的模型定义实现需基于MindSpore接口实现。由于MindSpore静态图模式执行性能有优势,因此模型需支持@jit静态图方式执行。详细可参考[Qwen2.5](https://gitee.com/mindspore/vllm-mindspore/blob/master/vllm_mindspore/model_executor/models/qwen2.py)模型定义实现。 -- **将新模型在vLLM MindSpore代码中进行注册。** 模型结构定义实现后,需要将该模型注册到vLLM MindSpore中,注册文件位于'vllm_mindspore/model_executor/models/registry.py'中,请将模型注册到`_NATIVE_MODELS`。 +- **模型基于MindSpore接口实现,支持jit静态图方式执行。** vLLM-MindSpore插件中的模型定义实现需基于MindSpore接口实现。由于MindSpore静态图模式执行性能有优势,因此模型需支持@jit静态图方式执行。详细可参考[Qwen2.5](https://gitee.com/mindspore/vllm-mindspore/blob/master/vllm_mindspore/model_executor/models/qwen2.py)模型定义实现。 +- **将新模型在vLLM-MindSpore插件代码中进行注册。** 模型结构定义实现后,需要将该模型注册到vLLM-MindSpore插件中,注册文件位于'vllm_mindspore/model_executor/models/registry.py'中,请将模型注册到`_NATIVE_MODELS`。 - **编写单元测试。** 新增的模型需同步提交单元测试用例,用例编写请参考[Qwen2.5模型用例](https://gitee.com/mindspore/vllm-mindspore/blob/master/tests/st/python/cases_parallel/vllm_qwen_7b.py)。 ## 贡献流程 @@ -29,13 +29,13 @@ - **编码指南:** 使用vLLM社区代码检查工具:yapf、codespell、ruff、isort和mypy。更多信息可参考[检查工具链使用说明](https://gitee.com/mindspore/vllm-mindspore/blob/master/codecheck_toolkits/README.md)。 -- **单元测试指南:** vLLM MindSpore使用Python单元测试框架[pytest](http://www.pytest.org/en/latest/)。注释名称需反映测试用例的设计意图。 +- **单元测试指南:** vLLM-MindSpore插件使用Python单元测试框架[pytest](http://www.pytest.org/en/latest/)。注释名称需反映测试用例的设计意图。 - **重构指南:** 我们鼓励开发人员重构我们的代码,以消除[代码坏味道](https://zh.wikipedia.org/wiki/%E4%BB%A3%E7%A0%81%E5%BC%82%E5%91%B3)。所有代码都要符合编码风格和测试风格,重构代码也不例外。 ### Fork-Pull开发模型 -- **Fork vLLM MindSpore代码仓:** 在提交代码至vLLM MindSpore项目之前,请确保已fork此项目到您自己的代码仓。vLLM MindSpore代码仓和您自己的代码仓之间可能会并行开发,请注意它们之间的一致性。 +- **Fork vLLM-MindSpore插件代码仓:** 在提交代码至vLLM-MindSpore插件项目之前,请确保已fork此项目到您自己的代码仓。vLLM-MindSpore插件代码仓和您自己的代码仓之间可能会并行开发,请注意它们之间的一致性。 - **克隆远程代码仓:** 如果您想将代码下载到本地计算机,最好使用git方法: @@ -63,7 +63,7 @@ git push origin {新分支名称} ``` -- **将请求拉取到vLLM MindSpore代码仓:** 在最后一步中,您需要在新分支和vLLM MindSpore主分支之间拉取比较请求然后创建PR。提交PR提交后,需要在评论中通过`/retest`手动触发门禁检查,进行构建测试。PR应该尽快合并到上游master分支中,以降低合并的风险。 +- **将请求拉取到vLLM-MindSpore插件代码仓:** 在最后一步中,您需要在新分支和vLLM-MindSpore插件主分支之间拉取比较请求然后创建PR。提交PR提交后,需要在评论中通过`/retest`手动触发门禁检查,进行构建测试。PR应该尽快合并到上游master分支中,以降低合并的风险。 ### 报告Issue @@ -71,7 +71,7 @@ 报告issue时,请参考以下格式: -- 说明您使用的环境版本(vLLM MindSpore、MindSpore TransFormers、MindSpore、OS、Python等); +- 说明您使用的环境版本(vLLM-MindSpore插件、MindSpore TransFormers、MindSpore、OS、Python等); - 说明是错误报告还是功能需求; - 说明issue类型,添加标签可以在issue板上突出显示该issue; - 问题是什么; @@ -99,4 +99,4 @@ - 确保您的分支与主分支始终一致。 - 用于修复错误的PR中,确保已关联所有相关问题。 -最后,感谢您对为vLLM MindSpore项目做出贡献的兴趣,我们欢迎并重视任何形式的贡献与合作。 +最后,感谢您对为vLLM-MindSpore插件项目做出贡献的兴趣,我们欢迎并重视任何形式的贡献与合作。 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/developer_guide/operations/custom_ops.md b/docs/vllm_mindspore/docs/source_zh_cn/developer_guide/operations/custom_ops.md index 357fbb0ed1..3e08c978c8 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/developer_guide/operations/custom_ops.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/developer_guide/operations/custom_ops.md @@ -4,15 +4,15 @@ 当内置算子不满足需求时,你可以利用MindSpore提供的自定义算子功能接入你的算子。 -本文档将以 **`advance_step_flashattn`** 算子为例,讲解如何在 vLLM MindSpore 项目中接入一个AscendC自定义算子。 +本文档将以 **`advance_step_flashattn`** 算子为例,讲解如何在vLLM-MindSpore插件项目中接入一个AscendC自定义算子。 -本文重点在于介绍把算子集成进vLLM MindSpore的流程,自定义算子的细节请参考 MindSpore 官方教程:[基于CustomOpBuilder的自定义算子](https://www.mindspore.cn/tutorials/zh-CN/master/custom_program/operation/op_customopbuilder.html)。AscendC算子的开发流程请参考昇腾官方文档:[Ascend C算子开发](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/developmentguide/opdevg/Ascendcopdevg/atlas_ascendc_10_0001.html)。 +本文重点在于介绍把算子集成进vLLM-MindSpore插件的流程,自定义算子的细节请参考 MindSpore 官方教程:[基于CustomOpBuilder的自定义算子](https://www.mindspore.cn/tutorials/zh-CN/master/custom_program/operation/op_customopbuilder.html)。AscendC算子的开发流程请参考昇腾官方文档:[Ascend C算子开发](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/developmentguide/opdevg/Ascendcopdevg/atlas_ascendc_10_0001.html)。 -**注:目前vLLM MindSpore的自定义算子仅支持动态图(PyNative Mode)场景。** +**注:目前vLLM-MindSpore插件的自定义算子仅支持动态图(PyNative Mode)场景。** ## 文件组织结构 -接入自定义算子需要在 vLLM MindSpore 项目的 `csrc` 目录下添加代码,目录结构如下: +接入自定义算子需要在vLLM-MindSpore插件项目的 `csrc` 目录下添加代码,目录结构如下: ```text vllm-mindspore/ @@ -111,11 +111,11 @@ VLLM_MS_EXTENSION_MODULE(m) { 上面`m.def()`接口的第一个参数`"advance_step_flashattn"`就是算子的Python接口名。 -`module.h` 和 `module.cpp` 文件的作用是基于pybind11创建算子的Python模块。因为一个动态库内只能有一个 `PYBIND11_MODULE` ,为了让用户可以在一个文件内完成算子接入工作,vLLM MindSpore提供了一个新的注册接口 `VLLM_MS_EXTENSION_MODULE` 宏。自定义算子动态库加载时,所有算子接口都会被自动注册到同一个Python模块中。 +`module.h` 和 `module.cpp` 文件的作用是基于pybind11创建算子的Python模块。因为一个动态库内只能有一个 `PYBIND11_MODULE` ,为了让用户可以在一个文件内完成算子接入工作,vLLM-MindSpore插件提供了一个新的注册接口 `VLLM_MS_EXTENSION_MODULE` 宏。自定义算子动态库加载时,所有算子接口都会被自动注册到同一个Python模块中。 ### 算子调用接口 -vLLM MindSpore的自定义算子被编译到了 `_C_ops.so` 里面,为了方便调用,可以在 `vllm_mindspore/_custom_ops.py` 添加一个调用接口。如果在算子调用前后需要做额外适配,也可以在这接口内实现。 +vLLM-MindSpore插件的自定义算子被编译到了 `_C_ops.so` 里面,为了方便调用,可以在 `vllm_mindspore/_custom_ops.py` 添加一个调用接口。如果在算子调用前后需要做额外适配,也可以在这接口内实现。 ```python def advance_step_flashattn(num_seqs: int, num_queries: int, block_size: int, @@ -142,8 +142,8 @@ def advance_step_flashattn(num_seqs: int, num_queries: int, block_size: int, ### 算子编译和测试 -1. **代码集成**:将代码集成至 vLLM MindSpore 项目。 -2. **编译项目**:在项目代码根目录下,执行 `pip install .` ,编译安装vLLM MindSpore。 +1. **代码集成**:将代码集成至vLLM-MindSpore插件项目。 +2. **编译项目**:在项目代码根目录下,执行 `pip install .` ,编译安装vLLM-MindSpore插件。 3. **测试算子接口**:通过 `_custom_ops` 调用算子接口,可以参考测试用例[test_custom_advstepflash.py](https://gitee.com/mindspore/vllm-mindspore/blob/master/tests/st/python/test_custom_advstepflash.py): ```python @@ -154,18 +154,18 @@ custom_ops.advance_step_flashattn(...) ## 自定义算子编译工程 -当前MindSpore仅提供了一个 [CustomOpBuilder接口](https://www.mindspore.cn/docs/zh-CN/master/api_python/ops/mindspore.ops.CustomOpBuilder.html) 用于在线编译自定义算子,接口内置了默认的编译和链接选项。vLLM MindSpore基于MindSpore的自定义算子功能接入算子,并编译成动态库随包发布。下面是编译流程介绍: +当前MindSpore仅提供了一个 [CustomOpBuilder接口](https://www.mindspore.cn/docs/zh-CN/master/api_python/ops/mindspore.ops.CustomOpBuilder.html) 用于在线编译自定义算子,接口内置了默认的编译和链接选项。vLLM-MindSpore插件基于MindSpore的自定义算子功能接入算子,并编译成动态库随包发布。下面是编译流程介绍: ### 算子扩展库模块 -在 `setup.py` 中,vLLM MindSpore添加了一个 `vllm_mindspore._C_ops` 扩展,并添加了相应的编译模块: +在 `setup.py` 中,vLLM-MindSpore插件添加了一个 `vllm_mindspore._C_ops` 扩展,并添加了相应的编译模块: ```python ext_modules = [Extension("vllm_mindspore._C_ops", sources=[])], cmdclass = {"build_ext": CustomBuildExt}, ``` -这里不需要指定 `sources` ,是因为vLLM MindSpore通过CMake触发算子编译,自动收集了源文件。 +这里不需要指定 `sources` ,是因为vLLM-MindSpore插件通过CMake触发算子编译,自动收集了源文件。 ### 算子编译流程 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/general/security.md b/docs/vllm_mindspore/docs/source_zh_cn/general/security.md index 575a3b1bf5..63118b8195 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/general/security.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/general/security.md @@ -2,24 +2,24 @@ [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/general/security.md) -通过 vLLM MindSpore 在 Ascend 上使能推理服务时,由于服务化、节点通信、模型执行等必要功能需要使用一些网络端口,因此会存在安全相关的一些问题。 +通过vLLM-MindSpore插件在 Ascend 上使能推理服务时,由于服务化、节点通信、模型执行等必要功能需要使用一些网络端口,因此会存在安全相关的一些问题。 ## 服务化端口配置 -使用 vLLM MindSpore 启动推理服务时,需要相关 IP 与端口信息,包括: +使用vLLM-MindSpore插件启动推理服务时,需要相关 IP 与端口信息,包括: -1. `host`: 配置服务关联的 IP 地址,默认值为 `0.0.0.0`。 +1. `host`: 配置服务关联的IP地址,默认值为 `0.0.0.0`。 2. `port`: 配置服务关联的端口,默认值为 `8000`。 -3. `data-parallel-address`: 配置数据并行管理 IP,默认值为 `127.0.0.1`。 - > 仅在使能了多节点数据并行的 vLLM 中生效。 +3. `data-parallel-address`: 配置数据并行管理IP,默认值为 `127.0.0.1`。 + > 仅在使能了多节点数据并行的vLLM中生效。 4. `data-parallel-rpc-port`: 配置数据并行管理端口,默认值为 `29550`。 - > 仅在使能了多节点数据并行的 vLLM 中生效。 + > 仅在使能了多节点数据并行的vLLM中生效。 ## 节点间通信 -通过 vLLM MindSpore 进行多节点部署时,使用默认配置进行节点间的通信是不安全的,包括以下场景: +通过vLLM-MindSpore插件进行多节点部署时,使用默认配置进行节点间的通信是不安全的,包括以下场景: -1. MindSpore 分布式通信。 +1. MindSpore分布式通信。 2. 模型TP、DP并行下的通信。 为了保证其安全性,应该部署在有足够安全的隔离网络环境中。 @@ -27,28 +27,28 @@ ### vLLM 中关于节点间通信的可配置项 1. 环境变量 - * `VLLM_HOST_IP`: 可配置 vLLM 中进程间用于通信的 IP 地址。主要作用场景是传递给运行框架进行分布式通信组网。 - * `VLLM_DP_MASTER_IP`: 设置数据并行主节点 IP 地址(非服务化启动数据并行场景),默认值为 `127.0.0.1`。 + * `VLLM_HOST_IP`: 可配置vLLM中进程间用于通信的IP地址。主要作用场景是传递给运行框架进行分布式通信组网。 + * `VLLM_DP_MASTER_IP`: 设置数据并行主节点IP地址(非服务化启动数据并行场景),默认值为 `127.0.0.1`。 * `VLLM_DP_MASTER_PORT`: 设置数据并行主节点端口(非服务化启动数据并行场景),默认值为 `0`。 2. 数据并行配置项 - * `data_parallel_master_ip`: 设置数据并行时的主节点 IP 地址,默认值为 `127.0.0.1`。 + * `data_parallel_master_ip`: 设置数据并行时的主节点IP地址,默认值为 `127.0.0.1`。 * `data_parallel_master_port`: 设置数据并行时的主节点端口,默认为 `29500`。 ### 执行框架分布式通信 -需要注意的是,vLLM MindSpore 当前通过 MindSpore 进行分布式通信,与其相关的安全问题应该参考 [MindSpore官网](https://www.mindspore.cn/)。 +需要注意的是,vLLM-MindSpore插件当前通过MindSpore进行分布式通信,与其相关的安全问题应该参考 [MindSpore官网](https://www.mindspore.cn/)。 ## 安全建议 1. 网络隔离 - * 在隔离的专用网络上部署 vLLM 节点。 + * 在隔离的专用网络上部署vLLM节点。 * 通过网络分段防止未经授权的访问。 * 设置恰当的防火墙规则。如: - * 除了 vLLM API 服务监听的端口外,阻止所以其他连接。 + * 除了vLLM API服务监听的端口外,阻止所以其他连接。 * 确保用于内部的通信端口仅被信任的主机或网络访。 * 不向公共互联网或不受信任的网络暴露内部端口。 2. 推荐配置行为 - * 总是配置相关参数,避免使用默认值,如通过 `VLLM_HOST_IP` 设置指定的 IP 地址。 + * 总是配置相关参数,避免使用默认值,如通过 `VLLM_HOST_IP` 设置指定的IP地址。 * 设定防火墙规则,只允许必要的端口有访问权限。 3. 管理访问权限 * 在部署环境实施物理层和网络层的访问限制。 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index 0c77217add..a979696d10 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -23,8 +23,6 @@ | vLLM | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/any/) | | vLLM-MindSpore插件 | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | - - ## docker安装 在本章节中,我们推荐用docker创建的方式,以快速部署vLLM-MindSpore插件环境,以下是部署docker的步骤介绍: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md index addd3951d0..f0d19926ba 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md @@ -2,15 +2,15 @@ [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md) -本文档将为用户提供快速指引,以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)模型为例,使用[docker](https://www.docker.com/)的安装方式部署vLLM MindSpore,并以[离线推理](#离线推理)与[在线推理](#在线推理)两种方式,快速体验vLLM MindSpore的服务化与推理能力。如用户需要了解更多的安装方式,请参考[安装指南](../installation/installation.md)。 +本文档将为用户提供快速指引,以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)模型为例,使用[docker](https://www.docker.com/)的安装方式部署vLLM-MindSpore插件,并以[离线推理](#离线推理)与[在线推理](#在线推理)两种方式,快速体验vLLM-MindSpore插件的服务化与推理能力。如用户需要了解更多的安装方式,请参考[安装指南](../installation/installation.md)。 ## docker安装 -在本章节中,我们推荐用docker创建的方式,以快速部署vLLM MindSpore环境,以下是部署docker的步骤介绍: +在本章节中,我们推荐用docker创建的方式,以快速部署vLLM-MindSpore插件环境,以下是部署docker的步骤介绍: ### 构建镜像 -用户可执行以下命令,拉取vLLM MindSpore代码仓库,并构建镜像: +用户可执行以下命令,拉取vLLM-MindSpore插件代码仓库,并构建镜像: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -137,7 +137,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra 以下是对上述环境变量的解释: -- `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询; +- `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端,可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询; - `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml)。 另外,用户需要确保MindSpore Transformers已安装。用户可通过 @@ -188,7 +188,7 @@ Prompt: 'Llama is'. Generated text: ' a 100% natural, biodegradable, and compost ### 在线推理 -vLLM MindSpore可使用OpenAI的API协议,进行在线推理部署。以下是以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例,介绍模型的[启动服务](#启动服务),并[发送请求](#发送请求),得到在线推理的推理结果。 +vLLM-MindSpore插件可使用OpenAI的API协议,进行在线推理部署。以下是以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例,介绍模型的[启动服务](#启动服务),并[发送请求](#发送请求),得到在线推理的推理结果。 #### 启动服务 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 813a0dd588..87eb81a0a1 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -2,7 +2,7 @@ [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md) -vLLM MindSpore支持张量并行(TP)、数据并行(DP)、专家并行(EP)及其组合配置的混合并行推理,不同并行策略的适用场景可参考[vLLM官方文档](https://docs.vllm.ai/en/latest/configuration/optimization.html#parallelism-strategies)。 +vLLM-MindSpore插件支持张量并行(TP)、数据并行(DP)、专家并行(EP)及其组合配置的混合并行推理,不同并行策略的适用场景可参考[vLLM官方文档](https://docs.vllm.ai/en/latest/configuration/optimization.html#parallelism-strategies)。 本文档将以DeepSeek R1 671B W8A8为例介绍[张量并行](#tp16-张量并行推理)及[混合并行](#混合并行推理)推理流程。DeepSeek R1 671B W8A8模型需使用多个节点资源运行推理模型。为确保各个节点的执行配置(包括模型配置文件路径、Python环境等)一致,推荐通过 docker 镜像创建容器的方式避免执行差异。 @@ -10,11 +10,11 @@ vLLM MindSpore支持张量并行(TP)、数据并行(DP)、专家并行 ## docker安装 -在本章节中,我们推荐用docker创建的方式,以快速部署vLLM MindSpore环境。以下是部署docker的步骤介绍: +在本章节中,我们推荐用docker创建的方式,以快速部署vLLM-MindSpore插件环境。以下是部署docker的步骤介绍: ### 构建镜像 -用户可执行以下命令,拉取vLLM MindSpore代码仓库,并构建镜像: +用户可执行以下命令,拉取vLLM-MindSpore插件代码仓库,并构建镜像: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -156,7 +156,7 @@ export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/pre - `HCCL_OP_EXPANSION_MODE`: 配置通信算法的编排展开位置为Device侧的AI Vector Core计算单元。 - `MS_ALLOC_CONF`: 设置内存策略。可参考[MindSpore官网文档](https://www.mindspore.cn/docs/zh-CN/master/api_python/env_var_list.html)。 - `ASCEND_RT_VISIBLE_DEVICES`: 配置每个节点可用device id。用户可使用`npu-smi info`命令进行查询。 -- `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 +- `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 - `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/deepseek3/deepseek_r1_671b)中,找到对应模型的yaml文件[predict_deepseek_r1_671b_w8a8.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml) 。 模型并行策略通过配置文件中的`parallel_config`指定,例如TP16 张量并行配置如下所示: @@ -264,7 +264,7 @@ chmod -R 777 ./Ascend-pyACL_8.0.RC1_linux-aarch64.run #### 启动服务 -vLLM MindSpore可使用OpenAI的API协议,部署为在线推理。以下是在线推理的拉起流程。 +vLLM-MindSpore插件可使用OpenAI的API协议,部署为在线推理。以下是在线推理的拉起流程。 ```bash # 启动配置参数说明 @@ -328,7 +328,7 @@ export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/pre - `HCCL_OP_EXPANSION_MODE`: 配置通信算法的编排展开位置为Device侧的AI Vector Core计算单元。 - `MS_ALLOC_CONF`: 设置内存策略。可参考[MindSpore官网文档](https://www.mindspore.cn/docs/zh-CN/r2.6.0/api_python/env_var_list.html)。 - `ASCEND_RT_VISIBLE_DEVICES`: 配置每个节点可用device id。用户可使用`npu-smi info`命令进行查询。 -- `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 +- `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 - `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/deepseek3/deepseek_r1_671b)中,找到对应模型的yaml文件[predict_deepseek_r1_671b_w8a8.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8_ep4tp4.yaml)。 模型并行策略通过配置文件中的`parallel_config`指定,例如混合并行配置如下所示: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index f0901e2e0f..67ea031dd1 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -2,15 +2,15 @@ [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md) -本文档将为用户介绍使用vLLM MindSpore进行单节点多卡的推理流程。以[Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)模型为例,用户通过以下[docker安装](#docker安装)章节,或[安装指南](../../installation/installation.md#安装指南)进行环境配置,并[下载模型权重](#下载模型权重)。在[设置环境变量](#设置环境变量)之后,可部署[在线推理](#在线推理),以体验单节点多卡的推理功能。 +本文档将为用户介绍使用vLLM-MindSpore插件进行单节点多卡的推理流程。以[Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)模型为例,用户通过以下[docker安装](#docker安装)章节,或[安装指南](../../installation/installation.md#安装指南)进行环境配置,并[下载模型权重](#下载模型权重)。在[设置环境变量](#设置环境变量)之后,可部署[在线推理](#在线推理),以体验单节点多卡的推理功能。 ## docker安装 -在本章节中,我们推荐用docker创建的方式,以快速部署vLLM MindSpore环境,以下是部署docker的步骤介绍: +在本章节中,我们推荐用docker创建的方式,以快速部署vLLM-MindSpore插件环境,以下是部署docker的步骤介绍: ### 构建镜像 -用户可执行以下命令,拉取vLLM MindSpore代码仓库,并构建镜像: +用户可执行以下命令,拉取vLLM-MindSpore插件代码仓库,并构建镜像: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -134,7 +134,7 @@ export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Tra 以下是对上述环境变量的解释: -- `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 +- `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 - `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-32B为例,则其yaml文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml) 。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡。以下例子为假设用户使用4,5,6,7卡进行推理: @@ -145,7 +145,7 @@ export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7 ## 在线推理 -vLLM MindSpore可使用OpenAI的API协议,部署为在线推理。以下是在线推理的拉起流程。以下是以[Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) 为例,介绍模型的[启动服务](#启动服务),并[发送请求](#发送请求),得到在线推理的推理结果。 +vLLM-MindSpore插件可使用OpenAI的API协议,部署为在线推理。以下是在线推理的拉起流程。以下是以[Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) 为例,介绍模型的[启动服务](#启动服务),并[发送请求](#发送请求),得到在线推理的推理结果。 ### 启动服务 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/release_notes/release_notes.md b/docs/vllm_mindspore/docs/source_zh_cn/release_notes/release_notes.md index f8bd01f5a0..8b6c39dbc6 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/release_notes/release_notes.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/release_notes/release_notes.md @@ -2,9 +2,9 @@ [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/release_notes/release_notes.md) -## vLLM MindSpore 0.3.0 Release Notes +## vLLM-MindSpore插件 0.3.0 Release Notes -以下为vLLM MindSpore插件0.3.0版本支持的关键新功能和模型。 +以下为vLLM-MindSpore插件0.3.0版本支持的关键新功能和模型。 ### 新特性 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md index b5e2aefd2d..d3f594eb29 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md @@ -4,7 +4,7 @@ | 环境变量 | 功能 | 类型 | 取值 | 说明 | | ------ | ------- | ------ | ------ | ------ | -| `vLLM_MODEL_BACKEND` | 用于指定模型后端。使用vLLM MindSpore原生模型后端时无需指定;使用模型为vLLM MindSpore外部后端时则需要指定。 | String | `MindFormers`: 模型后端为MindSpore Transformers。 | 原生模型后端当前支持Qwen2.5系列;MindSpore Transformers模型后端支持Qwen系列、DeepSeek、Llama系列模型,使用时需配置环境变量:`export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`。 | +| `vLLM_MODEL_BACKEND` | 用于指定模型后端。使用vLLM-MindSpore插件原生模型后端时无需指定;使用模型为vLLM-MindSpore插件外部后端时则需要指定。 | String | `MindFormers`: 模型后端为MindSpore Transformers。 | 原生模型后端当前支持Qwen2.5系列;MindSpore Transformers模型后端支持Qwen系列、DeepSeek、Llama系列模型,使用时需配置环境变量:`export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`。 | | `MINDFORMERS_MODEL_CONFIG` | MindSpore Transformers模型的配置文件。使用Qwen2.5系列、DeepSeek系列模型时,需要配置文件路径。 | String | 模型配置文件路径。 | **该环境变量在后续版本会被移除。** 样例:`export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`。 | | `GLOO_SOCKET_IFNAME` | 用于多机之间使用gloo通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | | `TP_SOCKET_IFNAME` | 用于多机之间使用TP通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md index 6d28a6ff7c..390f5e6291 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md @@ -2,7 +2,7 @@ [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md) -vLLM MindSpore的性能测试能力,继承自vLLM所提供的性能测试能力,详情可参考[vLLM BenchMark](https://github.com/vllm-project/vllm/blob/main/benchmarks/README.md)文档。该文档将介绍[在线性能测试](#在线性能测试)与[离线性能测试](#离线性能测试),用户可以根据所介绍步骤进行性能测试。 +vLLM-MindSpore插件的性能测试能力,继承自vLLM所提供的性能测试能力,详情可参考[vLLM BenchMark](https://github.com/vllm-project/vllm/blob/main/benchmarks/README.md)文档。该文档将介绍[在线性能测试](#在线性能测试)与[离线性能测试](#离线性能测试),用户可以根据所介绍步骤进行性能测试。 ## 在线性能测试 @@ -35,7 +35,7 @@ INFO: Waiting for application startup. INFO: Application startup complete. ``` -拉取vLLM代码仓,导入vLLM MindSpore插件,复用其中benchmark功能: +拉取vLLM代码仓,导入vLLM-MindSpore插件,复用其中benchmark功能: ```bash export VLLM_BRANCH=v0.9.1 @@ -44,7 +44,7 @@ cd vllm sed -i '1i import vllm_mindspore' benchmarks/benchmark_serving.py ``` -其中,`VLLM_BRANCH`为vLLM的分支名,其需要与vLLM MindSpore相配套。配套关系可以参考[这里](../../../getting_started/installation/installation.md#版本配套)。 +其中,`VLLM_BRANCH`为vLLM的分支名,其需要与vLLM-MindSpore插件相配套。配套关系可以参考[这里](../../../getting_started/installation/installation.md#版本配套)。 执行测试脚本: @@ -106,7 +106,7 @@ export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model back export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` -并拉取vLLM代码仓,导入vLLM MindSpore插件,复用其中benchmark功能: +并拉取vLLM代码仓,导入vLLM-MindSpore插件,复用其中benchmark功能: ```bash export VLLM_BRANCH=v0.9.1 @@ -115,7 +115,7 @@ cd vllm sed -i '1i import vllm_mindspore' benchmarks/benchmark_throughput.py ``` -其中,`VLLM_BRANCH`为vLLM的分支名,其需要与vLLM MindSpore相配套。配套关系可以参考[这里](../../../getting_started/installation/installation.md#版本配套)。 +其中,`VLLM_BRANCH`为vLLM的分支名,其需要与vLLM-MindSpore插件相配套。配套关系可以参考[这里](../../../getting_started/installation/installation.md#版本配套)。 用户可通过以下命令,运行测试脚本。该脚本将启动模型,并执行测试,用户不需要再拉起模型: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/features_list/features_list.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/features_list/features_list.md index 910550afcc..baa7edf953 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/features_list/features_list.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/features_list/features_list.md @@ -2,9 +2,9 @@ [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/features_list/features_list.md) -vLLM MindSpore支持的特性功能与vLLM社区版本保持一致,特性描述和使用请参考[vLLM官方资料](https://docs.vllm.ai/en/latest/)。 +vLLM-MindSpore插件支持的特性功能与vLLM社区版本保持一致,特性描述和使用请参考[vLLM官方资料](https://docs.vllm.ai/en/latest/)。 -以下是vLLM MindSpore的功能支持状态: +以下是vLLM-MindSpore插件的功能支持状态: | **功能** | **vLLM V0** | **vLLM V1** | |-----------------------------------|--------------------|--------------------| @@ -39,5 +39,5 @@ vLLM MindSpore支持的特性功能与vLLM社区版本保持一致,特性描 ## 特性说明 -- LoRA目前仅支持Qwen2.5 vLLM MindSpore原生模型,其他模型正在适配中; +- LoRA目前仅支持Qwen2.5 vLLM-MindSpore插件原生模型,其他模型正在适配中; - Tool Calling目前已支持DeepSeek V3 0324 W8A8模型。 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md index eb90283188..2df2785116 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md @@ -2,7 +2,7 @@ [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md) -vLLM MindSpore支持使用`mindspore.Profiler`模块,跟踪vLLM MindSpore中worker的性能。用户可以根据[采集profiling数据](#采集profiling数据)章节,在完成数据采集后,根据[分析profiling数据](#分析profiling数据),进行数据分析。另一方面,用户可以根据[图数据dump](#图数据dump),查看模型的IR图,从而进行对模型结构的分析与调试。 +vLLM-MindSpore插件支持使用`mindspore.Profiler`模块,跟踪vLLM-MindSpore插件中worker的性能。用户可以根据[采集profiling数据](#采集profiling数据)章节,在完成数据采集后,根据[分析profiling数据](#分析profiling数据),进行数据分析。另一方面,用户可以根据[图数据dump](#图数据dump),查看模型的IR图,从而进行对模型结构的分析与调试。 ## 采集profiling数据 @@ -12,7 +12,7 @@ vLLM MindSpore支持使用`mindspore.Profiler`模块,跟踪vLLM MindSpore中wo export VLLM_TORCH_PROFILER_DIR=/path/to/save/vllm_profile ``` -设置完成后,以[Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) 为例,启动vLLM MindSpore服务: +设置完成后,以[Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) 为例,启动vLLM-MindSpore插件服务: ```bash export TENSOR_PARALLEL_SIZE=4 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md index 22a83475ef..0f11e66c83 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md @@ -24,7 +24,7 @@ ### 离线推理 -用户可以参考[安装指南](../../../getting_started/installation/installation.md),进行vLLM MindSpore的环境搭建。用户需设置以下环境变量: +用户可以参考[安装指南](../../../getting_started/installation/installation.md),进行vLLM-MindSpore插件的环境搭建。用户需设置以下环境变量: ```bash export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -- Gitee