diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index 73851332d61d4d9f33665d31dfb44915d1071bb9..7299d810a19425a57052642c19cb664345246e7d 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -4,8 +4,7 @@ This document describes the steps to install the vLLM MindSpore environment. Three installation methods are provided: -- [Docker Installation](#docker-installation): Suitable for quick deployment scenarios. -- [Pip Installation](#pip-installation): Suitable for scenarios requiring specific versions. +- [Docker Installation](#docker-installation): Suitable for quick deployment scenarios. - [Source Code Installation](#source-code-installation): Suitable for incremental development of vLLM MindSpore. ## Version Compatibility @@ -14,19 +13,19 @@ This document describes the steps to install the vLLM MindSpore environment. Thr - Python: 3.9 / 3.10 / 3.11 - Software version compatibility - | Software | Version | Corresponding Branch | - | -------- | ------- | -------------------- | - | [CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | - | - | [MindSpore](https://www.mindspore.cn/install/) | 2.7 | master | - | [MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter) | 0.2 | master | - | [MindSpore Transformers](https://gitee.com/mindspore/mindformers) | 1.6 | dev | - | [Golden Stick](https://gitee.com/mindspore/golden-stick) | 1.1.0 | r1.1.0 | - | [vLLM](https://github.com/vllm-project/vllm) | 0.9.1 | v0.9.1 | - | [vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3 | master | + | Software | Version And Links | + | ----- | ----- | + |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit) | + |[MindSpore](https://www.mindspore.cn/install/) | [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/) | + |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | + |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers) | + |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | + |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202505/20250514/v0.8.4.dev0_newest/any/) | + |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | ## Environment Setup -This section introduces three installation methods: [Docker Installation](#docker-installation), [Pip Installation](#pip-installation), [Source Code Installation](#source-code-installation), and [Quick Verification](#quick-verification) example to check the installation. +This section introduces two installation methods: [Docker Installation](#docker-installation), [Source Code Installation](#source-code-installation), and [Quick Verification](#quick-verification) example to check the installation. ### Docker Installation @@ -106,56 +105,109 @@ docker exec -it $DOCKER_NAME bash ### Source Code Installation -- **CANN Installation** +#### CANN Installation - For CANN installation methods and environment configuration, please refer to [CANN Community Edition Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit). If you encounter any issues during CANN installation, please consult the [Ascend FAQ](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html) for troubleshooting. +For CANN installation methods and environment configuration, please refer to [CANN Community Edition Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit). If you encounter any issues during CANN installation, please consult the [Ascend FAQ](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html) for troubleshooting. - The default installation path for CANN is `/usr/local/Ascend`. After completing CANN installation, configure the environment variables with the following commands: +The default installation path for CANN is `/usr/local/Ascend`. After completing CANN installation, configure the environment variables with the following commands: - ```bash - LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package - source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh - export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit - ``` +```bash +LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package +source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh +export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit +``` -- **vLLM Prerequisites Installation** +#### vLLM Prerequisites Installation For vLLM environment configuration and installation methods, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html). In vllM installation, `gcc/g++ >= 12.3.0` is required, and it could be installed by the following command: - ```bash - yum install -y gcc gcc-c++ - ``` +```bash +yum install -y gcc gcc-c++ +``` + +#### vLLM MindSpore Installation + +vLLM MindSpore can be installed in the following two ways. **vLLM MindSpore One-click Installation** is suitable for scenarios where users need quick deployment and usage. **vLLM MindSpore Manual Installation** is suitable for scenarios where users require custom modifications to the components. + +- **vLLM MindSpore One-click Installation** + + To install vLLM MindSpore, user needs to pull the vLLM MindSpore source code and then runs the following command to install the dependencies: + + ```bash + git clone https://gitee.com/mindspore/vllm-mindspore.git + cd vllm-mindspore + bash install_depend_pkgs.sh + ``` + + Compile and install vLLM MindSpore: + + ```bash + pip install . + ``` + + After executing the above commands, `mindformers` folder will be generated in the `vllm-mindspore/install_depend_pkgs` directory. Add this folder to the environment variables: + + ```bash + export PYTHONPATH=$MF_PATH:$PYTHONPATH + ``` + +- **vLLM MindSpore Manual Installation** + + If user need to modify the components or use other versions, components need to be manually installed in a specific order. Version compatibility of vLLM MindSpore can be found [Version Compatibility](#version-compatibility), abd vLLM MindSpore requires the following installation sequence: + + 1. Install vLLM + + ```bash + pip install /path/to/vllm-*.whl + ``` + + 2. Uninstall Torch-related components + + ```bash + pip uninstall torch torch-npu torchvision torchaudio -y + ``` + + 3. Install MindSpore + + ```bash + pip install /path/to/mindspore-*.whl + ``` + + 4. Clone the MindSpore Transformers repository and add it to `PYTHONPATH` + + ```bash + git clone https://gitee.com/mindspore/mindformers.git + export PYTHONPATH=$MF_PATH:$PYTHONPATH + ``` -- **vLLM MindSpore Installation** + 5. Install Golden Stick - To install vLLM MindSpore, user needs to pull the vLLM MindSpore source code and then runs the following command to install the dependencies: + ```bash + pip install /path/to/mindspore_gs-*.whl + ``` - ```bash - git clone https://gitee.com/mindspore/vllm-mindspore.git - cd vllm-mindspore - bash install_depend_pkgs.sh - ``` + 6. Install MSAdapter - Compile and install vLLM MindSpore: + ```bash + pip install /path/to/msadapter-*.whl + ``` - ```bash - pip install . - ``` + 7. Install vLLM MindSpore - After executing the above commands, `mindformers` folder will be generated in the `vllm-mindspore/install_depend_pkgs` directory. Add this folder to the environment variables: + User needs to pull source of vLLM MindSpore, and run installation. - ```bash - export PYTHONPATH=$MF_PATH:$PYTHONPATH - ``` + ```bash + git clone https://gitee.com/mindspore/vllm-mindspore.git + cd vllm-mindspore + pip install . + ``` -### Quick Verification +## Quick Verification User can verify the installation with a simple offline inference test. First, user need to configure the environment variables with the following command: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md index c0eaf16c347299abc75285ffc1d57c50403d5fee..91a88e814de3e6d37ed6edb5b4a37615dc58ffac 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md @@ -131,18 +131,14 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct Before launching the model, user need to set the following environment variables: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation. export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` Here is an explanation of these environment variables: -- `ASCEND_TOTAL_MEMORY_GB`: The memory size of each card. User can check the memory by using `npu-smi info`, where the value corresponds to `HBM-Usage(MB)` in the query results. - `vLLM_MODEL_BACKEND`: The backend of the model to run. User could find supported models and backends for vLLM MindSpore in the [Model Support List](../../user_guide/supported_models/models_list/models_list.md). -- `vLLM_MODEL_MEMORY_USE_GB`: The memory reserved for model loading. Adjust this value if insufficient memory error occurs during model loading. -- `MINDFORMERS_MODEL_CONFIG`: The model configuration file. +- `MINDFORMERS_MODEL_CONFIG`: The model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-7B, the YAML file is [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml). Additionally, users need to ensure that MindSpore Transformers is installed. Users can add it by running the following command: @@ -202,7 +198,7 @@ Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the fol python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" ``` -If the service starts successfully, similar output will be obtained: +User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained: ```text INFO: Started server process [6363] @@ -224,6 +220,8 @@ Use the following command to send a request, where `prompt` is the model input: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 15, "temperature": 0}' ``` +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. + If the request is processed successfully, the following inference result will be returned: ```text diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 7a3a9fb4a83667a8d74de50b205d55a4a8a7a928..9c369446a24fd0f0cd955eeb0a369d4293266e07 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -52,8 +52,8 @@ Execute the following Python script to download the MindSpore-compatible DeepSee ```python from openmind_hub import snapshot_download -snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-W8A8", - local_dir="/path/to/save/deepseek_r1_w8a8", +snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-0528-A8W8", + local_dir="/path/to/save/deepseek_r1_0528_a8w8", local_dir_use_symlinks=False) ``` @@ -78,7 +78,7 @@ If the tool is unavailable, install [git-lfs](https://git-lfs.com) first. Refer Once confirmed, download the weights by executing the following command: ```shell -git clone https://modelers.cn/MindSpore-Lab/DeepSeek-R1-W8A8.git +git clone https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8.git ``` ## TP16 Tensor Parallel Inference @@ -241,18 +241,20 @@ Execution example: ```bash # Master node: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray ``` -In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. +In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can also set the local model path by `--model` argument. #### Sending Requests Use the following command to send requests, where `prompt` is the model input: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' -``` +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' +``` + +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. ## Hybrid Parallel Inference @@ -301,6 +303,8 @@ parallel_config: ### Online Inference +#### Starting the Service + `vllm-mindspore` can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service: ```bash @@ -321,22 +325,24 @@ vllm-mindspore serve --data-parallel-address [Master node communication IP] --data-parallel-rpc-port [Master node communication port] --enable-expert-parallel # Enable expert parallelism -``` +``` -Execution example: +User can also set the local model path by `--model` argument. The following is an execution example: ```bash # Master node: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel # Worker node: -vllm-mindspore serve --headless --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel -``` +vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +``` -## Sending Requests +#### Sending Requests Use the following command to send requests, where `prompt` is the model input: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` + +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 24d4d4a2cac790dc5e9e8f5d5145266b896d32f5..40ad4a1597b1a469f47fbdcd06d036219c27b71a 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -127,18 +127,14 @@ For [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), the followi ```bash #set environment variables -export ASCEND_TOTAL_MEMORY_GB=64 # Use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # Use MindSpore TransFormers as the model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Adjust based on the model's maximum usage, with the remaining allocated for KV cache. export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model YAML file. ``` Here is an explanation of these environment variables: -- `ASCEND_TOTAL_MEMORY_GB`: The memory size of each compute card. Query using `npu-smi info`, corresponding to `HBM-Usage(MB)` in the results. - `vLLM_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). -- `vLLM_MODEL_MEMORY_USE_GB`: Memory reserved for model loading. Adjust this if encountering insufficient memory. -- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5). For Qwen2.5-32B, the YAML file is [predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml). +- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-32B, the YAML file is [predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml). Users can check memory usage with `npu-smi info` and set the NPU cards for inference using the following example (assuming cards 4,5,6,7 are used): @@ -160,7 +156,7 @@ export MAX_MODEL_LEN=1024 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN ``` -Here, `TENSOR_PARALLEL_SIZE` specifies the number of NPU cards, and `MAX_MODEL_LEN` sets the maximum output token length. +Here, `TENSOR_PARALLEL_SIZE` specifies the number of NPU cards, and `MAX_MODEL_LEN` sets the maximum output token length. User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained: @@ -181,9 +177,11 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 Use the following command to send a request, where `prompt` is the model input: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. + If processed successfully, the inference result will be: ```text diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index 2c360e28f8010720b10792648cd758d2fc54acab..79ed73f81296beb24f42b95e2723377dcd116040 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -127,17 +127,13 @@ For [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), the following ```bash #set environment variables -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` Here is an explanation of these variables: -- `ASCEND_TOTAL_MEMORY_GB`: The memory size of each compute card. Query using `npu-smi info`, corresponding to `HBM-Usage(MB)` in the results. - `vLLM_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). -- `vLLM_MODEL_MEMORY_USE_GB`: Memory reserved for model loading. Adjust this if encountering insufficient memory. - `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-7B, the YAML file is [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml). User can check memory usage with `npu-smi info` and set the compute card for inference using: @@ -196,7 +192,7 @@ Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the fol python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" ``` -If the service starts successfully, similar output will be obtained: +User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained: ```text INFO: Started server process [6363] @@ -218,6 +214,8 @@ Use the following command to send a request, where `prompt` is the model input: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 15, "temperature": 0}' ``` +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. + If the request is processed successfully, the following inference result will be returned: ```text diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md index c1b71616260505fa7705916b0598ded2aa098674..036834798eb4e55ba82c5ff34667ff7e83b54e5f 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md @@ -11,6 +11,13 @@ | `HCCL_SOCKET_IFNAME` | Specifies the network interface name for inter-machine communication using HCCL. | String | Interface name (e.g., `enp189s0f0`). | Used in multi-machine scenarios. The interface name can be found via `ifconfig` by matching the IP address. | | `ASCEND_RT_VISIBLE_DEVICES` | Specifies which devices are visible to the current process, supporting one or multiple Device IDs. | String | Device IDs as a comma-separated string (e.g., `"0,1,2,3,4,5,6,7"`). | Recommended for Ray usage scenarios. | | `HCCL_BUFFSIZE` | Controls the buffer size for data sharing between two NPUs. | int | Buffer size in MB (e.g., `2048`). | Usage reference: [HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html). Example: For DeepSeek hybrid parallelism (Data Parallel: 32, Expert Parallel: 32) with `max-num-batched-tokens=256`, set `export HCCL_BUFFSIZE=2048`. | -| MS_MEMPOOL_BLOCK_SIZE | Set the size of the memory pool block in PyNative mode for devices | String | String of positive number, and the unit is GB. | | -| vLLM_USE_NPU_ADV_STEP_FLASH_OP | Whether to use Ascend operation `adv_step_flash` | String | `on`: Use;`off`:Not use | If the variable is set to `off`, model will use the implement of small operations. | -| VLLM_TORCH_PROFILER_DIR | Enables profiling data collection and takes effect when a data save path is configured. | String | The path to save profiling data. | | +| `MS_MEMPOOL_BLOCK_SIZE` | Set the size of the memory pool block in PyNative mode for devices | String | String of positive number, and the unit is GB. | | +| `vLLM_USE_NPU_ADV_STEP_FLASH_OP` | Whether to use Ascend operation `adv_step_flash` | String | `on`: Use;`off`:Not use | If the variable is set to `off`, model will use the implement of small operations. | +| `VLLM_TORCH_PROFILER_DIR` | Enables profiling data collection and takes effect when a data save path is configured. | String | The path to save profiling data. | | + +More environment variable information can be referred in the following links: + +- [CANN Environment Variable List](https://www.hiascend.com/document/detail/en/CANNCommunityEdition/81RC1beta1/index/index.html) +- [MindSpore Environment Variable List](https://www.mindspore.cn/docs/en/master/api_python/env_var_list.html) +- [MindSpore Transformers Environment Variable List](https://www.mindspore.cn/mindformers/docs/en/master/index.html) +- [vLLM Environment Variable List](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md index b704a49f442c313bdc3b4a38029672b11ab39734..9ab03ff8ac88bd2612a18421d204a026c63597f5 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md @@ -9,9 +9,7 @@ The benchmark tool of vLLM MindSpore is inherited from vLLM. You can refer to th For single-card inference, we take [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. You can prepare the environment by following the guide [Single-Card Inference (Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#online-inference), set the environment variables: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` @@ -104,9 +102,7 @@ P99 ITL (ms): .... For offline performance benchmark, take [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. Prepare the environment by following the guide [Single-Card Inference (Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#offline-inference). User need to set the environment variables: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md index b24b541e59a4a4184de1e3619164b7951fcddd11..897f1ec0c848a9d7608a0100a6e4411989ffe5b8 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md @@ -40,7 +40,7 @@ curl -X POST http://127.0.0.1:8000/start_profile curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "/home/DeepSeekV3", + "model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md index 401768ca1af91c1442bee6a3b3d8960921fcea92..fa1b8f89c339d94737f7a9a5329e25c587691e8d 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md @@ -16,7 +16,7 @@ We employ [MindSpore Golden Stick's PTQ algorithm](https://gitee.com/mindspore/g ### Downloading Quantized Weights -We have uploaded the quantized DeepSeek-R1 to [ModelArts Community](https://modelers.cn): [MindSpore-Lab/DeepSeek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-W8A8). Refer to the [ModelArts Community documentation](https://modelers.cn/docs/en/openmind-hub-client/0.9/basic_tutorial/download.html) to download the weights locally. +We have uploaded the quantized DeepSeek-R1 to [ModelArts Community](https://modelers.cn): [MindSpore-Lab/DeepSeek-R1-0528-A8W8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8). Refer to the [ModelArts Community documentation](https://modelers.cn/docs/en/openmind-hub-client/0.9/basic_tutorial/download.html) to download the weights locally. ## Quantized Model Inference @@ -27,9 +27,7 @@ After obtaining the DeepSeek-R1 W8A8 weights, ensure they are stored in the rela Refer to the [Installation Guide](../../../getting_started/installation/installation.md) to set up the vLLM MindSpore environment. User need to set the following environment variables: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md index ba825bcae5a769b9217aae4ba8fb808e819b4f85..3d9b49bd6e3f3bd9a9f9bb1cd614cc201c8d1fa9 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md @@ -6,7 +6,7 @@ |-------| --------- | ---- | | DeepSeek-V3 | Supported | [DeepSeek-V3](https://modelers.cn/models/MindSpore-Lab/DeepSeek-V3) | | DeepSeek-R1 | Supported | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-V3) | -| DeepSeek-R1 W8A8 | Supported | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-r1-w8a8) | +| DeepSeek-R1 W8A8 | Supported | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8) | | Qwen2.5 | Supported | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct), [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct), [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct), [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct), [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | | Qwen3-32B | Supported | [Qwen3-32B](https://modelers.cn/models/MindSpore-Lab/Qwen3-32B) | | Qwen3-235B-A22B | Supported | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index 8121916024cadc82ada5d9bbcfe0b86451ada08b..67269d941984af0fea94ca8b829ff08c93fe5f50 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -13,19 +13,19 @@ - Python:3.9 / 3.10 / 3.11 - 软件版本配套 - | 软件 | 版本 | 对应分支 | - | ----- | ----- | ----- | - |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | 8.1 | - | - |[MindSpore](https://www.mindspore.cn/install/) | 2.7 | master | - |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| 0.2 | master | - |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)|1.6 | dev | - |[Golden Stick](https://gitee.com/mindspore/golden-stick)|1.1.0 | r1.1.0 | - |[vLLM](https://github.com/vllm-project/vllm) | 0.9.1 | v0.9.1 | - |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.3 | master | + | 软件 | 配套版本与下载链接 | + | ----- | ----- | + |[CANN](https://www.hiascend.com/developer/download/community/result?module=cann) | [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit) | + |[MindSpore](https://www.mindspore.cn/install/) | [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/) | + |[MSAdapter](https://git.openi.org.cn/OpenI/MSAdapter)| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | + |[MindSpore Transformers](https://gitee.com/mindspore/mindformers)| [1.6.0](https://gitee.com/mindspore/mindformers) | + |[Golden Stick](https://gitee.com/mindspore/golden-stick)| [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | + |[vLLM](https://github.com/vllm-project/vllm) | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202505/20250514/v0.8.4.dev0_newest/any/) | + |[vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | ## 配置环境 -在本章节中,我们将介绍[docker安装](#docker安装)、[pip安装](#pip安装)、[源码安装](#源码安装)三种安装方式,以及[快速验证](#快速验证)用例,用于验证安装是否成功。 +在本章节中,我们将介绍[docker安装](#docker安装)、[源码安装](#源码安装)两种安装方式,以及[快速验证](#快速验证)用例,用于验证安装是否成功。 ### docker安装 @@ -105,29 +105,33 @@ docker exec -it $DOCKER_NAME bash ### 源码安装 -- **CANN安装** +#### CANN安装 - CANN安装方法与环境配套,请参考[CANN社区版软件安装](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit),若用户在安装CANN过程中遇到问题,可参考[昇腾常见问题](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html)进行解决。 +CANN安装方法与环境配套,请参考[CANN社区版软件安装](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit),若用户在安装CANN过程中遇到问题,可参考[昇腾常见问题](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html)进行解决。 - CANN默认安装路径为`/usr/local/Ascend`。用户在安装CANN完毕后,使用如下命令,为CANN配置环境变量: +CANN默认安装路径为`/usr/local/Ascend`。用户在安装CANN完毕后,使用如下命令,为CANN配置环境变量: - ```bash - LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package - source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh - export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit - ``` +```bash +LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package +source ${LOCAL_ASCEND}/ascend-toolkit/set_env.sh +export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit +``` -- **vLLM前置依赖安装** +#### vLLM前置依赖安装 vLLM的环境配置与安装方法,请参考[vLLM安装教程](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html)。其依赖`gcc/g++ >= 12.3.0`版本,可通过以下命令完成安装: - ```bash - yum install -y gcc gcc-c++ - ``` +```bash +yum install -y gcc gcc-c++ +``` + +#### vLLM MindSpore安装 -- **vLLM MindSpore安装** +vLLM MindSpore有以下两种安装方式。**vLLM MindSpore一键式安装**适用于用户快速使用与部署的场景。**vLLM MindSpore手动安装**适用于用户对组件有自定义修改的场景。 - 安装vLLM MindSpore,需要在拉取vLLM MindSpore源码后,执行以下命令,安装依赖包: +- **vLLM MindSpore一键式安装** + + 采用一键式安装脚本来安装vLLM MindSpore,需要在拉取vLLM MindSpore源码后,执行以下命令,安装依赖包: ```bash git clone https://gitee.com/mindspore/vllm-mindspore.git @@ -147,14 +151,63 @@ docker exec -it $DOCKER_NAME bash export PYTHONPATH=$MF_PATH:$PYTHONPATH ``` -### 快速验证 +- **vLLM MindSpore手动安装** + + 若用户对组件有修改,或者需使用其他版本,则用户需要按照特定顺序,手动安装组件。vLLM MindSpore软件配套下载地址可以参考[版本配套](#版本配套),且对组件的安装顺序要求如下: + + 1. 安装vLLM + + ```bash + pip install /path/to/vllm-*.whl + ``` + + 2. 卸载torch相关组件 + + ```bash + pip uninstall torch torch-npu torchvision torchaudio -y + ``` + + 3. 安装MindSpore + + ```bash + pip install /path/to/mindspore-*.whl + ``` + + 4. 引入MindSpore Transformers仓,加入到`PYTHONPATH`中 + + ```bash + git clone https://gitee.com/mindspore/mindformers.git + export PYTHONPATH=$MF_PATH:$PYTHONPATH + ``` + + 5. 安装Golden Stick + + ```bash + pip install /path/to/mindspore_gs-*.whl + ``` + + 6. 安装MSAdapter + + ```bash + pip install /path/to/msadapter-*.whl + ``` + + 7. 安装vLLM MindSpore + + 需要先拉取vLLM MindSpore源码,再执行安装 + + ```bash + git clone https://gitee.com/mindspore/vllm-mindspore.git + cd vllm-mindspore + pip install . + ``` + +## 快速验证 用户可以创建一个简单的离线推理场景,验证安装是否成功。下面以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 为例。首先用户需要执行以下命令,设置环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md index 2bf629908a752b2c386f752e33355339b9ec6069..addd3951d0d120bf3d7a2e3e237b20ddad6a32d4 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md @@ -131,18 +131,14 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct 用户在拉起模型前,需设置以下环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 以下是对上述环境变量的解释: -- `ASCEND_TOTAL_MEMORY_GB`: 每一张计算卡的显存大小。用户可使用`npu-smi info`命令进行查询,该值对应查询结果中的`HBM-Usage(MB)`; - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `vLLM_MODEL_MEMORY_USE_GB`:模型加载时所用空间,根据用户所使用的模型进行设置。若用户在模型加载过程中遇到显存不足时,可适当增大该值并重试; -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。 +- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml)。 另外,用户需要确保MindSpore Transformers已安装。用户可通过 @@ -202,7 +198,7 @@ vLLM MindSpore可使用OpenAI的API协议,进行在线推理部署。以下是 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" ``` -若服务成功拉起,则可以获得类似的执行结果: +用户可以通过`--model`参数,指定模型保存的本地路径。若服务成功拉起,则可以获得类似的执行结果: ```text INFO: Started server process [6363] @@ -224,7 +220,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -若请求处理成功,将获得以下的推理结果: +其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 047e7b4aad2f9239abc18c111f8f576867c3a8be..813a0dd588763cd4c46b31331bec7e88edd158c5 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -94,8 +94,8 @@ docker exec -it $DOCKER_NAME bash ```python from openmind_hub import snapshot_download -snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-W8A8", - local_dir="/path/to/save/deepseek_r1_w8a8", +snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-0528-A8W8", + local_dir="/path/to/save/deepseek_r1_0528_a8w8", local_dir_use_symlinks=False) ``` @@ -120,7 +120,7 @@ Git LFS initialized. 工具确认可用后,执行以下命令,下载权重: ```shell -git clone https://modelers.cn/MindSpore-Lab/DeepSeek-R1-W8A8.git +git clone https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8.git ``` ## TP16 张量并行推理 @@ -284,19 +284,21 @@ vllm-mindspore serve ```bash # 主节点: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray ``` -张量并行场景下,`--tensor-parallel-size`参数会覆盖模型yaml文件中`parallel_config`的`model_parallel`配置。 +张量并行场景下,`--tensor-parallel-size`参数会覆盖模型yaml文件中`parallel_config`的`model_parallel`配置。用户可以通过`--model`参数,指定模型保存的本地路径。 #### 发起请求 使用如下命令发送请求。其中`prompt`字段为模型输入: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' ``` +用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。 + ## 混合并行推理 vLLM 通过 Ray 对多个节点资源进行管理和运行。该样例对应以下并行策略场景: @@ -344,6 +346,8 @@ parallel_config: ### 在线推理 +#### 启动服务 + `vllm-mindspore`可使用OpenAI的API协议部署在线推理。以下是在线推理的拉起流程: ```bash @@ -366,20 +370,22 @@ vllm-mindspore serve --enable-expert-parallel # 使能专家并行 ``` -执行示例: +用户可以通过`--model`参数,指定模型保存的本地路径。以下为执行示例: ```bash # 主节点: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel # 从节点: -vllm-mindspore serve --headless --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel ``` -## 发送请求 +#### 发送请求 使用如下命令发送请求。其中`prompt`字段为模型输入: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am, "max_tokens": 120, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am, "max_tokens": 120, "temperature": 0}' ``` + +用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 811009f6471bea2dad284efdfad79184b3fed3a8..f0901e2e0f9c461a047645f91b508d6278605e03 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -128,18 +128,14 @@ git clone https://huggingface.co/Qwen/Qwen2.5-32B-Instruct ```bash #set environment variables -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 以下是对上述环境变量的解释: -- `ASCEND_TOTAL_MEMORY_GB`: 每一张计算卡的显存大小。用户可使用`npu-smi info`命令进行查询,该值对应查询结果中的`HBM-Usage(MB)`。 - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 -- `vLLM_MODEL_MEMORY_USE_GB`:模型加载时所用空间,根据用户所使用的模型进行设置。若用户在模型加载过程中遇到显存不足时,可适当增大该值并重试。 -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-32B为例,则其yaml文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml) 。 +- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-32B为例,则其yaml文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml) 。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡。以下例子为假设用户使用4,5,6,7卡进行推理: @@ -163,7 +159,7 @@ python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model 其中,`TENSOR_PARALLEL_SIZE`为用户指定的卡数,`MAX_MODEL_LEN`为模型最大输出token数。 -若服务成功拉起,则可以获得类似的执行结果: +用户可以通过`--model`参数,指定模型保存的本地路径。若服务成功拉起,则可以获得类似的执行结果: ```text INFO: Started server process [6363] @@ -182,10 +178,10 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 使用如下命令发送请求。其中`prompt`字段为模型输入: ```bash -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -若请求处理成功,将获得以下的推理结果: +其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index ffc82071b2e9a98e315268756ad2a48ec9e12246..c7d8426f6dc0c258addf291672a440d3fe93af07 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -128,18 +128,14 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct ```bash #set environment variables -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 以下是对上述环境变量的解释: -- `ASCEND_TOTAL_MEMORY_GB`: 每一张计算卡的显存大小。用户可使用`npu-smi info`命令进行查询,该值对应查询结果中的`HBM-Usage(MB)`; - `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `vLLM_MODEL_MEMORY_USE_GB`:模型加载时所用空间,根据用户所使用的模型进行设置。若用户在模型加载过程中遇到显存不足时,可适当增大该值并重试; -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/r1.5.0/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。 +- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的yaml文件。以Qwen2.5-7B为例,则其yaml文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml) 。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡: @@ -198,7 +194,7 @@ vLLM MindSpore可使用OpenAI的API协议,部署为在线推理。以下是以 python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" ``` -若服务成功拉起,则可以获得类似的执行结果: +用户可以通过`--model`参数,指定模型保存的本地路径。若服务成功拉起,则可以获得类似的执行结果: ```text INFO: Started server process [6363] @@ -220,7 +216,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -若请求处理成功,将获得以下的推理结果: +其中,用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md index 7fd53b3ff3ee7ca8084c66a5942184e2a1fdff73..b5e2aefd2d27953a0ec22dc8cd05a96117ae8bdd 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md @@ -4,13 +4,20 @@ | 环境变量 | 功能 | 类型 | 取值 | 说明 | | ------ | ------- | ------ | ------ | ------ | -| vLLM_MODEL_BACKEND | 用于指定模型后端。使用vLLM MindSpore原生模型后端时无需指定;使用模型为vLLM MindSpore外部后端时则需要指定。 | String | `MindFormers`: 模型后端为MindSpore Transformers。 | 原生模型后端当前支持Qwen2.5系列;MindSpore Transformers模型后端支持Qwen系列、DeepSeek、Llama系列模型,使用时需配置环境变量:`export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`。 | -| MINDFORMERS_MODEL_CONFIG | MindSpore Transformers模型的配置文件。使用Qwen2.5系列、DeepSeek系列模型时,需要配置文件路径。 | String | 模型配置文件路径。 | **该环境变量在后续版本会被移除。** 样例:`export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`。 | -| GLOO_SOCKET_IFNAME | 用于多机之间使用gloo通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | -| TP_SOCKET_IFNAME | 用于多机之间使用TP通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | -| HCCL_SOCKET_IFNAME | 用于多机之间使用HCCL通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | -| ASCEND_RT_VISIBLE_DEVICES | 指定哪些Device对当前进程可见,支持一次指定一个或多个Device ID。 | String | 为Device ID,逗号分割的字符串,例如"0,1,2,3,4,5,6,7"。 | ray使用场景建议使用。 | -| HCCL_BUFFSIZE | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | int | 缓存区大小,大小为MB。例如:`2048`。 | 使用方法参考:[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行(数据并行数为32,专家并行数为32),且`max-num-batched-tokens`为256时,则`export HCCL_BUFFSIZE=2048`。 | -| MS_MEMPOOL_BLOCK_SIZE | 设置PyNative模式下设备内存池的块大小。 | String | 正整数string,单位为GB。 | | -| vLLM_USE_NPU_ADV_STEP_FLASH_OP | 是否使用昇腾`adv_step_flash`算子。 | String | `on`: 使用;`off`:不使用 | 取值为`off`时,将使用小算子实现替代`adv_step_flash`算子。 | -| VLLM_TORCH_PROFILER_DIR | 开启profiling采集数据,当配置了采集数据保存路径后生效 | String | Profiling数据保存路径。| | +| `vLLM_MODEL_BACKEND` | 用于指定模型后端。使用vLLM MindSpore原生模型后端时无需指定;使用模型为vLLM MindSpore外部后端时则需要指定。 | String | `MindFormers`: 模型后端为MindSpore Transformers。 | 原生模型后端当前支持Qwen2.5系列;MindSpore Transformers模型后端支持Qwen系列、DeepSeek、Llama系列模型,使用时需配置环境变量:`export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`。 | +| `MINDFORMERS_MODEL_CONFIG` | MindSpore Transformers模型的配置文件。使用Qwen2.5系列、DeepSeek系列模型时,需要配置文件路径。 | String | 模型配置文件路径。 | **该环境变量在后续版本会被移除。** 样例:`export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`。 | +| `GLOO_SOCKET_IFNAME` | 用于多机之间使用gloo通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | +| `TP_SOCKET_IFNAME` | 用于多机之间使用TP通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | +| `HCCL_SOCKET_IFNAME` | 用于多机之间使用HCCL通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找ip对应网卡的网卡名。 | +| `ASCEND_RT_VISIBLE_DEVICES` | 指定哪些Device对当前进程可见,支持一次指定一个或多个Device ID。 | String | 为Device ID,逗号分割的字符串,例如"0,1,2,3,4,5,6,7"。 | ray使用场景建议使用。 | +| `HCCL_BUFFSIZE` | 此环境变量用于控制两个NPU之间共享数据的缓存区大小。 | Integer | 缓存区大小,大小为MB。例如:`2048`。 | 使用方法参考:[HCCL_BUFFSIZE](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/maintenref/envvar/envref_07_0080.html)。例如DeepSeek 混合并行(数据并行数为32,专家并行数为32),且`max-num-batched-tokens`为256时,则`export HCCL_BUFFSIZE=2048`。 | +| `MS_MEMPOOL_BLOCK_SIZE` | 设置PyNative模式下设备内存池的块大小。 | String | 正整数string,单位为GB。 | | +| `vLLM_USE_NPU_ADV_STEP_FLASH_OP` | 是否使用昇腾`adv_step_flash`算子。 | String | `on`: 使用;`off`:不使用 | 取值为`off`时,将使用小算子实现替代`adv_step_flash`算子。 | +| `VLLM_TORCH_PROFILER_DIR` | 开启profiling采集数据,当配置了采集数据保存路径后生效 | String | Profiling数据保存路径。| | + +更多的环境变量信息,请查看: + +- [CANN 环境变量列表](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1beta1/index/index.html) +- [MindSpore 环境变量列表](https://www.mindspore.cn/docs/zh-CN/master/api_python/env_var_list.html) +- [MindSpore Transformers 环境变量列表](https://www.mindspore.cn/mindformers/docs/zh-CN/master/index.html) +- [vLLM 环境变量列表](https://docs.vllm.ai/en/v0.8.4/serving/env_vars.html) diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md index 15f104069996c5c0c4e32fbfdf909a2269f4074a..6d28a6ff7c60cb8c1677aeb7fe116534da81802a 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md @@ -9,9 +9,7 @@ vLLM MindSpore的性能测试能力,继承自vLLM所提供的性能测试能 若用户使用单卡推理,以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)为例,可按照文档[单卡推理(Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#在线推理)进行环境准备,设置以下环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` @@ -104,9 +102,7 @@ P99 ITL (ms): .... 用户使用离线性能测试时,以[Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)为例,可按照文档[单卡推理(Qwen2.5-7B)](../../../getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md#离线推理)进行环境准备,设置以下环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md index 4dc4f2ccee29c5c3d3842ba62e5381c4e01834d4..eb90283188cdae258793d3614d88431cd2063f0b 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md @@ -40,7 +40,7 @@ curl -X POST http://127.0.0.1:8000/start_profile curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "/home/DeepSeekV3", + "model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md index 54ad35032dbdcb594ef4ae2b847beb5d41f701fe..22a83475eff16c8a311a7d7c6a12cc62dcb8f483 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md @@ -16,7 +16,7 @@ ### 直接下载量化权重 -我们已经将量化好的DeepSeek-R1上传到[魔乐社区](https://modelers.cn):[MindSpore-Lab/DeepSeek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-W8A8),可以参考[魔乐社区文档](https://modelers.cn/docs/zh/openmind-hub-client/0.9/basic_tutorial/download.html)将权重下载到本地。 +我们已经将量化好的DeepSeek-R1上传到[魔乐社区](https://modelers.cn):[MindSpore-Lab/DeepSeek-R1-0528-A8W8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8),可以参考[魔乐社区文档](https://modelers.cn/docs/zh/openmind-hub-client/0.9/basic_tutorial/download.html)将权重下载到本地。 ## 量化模型推理 @@ -27,9 +27,7 @@ 用户可以参考[安装指南](../../../getting_started/installation/installation.md),进行vLLM MindSpore的环境搭建。用户需设置以下环境变量: ```bash -export ASCEND_TOTAL_MEMORY_GB=64 # Please use `npu-smi info` to check the memory. export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export vLLM_MODEL_MEMORY_USE_GB=32 # Memory reserved for model execution. Set according to the model's maximum usage, with the remaining environment used for kvcache allocation export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md index 2e504c0fecd265842d02a1474e0f77fdfe151eac..c64725c9e1e448999d189ec22aeab0942f012299 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md @@ -6,7 +6,7 @@ |-------| --------- | ---- | | DeepSeek-V3 | 已支持 | [DeepSeek-V3](https://modelers.cn/models/MindSpore-Lab/DeepSeek-V3) | | DeepSeek-R1 | 已支持 | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-V3) | -| DeepSeek-R1 W8A8 | 已支持 | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-r1-w8a8) | +| DeepSeek-R1 W8A8 | 已支持 | [Deepseek-R1-W8A8](https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8) | | Qwen2.5 | 已支持 | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)、[Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)、[Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)、 [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)、[Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)、[Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)、[Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | | Qwen3-32B | 已支持 | [Qwen3-32B](https://modelers.cn/models/MindSpore-Lab/Qwen3-32B) | | Qwen3-235B-A22B | 已支持 | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) |