diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index f088336aa7a953454ca0bf95d4963a7d81a83cf8..972dd8be0f99432037d3c5437e475cdfbf5c09bc 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -1,4 +1,4 @@ -# Installation Guide +# Installation Guide [![View Source](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md) @@ -8,7 +8,7 @@ This document describes the steps to install the vLLM MindSpore environment. Thr - [Pip Installation](#pip-installation): Suitable for scenarios requiring specific versions. - [Source Code Installation](#source-code-installation): Suitable for incremental development of vLLM MindSpore. -## Version Compatibility +## Version Compatibility - OS: Linux-aarch64 - Python: 3.9 / 3.10 / 3.11 @@ -24,15 +24,15 @@ This document describes the steps to install the vLLM MindSpore environment. Thr | [vLLM](https://github.com/vllm-project/vllm) | 0.8.3 | v0.8.3 | | [vLLM MindSpore](https://gitee.com/mindspore/vllm-mindspore) | 0.2 | master | -## Environment Setup +## Environment Setup This section introduces three installation methods: [Docker Installation](#docker-installation), [Pip Installation](#pip-installation), [Source Code Installation](#source-code-installation), and [Quick Verification](#quick-verification) example to check the installation. -### Docker Installation +### Docker Installation We recommend using Docker for quick deployment of the vLLM MindSpore environment. Below are the steps: -#### Pulling the Image +#### Pulling the Image Execute the following command to pull the vLLM MindSpore Docker image: @@ -46,7 +46,7 @@ During the pull process, user will see the progress of each layer. After success docker images ``` -#### Creating a Container +#### Creating a Container After [pulling the image](#pulling-the-image), set `DOCKER_NAME` and `IMAGE_NAME` as the container and image names, then execute the following command to create the container: @@ -88,7 +88,7 @@ The container ID will be returned if docker is created successfully. User can al docker ps ``` -#### Entering the Container +#### Entering the Container After [creating the container](#creating-a-container), user can start and enter the container, using the environment variable `DOCKER_NAME`: @@ -96,7 +96,7 @@ After [creating the container](#creating-a-container), user can start and enter docker exec -it $DOCKER_NAME bash ``` -### Pip Installation +### Pip Installation Use pip to install vLLM MindSpore, by executing the following command: @@ -104,7 +104,7 @@ Use pip to install vLLM MindSpore, by executing the following command: pip install vllm_mindspore ``` -### Source Code Installation +### Source Code Installation - **CANN Installation** For CANN installation methods and environment configuration, please refer to [CANN Community Edition Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit). If you encounter any issues during CANN installation, please consult the [Ascend FAQ](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html) for troubleshooting. @@ -124,7 +124,7 @@ pip install vllm_mindspore yum install -y gcc gcc-c++ ``` -- **vLLM MindSpore Installation** +- **vLLM MindSpore Installation** To install vLLM MindSpore, user needs to pull the vLLM MindSpore source code and then runs the following command to install the dependencies: ```bash @@ -153,7 +153,7 @@ pip install vllm_mindspore export PYTHONPATH=$MF_PATH:$PYTHONPATH ``` -### Quick Verification +### Quick Verification To verify the installation, run a simple offline inference test with [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct): diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md index edf49a6dbb417fcd469be38ab9a5c63815a83db2..277383bf69a0d69211c1179ac8101c9a1ee67149 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md @@ -1,14 +1,14 @@ -# Quick Start +# Quick Start [![View Source](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md) This document provides a quick guide to deploy vLLM MindSpore by [docker](https://www.docker.com/), with the [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) model as an example. User can quickly experience the serving and inference abilities of vLLM MindSpore by [offline inference](#offline-inference) and [online serving](#online-serving). For more information about installation, please refer to the [Installation Guide](../installation/installation.md). -## Docker Installation +## Docker Installation In this section, we recommend to use docker to deploy the vLLM MindSpore environment. The following sections are the steps for deployment: -### Pulling the Image +### Pulling the Image Pull the vLLM MindSpore docker image by executing the following command: @@ -22,7 +22,7 @@ During the pull process, user will see the progress of each layer of the docker docker images ``` -### Creating a Container +### Creating a Container After [pulling the image](#pulling-the-image), set `DOCKER_NAME` and `IMAGE_NAME` as the container and image names, and create the container by running: @@ -64,7 +64,7 @@ After successfully creating the container, the container ID will be returned. Us docker ps ``` -### Entering the Container +### Entering the Container After [creating the container](#creating-a-container), use the environment variable `DOCKER_NAME` to start and enter the container by executing the following command: @@ -80,7 +80,7 @@ After deploying the environment, user need to prepare the model files before run User can download the model using either the [Python Tool](#downloading-with-python-tool) or [git-lfs Tool](#downloading-with-git-lfs-tool). -#### Downloading with Python Tool +#### Downloading with Python Tool Execute the following Python script to download the [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) weights and files from [Hugging Face](https://huggingface.co/): @@ -118,7 +118,7 @@ Once confirmed, download the weights by executing the following command: git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct ``` -### Setting Environment Variables +### Setting Environment Variables Before launching the model, user need to set the following environment variables: @@ -136,7 +136,7 @@ Here is an explanation of these environment variables: - `vLLM_MODEL_MEMORY_USE_GB`: The memory reserved for model loading. Adjust this value if insufficient memory error occurs during model loading. - `MINDFORMERS_MODEL_CONFIG`: The model configuration file. -### Offline Inference +### Offline Inference Taking [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example, user can perform offline inference with the following Python script: @@ -174,11 +174,11 @@ Prompt: 'Today is'. Generated text: ' the 100th day of school. To celebrate, the Prompt: 'Llama is'. Generated text: ' a 100% natural, biodegradable, and compostable alternative' ``` -### Online Serving +### Online Serving vLLM MindSpore supports online serving deployment with the OpenAI API protocol. The following section would introduce how to [starting the service](#starting-the-service) and [send requests](#sending-requests) to obtain inference results, using [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. -#### Starting the Service +#### Starting the Service Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the following command: @@ -200,7 +200,7 @@ Additionally, performance metrics will be logged, such as: Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg gereration throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% ``` -#### Sending Requests +#### Sending Requests Use the following command to send a request, where `prompt` is the model input: diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md new file mode 100644 index 0000000000000000000000000000000000000000..17257f7638b3b84b68e8003a3d4202521d3fc1be --- /dev/null +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -0,0 +1,342 @@ +# Parallel Inference (DeepSeek R1) + +[![View Source](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md) + +vLLM MindSpore supports hybrid parallel inference with configurations of tensor parallelism (TP), data parallelism (DP), expert parallelism (EP), and their combinations. For the applicable scenarios of different parallel strategies, refer to the [vLLM official documentation](https://docs.vllm.ai/en/latest/configuration/optimization.html#parallelism-strategies). + +This document uses the DeepSeek R1 671B W8A8 model as an example to introduce the inference workflows for [tensor parallelism (TP16)](#tp16-tensor-parallel-inference) and [hybrid parallelism (DP4TP4EP4)](#dp4tp4ep4-hybrid-parallel-inference). The DeepSeek R1 671B W8A8 model requires multiple nodes to run inference. To ensure consistent execution configurations (including model configuration file paths, Python environments, etc.) across all nodes, it is recommended to use Docker containers to eliminate execution differences. + +Users can configure the environment by following the [Creating a Container](#creating-a-container) section below or referring to the [Installation Guide](../../installation/installation.md#installation-guide). + +## Creating a Container + +```bash +docker pull hub.oepkgs.net/oedeploy/openeuler/aarch64/mindspore:latest + +# Create Docker containers on the master and worker nodes respectively +docker run -itd --name=mindspore_vllm --ipc=host --network=host --privileged=true \ + --device=/dev/davinci0 \ + --device=/dev/davinci1 \ + --device=/dev/davinci2 \ + --device=/dev/davinci3 \ + --device=/dev/davinci4 \ + --device=/dev/davinci5 \ + --device=/dev/davinci6 \ + --device=/dev/davinci7 \ + --device=/dev/davinci_manager \ + --device=/dev/devmm_svm \ + --device=/dev/hisi_hdc \ + -v /usr/local/sbin/:/usr/local/sbin/ \ + -v /var/log/npu/slog/:/var/log/npu/slog \ + -v /var/log/npu/profiling/:/var/log/npu/profiling \ + -v /var/log/npu/dump/:/var/log/npu/dump \ + -v /var/log/npu/:/usr/slog \ + -v /etc/hccn.conf:/etc/hccn.conf \ + -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ + -v /usr/local/dcmi:/usr/local/dcmi \ + -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ + -v /etc/ascend_install.info:/etc/ascend_install.info \ + -v /etc/vnpu.cfg:/etc/vnpu.cfg \ + --shm-size="250g" \ + hub.oepkgs.net/oedeploy/openeuler/aarch64/mindspore:latest \ + bash +``` + +After successfully creating the container, the container ID will be returned. Users can execute the following command to verify whether the container was created successfully: + +```bash +docker ps +``` + +### Entering the Container + +After completing the [Creating a Container](#creating-a-container) step, use the predefined environment variable `DOCKER_NAME` to start and enter the container: + +```bash +docker exec -it $DOCKER_NAME bash +``` + +## Downloading Model Weights + +User can download the model using either [Python Tool](#downloading-with-python-tool) or [git-lfs Tool](#downloading-with-git-lfs-tool). + +### Downloading with Python Tool + +Execute the following Python script to download the MindSpore-compatible DeepSeek-R1 W8A8 weights and files from [Modelers Community](https://modelers.cn): + +```python +from openmind_hub import snapshot_download +snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-W8A8", + local_dir="/path/to/save/deepseek_r1_w8a8", + local_dir_use_symlinks=False) +``` + +`local_dir` is the user-specified model save path. Ensure sufficient disk space is available. + +### Downloading with git-lfs Tool + +Run the following command to check if [git-lfs](https://git-lfs.com) is available: + +```bash +git lfs install +``` + +If available, the following output will be displayed: + +```text +Git LFS initialized. +``` + +If the tool is unavailable, install [git-lfs](https://git-lfs.com) first. Refer to [git-lfs installation](../../../faqs/faqs.md#git-lfs-installation) guidance in the [FAQ](../../../faqs/faqs.md) section. + +Once confirmed, download the weights by executing the following command: + +```shell +git clone https://modelers.cn/MindSpore-Lab/DeepSeek-R1-W8A8.git +``` + +## TP16 Tensor Parallel Inference + +vLLM manages and runs multi-node resources through Ray. This example corresponds to a scenario with Tensor Parallelism (TP) set to 16. + +### Setting Environment Variables + +Environment variables must be set before creating the Ray cluster. If the environment changes, stop the cluster with `ray stop` and recreate it; otherwise, the environment variables will not take effect. + +Configure the following environment variables on the master and worker nodes: + +```bash +source /usr/local/Ascend/ascend-toolkit/set_env.sh + +export GLOO_SOCKET_IFNAME=enp189s0f0 +export HCCL_SOCKET_IFNAME=enp189s0f0 +export TP_SOCKET_IFNAME=enp189s0f0 +export MS_ENABLE_LCCL=off +export HCCL_OP_EXPANSION_MODE=AIV +export MS_ALLOC_CONF=enable_vmm:true +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 +export vLLM_MODEL_BACKEND=MindFormers +export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml +``` + +Environment variable descriptions: + +- `GLOO_SOCKET_IFNAME`: GLOO backend port. Use `ifconfig` to find the network interface name corresponding to the IP. +- `HCCL_SOCKET_IFNAME`: Configure the HCCL port. Use `ifconfig` to find the network interface name corresponding to the IP. +- `TP_SOCKET_IFNAME`: Configure the TP port. Use `ifconfig` to find the network interface name corresponding to the IP. +- `MS_ENABLE_LCCL`: Disable LCCL and enable HCCL communication. +- `HCCL_OP_EXPANSION_MODE`: Configure the communication algorithm expansion location to the AI Vector Core (AIV) computing unit on the device side. +- `MS_ALLOC_CONF`: Set the memory policy. Refer to the [MindSpore documentation](https://www.mindspore.cn/docs/en/master/api_python/env_var_list.html). +- `ASCEND_RT_VISIBLE_DEVICES`: Configure the available device IDs for each node. Use the `npu-smi info` command to check. +- `vLLM_MODEL_BACKEND`: The backend of the model to run. Currently supported models and backends for vLLM MindSpore can be found in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). +- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. Users can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/dev/research/deepseek3/deepseek_r1_671b), such as [predict_deepseek_r1_671b_w8a8.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml). + +The model parallel strategy is specified in the `parallel_config` of the configuration file. For example, the TP16 tensor parallel configuration is as follows: + +```text +# default parallel of device num = 16 for Atlas 800T A2 +parallel_config: + data_parallel: 1 + model_parallel: 16 + pipeline_stage: 1 + expert_parallel: 1 +``` + +### Starting Ray for Multi-Node Cluster Management + +On Ascend, the pyACL package must be installed to adapt Ray. Additionally, the CANN dependency versions on all nodes must be consistent. + +#### Installing pyACL + +pyACL (Python Ascend Computing Language) encapsulates AscendCL APIs via CPython, enabling management of Ascend AI processors and computing resources. + +In the corresponding environment, obtain the Ascend-cann-nnrt installation package for the required version, extract the pyACL dependency package, install it separately, and add the installation path to the environment variables: + +```shell +./Ascend-cann-nnrt_8.0.RC1_linux-aarch64.run --noexec --extract=./ +cd ./run_package +./Ascend-pyACL_8.0.RC1_linux-aarch64.run --full --install-path= +export PYTHONPATH=/CANN-/python/site-packages/:$PYTHONPATH +``` + +Download the Ascend runtime package from the [Ascend homepage](https://www.hiascend.cn/developer/download/community/result?module=cann&version=8.0.RC1.beta1). + +#### Multi-Node Cluster + +Before managing a multi-node cluster, ensure that the hostnames of all nodes are unique. If they are the same, set different hostnames using `hostname `. + +1. Start the master node: `ray start --head --port=`. After successful startup, the connection method for worker nodes will be displayed. For example, running `ray start --head --port=6379` on a node with IP `192.5.5.5` will display: + + ```text + Local node IP: 192.5.5.5 + + -------------------- + Ray runtime started. + -------------------- + + Next steps + To add another node to this Ray cluster, run + ray start --address='192.5.5.5:6379' + + To connect to this Ray cluster: + import ray + ray.init() + + To terminate the Ray runtime, run + ray stop + + To view the status of the cluster, use + ray status + ``` + +2. Connect worker nodes to the master node: `ray start --address=:`. +3. Check the cluster status with `ray status`. If the total number of NPUs displayed matches the sum of all nodes, the cluster is successfully created. + + For example, with two nodes, each with 8 NPUs, the output will be: + + ```shell + ======== Autoscaler status: 2025-05-19 00:00:00.000000 ======== + Node status + --------------------------------------------------------------- + Active: + 1 node_efa0981305b1204810c3080c09898097099090f09ee909d0ae12545 + 1 node_184f44c4790135907ab098897c878699d89098e879f2403bc990112 + Pending: + (no pending nodes) + Recent failures: + (no failures) + + Resources + --------------------------------------------------------------- + Usage: + 0.0/384.0 CPU + 0.0/16.0 NPU + 0B/2.58TiB memory + 0B/372.56GiB object_store_memory + + Demands: + (no resource demands) + ``` + +### Starting Online Service + +#### Starting the Service + +vLLM MindSpore can deploy online services using the OpenAI API protocol. Below is the workflow for launching the service. + +```bash +# Service launch parameter explanation +vllm-mindspore serve + --model=[Model Config/Weights Path] + --trust-remote-code # Use locally downloaded model files + --max-num-seqs [Maximum Batch Size] + --max-model-len [Maximum Input/Output Length] + --max-num-batched-tokens [Maximum Tokens per Iteration, recommended: 4096] + --block-size [Block Size, recommended: 128] + --gpu-memory-utilization [GPU Memory Utilization, recommended: 0.9] + --tensor-parallel-size [TP Parallelism Degree] +``` + +Execution example: + +```bash +# Master node: +vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray +``` + +In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. + +#### Sending Requests + +Use the following command to send requests, where `prompt` is the model input: + +```bash +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' +``` + +## DP4TP4EP4 Hybrid Parallel Inference + +vLLM manages and operates resources across multiple nodes through Ray. This example corresponds to the following parallel strategy: + +- Data Parallelism (DP): 4; +- Tensor Parallelism (TP): 4; +- Expert Parallelism (EP): 4. + +### DP4TP4EP4 Setting Environment Variables + +Configure the following environment variables on the master and worker nodes: + +```bash +source /usr/local/Ascend/ascend-toolkit/set_env.sh + +export MS_ENABLE_LCCL=off +export HCCL_OP_EXPANSION_MODE=AIV +export MS_ALLOC_CONF=enable_vmm:true +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 +export vLLM_MODEL_BACKEND=MindFormers +export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8_ep4tp4.yaml +``` + +Environment variable descriptions: + +- `MS_ENABLE_LCCL`: Disable LCCL and enable HCCL communication. +- `HCCL_OP_EXPANSION_MODE`: Configure the communication algorithm expansion location to the AI Vector Core (AIV) computing unit on the device side. +- `MS_ALLOC_CONF`: Set the memory policy. Refer to the [MindSpore documentation](https://www.mindspore.cn/docs/en/master/api_python/env_var_list.html). +- `ASCEND_RT_VISIBLE_DEVICES`: Configure the available device IDs for each node. Use the `npu-smi info` command to check. +- `vLLM_MODEL_BACKEND`: The backend of the model to run. Currently supported models and backends for vLLM MindSpore can be found in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). +- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. Users can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/dev/research/deepseek3/deepseek_r1_671b), such as [predict_deepseek_r1_671b_w8a8_ep4t4.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8_ep4t4.yaml). + +The model parallel strategy is specified in the `parallel_config` of the configuration file. For example, the DP4TP4EP4 hybrid parallel configuration is as follows: + +```text +# default parallel of device num = 16 for Atlas 800T A2 +parallel_config: + data_parallel: 4 + model_parallel: 4 + pipeline_stage: 1 + expert_parallel: 4 +``` + +`data_parallel` and `model_parallel` specify the parallelism strategy for the attention and feed-forward dense layers, while `expert_parallel` specifies the expert routing parallelism strategy for MoE layers. Ensure that `data_parallel` * `model_parallel` is divisible by `expert_parallel`. + +### DP4TP4EP4 Starting Online Service + +`vllm-mindspore` can deploy online services using the OpenAI API protocol. Below is the workflow for launching the service: + +```bash +# Parameter explanations for service launch +vllm-mindspore serve + --model=[Model Config/Weights Path] + --trust-remote-code # Use locally downloaded model files + --max-num-seqs [Maximum Batch Size] + --max-model-len [Maximum Input/Output Length] + --max-num-batched-tokens [Maximum Tokens per Iteration, recommended: 4096] + --block-size [Block Size, recommended: 128] + --gpu-memory-utilization [GPU Memory Utilization, recommended: 0.9] + --tensor-parallel-size [TP Parallelism Degree] + --headless # Required only for worker nodes, indicating no service-side content + --data-parallel-size [DP Parallelism Degree] + --data-parallel-size-local [DP count on the current service node, sum across all nodes equals data-parallel-size] + --data-parallel-start-rank [Offset of the first DP handled by the current service node] + --data-parallel-address [Master node communication IP] + --data-parallel-rpc-port [Master node communication port] + --enable-expert-parallel # Enable expert parallelism +``` + +Execution example: + +```bash +# Master node: +vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel + +# Worker node: +vllm-mindspore serve --headless --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel +``` + +## Sending Requests + +Use the following command to send requests, where `$PROMPT` is the model input: + +```bash +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +``` diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 007bdab99ef7f181c9102280ec54feca98e1a3b3..6142a006d9865084b74cac8627c314d38454ef12 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -1,4 +1,4 @@ -# NPU Single-Node Multi-Card Inference (Qwen2.5-32B) +# Single-Node Multi-Card NPU Inference (Qwen2.5-32B) [![View Source](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md) @@ -91,7 +91,7 @@ snapshot_download( `local_dir` is the user-specified path to save the model. Ensure sufficient disk space is available. -### Downloading with git-lfs Tool +### Downloading with git-lfs Tool Run the following command to verify if [git-lfs](https://git-lfs.com) is available: @@ -105,7 +105,7 @@ If available, the following output will be displayed: Git LFS initialized. ``` -If unavailable, install [git-lfs](https://git-lfs.com) first. Refer to the [FAQ](../../../faqs/faqs.md) section for [git-lfs installation](../../../faqs/faqs.md#git-lfs-installation) guidance. +If the tool is unavailable, install [git-lfs](https://git-lfs.com) first. Refer to [git-lfs installation](../../../faqs/faqs.md#git-lfs-installation) guidance in the [FAQ](../../../faqs/faqs.md) section. Once confirmed, execute the following command to download the weights: diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index eb227e4ff9632e8dc0cba350e96f31e8187f680e..8cee4513c5bea7e2177797543c1037686d6a6732 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -1,14 +1,14 @@ -# Single NPU Inference (Qwen2.5-7B) +# Single-Card NPU Inference (Qwen2.5-7B) [![View Source](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md) This document introduces single NPU inference process by vLLM MindSpore. Taking the [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) model as an example, user can configure the environment through the [Docker Installation](#docker-installation) or the [Installation Guide](../../installation/installation.md#installation-guide), and [downloading model weights](#downloading-model-weights). After [setting environment variables](#setting-environment-variables), user can perform [offline inference](#offline-inference) and [online inference](#online-inference) to experience single NPU inference abilities. -## Docker Installation +## Docker Installation In this section, we recommend using Docker for quick deployment of the vLLM MindSpore environment. Below are the steps for Docker deployment: -### Pulling the Image +### Pulling the Image Pull the vLLM MindSpore Docker image by executing the following command: @@ -22,7 +22,7 @@ During the pull process, user will see the progress of each layer. After success docker images ``` -### Creating a Container +### Creating a Container After [pulling the image](#pulling-the-image), set `DOCKER_NAME` and `IMAGE_NAME` as the container and image names, then create the container: @@ -64,7 +64,7 @@ After successful creation, the container ID will be returned. Verify the contain docker ps ``` -### Entering the Container +### Entering the Container After [creating the container](#creating-a-container), start and enter the container using the predefined `DOCKER_NAME`: @@ -72,7 +72,7 @@ After [creating the container](#creating-a-container), start and enter the conta docker exec -it $DOCKER_NAME bash ``` -## Downloading Model Weights +## Downloading Model Weights User can download the model using either [Python Tool](#downloading-with-python-tool) or [git-lfs Tool](#downloading-with-git-lfs-tool). @@ -91,7 +91,7 @@ snapshot_download( `local_dir` is the user-specified model save path. Ensure sufficient disk space is available. -### Downloading with git-lfs Tool +### Downloading with git-lfs Tool Run the following command to check if [git-lfs](https://git-lfs.com) is available: @@ -105,7 +105,7 @@ If available, the following output will be displayed: Git LFS initialized. ``` -If unavailable, install [git-lfs](https://git-lfs.com) first. Refer to the [FAQ](../../../faqs/faqs.md) section for [git-lfs installation](../../../faqs/faqs.md#git-lfs-installation) guidance. +If the tool is unavailable, install [git-lfs](https://git-lfs.com) first. Refer to [git-lfs installation](../../../faqs/faqs.md#git-lfs-installation) guidance in the [FAQ](../../../faqs/faqs.md) section. Once confirmed, download the weights by executing the following command: @@ -113,7 +113,7 @@ Once confirmed, download the weights by executing the following command: git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct ``` -## Setting Environment Variables +## Setting Environment Variables For [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), the following environment variables configure memory allocation, backend, and model-related YAML files: @@ -139,7 +139,7 @@ export NPU_VISIBE_DEVICES=0 export ASCEND_RT_VISIBLE_DEVICES=$NPU_VISIBE_DEVICES ``` -## Offline Inference +## Offline Inference Taking [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example, user can perform offline inference with the following Python code: @@ -176,11 +176,11 @@ Prompt: 'Today is'. Generated text: ' the 100th day of school. To celebrate, the Prompt: 'Llama is'. Generated text: ' a 100% natural, biodegradable, and compostable alternative' ``` -## Online Inference +## Online Inference vLLM MindSpore supports online serving deployment with the OpenAI API protocol. The following section would introduce how to [starting the service](#starting-the-service) and [send requests](#sending-requests) to obtain inference results, using [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as an example. -### Starting the Service +### Starting the Service Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the following command: @@ -202,7 +202,7 @@ Additionally, performance metrics will be logged, such as: Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% ``` -#### Sending Requests +#### Sending Requests Use the following command to send a request, where `prompt` is the model input: diff --git a/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md b/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md index e8dc13f686149d9ac4066029e039c40b1c38dba1..f8a100f29165a48cfd67a267dcb825958346d6dd 100644 --- a/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md +++ b/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md @@ -2,11 +2,11 @@ [![View Source](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/release_notes/release_notes.md) -## vLLM MindSpore 0.3.0 Release Notes +## vLLM MindSpore 0.3.0 Release Notes The following are the key new features and models supported in the vLLM MindSpore plugin version 0.3.0. -### New Features +### New Features - 0.8.3 V1 Architecture Basic Features, including chunked prefill and automatic prefix caching; - V0 Multi-step Scheduling; diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/operations/npu_ops.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/operations/npu_ops.md index 4e8392a75eae6a7bec60b1df429e7f3b77d4f1ae..b186332f26102ae08ef624199f075985d049a586 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/operations/npu_ops.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/operations/npu_ops.md @@ -1,4 +1,4 @@ -# Custom Operator Integration +# Custom Operator Integration [![View Source](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/operations/npu_ops.md) @@ -6,7 +6,7 @@ This document would introduce how to integrate a new custom operator into the vL For development, additional features can be extended based on project requirements. Implementation details can be referenced from [MindSpore Custom Operator Implementation](https://www.mindspore.cn/tutorials/en/master/custom_program/operation/op_customopbuilder.html). -## File Structure +## File Structure The directory `vllm_mindspore/ops` contains and declaration and implementation of operations: @@ -26,11 +26,11 @@ vllm_mindspore/ops/ - **`ops/ascendc/`**: Contains AscendC custom operator implementation code. - **`ops/module/`**: Contains operator integration layer code, including common module registration (`module.h`, `module.cpp`) and operator-specific integration (e.g., `adv_step_flash.cpp`). -## Integration Process +## Integration Process To integrate a custom operator, user need to create [Operator Interface Declaration](#operator-interface-declaration), [Operator Implementation](#operator-implementation) and [Operator Integration](#operator-integration) in the directory `ops/ascendc/`. And do [Operator Compilation and Testing](#operator-compilation-and-testing) after declaration and implementation. -### Operator Interface Declaration +### Operator Interface Declaration Create a header file (e.g., `my_custom_op.h`) in `ops/ascendc/` to declare the operator function and related interfaces: @@ -44,7 +44,7 @@ extern void MyCustomOpKernelEntry(uint32_t blockDims, void *l2ctrl, void *aclStr #endif // VLLM_MINDSPORE_OPS_ASCENDC_MY_CUSTOM_OP_H ``` -### Operator Implementation +### Operator Implementation Create an implementation file (e.g., `my_custom_op.c`) in `ops/ascendc/` for the core logic: @@ -65,7 +65,7 @@ void MyCustomOpKernelEntry(uint32_t blockDims, void *l2ctrl, void *aclStream, #endif ``` -### Operator Integration +### Operator Integration Create an integration file (e.g., `my_custom_op.cpp`) in `module/`. User can refer to `adv_step_flash.cpp` for more details about the integration: @@ -86,7 +86,7 @@ MS_EXTENSION_MODULE(my_custom_op) { } ``` -### Operator Compilation and Testing +### Operator Compilation and Testing 1. **Code Integration**: Merge the code into the vLLM MindSpore project. 2. **Project Compilation**: Build and install the whl package containing the custom operator. diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md index 7213a8c04bfd0aef1c12da1da05e5f7b33be379e..fbf5b7868f499625f50ff997b0fc09bbe74b8556 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md @@ -4,7 +4,7 @@ vLLM MindSpore supports the `mindspore.Profiler` module to track the performance of workers in vLLM MindSpore. User can follow the [Collecting Profiling Data](#collecting-profiling-data) section to gather data and then analyze it according to [Analyzing Profiling Data](#analyzing-profiling-data). Additionally, user can inspect the model's IR graph through [Graph Data Dump](#graph-data-dump) to analyze and debug the model structure. -## Collecting Profiling Data +## Collecting Profiling Data To enable profiling data collection, user need to set the `VLLM_TORCH_PROFILER_DIR` environment variable to the directory where the profiling results will be saved. For multi-machine inference, this variable must be set on each machine before inference: @@ -56,7 +56,7 @@ When the log displays content similar to the following, it indicates that profil Parsing: [####################] 3/3 Done ``` -## Analyzing Profiling Data +## Analyzing Profiling Data The directory specified by `VLLM_TORCH_PROFILER_DIR` contains the profiling results, with subdirectories named with the `ascend_ms` suffix. Each subdirectory stores the profiling results for one worker. The files in these subdirectories can be referenced for performance analysis, as described in [Ascend Performance Tuning](https://www.mindspore.cn/tutorials/en/master/debug/profiler.html). @@ -82,7 +82,7 @@ User can select a subdirectory to analyze the performance of a single worker: ![](trace_2.png) -## Graph Data Dump +## Graph Data Dump Refer to the [MindSpore Dump Documentation](https://www.mindspore.cn/tutorials/en/master/debug/dump.html). First, configure the JSON file, then set the `MINDSPORE_DUMP_CONFIG` environment variable to point to the absolute path of this configuration file. After inference completes, the graph data can be obtained. diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_DP_EP/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_DP_EP/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md deleted file mode 100644 index fe4b053d40fb25f81c1350a4887efe77423b3f27..0000000000000000000000000000000000000000 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_DP_EP/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ /dev/null @@ -1,136 +0,0 @@ -# Deepseek r1 DP EP推理示例 - -[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_DP_EP/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md) - -以下将以Deepseek r1 671B w8a8为例,介绍双机DP推理流程。 - -## 新建docker容器 - -```bash -docker pull hub.oepkgs.net/oedeploy/openeuler/aarch64/mindspore:latest - -# 分别在主从节点新建docker容器 -docker run -itd --name=mindspore_vllm --ipc=host --network=host --privileged=true \ - --device=/dev/davinci0 \ - --device=/dev/davinci1 \ - --device=/dev/davinci2 \ - --device=/dev/davinci3 \ - --device=/dev/davinci4 \ - --device=/dev/davinci5 \ - --device=/dev/davinci6 \ - --device=/dev/davinci7 \ - --device=/dev/davinci_manager \ - --device=/dev/devmm_svm \ - --device=/dev/hisi_hdc \ - -v /usr/local/sbin/:/usr/local/sbin/ \ - -v /var/log/npu/slog/:/var/log/npu/slog \ - -v /var/log/npu/profiling/:/var/log/npu/profiling \ - -v /var/log/npu/dump/:/var/log/npu/dump \ - -v /var/log/npu/:/usr/slog \ - -v /etc/hccn.conf:/etc/hccn.conf \ - -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ - -v /usr/local/dcmi:/usr/local/dcmi \ - -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ - -v /etc/ascend_install.info:/etc/ascend_install.info \ - -v /etc/vnpu.cfg:/etc/vnpu.cfg \ - --shm-size="250g" \ - hub.oepkgs.net/oedeploy/openeuler/aarch64/mindspore:latest \ - bash -``` - -## 下载模型权重 - -### Python脚本工具下载 - -执行以下 Python 脚本,从[魔乐社区](https://modelers.cn)下载 MindSpore版本的DeepSeek-R1 W8A8权重及文件。其中`local_dir`由用户指定,请确保该路径下有足够的硬盘空间。 - -```python -from openmind_hub import snapshot_download -snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-W8A8", - local_dir="/path/to/save/deepseek_r1_w8a8", - local_dir_use_symlinks=False) -``` - -### git-lfs工具下载 - -执行以下代码,以确认`git-lfs`工具是否可用: - -```bash -git lfs install -``` - -如果可用,将获得如下返回结果: - -```text -Git LFS initialized. -``` - -不可用则需要先安装`git-lfs`,请参考[git-lfs](https://git-lfs.com),或参考[faqs](../../../faqs/faqs.md)章节中关于`git-lfs安装`的阐述。 -执行以下命令,下载权重: - -```shell -git clone https://modelers.cn/MindSpore-Lab/DeepSeek-R1-W8A8.git -``` - -## 设置环境变量 - -分别在主从节点配置如下环境变量: - -```bash -alias wget="wget --no-check-certificate" -source /usr/local/Ascend/ascend-toolkit/set_env.sh -export ASCEND_CUSTOM_PATH=$ASCEND_HOME_PATH/../ -export MINDFORMERS_MODEL_CONFIG=/usr/local/Python-3.11/lib/python3.11/site-packages/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8_ep4tp4.yaml # DP4 TP4 EP4 -# export MINDFORMERS_MODEL_CONFIG=/usr/local/Python-3.11/lib/python3.11/site-packages/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8_ep16.yaml # DP16 EP16 -export vLLM_MODEL_BACKEND=MindFormers -export GLOO_SOCKET_IFNAME=enp189s0f0 # ifconfig查找ip对应网卡的网卡名 -export MS_ENABLE_LCCL=off -export HCCL_OP_EXPANSION_MODE=AIV -export VLLM_USE_V1=1 -export EXPERIMENTAL_KERNEL_LAUNCH_GROUP='thread_num:4,kernel_group_num:16' -export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -``` - -## 启动Deepseek r1 671B w8a8模型在线服务 - -`vllm-mindspore`可使用OpenAI的API协议,部署为在线服务。以下是在线服务的拉起流程: - -```bash -# 启动配置参数说明 -vllm-mindspore serve - --model=[模型Config/权重路径] - --trust_remote_code # 使用本地下载的model文件 - --max-num-seqs [最大Batch数] - --max_model_len [输出输出最大长度] - --max-num-batched-tokens [单次迭代最大支持token数, 推荐4096] - --block-size [Block Size 大小, 推荐128] - --gpu-memory-utilization [显存利用率, 推荐0.9] - --tensor-parallel-size [TP 并行数] - --headless # 仅从节点需要配置,表示不需要服务侧相关内容 - --data-parallel-size [DP 并行数] - --data-parallel-size-local [当前服务节点中的DP数,所有节点求和等于data-parallel-size] - --data-parallel-start-rank [当前服务节点中负责的首个DP的偏移量] - --data-parallel-address [主节点的通讯IP] - --data-parallel-rpc-port [主节点的通讯端口] - --enable-expert-parallel # 使能专家并行 -``` - -执行示例: - -```bash -# 主节点: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust_remote_code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel > log11 2>&1 & - -# 从节点: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust_remote_code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --headless --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel > log11 2>&1 & -``` - -## 发送请求 - -使用如下命令发送请求。其中`$PROMPT`为模型输入: - -```bash -PROMPT="I am" -MAX_TOKEN=120 -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "$PROMPT", "max_tokens": $MAX_TOKEN, "temperature": 0}' -``` diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_multiNode/deepseek_r1_671b_w8a8_tp16_multi_node.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md similarity index 32% rename from docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_multiNode/deepseek_r1_671b_w8a8_tp16_multi_node.md rename to docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index af563414a8a993bc6ba58f27887779c7999a34ca..add87115e00ddfd1d7712928039f4db8f1b7a8a5 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_multiNode/deepseek_r1_671b_w8a8_tp16_multi_node.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -1,16 +1,14 @@ -# Deepseek r1 多节点 TP16 推理示例 +# 并行推理(DeepSeek R1) -[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_multiNode/deepseek_r1_671b_w8a8_tp16_multi_node.md) +[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md) -如果一个节点环境无法支撑一个推理模型服务的运行,则考虑会使用多个节点资源运行推理模型。 -VLLM 通过 Ray 对多个节点资源进行管理和运行。 +vLLM MindSpore支持张量并行(TP)、数据并行(DP)、专家并行(EP)及其组合配置的混合并行推理,不同并行策略的适用场景可参考[vLLM官方文档](https://docs.vllm.ai/en/latest/configuration/optimization.html#parallelism-strategies)。 -以下将以 Deepseek r1 671B w8a8 为例,介绍双节点TP推理流程。 +本文档将以DeepSeek R1 671B W8A8为例介绍[张量并行](#tp16-张量并行推理)及[混合并行](#dp4tp4ep4-混合并行推理)推理流程。DeepSeek R1 671B W8A8模型需使用多个节点资源运行推理模型。为确保各个节点的执行配置(包括模型配置文件路径、Python环境等)一致,推荐通过 docker 镜像创建容器的方式避免执行差异。 -## 使用 docker 容器 +用户可通过以下[新建容器](#新建容器)章节或参考[安装指南](../../installation/installation.md#安装指南)进行环境配置。 -在通过 Ray 执行多节点任务时,需要确保各个节点的执行配置都是一致的,包括但不限于:模型配置文件路径、Python环境等。 -推荐通过 docker 镜像创建容器的方式屏蔽执行差异。镜像和创建容器方法如下: +## 新建容器 ```bash docker pull hub.oepkgs.net/oedeploy/openeuler/aarch64/mindspore:latest @@ -44,15 +42,27 @@ docker run -itd --name=mindspore_vllm --ipc=host --network=host --privileged=tru bash ``` -## 下载模型权重 +新建容器后成功后,将返回容器ID。用户可执行以下命令,确认容器是否创建成功: + +```bash +docker ps +``` + +### 进入容器 + +用户在完成[新建容器](#新建容器)后,使用已定义的环境变量`DOCKER_NAME`,启动并进入容器: + +```bash +docker exec -it $DOCKER_NAME bash +``` -推理模型服务需要下再对应的模型配置与权重。 +## 下载模型权重 -> 各节点的下载存放位置应该一致。 +用户可采用[Python工具下载](#python工具下载)或[git-lfs工具下载](#git-lfs工具下载)两种方式,进行模型下载。 -### Python 脚本工具下载 +### Python工具下载 -执行以下 Python 脚本,从[魔乐社区](https://modelers.cn)下载 MindSpore版本的DeepSeek-R1 W8A8权重及文件。其中`local_dir`由用户指定,请确保该路径下有足够的硬盘空间: +执行以下 Python 脚本,从[魔乐社区](https://modelers.cn)下载 MindSpore版本的DeepSeek-R1 W8A8权重及文件: ```python from openmind_hub import snapshot_download @@ -61,9 +71,11 @@ snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-W8A8", local_dir_use_symlinks=False) ``` -### git-lfs 工具下载 +其中`local_dir`为模型保存路径,由用户指定,请确保该路径下有足够的硬盘空间。 + +### git-lfs工具下载 -执行以下代码,以确认`git-lfs`工具是否可用: +执行以下代码,以确认[git-lfs](https://git-lfs.com)工具是否可用: ```bash git lfs install @@ -75,40 +87,66 @@ git lfs install Git LFS initialized. ``` -不可用则需要先安装`git-lfs`,请参考[git-lfs](https://git-lfs.com),或参考[faqs](../../../faqs/faqs.md)章节中关于`git-lfs安装`的阐述。 -执行以下命令,下载权重: +若工具不可用,则需要先安装[git-lfs](https://git-lfs.com),可参考[FAQ](../../../faqs/faqs.md)章节中关于[git-lfs安装](../../../faqs/faqs.md#git-lfs安装)的阐述。 + +工具确认可用后,执行以下命令,下载权重: ```shell git clone https://modelers.cn/MindSpore-Lab/DeepSeek-R1-W8A8.git ``` -## 设置环境变量 +## TP16 张量并行推理 -分别在主从节点配置如下环境变量: +vLLM 通过 Ray 对多个节点资源进行管理和运行。该样例对应张量并行(TP)为16的场景。 + +### 设置环境变量 -> 环境变量必须设置在 Ray 创建集群前,且当环境有变更时,需要通过 `ray stop` 将主从节点集群停止,并重新创建集群,否则环境变量将不生效。 +环境变量必须设置在 Ray 创建集群前,且当环境有变更时,需要通过 `ray stop` 将主从节点集群停止,并重新创建集群,否则环境变量将不生效。 + +分别在主从节点配置如下环境变量: ```bash source /usr/local/Ascend/ascend-toolkit/set_env.sh -export ASCEND_CUSTOM_PATH=$ASCEND_HOME_PATH/../ -export GLOO_SOCKET_IFNAME=enp189s0f0 # ifconfig查找ip对应网卡的网卡名 -export HCCL_SOCKET_IFNAME=enp189s0f0 # ifconfig查找ip对应网卡的网卡名 -export TP_SOCKET_IFNAME=enp189s0f0 # ifconfig查找ip对应网卡的网卡名 +export GLOO_SOCKET_IFNAME=enp189s0f0 +export HCCL_SOCKET_IFNAME=enp189s0f0 +export TP_SOCKET_IFNAME=enp189s0f0 export MS_ENABLE_LCCL=off export HCCL_OP_EXPANSION_MODE=AIV export MS_ALLOC_CONF=enable_vmm:true export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 export vLLM_MODEL_BACKEND=MindFormers export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml ``` -## 启动 Ray 进行多节点集群管理 +环境变量说明: + +- `GLOO_SOCKET_IFNAME`: GLOO后端端口。可通过`ifconfig`查找ip对应网卡的网卡名。 +- `HCCL_SOCKET_IFNAME`: 配置HCCL端口。可通过`ifconfig`查找ip对应网卡的网卡名。 +- `TP_SOCKET_IFNAME`: 配置TP端口。可通过`ifconfig`查找ip对应网卡的网卡名。 +- `MS_ENABLE_LCCL`: 关闭LCCL,使能HCCL通信。 +- `HCCL_OP_EXPANSION_MODE`: 配置通信算法的编排展开位置为Device侧的AI Vector Core计算单元。 +- `MS_ALLOC_CONF`: 设置内存策略。可参考[MindSpore官网文档](https://www.mindspore.cn/docs/zh-CN/master/api_python/env_var_list.html)。 +- `ASCEND_RT_VISIBLE_DEVICES`: 配置每个节点可用device id。用户可使用`npu-smi info`命令进行查询。 +- `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 +- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/dev/research/deepseek3/deepseek_r1_671b)中,找到对应模型的yaml文件[predict_deepseek_r1_671b_w8a8.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml) 。 + +模型并行策略通过配置文件中的`parallel_config`指定,例如TP16 张量并行配置如下所示: + +```text +# default parallel of device num = 16 for Atlas 800T A2 +parallel_config: + data_parallel: 1 + model_parallel: 16 + pipeline_stage: 1 + expert_parallel: 1 +``` + +### 启动 Ray 进行多节点集群管理 在 Ascend 上,需要额外安装 pyACL 包来适配 Ray。且所有节点的 CANN 依赖版本需要保持一致。 -### 安装 pyACL +#### 安装 pyACL pyACL (Python Ascend Computing Language) 通过 CPython 封装了 AscendCL 对应的 API 接口,使用接口可以管理 Ascend AI 处理器和对应的计算资源。 @@ -118,16 +156,16 @@ pyACL (Python Ascend Computing Language) 通过 CPython 封装了 AscendCL 对 ./Ascend-cann-nnrt_8.0.RC1_linux-aarch64.run --noexec --extract=./ cd ./run_package ./Ascend-pyACL_8.0.RC1_linux-aarch64.run --full --install-path= -export PYTHONPATH=/CANN-/python/site-packages/:\$PYTHONPATH +export PYTHONPATH=/CANN-/python/site-packages/:$PYTHONPATH ``` -> 在 Ascend 的首页中可以下载 Ascend 运行包。如, 可以下载 [8.0.RC1.beta1](https://www.hiascend.cn/developer/download/community/result?module=cann&version=8.0.RC1.beta1) 对应版本的运行包。 +在 Ascend 的首页中可以下载 Ascend 运行包。如, 可以下载 [8.0.RC1.beta1](https://www.hiascend.cn/developer/download/community/result?module=cann&version=8.0.RC1.beta1) 对应版本的运行包。 -### 多节点间集群 +#### 多节点间集群 -> 多节点集群管理前,需要检查各节点的 hostname 是否各异,如果存在相同的,需要通过 `hostname ` 设置不同的 hostname。 +多节点集群管理前,需要检查各节点的 hostname 是否各异,如果存在相同的,需要通过 `hostname ` 设置不同的 hostname。 -- 启动主节点 `ray start --head --port=`,启动成功后,会提示从节点的连接方式。如在 ip 为 `192.5.5.5` 的环境中,通过 `ray start --head --port=6379`,提示如下: +1. 启动主节点 `ray start --head --port=`,启动成功后,会提示从节点的连接方式。如在 ip 为 `192.5.5.5` 的环境中,通过 `ray start --head --port=6379`,提示如下: ```text Local node IP: 192.5.5.5 @@ -151,8 +189,8 @@ export PYTHONPATH=/CANN-/python/site-packages/:\$PYTHONPA ray status ``` -- 从节点连接主节点 `ray start --address=:`。 -- 通过 `ray status` 查询集群状态,显示的NPU总数为节点总合,则表示集群成功。 +2. 从节点连接主节点 `ray start --address=:`。 +3. 通过 `ray status` 查询集群状态,显示的NPU总数为节点总合,则表示集群成功。 当有两个节点,每个节点有8个NPU时,其结果如下: @@ -180,19 +218,20 @@ export PYTHONPATH=/CANN-/python/site-packages/:\$PYTHONPA (no resource demands) ``` -## 启动 Deepseek r1 671B w8a8 模型在线服务 +### 启动在线服务 -### 启动服务 +#### 启动服务 vLLM MindSpore可使用OpenAI的API协议,部署为在线服务。以下是在线服务的拉起流程。 ```bash # 启动配置参数说明 + vllm-mindspore serve --model=[模型Config/权重路径] - --trust_remote_code # 使用本地下载的model文件 + --trust-remote-code # 使用本地下载的model文件 --max-num-seqs [最大Batch数] - --max_model_len [输出输出最大长度] + --max-model-len [输出输出最大长度] --max-num-batched-tokens [单次迭代最大支持token数, 推荐4096] --block-size [Block Size 大小, 推荐128] --gpu-memory-utilization [显存利用率, 推荐0.9] @@ -203,15 +242,104 @@ vllm-mindspore serve ```bash # 主节点: -vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust_remote_code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 +vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray +``` + +张量并行场景下,`--tensor-parallel-size`参数会覆盖模型yaml文件中`parallel_config`的`model_parallel`配置。 + +#### 发起请求 + +使用如下命令发送请求。其中`prompt`字段为模型输入: + +```bash +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' +``` + +## DP4TP4EP4 混合并行推理 + +vLLM 通过 Ray 对多个节点资源进行管理和运行。该样例对应以下并行策略场景: + +- 数据并行(DP)为4; +- 张量并行(TP)为4; +- 专家并行(EP)为4。 + +### DP4TP4EP4 设置环境变量 + +分别在主从节点配置如下环境变量: + +```bash +source /usr/local/Ascend/ascend-toolkit/set_env.sh + +export MS_ENABLE_LCCL=off +export HCCL_OP_EXPANSION_MODE=AIV +export MS_ALLOC_CONF=enable_vmm:true +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 +export vLLM_MODEL_BACKEND=MindFormers +export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8_ep4tp4.yaml +``` + +环境变量说明: + +- `MS_ENABLE_LCCL`: 关闭LCCL,使能HCCL通信。 +- `HCCL_OP_EXPANSION_MODE`: 配置通信算法的编排展开位置为Device侧的AI Vector Core计算单元。 +- `MS_ALLOC_CONF`: 设置内存策略。可参考[MindSpore官网文档](https://www.mindspore.cn/docs/zh-CN/r2.6.0/api_python/env_var_list.html)。 +- `ASCEND_RT_VISIBLE_DEVICES`: 配置每个节点可用device id。用户可使用`npu-smi info`命令进行查询。 +- `vLLM_MODEL_BACKEND`:所运行的模型后端。目前vLLM MindSpore所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 +- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/dev/research/deepseek3/deepseek_r1_671b)中,找到对应模型的yaml文件[predict_deepseek_r1_671b_w8a8.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8_ep4t4.yaml) 。 + +模型并行策略通过配置文件中的`parallel_config`指定,例如DP4TP4EP4 混合并行配置如下所示: + +```text +# default parallel of device num = 16 for Atlas 800T A2 +parallel_config: + data_parallel: 4 + model_parallel: 4 + pipeline_stage: 1 + expert_parallel: 4 +``` + +`data_parallel`及`model_parallel`指定attn及ffn-dense部分的并行策略,`expert_parallel`指定moe部分路由专家并行策略,且需满足`data_parallel` * `model_parallel`可被`expert_parallel`整除。 + +### DP4TP4EP4 启动在线服务 + +`vllm-mindspore`可使用OpenAI的API协议部署在线服务。以下是在线服务的拉起流程: + +```bash +# 启动配置参数说明 +vllm-mindspore serve + --model=[模型Config/权重路径] + --trust-remote-code # 使用本地下载的model文件 + --max-num-seqs [最大Batch数] + --max-model-len [输出输出最大长度] + --max-num-batched-tokens [单次迭代最大支持token数, 推荐4096] + --block-size [Block Size 大小, 推荐128] + --gpu-memory-utilization [显存利用率, 推荐0.9] + --tensor-parallel-size [TP 并行数] + --headless # 仅从节点需要配置,表示不需要服务侧相关内容 + --data-parallel-size [DP 并行数] + --data-parallel-size-local [当前服务节点中的DP数,所有节点求和等于data-parallel-size] + --data-parallel-start-rank [当前服务节点中负责的首个DP的偏移量] + --data-parallel-address [主节点的通讯IP] + --data-parallel-rpc-port [主节点的通讯端口] + --enable-expert-parallel # 使能专家并行 +``` + +执行示例: + +```bash +# 主节点: +vllm-mindspore serve --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel + +# 从节点: +vllm-mindspore serve --headless --model="/path/to/save/deepseek_r1_w8a8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel ``` -### 发起请求 +## 发送请求 -使用如下命令发送请求。其中`$PROMPT`为模型输入。 +使用如下命令发送请求。其中`$PROMPT`为模型输入: ```bash PROMPT="I am" MAX_TOKEN=120 -curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "$PROMPT", "max_tokens": $MAX_TOKEN, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}' +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/path/to/save/deepseek_r1_w8a8", "prompt": "$PROMPT", "max_tokens": $MAX_TOKEN, "temperature": 0}' ``` diff --git a/docs/vllm_mindspore/docs/source_zh_cn/index.rst b/docs/vllm_mindspore/docs/source_zh_cn/index.rst index 46f7199b4adb2838f63e009ff141e262eadef67d..7213a68b710c76a0c1f2789c0bffaf04e452e43b 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/index.rst +++ b/docs/vllm_mindspore/docs/source_zh_cn/index.rst @@ -103,6 +103,7 @@ Apache 许可证 2.0,如 `LICENSE