diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 9080b67487908939da1d82b6a212341c274042b7..dee8d2d13bde51cd80d98bcefbfb4d4c5728011c 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -264,7 +264,16 @@ vLLM manages and operates resources across multiple nodes through Ray. This exam - Tensor Parallelism (TP): 4; - Expert Parallelism (EP): 4. -### Setting Environment Variables +The data parallel deployment can be set using `--data-parallel-backend`, either `mp` or `ray`. By default, data parallel will be deployed with `mp` backend. + +`--data-parallel-backend` options: + +- `mp`: Deploy with multiprocessing backend; +- `ray`: Deploy with ray backend. + +### Deploy DP with `mp` Backend + +#### Setting Environment Variables Configure the following environment variables on the master and worker nodes: @@ -301,11 +310,11 @@ parallel_config: `data_parallel` and `model_parallel` specify the parallelism strategy for the attention and feed-forward dense layers, while `expert_parallel` specifies the expert routing parallelism strategy for MoE layers. Ensure that `data_parallel` * `model_parallel` is divisible by `expert_parallel`. -### Online Inference +#### Online Inference -#### Starting the Service +**Starting the Service** -`vllm-mindspore` can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service: +vLLM-MindSpore Plugin can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service: ```bash # Parameter explanations for service launch @@ -337,7 +346,83 @@ vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remot vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel ``` -#### Sending Requests +**Sending Requests** + +Use the following command to send requests, where `prompt` is the model input: + +```bash +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +``` + +User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. + +### Deploy DP with `ray` Backend + +#### Setting Environment Variables + +Configure the following environment variables on the master and worker nodes: + +```bash +source /usr/local/Ascend/ascend-toolkit/set_env.sh + +export MS_ENABLE_LCCL=off +export HCCL_OP_EXPANSION_MODE=AIV +export MS_ALLOC_CONF=enable_vmm:true +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 +export vLLM_MODEL_BACKEND=MindFormers +export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8_ep4tp4.yaml + +export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python +export GLOO_SOCKET_IFNAME=enp189s0f0 +export HCCL_SOCKET_IFNAME=enp189s0f0 +export TP_SOCKET_IFNAME=enp189s0f0 +``` + +Extra environment variable descriptions: + +- `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION`: Setting the implementation of protocol buffers, need to be set to `python`. +- `GLOO_SOCKET_IFNAME`: GLOO backend port. Use `ifconfig` to find the network interface name corresponding to the IP. +- `HCCL_SOCKET_IFNAME`: Configure the HCCL port. Use `ifconfig` to find the network interface name corresponding to the IP. +- `TP_SOCKET_IFNAME`: Configure the TP port. Use `ifconfig` to find the network interface name corresponding to the IP. + +The model parallel strategy is specified in the `parallel_config` of the configuration file. + +For more information on environment variables and parallel strategy configuration, please refer to [Deployed with `ray` backend](#setting-environment-variables-1). + +#### Starting Ray for Multi-Node Cluster Management + +Refer to [Starting Ray for Multi-Node Cluster Management for TP](#starting-ray-for-multi-node-cluster-management). + +#### Online Inference + +**Starting the Service** + +vLLM-MindSpore Plugin can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service: + +```bash +# Parameter explanations for service launch +vllm-mindspore serve + --model=[Model Config/Weights Path] + --trust-remote-code # Use locally downloaded model files + --max-num-seqs [Maximum Batch Size] + --max-model-len [Maximum Input/Output Length] + --max-num-batched-tokens [Maximum Tokens per Iteration, recommended: 4096] + --block-size [Block Size, recommended: 128] + --gpu-memory-utilization [GPU Memory Utilization, recommended: 0.9] + --tensor-parallel-size [TP Parallelism Degree] + --data-parallel-size [DP Parallelism Degree] + --data-parallel-size-local [DP count on the current service node, sum across all nodes equals data-parallel-size] + --enable-expert-parallel # Enable expert parallelism + --data-parallel-backend=ray # Set the dp deployment method to ray. +``` + +User can also set the local model path by `--model` argument. The following is an execution example: + +```bash +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --data-parallel-backend=ray +``` + +**Sending Requests** Use the following command to send requests, where `prompt` is the model input: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 87eb81a0a13c62b94bc89191982b163df7b596e1..690b13196436c770014669e66b8eada82b2adf06 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -307,7 +307,16 @@ vLLM 通过 Ray 对多个节点资源进行管理和运行。该样例对应以 - 张量并行(TP)为4; - 专家并行(EP)为4。 -### 设置环境变量 +数据并行可以通过 `--data-parallel-backend` 设置部署方式,可选项为 `mp` 和 `ray`。默认行为下,DP 将以 mp 方式部署。 + +`--data-parallel-backend` 选项: + +- `mp`: 以多进程的方式部署; +- `ray`: 以 ray 方式部署。 + +### 以 mp 方式部署 DP + +#### 设置环境变量 分别在主从节点配置如下环境变量: @@ -344,11 +353,11 @@ parallel_config: `data_parallel`及`model_parallel`指定attn及ffn-dense部分的并行策略,`expert_parallel`指定moe部分路由专家并行策略,且需满足`data_parallel` * `model_parallel`可被`expert_parallel`整除。 -### 在线推理 +#### 在线推理 -#### 启动服务 +**启动服务** -`vllm-mindspore`可使用OpenAI的API协议部署在线推理。以下是在线推理的拉起流程: +vLLM-MindSpore插件可使用OpenAI的API协议部署在线推理。以下是在线推理的拉起流程: ```bash # 启动配置参数说明 @@ -380,7 +389,83 @@ vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remot vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel ``` -#### 发送请求 +**发送请求** + +使用如下命令发送请求。其中`prompt`字段为模型输入: + +```bash +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am, "max_tokens": 120, "temperature": 0}' +``` + +用户需确认`"model"`字段与启动服务中`--model`一致,请求才能成功匹配到模型。 + +### 以 ray 方式部署 DP + +#### 设置环境变量 + +分别在主从节点配置如下环境变量: + +```bash +source /usr/local/Ascend/ascend-toolkit/set_env.sh + +export MS_ENABLE_LCCL=off +export HCCL_OP_EXPANSION_MODE=AIV +export MS_ALLOC_CONF=enable_vmm:true +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 +export vLLM_MODEL_BACKEND=MindFormers +export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8_ep4tp4.yaml + +export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python +export GLOO_SOCKET_IFNAME=enp189s0f0 +export HCCL_SOCKET_IFNAME=enp189s0f0 +export TP_SOCKET_IFNAME=enp189s0f0 +``` + +额外环境变量说明: + +- `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION`: 指定 protocol buffers 的实现方式,需要设置为 `python`。 +- `GLOO_SOCKET_IFNAME`: GLOO后端端口。可通过`ifconfig`查找ip对应网卡的网卡名。 +- `HCCL_SOCKET_IFNAME`: 配置HCCL端口。可通过`ifconfig`查找ip对应网卡的网卡名。 +- `TP_SOCKET_IFNAME`: 配置TP端口。可通过`ifconfig`查找ip对应网卡的网卡名。 + +模型并行策略通过配置文件中的`parallel_config`指定。 + +更多环境变量与并行策略配置方式请参考 [mp 部署 DP](#设置环境变量-1)。 + +#### 启动 Ray 进行多节点集群管理 + +参考 [TP 中启动 ray 进行多节点集群管理](#启动-ray-进行多节点集群管理)。 + +#### 在线推理 + +**启动服务** + +vLLM-MindSpore插件可使用OpenAI的API协议部署在线推理。以下是在线推理的拉起流程: + +```bash +# 启动配置参数说明 +vllm-mindspore serve + --model=[模型Config/权重路径] + --trust-remote-code # 使用本地下载的model文件 + --max-num-seqs [最大Batch数] + --max-model-len [输出输出最大长度] + --max-num-batched-tokens [单次迭代最大支持token数, 推荐4096] + --block-size [Block Size 大小, 推荐128] + --gpu-memory-utilization [显存利用率, 推荐0.9] + --tensor-parallel-size [TP 并行数] + --data-parallel-size [DP 并行数] + --data-parallel-size-local [当前服务节点中的DP数,所有节点求和等于data-parallel-size] + --enable-expert-parallel # 使能专家并行 + --data-parallel-backend=ray # 指定 dp 部署方式为 ray +``` + +用户可以通过`--model`参数,指定模型保存的本地路径。以下为执行示例: + +```bash +vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --data-parallel-backend=ray +``` + +**发送请求** 使用如下命令发送请求。其中`prompt`字段为模型输入: