From e024428679f71c211afcb543be67fdda3be6cc28 Mon Sep 17 00:00:00 2001
From: liyanlong <2020023855@m.scnu.edu.cn>
Date: Tue, 1 Jun 2021 22:09:51 +0800
Subject: [PATCH] update
tutorials/inference/source_en/serving_distributed_example.md.
---
.../source_en/serving_distributed_example.md | 121 +++++++++---------
1 file changed, 60 insertions(+), 61 deletions(-)
diff --git a/tutorials/inference/source_en/serving_distributed_example.md b/tutorials/inference/source_en/serving_distributed_example.md
index 9394788a2a..e611c795e6 100644
--- a/tutorials/inference/source_en/serving_distributed_example.md
+++ b/tutorials/inference/source_en/serving_distributed_example.md
@@ -1,52 +1,51 @@
-# MindSpore Serving-based Distributed Inference Service Deployment
+# Deploying distributed inference services base on MindSpore Serving
-Translator: [xiaoxiaozhang](https://gitee.com/xiaoxinniuniu)
-
-`Linux` `Ascend` `Serving` `Intermediate` `Senior`
+`Linux` `Ascend` `Serving` `中级` `高级`
-- [MindSpore Serving-based Distributed Inference Service Deployment](#mindspore-serving-based-distributed-inference-service-deployment)
- - [Overview](#overview)
- - [Environment Preparation](#environment-preparation)
- - [Exporting a Distributed Model](#exporting-a-distributed-model)
- - [Deploying the Distributed Inference Service](#deploying-the-distributed-inference-service)
- - [Starting Master and Distributed Worker](#starting-master-and-distributed-worker)
- - [Starting Agent](#starting-agent)
- - [Executing Inference](#executing-inference)
+- [Deploying distributed inference services base on MindSpore Serving](#基于mindspore-serving部署分布式推理服务)
+ - [abstract](#概述)
+ - [Environmental preparation](#环境准备)
+ - [Export distributed model](#导出分布式模型)
+ - [Deploying distributed inference services](#部署分布式推理服务)
+ - [Restrictions on use](#使用限制)
+ - [Startup Master and distributed Worker](#启动master与分布式worker)
+ - [StartupAgent](#启动agent)
+ - [Executive inference](#执行推理)
-
+
-## Overview
+## Abstract
-Distributed inference means that multiple cards are used in the inference phase, in order to solve the problem that too many parameters are in the very large scale neural network and the model cannot be fully loaded into a single card for inference, multi-cards can be used for distributed inference. This document describes the process of deploying the distributed inference service, which is similar to the process of deploying the [single-card inference service](https://www.mindspore.cn/tutorial/inference/en/master/serving_example.html), and these two can refer to each other.
+Distributed inference is the reasoning stage using multi-card inference,There are too many parameters in the super large scale neural network model, and the model can not be fully loaded into a single card for inference,Multi card can be used for distributed inference.This paper introduces the process of deploying distributed inferece service and [single card inference services](https://www.mindspore.cn/tutorial/inference/zh-CN/master/serving_example.html)The deployment process is roughly the same and can be referred to each other.
-The architecture of the distributed inference service shows as follows:
+The structure of distributed inference service is shown in the figure:

-The master provides an interface for client access, manages distributed workers, and performs task management and distribution; Distributed workers automatically schedule agents based on model configurations to complete distributed inference; Each agent contains a slice of the distributed model, occupies a device, and loads the model to performance inference.
+Master provides client access interface to manage distributed workers and manage and distribute tasks;According to the model configuration, the distributed worker automatically schedules the agent to complete the distributed inference;Each agent contains a slice of the distributed model,using a device to load the model and perform reasoning.
-The preceding figure shows the scenario where rank_size is 16 and stage_size is 2. Each stage contains 8 agents and occupies 8 devices. rank_size indicates the number of devices used in inference, stage indicates a pipeline segment, and stage_size indicates the number of pipeline segments. The distributed worker sends an inference requests to the agent and obtains the inference result from the agent. Agents communicate with each other using HCCL.
+The figure above shows that the example of the rank_size is 16 and the stage_size is 2,and each stage contains 8 agents, occupying 8 devices.Rank_ Size represents the number of devices used in inference, stage represents a section of the pipeline, stage_ Size is the number of segments in the pipeline.Distributed worker sends inference request to agent and obtains inference result from agent. HCCL is used to communicate between agents.
-Currently, the distributed model has the following restrictions:
+Currently, there are the following restrictions on distributed models:
- The model of the first stage receives the same input data.
-- The models of other stages do not receive data.
-- All models of the latter stage return the same data.
-- Only Ascend 910 inference is supported.
+- Other stage models do not receive data.
+- All models of the last stage return the same data.
+- Only ascend 910 inference is supported
-The following uses a simple distributed network MatMul as an example to demonstrate the deployment process.
+Take a simple distributed network matmul as an example to demonstrate the deployment process.
-### Environment Preparation
+### Environmental preparation
-Before running the example, ensure that MindSpore Serving has been correctly installed. If not, install MindSpore Serving by referring to the [MindSpore Serving installation page](https://gitee.com/mindspore/serving/blob/master/README.md#installation), and configure environment variables by referring to the [MindSpore Serving environment configuration page](https://gitee.com/mindspore/serving/blob/master/README.md#configuring-environment-variables).
+Before running the example, make sure mindsprore serving is installed correctly.If not,you can refer to [MindSpore Serving Installation page] (https://gitee.com/mindspore/serving/blob/master/README_CN.md),Install mindsrore serving correctly on your computer,Also refer to [MindSpore Serving Environment configuration page] (https://gitee.com/mindspore/serving/blob/master/README_CN.md) Complete the configuration of environment variables.
-### Exporting a Distributed Model
+### Export distributed model
-For details about the files required for exporting distributed models, see the [export_model directory](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/distributed_training_ascend.html#id4), the following files are required:
+For the files needed to export the distributed model, please refer to [export_model directory](https://gitee.com/mindspore/serving/tree/master/example/matmul_distributed/export_model),The following file list is required:
```text
export_model
@@ -56,12 +55,12 @@ export_model
└── rank_table_8pcs.json
```
-- `net.py` contains the definition of MatMul network.
-- `distributed_inference.py` is used to configure distributed parameters.
-- `export_model.sh` creates `device` directory on the current host and exports model files corresponding to `device`.
-- `rank_table_8pcs.json` is a json file for configuring the multi-cards network. For details, see [rank_table](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/distributed_training_ascend.html#id4).
+- `net.py`is the matmul network definition.
+- `distributed_inference.py`configure distributed related parameters.
+- `export_model.sh`create the 'device' directory on the current machine and export the model file corresponding to each 'device'.
+- `rank_table_8pcs.json`is a json file for configuring the multi-cards network. For details, see rank_table.
-Use [net.py](https://gitee.com/mindspore/serving/blob/master/example/matmul_distributed/export_model/net.py) to construct a network that contains the MatMul and Neg operators.
+Employ [net.py](https://gitee.com/mindspore/serving/blob/master/example/matmul_distributed/export_model/net.py),to construct a network containing matmul and neg operators.
```python
import numpy as np
@@ -85,7 +84,7 @@ class Net(Cell):
return x
```
-Use [distributed_inference.py](https://gitee.com/mindspore/serving/blob/master/example/matmul_distributed/export_model/distributed_inference.py) to configure the distributed model. Refer to [Distributed inference](https://www.mindspore.cn/tutorial/inference/en/master/multi_platform_inference_ascend_910.html#id1)。
+Employ [distributed_inference.py](https://gitee.com/mindspore/serving/blob/master/example/matmul_distributed/export_model/distributed_inference.py), Configure the distributed model. Refer to [distributed reference](https://www.mindspore.cn/tutorial/inference/zh-CN/master/multi_platform_inference_ascend_910.html#id1)。
```python
import numpy as np
@@ -114,13 +113,13 @@ def create_predict_data():
return Tensor(inputs_np)
```
-Run [export_model.sh](https://gitee.com/mindspore/serving/blob/master/example/matmul_distributed/export_model/export_model.sh) to export the distributed model. After the command is executed successfully, the `model` directory is created in the upper-level directory. The structure is as follows:
+Employ [export_model.sh](https://gitee.com/mindspore/serving/blob/master/example/matmul_distributed/export_model/export_model.sh),export the distributed model.After successful execution, a 'model' directory will be created in the upper level directory. The structure is as follows:
```text
model
├── device0
-│ ├── group_config.pb
-│ └── matmul.mindir
+│ ├── group_config.pb
+│ └── matmul.mindir
├── device1
├── device2
├── device3
@@ -130,26 +129,26 @@ model
└── device7
```
-Each `device` directory contains two files, `group_config.pb` and `matmul.mindir`, which represent the model group configuration file and model file respectively.
+Each 'device' directory contains two files 'group_config.pb' and 'matmul.mindir'.It represents model grouping configuration file and model file respectively.
-### Deploying the Distributed Inference Service
+### Deploying distributed reasoning services
-For details about how to start the distributed inference service, refer to [matmul_distributed](https://gitee.com/mindspore/serving/tree/master/example/matmul_distributed), the following files are required:
+Startup the distributed reference service, please refer to [matmul_distributed](https://gitee.com/mindspore/serving/tree/master/example/matmul_distributed), the following file list is required:
```text
matmul_distributed
├── agent.py
├── master_with_worker.py
├── matmul
-│ └── servable_config.py
+│ └── servable_config.py
├── model
└── rank_table_8pcs.json
```
-- `model` is the directory for storing model files.
-- `master_with_worker.py` is the script for starting services.
-- `agent.py` is the script for starting agents.
-- `servable_config.py` is the [Model Configuration File](https://www.mindspore.cn/tutorial/inference/en/master/serving_model.html). It declares a distributed model with rank_size 8 and stage_size 1 through `declare_distributed_servable`, and defines a method `predict` for distributed servable.
+- `model`is the directory where the model files are stored.
+- `master_with_worker.py`Is the service startup script.
+- `agent.py`Is the agent startup script
+- `servable_config.py`is the Model Configuration File. It declares a distributed model with rank_size 8 and stage_size 1 through declare_distributed_servable, and defines a method predict for distributed servable.
The content of the model configuration file is as follows:
@@ -166,9 +165,9 @@ def predict(x):
return y
```
-#### Starting Master and Distributed Worker
+#### Starting master and distributed worker
-Use [master_with_worker.py](https://gitee.com/mindspore/serving/blob/master/example/matmul_distributed/master_with_worker.py) to call `start_distributed_servable_in_master` method to deploy the co-process master and distributed workers.
+Use master_with_worker.py to call start_distributed_servable_in_master method to deploy the co-process master and distributed workers.
```python
import os
@@ -192,16 +191,16 @@ if __name__ == "__main__":
start()
```
-- `servable_dir` is the directory for storing a servable.
-- `servable_name` is the name of the servable, which corresponds to a directory for storing model configuration files.
-- `rank_table_json_file` is the JSON file for configuring multi-cards network.
+- `servable_dir`is the directory for storing a servable.
+- `servable_name`is the name of the servable, which corresponds to a directory for storing model configuration files.
+- `rank_table_json_file`is the JSON file for configuring multi-cards network.
- `worker_ip` is the IP address of the distributed worker.
-- `worker_port` is the port of the distributed worker.
-- `wait_agents_time_in_seconds` specifies the duration of waiting for all agents to be registered, the default value 0 means it will wait forever.
+- `worker_port`is the port of the distributed worker.
+- `wait_agents_time_in_seconds`specifies the duration of waiting for all agents to be registered, the default value 0 means it will wait forever.
#### Starting Agent
-Use [agent.py](https://gitee.com/mindspore/serving/blob/master/example/matmul_distributed/agent.py) to call `startup_worker_agents` method to start 8 agent processes on the current host. Agents obtain rank_tables from distributed workers so that agents can communicate with each other using HCCL.
+Use agent.py to call startup_worker_agents method to start 8 agent processes on the current host. Agents obtain rank_tables from distributed workers so that agents can communicate with each other using HCCL.
```python
from mindspore_serving.worker import distributed
@@ -224,17 +223,17 @@ if __name__ == '__main__':
start_agents()
```
-- `worker_ip` is the IP address of the distributed worker.
-- `worker_port` is the port of the distributed worker.
-- `model_files` is a list of model file paths.
-- `group_config_files` is a list of model group configuration file paths.
+- `worker_ip`is the IP address of the distributed worker.
+- `worker_port`is the port of the distributed worker.
+- `model_files`is a list of model file paths.
+- `group_config_files`is a list of model group configuration file paths.
- `agent_start_port` is the start port used by the agent. The default value is 7000.
-- `agent_ip` is the IP address of an agent. The default value is None. The IP address used by the agent to communicate with the distributed worker is obtained from rank_table by default. If the IP address is unavailable, you need to set both `agent_ip` and `rank_start`.
-- `rank_start` is the start rank_id of the current server, the default value is None.
+- `agent_ip`is the IP address of an agent. The default value is None. The IP address used by the agent to communicate with the distributed worker is obtained from rank_table by default. If the IP address is unavailable, you need to set both agent_ip and rank_start.
+- `rank_start`is the start rank_id of the current server, the default value is None.
-### Executing Inference
+### Executive reference
-To access the inference service through gRPC, the client needs to specify the IP address and port of the gRPC server. Run [client.py](https://gitee.com/mindspore/serving/blob/master/example/matmul_distributed/client.py) to call the `predict` method of matmul distributed model, execute inference.
+To access the inference service through gRPC, the client needs to specify the IP address and port of the gRPC server. Run client.py to call the predict method of matmul distributed model, execute inference.
```python
import numpy as np
@@ -253,7 +252,7 @@ if __name__ == '__main__':
run_matmul()
```
-The following return value indicates that the Serving distributed inference service has correctly executed the inference of MatMul net:
+After execution, the following return values are displayed, indicating that the serving distributed reasoning service has correctly executed the reasoning of the matmul network
```text
result:
--
Gitee