diff --git a/tutorials/experts/source_en/parallel/save_load.md b/tutorials/experts/source_en/parallel/save_load.md
index b915076d87e9bf55d3852e02facfd2988459d6bd..3b01f05b92450d34c89bbc74c457a0349882b147 100644
--- a/tutorials/experts/source_en/parallel/save_load.md
+++ b/tutorials/experts/source_en/parallel/save_load.md
@@ -8,9 +8,9 @@
In the MindSpore model parallel scenario, each instance process stores only the parameter data on the current node. The parameter data of a model parallel Cell on each node is a slice of the complete parameter data. For example, the complete parameter data shape is \[8, 8], and the parameter data on each node is a part of the data, for example, shape \[2, 8].
-In the auto parallel scenario, MindSpore automatically generates the dividing strategy. The MindSpore checkpoint module supports automatic integrating, saving, and loading.
+In the auto parallel scenario, MindSpore automatically generates the slice strategy. The MindSpore checkpoint module supports automatic integrating, saving, and loading.
-In the hybrid parallel scenario, the dividing strategy is implemented by users. MindSpore saves the slice strategy of model, which is the same on each node, and the data corresponding to each node is stored respectively. Users need to integrate, save, and load the checkpoint files by themselves. This tutorial describes how to integrate, save, and load checkpoint files in the hybrid parallel scenario.
+In the hybrid parallel scenario, the slice strategy is implemented by users. MindSpore saves the slice strategy of model, which is the same on each node, and the data corresponding to each node is stored respectively. Users need to integrate, save, and load the checkpoint files by themselves. This tutorial describes how to integrate, save, and load checkpoint files in the hybrid parallel scenario.
### Application Scenario
@@ -24,7 +24,7 @@ The following describes the overall process of training on 64 devices and infere
2. Integrate the saved checkpoint files.
- Integrate the divided model parameters based on the specific dividing strategy to generate a new checkpoint file.
+ Integrate the divided model parameters based on the specific slice strategy to generate a new checkpoint file.
3. Load the new checkpoint file in the single-GPU environment and call the export API to export the model for inference as required.
@@ -38,7 +38,7 @@ For example, in the training stage 1, the training environment with 64 devices i
2. Integrate the saved checkpoint files.
- Integrate the divided model parameters based on the specific dividing strategy to generate a new checkpoint file.
+ Integrate the divided model parameters based on the specific slice strategy to generate a new checkpoint file.
3. Load the checkpoint file that is integrated and saved in the stage 2 cluster.
@@ -85,7 +85,7 @@ In the preceding information:
- `load_checkpoint`: loads the checkpoint model parameter file and returns a parameter dictionary.
- `load_param_into_net`: loads model parameter data to the network.
-#### Obtaining a List of All Parameters on the Network
+#### Obtaining the Model Parameter Slice Strategy
Call the `build_searched_strategy` API to obtain the slice strategy of model.
@@ -95,13 +95,13 @@ strategy = build_searched_strategy("./strategy_train.ckpt")
In the preceding information:
-- `strategy_train.ckpt`: name of model slice strategy, set by users calling `set_auto_parallel_context` API and customizing `strategy_ckpt_save_file` parameter before training network.
+- `strategy_train.ckpt`: name of model parameter slice strategy. The training network is generated by the user calling the `set_auto_parallel_context` interface to customize the `strategy_ckpt_save_file` parameter.
-### Integrate the Model Parallel Parameters
+### Integrating the Model Parallel Parameters
The following uses a model parameter as an example to describe a specific integration process.
-The parameter name is weight and the dividing strategy is to perform dividing in a 4-device scenario.
+The parameter name is weight and the slice strategy is to perform slice in a 4-device scenario.
1. Obtain the data value on all nodes for model parallel parameters.
@@ -124,7 +124,7 @@ The parameter name is weight and the dividing strategy is to perform dividing in
### Saving the Data and Generating a New Checkpoint File
-1. Convert `param_dict` to `param_list`.
+1. Convert `param_dict` to list type data..
```python
param_list = []
@@ -154,7 +154,11 @@ The parameter name is weight and the dividing strategy is to perform dividing in
### Overall Process
-If you need to load the integrated and saved checkpoint file to multi-device training or inference, divide the parallel parameter data based on the new strategy before loading the model parameters to the network. The following steps are implemented in the pre-training script. Steps 1 and 3 are the same as the strategy of checkpoint loading in a single-node system. Step 2 is added to divide model parallel parameters. In the single-device training/inference scenario, data dividing is not involved. In this case, step 2 can be skipped.
+If you need to load the integrated and saved checkpoint file to multi-device training or inference, divide the parallel parameter data based on the new strategy before loading the model parameters to the network.
+
+The following steps are implemented in the pre-training script. Steps 1 and 3 are the same as the strategy of checkpoint loading in a single-node system. Step 2 is added to divide model parallel parameters.
+
+In the single-device training/inference scenario, data slice is not involved. In this case, step 2 can be skipped.
### Step 1: Loading the Checkpoint File
@@ -167,9 +171,11 @@ param_dict = load_checkpoint("./CKP-Integrated_1-4_32.ckpt")
- `load_checkpoint`: loads the checkpoint model parameter file and returns a parameter dictionary.
- `CKP-Integrated_1-4_32.ckpt`: name of the checkpoint model parameter file to be loaded.
-### Step 2: Dividing a Model Parallel Parameter
+### Step 2: Slicing the Parallel Parameters of the Model
+
+The following uses a specific model parameter as an example. The parameter name is "weight", and the data value is Tensor \[\[1, 2, 3, 4], \[5, 6, 7, 8]]. The slice strategy is to perform slice in the two-device scenario based on \[2, 1].
-The following uses a specific model parameter as an example. The parameter name is weight, the data value is Tensor \[\[1, 2, 3, 4], \[5, 6, 7, 8]], and the dividing strategy is to perform dividing in the two-device scenario based on \[2, 1]. Data distribution after dividing is as follows:
+Data distribution after slicing is as follows:
| Device0 | Device1 |
|--------------------|---------------------|
@@ -186,14 +192,14 @@ The following uses a specific model parameter as an example. The parameter name
slice_moments_list = np.split(new_param_moments.data.asnumpy(), 2, axis=0)
```
- Data after dividing:
+ Data after slicing:
```text
slice_list[0] --- [1, 2, 3, 4] Corresponding to device0
slice_list[1] --- [5, 6, 7, 8] Corresponding to device1
```
- Similar to slice\_list, slice\_moments\_list is divided into two tensors with the shape of \[1, 4].
+ Similar to `slice\_list`,`slice\_moments\_list` is divided into two tensors with the shape of \[1, 4].
2. Load the corresponding data slice on each node.
@@ -237,7 +243,7 @@ User process:
1. Execute stage 1 training. There are four devices in stage 1 training environment. The weight shape of the MatMul operator on each device is \[2, 8]. Checkpoint files are automatically exported during the training.
-2. Execute the script to integrate checkpoint files. Based on the specific dividing strategy, integrate the divided model parameters to generate the integrated checkpoint file.
+2. Execute the script to integrate checkpoint files. Based on the specific slice strategy, integrate the divided model parameters to generate the integrated checkpoint file.
3. Execute stage 2 training: There are two devices in stage 2 training environment. The weight shape of the MatMul operator on each device is \[4, 8]. Load the initialized model parameter data from the integrated checkpoint file and then perform training.
diff --git a/tutorials/experts/source_en/parallel/train_ascend.md b/tutorials/experts/source_en/parallel/train_ascend.md
index c07847e2002ca275c34c9e9284c30a4985e08233..a36997070baae1876c7eaaea61f883c896ef8960 100644
--- a/tutorials/experts/source_en/parallel/train_ascend.md
+++ b/tutorials/experts/source_en/parallel/train_ascend.md
@@ -1,11 +1,13 @@
-# Parallel Distributed Training Example (Ascend)
+# Distributed Parallel Training Example (Ascend)
## Overview
This tutorial describes how to train the ResNet-50 network in data parallel and automatic parallel modes on MindSpore based on the Ascend 910 AI processor.
-> Download address of the complete sample code:
+> Download address of the complete sample code:
+>
+>
The directory structure is as follow:
@@ -29,7 +31,7 @@ The directory structure is as follow:
...
```
-`rank_table_16pcs.json`, `rank_table_8pcs.json` and `rank_table_2pcs.json` are the networking information files. `resnet.py`,`resnet50_distributed_training.py` , `resnet50_distributed_training_gpu.py` and `resnet50_distributed_training_grad_accu.py` are the network structure files. `run.sh` , `run_gpu.sh`, `run_grad_accu.sh` and `run_cluster.sh` are the execute scripts.
+`rank_table_16pcs.json`, `rank_table_8pcs.json` and `rank_table_2pcs.json` are the networking information files. `resnet.py`,`resnet50_distributed_training.py` , `resnet50_distributed_training_gpu.py` and `resnet50_distributed_training_grad_accu.py` are the network structure files. `run.sh` , `run_gpu.sh`, `run_grad_accu.sh` and `run_cluster.sh` are the execution scripts.
Besides, we describe the usages of hybrid parallel and semi-auto parallel modes in the sections [Defining the Network](https://www.mindspore.cn/tutorials/experts/en/master/parallel/train_ascend.html#defining-the-network) and [Distributed Training Model Parameters Saving and Loading](https://www.mindspore.cn/tutorials/experts/en/master/parallel/train_ascend.html#distributed-training-model-parameters-saving-and-loading).
@@ -169,7 +171,7 @@ Different from the single-node system, the multi-node system needs to transfer t
## Defining the Network
-In data parallel and automatic parallel modes, the network definition method is the same as that in a single-node system. The reference code of ResNet is as follows:
+In data parallel and automatic parallel modes, the network definition method is the same as that in a single-node system. The reference code of ResNet is as follows: [ResNet network sample script](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/resnet/resnet.py)
In this section we focus on how to define a network in hybrid parallel or semi-auto parallel mode.
@@ -294,9 +296,11 @@ The `Momentum` optimizer is used as the parameter update tool. The definition is
- `gradients_mean`: During backward computation, the framework collects gradients of parameters in data parallel mode across multiple hosts, obtains the global gradient value, and transfers the global gradient value to the optimizer for update. The default value is `False`, which indicates that the `AllReduce.Sum` operation is applied. The value `True` indicates that the `AllReduce.Mean` operation is applied.
- You are advised to set `device_num` and `global_rank` to their default values. The framework calls the HCCL API to obtain the values.
+> For more information about distributed parallelism configuration items, see [Distributed Parallel Overview](https://www.mindspore.cn/tutorials/experts/en/master/parallel/introduction.html).
+
If multiple network cases exist in the script, call `reset_auto_parallel_context` to restore all parameters to default values before executing the next case.
-In the following sample code, the automatic parallel mode is specified. To switch to the data parallel mode, you only need to change `parallel_mode` to `DATA_PARALLEL` and do not need to specify the strategy search algorithm `auto_parallel_search_mode`. In the sample code, the recursive programming strategy search algorithm is specified for automatic parallel.
+In the following sample code, the automatic parallel mode is specified. To switch to the data parallel mode, you only need to change `parallel_mode` to `DATA_PARALLEL`.
```python
from mindspore import ParallelMode, Model, set_context, GRAPH_MODE, set_auto_parallel_context
@@ -402,7 +406,7 @@ The distributed related environment variables are as follows:
- `DEVICE_ID`: actual sequence number of the current device on the corresponding host.
- `RANK_ID`: logical sequence number of the current device.
-For details about other environment variables, see configuration items in the installation guide.
+For details about other environment variables, see configuration items in the [installation guide](https://www.mindspore.cn/install).
The running time is about 5 minutes, which is mainly occupied by operator compilation. The actual training time is within 20 seconds. You can use `ps -ef | grep pytest` to monitor task processes.
@@ -426,8 +430,7 @@ epoch: 10 step: 156, loss is 1.1533381
The previous chapters introduced the distributed training of MindSpore, which is based on the Ascend environment of a single host with multiple devices. Using multiple hosts for distributed training can greatly improve the training speed.
In the Ascend environment, the communication between NPU units across hosts is the same as the communication between each NPU unit in a single host. It is still communicated through HCCL. The difference is that the NPU units in a single host are naturally interoperable, while cross-host communication needs to be guaranteed that the networks of the two hosts are interoperable.
-Execute the following command on server 1 to configure the target connect IP as the `device ip` on the server 2. For example, configure the target IP of device 0 of server 1 as the IP of device 0 of server 2. Configuration command requires the `hccn_tool` tool.
-[HCCL tool](https://support.huawei.com/enterprise/en/ascend-computing/a300t-9000-pid-250702906?category=developer-documents) comes with the CANN package.
+Execute the following command on server 1 to configure the target connect IP as the `device ip` on the server 2. For example, configure the target IP of device 0 of server 1 as the IP of device 0 of server 2. Configuration command requires the `hccn_tool` tool. [HCCL tool](https://support.huawei.com/enterprise/en/ascend-computing/a300t-9000-pid-250702906?category=developer-documents) comes with the CANN package.
```bash
hccn_tool -i 0 -netdetect -s address 192.98.92.131
@@ -549,13 +552,13 @@ bash run_cluster.sh /path/dataset /path/rank_table.json 16 0
bash run_cluster.sh /path/dataset /path/rank_table.json 16 8
```
-## Running the Script
+## Running the Script through OpenMPI
Currently MindSpore also supports `mpirun`of OpenMPI for distributed training on Ascend hardware platform without environment variable `RANK_TABLE_FILE`.
### Single-host Training
-Take the distributed training script for eight devices[run_with_mpi.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/distributed_training/run_with_mpi.sh) for an example, the script will run in the background. The log file is saved in the device directory, the log for different device will be saved in `log_output/1/` directory.
+Take the distributed training script for eight devices [run_with_mpi.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/distributed_training/run_with_mpi.sh) for an example, the script will run in the background. The log file is saved in the device directory, the log for different device will be saved in `log_output/1/` directory.
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
>
@@ -567,18 +570,19 @@ Take the distributed training script for eight devices[run_with_mpi.sh](https://
>
> Refer to [GPU Distributed Parallel Training Example](https://www.mindspore.cn/tutorials/experts/en/master/parallel/train_gpu.html)or OpenMPI document for detailed information.
>
->
### Multi-host Training
Before running multi-host training, you need to ensure that you have the same openMPI, Python, and MindSpore versions and install path on each node.
+OpenMPI multi-host training generally adopts the way of configuring hostfile, adding `--hostfile filepath` to the `mpirun` command line argument. The format of each line of the hostfile file is `[hostname] slots=[slotnum]`. The hostname can be ip or hostname, and slotnum represents the number of child processes started by the machine.
+
### Non-sink Mode Training
In graph mode, you can specify to train the model in a non-sink mode by setting the environment variable [GRAPH_OP_RUN](https://www.mindspore.cn/docs/en/master/note/env_var_list.html)=1. In this case, you need to set environment variable `HCCL_WHITELIST_DISABLE=1` and train model with OpenMPI `mpirun`.
-## Distributed Training Model Parameters Saving and Loading
+## Saving and Loading Distributed Training Model Parameters
-The below content introduced how to save and load models under the four distributed parallel training modes respectively. Before saving model parameters for distributed training, it is necessary to configure distributed environment variables and collective communication library in accordance with this tutorial.
+In MindSpore, four distributed parallel training modes are supported, namely Auto Parallel, Data Parallel, Semi Auto Parallel, and Hybrid Parallel. The below content introduced how to save and load models under the four distributed parallel training modes respectively. Before saving model parameters for distributed training, it is necessary to configure distributed environment variables and collective communication library in accordance with this tutorial.
### Auto Parallel Mode