diff --git a/docs/mindspore/source_en/design/distributed_training_design.md b/docs/mindspore/source_en/design/distributed_training_design.md
index 3a2fc5cff63004684d1525d1f26e8b84025a39d5..079e174f9a280d90d7409b6c7ddb9f6043632522 100644
--- a/docs/mindspore/source_en/design/distributed_training_design.md
+++ b/docs/mindspore/source_en/design/distributed_training_design.md
@@ -123,8 +123,86 @@ As a key feature of MindSpore, automatic parallelism is used to implement hybrid
## Advanced Features
-- [pipeline_parallel](https://www.mindspore.cn/docs/en/master/design/pipeline_parallel.html)
-- [host_device_training](https://www.mindspore.cn/docs/en/master/design/host_device_training.html)
-- [recompute](https://www.mindspore.cn/docs/en/master/design/recompute.html)
-- [sharding_propagation](https://www.mindspore.cn/docs/en/master/design/sharding_propagation.html)
-- [parameter_server_training](https://www.mindspore.cn/docs/en/master/design/parameter_server_training.html)
\ No newline at end of file
+As deep learning evolves, models get larger and larger. For example, in the field of NLP, in just a few years, the amount of parameters has developed from BERT's 100 million to GPT-3's 170 billion, and then to Pangu alpha 200 billion, and the current industry has even proposed a million billion. It can be seen that the scale of parameters has shown an exponential growth trend in recent years. On the other hand, with the development of related technologies in the fields of big data and the Internet, the datasets available for model training are also rapidly expanding, such as recommendations, natural language processing and other scenarios of the dataset that can reach terabytes.
+
+In the face of large-scale data and large-scale parameter training, a single device either takes a long time to complete model training, or it cannot be trained due to insufficient display memory. Therefore, distributed training technology needs to be introduced.
+
+Currently, the most commonly used distributed training technique is data parallelism. Data parallelization splits the training data into multiple devices, each maintaining the same model parameters and the same size of computing tasks, but processing different data. In the process of backpropagation, the parameter gradient generated by each device is globally AllReduce synchronously summed. When the dataset is large and the model is small, there is an advantage to choosing data parallelism, such as ResNet50. However, when the model is large, or the dataset and model are larger, other distributed features need to be used.
+
+MindSpore provides the following advanced features to support distributed training of large models, and users can flexibly combine them according to their own needs.
+
+### Operator Parallel
+
+Operator-level parallelism is a distributed computation of operators by splitting their input tensors into multiple devices in units. On the one hand, data samples and model parameters can be split into multiple devices at the same time to complete the training of large models. On the other hand, you can make full use of cluster resources for parallel computing to improve the overall speed.
+
+The users can set the sharding strategy of each operator in the forward network, and the framework models each operator and its input tensor according to the sharding strategy of the operator, so that the computational logic of the operator remains mathematically equivalent before and after the sharding.
+
+### [Pipeline Parallel](https://www.mindspore.cn/docs/en/master/design/pipeline_parallel.html)
+
+When there are a large number of cluster devices, if only the operator level is used in parallel, communication needs to be carried out on the communication domain of the entire cluster, which may make communication inefficient and reduce overall performance.
+
+Pipeline parallel can split the neural network structure into multiple stages, and each stage runs in a part of the device. The communication domain of the set communication limits to this part of the device, and the stage uses point-to-point communication.
+
+The advantages of pipeline parallel are that they can improve communication efficiency and easily handle layered neural network structures. The disadvantage is that some nodes may be idle at the same time.
+
+### Optimizer Parallel
+
+When training in parallel with data or operators, the parameters of the model may have the same copy on multiple devices. This allows the optimizer to have redundant calculations across multiple devices when updating this weight. In this case, the optimizer's computational volume can be spread across multiple devices through optimizer parallelism. It has the advantage of reducing static memory consumption and reducing the amount of computation in the optimizer. The disadvantage is that it increases the communication overhead.
+
+### [Host Device Training](https://www.mindspore.cn/docs/en/master/design/host_device_training.html)
+
+When training large models, the overall size of the model that can be trained will be limited by the number of devices due to the limited memory capacity of each device (accelerator). In order to complete larger-scale model training, you can use the host and device heterogeneous training modes. It takes advantage of both the large memory on the host side and the fast calculation on the accelerator side, and is an effective way to reduce the number of devices during the training of the super-large model.
+
+### [Recompute](https://www.mindspore.cn/docs/en/master/design/recompute.html)
+
+MindSpore automatically derives the reverse graph according to the forward graph calculation process, and the forward graph and the inverse graph together form a complete calculation graph. When calculating some reverse operators, it may be necessary to use the calculation results of some forward operators, resulting in the calculation results of these forward operators, which need to reside in memory until these reverse operators are calculated, and the memory they occupy will not be reused by other operators. The compute results of these forward operators, which reside in memory for a long time, push up the peak memory footprint of the computation, especially in large-scale network models. In order to reduce memory peaks, the recomputing technique can not save the calculation results of the forward activation layer, so that the memory can be reused, and then when calculating the reverse part, recalculate the results of the forward activation layer.
+
+### [Sharding Propagation](https://www.mindspore.cn/docs/en/master/design/sharding_propagation.html)
+
+In operator-level parallelism, the user is required to configure a slicing strategy for each operator in the forward network (if not configured, the data-parallel policy is used by default). The slicing strategy propagation feature can configure only a few operators to automatically generate a feasible sharding strategy for operators without a sharding strategy, and achieve the effect of minimizing communication overhead.
+
+### [Parameter Server Training](https://www.mindspore.cn/docs/en/master/design/parameter_server_training.html)
+
+Parameter Server is a widely used architecture in distributed training, which has better flexibility, scalability, and node disaster tolerance than the AllReduce training method of data parallel synchronization. The parameter server supports both synchronous SGD (Stochastic Gradient Descent) and asynchronous SGD training algorithms. In terms of scalability, the calculation of the model and the update of the model are deployed in the worker and server processes respectively, so that the resources of the worker and server can be scaled horizontally independently (adding or removing the worker and server resources). In addition, in the environment of large-scale data centers, computing equipment, networks and storage often have various failures that lead to some node abnormalities, and under the architecture of parameter servers, such failures can be easily handled without affecting the tasks in training.
+
+### Communication Operator Fusion
+
+In the distributed training scenario, cross-device or even cross-node data transmission is a bottleneck that restricts scalability and computing power utilization. Communication operator fusion is an important method to improve the utilization of network resources and accelerate the efficiency of data transmission, which packages the communication operators of the same source node and the destination node and executes them at the same time to avoid the additional overhead caused by multiple single operator execution.
+
+### Dataset Splitting
+
+When doing distributed training, you need to import the training dataset to each device. There are two common ways to import: 1) Import in parallel with the data, that is, the data is split into match dimensions, and each device is imported as part; 2) Import full amount of data per device. In addition, when some dimensions of the data are particularly large (such as the H/W dimension of the remote sensing picture may be particularly large), even if the sample size is small, the picture needs to be split, that is, the data is split in the H/W dimension, and each device reads a part of the picture. This special performance supports splitting datasets into specific dimensions to meet training requirements in the field of large-format image processing.
+
+### [Distributed Inference](https://www.mindspore.cn/tutorials/experts/en/master/parallel/distributed_inference.html)
+
+For large models, the trained parameters are huge, such as Pangu alpha has 200 billion parameters. Such a huge model cannot be simply deployed on a single device when inference, and the model needs to be split into clusters when inference. Inference and training may not be in the same environment, and the cluster size used may be different. Therefore, it is also necessary to support scenarios where the number of inference devices is inconsistent with the number of training devices.
+
+### Functional Operator Splitting
+
+In dynamic graph mode, you specify that a part of the network structure executes in graph mode and performs various parallel operations.
+
+## Description of the Interface Related to the Feature
+
+| Feature category | Feature interface | Description | Function |
+| ---------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| operator parallel | shard(in_strategy=None, out_strategy=None)
In Primitive class | Set the sharding strategy of the input and output tensors of the operator (where the sharding strategy of the output tensor only supports some operators, such as Gauther and MatMul.) | Reduce the memory capacity of a single device by slicing the tensor involved in each operator in the network model to complete the large model training/inference. Or use cluster resources to perform distributed computing to reduce the overall execution time. |
+| | add_prim_attr(name, value)
In Primitive class | Gather Operator:
add_prim_attr(“manual_split”, config): Configure a non-uniform sharding strategy for its first input, where config type is tuple, which describes how the first parameter, dimension 0, is split. For example , ( 10 , 20 , 30 , 4 ) means that the 0th dimension of the first input of the operator is tangent into 4 parts , and the shape size of each part is 10 , 20 , 30 , 4, respectively. | In the recommended field, there is a scene where each column of the dataset corresponds to a subtable. In this scenario, using this configuration can reduce traffic and improve overall performance. |
+| | | EmbeddingLookUp Operator:
add_prim_attr(“primitive_target”, “CPU”): Configure it to execute on the CPU for heterogeneous scenarios. | In the recommended field, there is a particularly large scene of the Embedding Table, in order to save device memory, you can use this configuration to put EmbeddingLookUp on the CPU to execute to complete the training of the recommended large model. |
+| | set_auto_parallel_context(enable_alltoall=bool_value) | Indicate whether the AllToAll communication operator is allowed to be generated when communicating, and its value is the bool type, which defaults to False. | AllToAll communication can reduce the amount of communication data and improve communication efficiency, but it requires environmental support. |
+| Pipeline parallel | set_auto_parallel_context(pipeline_stages=stage_num) | Set the number of pipes in pipeline parallelism, the value of which is a positive integer, and the value range is [1, number of devices]. | Specify the number of stages, limiting the communication domain of the collection communication to the stage, and the point-to-point communication between the stages. |
+| | pipeline_stage(value)
In Cell class | Set which stage the Cell executes in. | Set which stage the Cell executes in. |
+| | PipelineCell(network, micro_size) | Specify the number of MicroSizes for the training network, where the network is the network to be trained and the micro_size is a positive integer. | Specify micro_size can reduce the idle wait time between stages and improve the overall efficiency of pipeline parallel. |
+| Optimizer parallel | set_auto_parallel_context(enable_parallel_optimizer=bool_value) | Indicate whether optimizer parallelism is enabled. Its value is bool type, and the default is False. | Optimizer parallel saves static memory overhead, but increases communication overhead. |
+| | set_auto_parallel_context(parallel_optimizer_config=config) | This configuration takes effect only after optimizer parallel is turned on. The config is a dict and supports two key values:
gradient_accumulation_shard(bool): If True, the cumulative gradient variable will be sharded on the data parallelism, defaulting to False.
parallel_optimizer_threshold(int): This value represents the optimizer sharding threshold in KB (default value is 64KB). When the parameter size does not exceed this value, it will not be split. | gradient_accumulation_shard true saves a portion of the parameter size of static memory, but increases communication overhead.
Optimizer sharding thresholds allow smaller shape parameters to be not optimized for splitting to save communication resources. |
+| Recompute | recompute(mode=True)
In primitive class | Used to specify whether the operator needs to be recalculated, and its value is bool type, which defaults to True and means that the operator recalculation is enabled. | After enabling operator recalculation, you can reduce the peak of dynamic memory, but increase the overall computation amount. |
+| | recompute(**kwargs)
In Cell class | When this interface is called, the operator in this Cell is recalculated.
The input parameter has two bool class options:
mp_comm_recompute: Whether to enable model parallel communication operator recalculation, and the default is True.
parallel_optimizer_comm_recompute: Whether to enable optimizer parallel communication operator recompute, and the default is False. | Enable Cell recompute and configure whether the model parallel communication operator and the optimizer parallel communication operator are recomputed. When the communication operator is recomputed, it consumes communication resources but reduces the peak of dynamic memory. |
+| Auto-parallel | set_auto_parallel_context(search_mode=mode) | Specify the policy search algorithm, with a value of type string, and the optional value:
1. "sharding_propagation": indicate a policy search by using sharding strategy propagation algorithm;
2. "dynamic_programming": indicate the use of dynamic programming algorithms for policy search;
3. "recursive_programming": indicate the use of a double recursive algorithm for policy search; | Automatic parallel allows the user to search for sharding strategy without configuring or configuring a small number of operators, and the framework searches for the sharding strategy. |
+| | set_algo_parameters(fully_use_devices=bool_value) | Whether operators need to be split across all devices when setting up search policies. Its value is of type bool, which defaults to True. | If the operator is split into all devices, the search space can be reduced and the search speed can be improved, but the search strategy is not globally optimal. |
+| | set_auto_parallel_context(all_reduce_fusion_config=config) | Configure the gradient AllReduce operator fusion strategy with a value of type list. For example: [20, 35], which means that the first 20 AllReduces are fused into 1, the 20th to 35th AllReduce are fused into 1, and the remaining AllReduce are fused into 1. | Reduce the number of operations of the AllReduce communication operator and improve communication efficiency. |
+| comm_fusion | set_auto_parallel_context(comm_fusion=config) | Set the fusion configuration of the communication operator, and support the configuration of the AllReduce, AllGather, and ReduceScatter communication operators currently. Its value is of type dict, such as comm_fusion={"allreduce": {"mode": "auto", "config": None}}. There are three options for "mode" among them:
"auto": Automatically perform operator fusion according to the data volume threshold of 64MB, and the configuration parameter "config" is None.
"size": Communicate operator fusion according to the method of manually setting the data volume threshold, and the configuration parameter "config" type is int, with the unit of MB.
"index": Only "allreduce" supports configuring index, which means that the configuration parameter "config" type is list according to the way the sequence number of the communication operator is fused. For example: [20, 35], which means that the first 20 AllReduces are fused into 1, the 20th to 35th AllReduce are fused into 1, and the remaining AllReduce are fused into 1. | Reduce the number of operations of the AllReduce/AllGather/ReduceScatter communication operator and improve communication efficiency. |
+| Dataset slicing | set_auto_parallel_context(dataset_strategy=config) | Configure the sharding policy for the dataset. where config is Union[str, tuple].
When a string is passed in, there are two options:
"full_batch": indicates that the dataset is not tangential, and
"data_parallel": indicates that the dataset is sliced in parallel with the data.
When passed in tuple, the content in tuple represents the shard() interface of the dataset, similar to the premiumive shard() interface.
if this interface is not called, it defaults to the "data_parallel" mode. | When the number of samples is smaller than the number of cards, it can be imported in the way of "full_batch"; when the number of samples is large and the model parameters are small, it can be imported in the way of "data_parallel"; when the data set is high-resolution image data, it can be imported by configuring the tuple sharding strategy. |
+| Distributed inference | infer_predict_layout(*predict_data) | Use inference data to perform precompilation, which outputs the splitting information of the operator. | Obtain the sharding information of the ownership weight at the time of inference. |
+| | load_distributed_checkpoint(network, checkpoint_filenames, predict_strategy=None, train_strategy_filename=None) | Load the distributed weights. Each machine needs to pre-place the full amount of ckpt.
where network represents the inference network, checkpoint_filenames represents the checkpoint file, predict_strategy is the output of the infer_predict_layout(), and train_strategy_filename is the operator slicing strategy information saved during training. | Load distributed weights for distributed inference. |
+| Functional operator sharding | shard(in_strategy, out_strategy, device="Ascend", level=0)
In Cell class | Set the sharding strategy of the input and output tensors of the cell, and the parallel strategy of the remaining operators is propagated by the sharding strategy. in_strategy/out_strategy specify the sharding policy for the input/output tensor. device specifies the execution device, and level specifies the pattern of the sharding policy propagation algorithm. | In PyNative mode, specify that a cell instance executes in graph mode, and synchronizes the operator-level model according to the specified input-output sharding strategy, while the rest of the model is still executed in Python mode. |
+| | ops.shard(fn, in_strategy, out_strategy, device="Ascend", level=0) | The incoming fn is a cell instance or function. The rest of the input is the same as shard, and the return value is a function. When this function is called, the operator-level model is executed in graph mode in parallel. | This usage allows you to specify that a function performs model parallelism at the operator level, with the same function as cell's shard method. |
+
diff --git a/docs/mindspore/source_en/design/glossary.md b/docs/mindspore/source_en/design/glossary.md
index b9612101327bff319bf9aa0df83a4bb2af0db501..c89650e91a50504ad640cf2c109d1ca553166bbf 100644
--- a/docs/mindspore/source_en/design/glossary.md
+++ b/docs/mindspore/source_en/design/glossary.md
@@ -65,8 +65,8 @@
| TFRecord | Data format defined by TensorFlow. |
| Tensor | A tensor is a generalization of vectors and matrices and is easily understood as a multidimensional array, scalar, and matrix. |
| Broadcast | In matrix mathematical operations, the shape of the operands is extended to a dimension compatible with the operation. In distributed parallelism, the parameters on one card are synchronized to other cards. |
-| Computational Graphs on Devices | The entire graph is executed on the device to reduce the interaction overheads between the host and device. For details see [On-Device Execution](https://www.mindspore.cn/docs/en/master/design/on_device.html) |
-| Cyclic Sinking | Cyclic sinking is optimized based on on-device execution to further reduce the number of interactions between the host and device. For details see [On-Device Execution](https://www.mindspore.cn/docs/en/master/design/on_device.html) |
-| Data Sinking | Sinking means that data is directly transmitted to the device through a channel. For details see [On-Device Execution](https://www.mindspore.cn/docs/en/master/design/on_device.html) |
+| Computational Graphs on Devices | The entire graph is executed on the device to reduce the interaction overheads between the host and device. For details see [On-Device Execution](https://www.mindspore.cn/docs/en/master/design/on_device.html). |
+| Cyclic Sinking | Cyclic sinking is optimized based on on-device execution to further reduce the number of interactions between the host and device. For details see [On-Device Execution](https://www.mindspore.cn/docs/en/master/design/on_device.html). |
+| Data Sinking | Sinking means that data is directly transmitted to the device through a channel. For details see [On-Device Execution](https://www.mindspore.cn/docs/en/master/design/on_device.html). |
| Graph Mode | Static graph mode or graph mode. In this mode, the neural network model is compiled into an entire graph, and then the graph is delivered for execution. This mode uses graph optimization to improve the running performance and facilitates large-scale deployment and cross-platform running. |
| PyNative Mode | Dynamic graph mode. In this mode, operators in the neural network are delivered and executed one by one, facilitating the compilation and debugging of the neural network model. |
diff --git a/docs/mindspore/source_en/design/host_device_training.md b/docs/mindspore/source_en/design/host_device_training.md
index 5c45525a68d85c6caee2c9131383fc6ffcef18f8..257c11ec7daea78ababba1dbffcfc9250044ef9d 100644
--- a/docs/mindspore/source_en/design/host_device_training.md
+++ b/docs/mindspore/source_en/design/host_device_training.md
@@ -8,16 +8,31 @@ In deep learning, one usually has to deal with the huge model problem, in which
the number of required accelerators is too overwhelming for people to access, resulting in this solution inapplicable. One alternative is Host+Device hybrid training. This solution simultaneously leveraging the huge memory in hosts and fast computation in accelerators, is a promisingly
efficient method for addressing huge model problem.
-In MindSpore, users can easily implement hybrid training by configuring trainable parameters and necessary operators to run on hosts, and other operators to run on accelerators.
-This tutorial introduces how to train [Wide&Deep](https://gitee.com/mindspore/models/tree/master/official/recommend/wide_and_deep) in the Host+Ascend 910 AI Accelerator mode.
+In MindSpore, users can easily implement hybrid training by configuring trainable parameters and necessary operators to run on hosts, and other operators to run on accelerators. This tutorial introduces how to train [Wide&Deep](https://gitee.com/mindspore/models/tree/master/official/recommend/wide_and_deep) in the Host+Ascend 910 AI Accelerator mode.
-## Preliminaries
+## Basic Principle
-1. Prepare the model. The Wide&Deep code can be found at: , in which `train_and_eval_auto_parallel.py` is the main function for training, `src/` directory contains the model definition, data processing and configuration files, `script/` directory contains the launch scripts in different modes.
+Pipeline parallel and operator-level parallel are suitable for the model to have a large number of operators, and the parameters are more evenly distributed among the operators. What if the number of operators in the model is small, and the parameters are concentrated in only a few operators? Wide & Deep is an example of this, as shown in the image below. The Embedding table in Wide & Deep can be trained as a parameter of hundreds of GIGabytes or even a few terabytes. If it is executed on an accelerator ( device ) , the number of accelerators required is huge, and the training cost is expensive. On the other hand, if you use accelerator computing, the training acceleration obtained is limited, and it will also trigger cross-server traffic, and the end-to-end training efficiency will not be very high.
+
+
+
+*Figure: Part of the structure of the Wide & Deep model*
+
+A careful analysis of the special structure of the Wide & Deep model can be obtained: although the Embedding table has a huge amount of parameters, it participates in very little computation, and the Embedding table and its corresponding operator, the EmbeddingLookup operator, can be placed on the Host side, by using the CPU for calculation, and the rest of the operators are placed on the accelerator side. This can take advantage of the large amount of memory on the Host side and the fast computing of the accelerator side, while taking advantage of the high bandwidth of the Host to accelerator of the same server. The following diagram shows how Wide & Deep heterogeneous slicing works:
+
+
+
+*Figure: Wide & Deep Heterogeneous Approach*
+
+## Practices
+
+### Sample Code Description
+
+1. Prepare the model code. The Wide&Deep code can be found at: , in which `train_and_eval_auto_parallel.py` defines the main function for model training, `src/` directory contains the model definition, data processing and configuration files, and `script/` directory contains the training scripts in different modes.
2. Prepare the dataset. Please refer the link in [1] to download the dataset, and use the script `src/preprocess_data.py` to transform dataset into MindRecord format.
-3. Configure the device information. When performing training in the bare-metal environment, the network information file needs to be configured. This example only employs one accelerator, thus `rank_table_1p_0.json` containing #0 accelerator is configured (about the rank table file, you can refer to [HCCL_TOOL](https://gitee.com/mindspore/models/tree/master/utils/hccl_tools)).
+3. Configure the device information. When performing distributed training in the bare-metal environment (That is, there is an Ascend 910 AI processor locally), the network information file needs to be configured. This example only employs one accelerator, thus `rank_table_1p_0.json` containing #0 accelerator is configured. MindSpore provides an automated build script for generating this configuration file and related instructions. For the detailed, see [HCCL_TOOL](https://gitee.com/mindspore/models/tree/master/utils/hccl_tools).
## Configuring for Hybrid Training
@@ -45,9 +60,7 @@ This tutorial introduces how to train [Wide&Deep](https://gitee.com/mindspore/mo
In order to save enough log information, use the command `export GLOG_v=1` to set the log level to INFO before executing the script, and add the `-p on` option when compiling MindSpore. For the details about compiling MindSpore, refer to [Compiling MindSpore](https://www.mindspore.cn/install/detail/en?path=install/master/mindspore_ascend_install_source_en.md&highlight=%E7%BC%96%E8%AF%91mindspore).
-Use the script `script/run_auto_parallel_train.sh`. Run the command `bash run_auto_parallel_train.sh 1 1 `,
-where the first `1` is the number of accelerators, the second `1` is the number of epochs, `DATASET_PATH` is the path of dataset,
-and `RANK_TABLE_FILE` is the path of the above `rank_table_1p_0.json` file.
+Use the script `script/run_auto_parallel_train.sh`. Run the command `bash run_auto_parallel_train.sh 1 1 `, where the first `1` is the number of cards used in the case, the second `1` is the number of epochs, `DATASET_PATH` is the path of dataset, and `RANK_TABLE_FILE` is the path of the above `rank_table_1p_0.json` file.
The running log is in the directory of `device_0`, where `loss.log` contains every loss value of every step in the epoch. Here is an example:
@@ -62,7 +75,6 @@ epoch: 1 step: 7, wide_loss is 0.5798845, deep_loss is 0.7245408
epoch: 1 step: 8, wide_loss is 0.57553077, deep_loss is 0.7123517
epoch: 1 step: 9, wide_loss is 0.5733629, deep_loss is 0.70278376
epoch: 1 step: 10, wide_loss is 0.566089, deep_loss is 0.6884129
-...
```
`test_deep0.log` contains the runtime log.
diff --git a/docs/mindspore/source_en/design/on_device.md b/docs/mindspore/source_en/design/on_device.md
index d110546af72b188f2cb3b856076998cd71949246..5c7706cf6047267d3c71afa12ef4ebb78a6ad1ca 100644
--- a/docs/mindspore/source_en/design/on_device.md
+++ b/docs/mindspore/source_en/design/on_device.md
@@ -6,11 +6,11 @@
The backends supported by MindSpore include Ascend, GPU, and CPU. The device in the "On-Device" refers to the Ascend AI processor.
-The Ascend AI processor integrates the AI core, AI CPU, and CPU. The AI core is responsible for large Tensor Vector computing, the AI CPU is responsible for scalar computing, and the CPU is responsible for logic control and task distribution.
+The Ascend AI processor integrates the AICORE, AICPU, and CPU. The AICORE is responsible for large Tensor Vector computing, the AI CPU is responsible for scalar computing, and the CPU is responsible for logic control and task distribution.
The CPU on the host side delivers graphs or operators to the Ascend AI processor. The Ascend AI processor has the functions of computing, logic control, and task distribution. Therefore, it does not need to frequently interact with the CPU on the host side. It only needs to return the final calculation result to the host. In this way, the entire graph is sunk to the device for execution, avoiding frequent interaction between the host and device and reducing overheads.
-### Computational Graphs on Devices
+### Computational Graphs Sinking
The entire graph is executed on the device to reduce the interaction overheads between the host and device. Multiple steps can be moved downwards together with cyclic sinking to further reduce the number of interactions between the host and device.
@@ -41,7 +41,7 @@ The following is a code example:
```python
import os
-
+import requests
import mindspore.dataset as ds
import mindspore.dataset.transforms.c_transforms as CT
import mindspore.dataset.vision.c_transforms as CV
@@ -50,10 +50,10 @@ from mindspore import Model, set_context, GRAPH_MODE
from mindspore import dtype as mstype
from mindspore.common.initializer import TruncatedNormal
from mindspore.dataset.vision import Inter
-from mindspore.nn import Accuracy
import mindspore.ops as ops
from mindspore import LossMonitor
+requests.packages.urllib3.disable_warnings()
def create_dataset(data_path, batch_size=32, repeat_size=1,
num_parallel_workers=1):
@@ -153,10 +153,26 @@ class LeNet5(nn.Cell):
x = self.fc3(x)
return x
+def download_dataset(dataset_url, path):
+ filename = dataset_url.split("/")[-1]
+ save_path = os.path.join(path, filename)
+ if os.path.exists(save_path):
+ return
+ if not os.path.exists(path):
+ os.makedirs(path)
+ res = requests.get(dataset_url, stream=True, verify=False)
+ with open(save_path, "wb") as f:
+ for chunk in res.iter_content(chunk_size=512):
+ if chunk:
+ f.write(chunk)
+ print("The {} file is downloaded and saved in the path {} after processing".format(os.path.basename(dataset_url), path))
+
if __name__ == "__main__":
set_context(mode=GRAPH_MODE, device_target="GPU")
ds_train_path = "./datasets/MNIST_Data/train/"
+ download_dataset("https://mindspore-website.obs.myhuaweicloud.com/notebook/datasets/mnist/train-labels-idx1-ubyte", ds_train_path)
+ download_dataset("https://mindspore-website.obs.myhuaweicloud.com/notebook/datasets/mnist/train-images-idx3-ubyte", ds_train_path)
ds_train = create_dataset(ds_train_path, 32)
network = LeNet5(10)
diff --git a/docs/mindspore/source_en/design/pipeline_parallel.md b/docs/mindspore/source_en/design/pipeline_parallel.md
index 9b66ac998151c89139efec42a119bf51c82cb075..8570a102719c87a47c3bc04fc15490a0a597d15d 100644
--- a/docs/mindspore/source_en/design/pipeline_parallel.md
+++ b/docs/mindspore/source_en/design/pipeline_parallel.md
@@ -1,16 +1,34 @@
-# Pipeline Parallelism
+# Pipeline Parallel
## Overview
-In recent years, the scale of neural networks has increased exponentially. Limited by the memory on a single device, the
-number of devices used for training large models is also increasing. Due to the low communication bandwidth between
-servers, the performance of the conventional hybrid parallelism (data parallel + model parallel) is poor. Therefore,
-pipeline parallelism needs to be introduced. Pipeline parallelism can divide a model in space based on `stage`.
-Each `stage` needs to execute only a part of the network, which greatly reduces memory overheads, shrinks the
-communication domain, and shortens the communication time. MindSpore can automatically convert a standalone model to the
-pipeline parallel mode based on user configurations.
+In recent years, the scale of neural networks has increased exponentially. Limited by the memory on a single device, the number of devices used for training large models is also increasing. Due to the low communication bandwidth between servers, the performance of the conventional hybrid parallelism (data parallel + model parallel) is poor. Therefore, pipeline parallelism needs to be introduced. Pipeline parallel can divide a model in space based on `stage`. Each `stage` needs to execute only a part of the network, which greatly reduces memory overheads, shrinks the communication domain, and shortens the communication time. MindSpore can automatically convert a standalone model to the pipeline parallel mode based on user configurations.
+
+## Basic Principle
+
+Pipeline parallel is the splitting of operators in a neural network into multiple stages, and then mapping the stages to different devices, so that different devices can compute different parts of the neural network. Pipeline parallel is suitable for graph structures where the model is linear. As shown in Figure 1, the network of 4 layers of MatMul is split into 4 stages and distributed to 4 devices. In forward calculations, each machine sends the result to the next machine through the communication operator after calculating the MatMul on the machine, and at the same time, the next machine receives (Receive) the MatMul result of the previous machine through the communication operator, and starts to calculate the MatMul on the machine; In reverse calculation, after the gradient of the last machine is calculated, the result is sent to the previous machine, and at the same time, the previous machine receives the gradient result of the last machine and begins to calculate the reverse of the current machine.
+
+
+
+*Figure 1: Schematic diagram of graph splitting in pipeline parallel*
+
+Simply splitting the model onto multiple devices does not bring about a performance gain, because the linear structure of the model has only one device at work at a time, while other devices are waiting, resulting in a waste of resources. In order to improve efficiency, the pipeline parallel further divides the small batch (MiniBatch) into more fine-grained micro batches (MicroBatch), and adopts a pipeline execution sequence in the micro batch, so as to achieve the purpose of improving efficiency, as shown in Figure 2. The small batches are cut into 4 micro-batches, and the 4 micro-batches are executed on 4 groups to form a pipeline. The gradient aggregation of the micro-batch is used to update the parameters, where each device only stores and updates the parameters of the corresponding group. where the white ordinal number represents the index of the micro-batch.
+
+
+
+*Figure 2: Schematic diagram of a pipeline parallel execution timeline with MicroBatch*
+
+In MindSpore's pipeline parallel implementation, the execution order has been adjusted for better memory management. As shown in Figure 3, the reverse of the MicroBatch numbered 0 is performed immediately after its forward execution, so that the memory of the intermediate result of the numbered 0 MicroBatch is freed earlier (compared to Figure 2), thus ensuring that the peak memory usage is lower than in the way of Figure 2.
+
+
+
+*Figure 3: MindSpore Pipeline Parallel Execution Timeline Diagram*
+
+## Operation Practices
+
+### Sample Code Description
> Download address of the complete sample code:
>
@@ -32,8 +50,6 @@ The directory structure is as follows:
`rank_table_16pcs.json`, `rank_table_8pcs.json` and `rank_table_2pcs.json` are the networking information files. `resnet.py` and `resnet50_distributed_training_pipeline.py` are the network structure files. `run_pipeline.sh` are the execute scripts.
-## Preparations
-
### Downloading the Dataset
This example uses the `CIFAR-10` dataset. For details about how to download and load the dataset,
@@ -105,10 +121,7 @@ To enable pipeline parallelism, you need to add the following configurations to
- Set `pipeline_stages` in `set_auto_parallel_context` to specify the total number of `stages`.
- Set the `SEMI_AUTO_PARALLEL` mode. Currently, the pipeline parallelism supports only this mode.
- Define the LossCell. In this example, the `nn.WithLossCell` API is called.
-- Finally, wrap the LossCell with `PipelineCell`, and specify the Micro_batch size. To improve machine utilization,
- MindSpore divides Mini_batch into finer-grained Micro_batch to streamline the entire cluster. The final loss value is
- the sum of the loss values computed by all Micro_batch. The size of Micro_batch must be greater than or equal to the
- number of `stages`.
+- Finally, wrap the LossCell with `PipelineCell`, and specify the Micro_batch size. To improve machine utilization, MindSpore divides Mini_batch into finer-grained Micro_batch to streamline the entire cluster. The final loss value is the sum of the loss values computed by all Micro_batch. The size of Micro_batch must be greater than or equal to the number of `stages`.
```python
from mindspore import Model, nn, set_auto_parallel_context, ParallelMode
@@ -136,7 +149,7 @@ def test_train_cifar(epoch_size=10):
## Running the Single-host with 8 devices Script
-Using the sample code, you can run a 2-stage pipeline on 8 Ascend devices using below scripts:
+Using the sample code, you can run a 2-stage pipeline on 8 Ascend devices by using below scripts:
```bash
bash run_pipeline.sh [DATA_PATH] Ascend
diff --git a/docs/mindspore/source_en/design/recompute.md b/docs/mindspore/source_en/design/recompute.md
index f10d4495d8356e6bfe0026192aac1f99f42efb88..2aea7131536f4ffc9cf3178157908c57eef709a9 100644
--- a/docs/mindspore/source_en/design/recompute.md
+++ b/docs/mindspore/source_en/design/recompute.md
@@ -8,9 +8,31 @@ The automatic differential of MindSpore is in reverse-mode, which derives the ba
In order to solve this problem, Mindspore provides the recomputation function. It will recompute the forward operators before computing the backward operators rather than storing the results of forward operators, which can help the memory be reused. This tutorial takes the model ResNet-50 for example to explain how to configure recomputation to train your model in MindSpore.
+## Basic Principle
+
+MindSpore automatically derives the reverse graph according to the forward graph compute process, and the forward graph and the inverse graph together form a complete compute graph. When calculating some reverse operators, it may be necessary to use the compute results of some forward operators, resulting in the compute results of these forward operators, which need to reside in memory until these reverse operators are computed, and the memory they occupy will not be reused by other operators. The computational results of these forward operators, which reside in memory for a long time, push up the peak memory footprint of the computation, especially in large-scale network models.
+
+In order to reduce memory peaks, the recompute technique can not save the compute results of the forward activation layer, so that the memory can be reused, and then when calculating the reverse part, recompute the results of the forward activation layer. MindSpore provides the ability to recompute.
+
+The recompute function is implemented as a forward operator that is recomputed according to the user's specified needs, copies the same operator, outputs it to the reverse operator, and deletes the continuous edge relationship between the original forward operator and the reverse operator. In addition, we need to ensure that the copied operator only begins to be evaluated when the corresponding inverse part is computed, so we need to insert control dependencies to ensure the order in which the operators are executed. As shown in the following figure:
+
+
+
+*Figure: Forward and reverse diagram before and after the recompute function is enabled*
+
+For user convenience, MindSpore currently provides not only a recompute interface for individual operators, but also a recompute interface for Cell. When the user calls The Cell's recompute interface, all forward operators in the Cell are set to recompute.
+
+Taking the GPT-3 model as an example, the policy is set to recalculate the cell corresponding to the layerer for each layer, and then the output operator of the layerer is set to non-recompute. The effect of recompute on the 72-layer GPT-3 network is shown in the following figure:
+
+
+
+*Figure: Comparison of GPT-3 memory usage before and after recalculation function is enabled*
+
## Preliminaries
-1. Prepare the model. The ResNet-50 code can be found at: , in which `train.py` is the main function for training, `src/` directory contains the model definition and configuration files of ResNet-50, `script/` directory contains the training and evaluation scripts.
+### Sample code description
+
+1. Prepare the model. The ResNet-50 code can be found at: , in which `train.py` is the main function for training, `src/` directory contains the model definition and configuration files of ResNet-50, and `script/` directory contains the training and evaluation scripts.
2. Prepare the dataset. This example uses the `CIFAR-10` dataset. For details about how to download and load the dataset, visit .
@@ -96,8 +118,7 @@ We can call two kinds of interface to configure the recomputation. Take `src/res
## Training the Model
-We take the GPU environment for example, use the script `script/run_standalone_train_gpu.sh`. Run the command `bash scripts/run_standalone_train_gpu.sh $date_set_path config/resnet50_cifar10_config.yaml`.
-We can set the context: `save_graph=True` in `src/train.py` to print the construction of the computation graph to do comparison.
+We take the GPU environment for example, use the script `script/run_standalone_train_gpu.sh`. Run the command `bash scripts/run_standalone_train_gpu.sh $date_set_path config/resnet50_cifar10_config.yaml`. We can set the context: `save_graph=True` in `src/train.py` to print the construction of the computation graph to do comparison.
The graph before setting recomputation is as follow:
diff --git a/docs/mindspore/source_zh_cn/design/distributed_training_design.md b/docs/mindspore/source_zh_cn/design/distributed_training_design.md
index d3c823c4e918fb0c6f9fa5170725b3a4c8f66bc2..1504013a904e2882c8a43caf33e8ca4e4a8e94f0 100644
--- a/docs/mindspore/source_zh_cn/design/distributed_training_design.md
+++ b/docs/mindspore/source_zh_cn/design/distributed_training_design.md
@@ -193,14 +193,14 @@ Parameter Server(参数服务器)是分布式训练中一种广泛使用的架
| | PipelineCell(network, micro_size) | 用于指定训练网络的MicroSize数量,其中network为待训练的网络,micro_size为正整数。 | 指定micro_size,能减少stage间的空闲等待时间,提升流水线并行的整体效率。 |
| 优化器并行 | set_auto_parallel_context(enable_parallel_optimizer=bool_value) | 表示是否开启优化器并行,其值为bool型,默认为False。 | 优化器并行能节省静态内存的开销,但增加了通信开销。 |
| | set_auto_parallel_context(parallel_optimizer_config=config) | 只有开启优化器并行后,此配置才生效。其中config是个dict,支持两个键值:
gradient_accumulation_shard(bool):如果为True,则累积梯度变量将在数据并行度上进行分片,默认为False。
parallel_optimizer_threshold(int):该值表示优化器切分阈值,单位为KB(默认64KB)。当参数大小不超过该值时,将不会被切分。 | gradient_accumulation_shard为True时,将节省一份参数大小的静态内存,但增加了通信开销。
优化器切分阈值,能使得shape较小的参数不进行优化器切分,以节省通信资源。 |
-| 重计算 | recompute(mode=True)
在primitive类中 | 用于指定该算子是否需要重计算,其值为bool类型,默认为True,表示开启算子重计算。 | 开启算子重计算后,能减少动态内存的峰值,但增加整体计算量。 |
-| | recompute(**kwargs)
在Cell类中 | 调用此接口后,将会对此Cell中的算子进行重计算。
其中输入参数有两个bool类型选项:
mp_comm_recompute:是否开启模型并行通信算子重计算,默认为True。
parallel_optimizer_comm_recompute:是否开启优化器并行通信算子重计算,默认为False | 开启Cell重计算,且能配置模型并行的通信算子、优化器并行的通信算子是否进行重计算。当通信算子重计算时,将消耗通信资源,但能降低动态内存的峰值。 |
+| 重计算 | recompute(mode=True)
在Primitive类中 | 用于指定该算子是否需要重计算,其值为bool类型,默认为True,表示开启算子重计算。 | 开启算子重计算后,能减少动态内存的峰值,但增加整体计算量。 |
+| | recompute(**kwargs)
在Cell类中 | 调用此接口后,将会对此Cell中的算子进行重计算。
其中输入参数有两个bool类型选项:
mp_comm_recompute:是否开启模型并行通信算子重计算,默认为True。
parallel_optimizer_comm_recompute:是否开启优化器并行通信算子重计算,默认为False。 | 开启Cell重计算,且能配置模型并行的通信算子、优化器并行的通信算子是否进行重计算。当通信算子重计算时,将消耗通信资源,但能降低动态内存的峰值。 |
| 自动并行 | set_auto_parallel_context(search_mode=mode) | 用于指定策略搜索算法,其值为字符串类型,可选值为:
1,"sharding_propagation":表示使用切分策略传播算法进行策略搜索;
2,"dynamic_programming":表示使用动态规划算法进行策略搜索;
3,"recursive_programming":表示使用双递归算法进行策略搜索; | 自动并行可以让用户不配置或者少量配置算子的切分策略,而由框架搜索出切分策略。 |
| | set_algo_parameters(fully_use_devices=bool_value) | 用于设置搜索策略时是否需要将算子切分到所有设备上。其值为bool类型,默认为True。 | 如果将算子切分到所有设备上,则能缩小搜索空间,提高搜索速度,但搜索出来的策略并非全局最优。 |
| | set_auto_parallel_context(all_reduce_fusion_config=config) | 配置梯度AllReduce算子融合策略,其值为list类型。例如:[20, 35],表示将前20个AllReduce融合成1个,第20~35个AllReduce融合成1个,剩下的AllReduce融合成1个。 | 减少AllReduce通信算子的操作次数,提高通信效率。 |
-| 通信算子融合 | set_auto_parallel_context(comm_fusion=config) | 设置通信算子的融合配置,当前支持AllReduce、AllGather、ReduceScatter通信算子的配置。其值为dict类型,如comm_fusion={"allreduce": {"mode": "auto", "config": None}}。其中"mode"有三种选项:
"auto":自动按照数据量阈值64MB进行算子融合,配置参数“config”为None。
"size":按照手动设置数据量阈值的方式进行通信算子融合,配置参数"config"类型为int,单位MB。
"index":仅"allreduce"支持配置index,表示按照通信算子序列号进行融合的方式,配置参数"config"类型为list。例如:[20, 35],表示将前20个AllReduce融合成1个,第20~35个AllReduce融合成1个,剩下的AllReduce融合成1个。 | 减少AllReduce/AllGather/ReduceScatter通信算子的操作次数,提高通信效率 |
+| 通信算子融合 | set_auto_parallel_context(comm_fusion=config) | 设置通信算子的融合配置,当前支持AllReduce、AllGather、ReduceScatter通信算子的配置。其值为dict类型,如comm_fusion={"allreduce": {"mode": "auto", "config": None}}。其中"mode"有三种选项:
"auto":自动按照数据量阈值64MB进行算子融合,配置参数“config”为None。
"size":按照手动设置数据量阈值的方式进行通信算子融合,配置参数"config"类型为int,单位MB。
"index":仅"allreduce"支持配置index,表示按照通信算子序列号进行融合的方式,配置参数"config"类型为list。例如:[20, 35],表示将前20个AllReduce融合成1个,第20~35个AllReduce融合成1个,剩下的AllReduce融合成1个。 | 减少AllReduce/AllGather/ReduceScatter通信算子的操作次数,提高通信效率。 |
| 数据集切分 | set_auto_parallel_context(dataset_strategy=config) | 配置数据集的切分策略。其中,config为Union[str, tuple]。
当传入字符串时,有两种选项:
"full_batch":表示数据集不切分;
"data_parallel":表示数据集按数据并行的方式切分。
当传入tuple时,tuple中的内容代表数据集的切分策略,类似于primitive的shard()接口。
若不调用此接口,则默认采用"data_parallel"的方式。 | 当样本数比卡数少时,可以采用"full_batch"的方式进行导入;当样本数大、模型参数小时,可以采用"data_parallel"的方式导入;当数据集是高分辨率图像数据时,可以采用配置tuple切分策略的方式导入。 |
| 分布式推理 | infer_predict_layout(*predict_data) | 使用推理数据进行一次预编译,输出算子的切分信息。 | 获取推理时所有权重的切分信息。 |
| | load_distributed_checkpoint(network, checkpoint_filenames, predict_strategy=None, train_strategy_filename=None) | 加载分布式权重,需每台机器预先放置全量的ckpt。
其中network代表推理网络,checkpoint_filenames代表checkpoint文件,predict_strategy为infer_predict_layout()的输出,train_strategy_filename为训练时保存的算子切分策略信息。 | 加载分布式权重,以进行分布式推理。 |
-| 函数式算子切分 | shard(in_strategy, out_strategy, device="Ascend", level=0)
在cell类中 | 设置cell的输入及输出张量的切分策略,其余算子的并行策略由切分策略传播得到。 in_strategy/out_strategy指定输入/输出张量的切分策略,device指定执行设备,level指定切分策略传播算法的模式。 | 在PyNative模式下指定某个cell实例以图模式执行,并且依据指定的输入输出切分策略进行算子级别的模型并行, 其余的部分仍以PyNative模式执行数据并行。 |
-| | ops.shard(fn, in_strategy, out_strategy, device="Ascend", level=0) | 传入的fn为cell实例或函数,其余输入和shard相同,返回值为函数,再调用此函数时,会以图模式执行算子级别的模型并行 | 此用法可以指定某个函数进行算子级别的模型并行,具体功能和cell的shard方法相同。|
\ No newline at end of file
+| 函数式算子切分 | shard(in_strategy, out_strategy, device="Ascend", level=0)
在Cell类中 | 设置cell的输入及输出张量的切分策略,其余算子的并行策略由切分策略传播得到。 in_strategy/out_strategy指定输入/输出张量的切分策略,device指定执行设备,level指定切分策略传播算法的模式。 | 在PyNative模式下指定某个cell实例以图模式执行,并且依据指定的输入输出切分策略进行算子级别的模型并行, 其余的部分仍以PyNative模式执行数据并行。 |
+| | ops.shard(fn, in_strategy, out_strategy, device="Ascend", level=0) | 传入的fn为cell实例或函数,其余输入和shard相同,返回值为函数,再调用此函数时,会以图模式执行算子级别的模型并行。 | 此用法可以指定某个函数进行算子级别的模型并行,具体功能和cell的shard方法相同。|
\ No newline at end of file
diff --git a/docs/mindspore/source_zh_cn/design/glossary.md b/docs/mindspore/source_zh_cn/design/glossary.md
index 2332767dcce57956cd281606d4be323193234001..b024bc8bd5784633e32e4161e84b19f32904bb83 100644
--- a/docs/mindspore/source_zh_cn/design/glossary.md
+++ b/docs/mindspore/source_zh_cn/design/glossary.md
@@ -65,8 +65,8 @@
| TFRecord | Tensorflow定义的数据格式。 |
| Tensor | 张量,存储多维数组的数据结构。最常见的是标量、向量或矩阵。 |
| 广播 | 在矩阵数学运算中,是将操作数的shape扩展到与该运算兼容的维。在分布式并行中,是某卡上的参数同步到其他卡上。 |
-| 计算图下沉 | 计算图整图下沉到Device上执行,减少Host-Device交互开销。详见[on-device执行](https://www.mindspore.cn/docs/zh-CN/master/design/on_device.html#计算图下沉) |
-| 循环下沉 | 在On Device执行的基础上的优化,目的是进一步减少Host侧和Device侧之间的交互次数。详见[on-device执行](https://www.mindspore.cn/docs/zh-CN/master/design/on_device.html) |
-| 数据下沉 | 下沉即数据通过通道直接传送到Device上。详见[on-device执行](https://www.mindspore.cn/docs/zh-CN/master/design/on_device.html#数据下沉) |
+| 计算图下沉 | 计算图整图下沉到Device上执行,减少Host-Device交互开销。详见[on-device执行](https://www.mindspore.cn/docs/zh-CN/master/design/on_device.html#计算图下沉)。 |
+| 循环下沉 | 在On Device执行的基础上的优化,目的是进一步减少Host侧和Device侧之间的交互次数。详见[on-device执行](https://www.mindspore.cn/docs/zh-CN/master/design/on_device.html)。 |
+| 数据下沉 | 下沉即数据通过通道直接传送到Device上。详见[on-device执行](https://www.mindspore.cn/docs/zh-CN/master/design/on_device.html#数据下沉)。 |
| 图模式 | 又称静态图模式,将神经网络模型编译成一整张图,然后下发执行。该模式利用图优化等技术提高运行性能,同时有助于规模部署和跨平台运行。 |
| PyNative模式 | 动态图模式,将神经网络中的各个算子逐一下发执行,方便用户编写和调试神经网络模型。 |
\ No newline at end of file
diff --git a/docs/mindspore/source_zh_cn/design/pipeline_parallel.md b/docs/mindspore/source_zh_cn/design/pipeline_parallel.md
index 5fe013bfafa63f6e2c918508ec7a3f8c01c638af..91087c1f3f0c663b4f7554153e6453384fdf5cfb 100644
--- a/docs/mindspore/source_zh_cn/design/pipeline_parallel.md
+++ b/docs/mindspore/source_zh_cn/design/pipeline_parallel.md
@@ -4,8 +4,7 @@
## 概述
-近年来,神经网络的规模几乎是呈指数型增长。受单卡内存的限制,训练这些大模型用到的设备数量也在不断增加。受server间通信带宽低的影响,传统数据并行叠加模型并行的这种混合并行模式的性能表现欠佳,需要引入流水线并行。流水线并行能够将模型在空间上按`stage`
-进行切分,每个`stage`只需执行网络的一部分,大大节省了内存开销,同时缩小了通信域,缩短了通信时间。MindSpore能够根据用户的配置,将单机模型自动地转换成流水线并行模式去执行。
+近年来,神经网络的规模几乎是呈指数型增长。受单卡内存的限制,训练这些大模型用到的设备数量也在不断增加。受server间通信带宽低的影响,传统数据并行叠加模型并行的这种混合并行模式的性能表现欠佳,需要引入流水线并行。流水线并行能够将模型在空间上按`stage`进行切分,每个`stage`只需执行网络的一部分,大大节省了内存开销,同时缩小了通信域,缩短了通信时间。MindSpore能够根据用户的配置,将单机模型自动地转换成流水线并行模式去执行。
## 基本原理
@@ -121,9 +120,7 @@ class ResNet(nn.Cell):
- 目前流水线并行只支持`SEMI_AUTO_PARALLEL`模式,数据集要以`full_batch`模式导入。
- 需要定义LossCell,本例中调用了`nn.WithLossCell`接口。
- 目前流水线并行不支持自动混合精度特性。
-- 最后,需要在LossCell外包一层`PipelineCell`
- ,并指定MicroBatch的size。为了提升机器的利用率,MindSpore将MiniBatch切分成了更细粒度的MicroBatch,最终的loss则是所有MicroBatch计算的loss值累加。其中,MicroBatch的size必须大于等于`stage`
- 的数量。
+- 最后,需要在LossCell外包一层`PipelineCell`,并指定MicroBatch的size。为了提升机器的利用率,MindSpore将MiniBatch切分成了更细粒度的MicroBatch,最终的loss则是所有MicroBatch计算的loss值累加。其中,MicroBatch的size必须大于等于`stage`的数量。
```python
from mindspore import Model, nn, ParallelMode, set_auto_parallel_context
@@ -151,9 +148,7 @@ def test_train_cifar(epoch_size=10):
### 运行单机八卡脚本
-利用样例代码,
-
-Ascend可以用以下命令运行8卡,2个stage的流水线训练:
+利用样例代码,Ascend可以用以下命令运行8卡,2个stage的流水线训练:
```bash
bash run_pipeline.sh [DATA_PATH] Ascend