diff --git a/docs/note/source_en/community.rst b/docs/note/source_en/community.rst index a0743659bf7114e4c183ee3e57d0203901fdb2da..c5db7a7b34e8feef3b9b13ce2d42e230b445469c 100644 --- a/docs/note/source_en/community.rst +++ b/docs/note/source_en/community.rst @@ -4,9 +4,9 @@ Participate in MindSpore Community Contributing Code ----------------- -If you want to contribute code, please read https://gitee.com/mindspore/mindspore/blob/master/CONTRIBUTING.md . +If you want to contribute code, please read https://gitee.com/mindspore/mindspore/blob/r1.0/CONTRIBUTING.md . Contributing Documents ---------------------- -If you want to contribute documents, please read https://gitee.com/mindspore/docs/blob/master/CONTRIBUTING_DOC.md . \ No newline at end of file +If you want to contribute documents, please read https://gitee.com/mindspore/docs/blob/r1.0/CONTRIBUTING_DOC.md . \ No newline at end of file diff --git a/docs/note/source_en/design.rst b/docs/note/source_en/design.rst index 9373ac6ee09efe3f394a6b5680599050b5807910..f7b0d4649a80d518f0bdb2f581b5b489e86c8130 100644 --- a/docs/note/source_en/design.rst +++ b/docs/note/source_en/design.rst @@ -11,4 +11,4 @@ Design design/mindinsight/training_visual_design design/mindinsight/graph_visual_design design/mindinsight/tensor_visual_design - design/mindarmour/differential_privacy_design.md + design/mindarmour/differential_privacy_design diff --git a/docs/note/source_zh_cn/community.rst b/docs/note/source_zh_cn/community.rst index 04ffbb4cde98b2fdb3646e754840ef042f019292..221f9a3050729656055058ad731acc99123aea77 100644 --- a/docs/note/source_zh_cn/community.rst +++ b/docs/note/source_zh_cn/community.rst @@ -4,9 +4,9 @@ 贡献代码 ----------- -如何贡献代码,请参见链接 https://gitee.com/mindspore/mindspore/blob/master/CONTRIBUTING.md 。 +如何贡献代码,请参见链接。 贡献文档 ----------- -如何贡献文档,请参见链接 https://gitee.com/mindspore/docs/blob/master/CONTRIBUTING_DOC.md 。 \ No newline at end of file +如何贡献文档,请参见链接。 \ No newline at end of file diff --git a/docs/programming_guide/source_zh_cn/auto_parallel.md b/docs/programming_guide/source_zh_cn/auto_parallel.md index cc8cee2f718ac38ca7b222613db8401c836f55b4..a0c514462f2ac47694932d0eaa9da7ee36863d67 100644 --- a/docs/programming_guide/source_zh_cn/auto_parallel.md +++ b/docs/programming_guide/source_zh_cn/auto_parallel.md @@ -13,7 +13,6 @@ - [all_reduce_fusion_config](#all_reduce_fusion_config) - [自动并行配置](#自动并行配置) - [gradient_fp32_sync](#gradient_fp32_sync) - - [loss_repeated_mean](#loss_repeated_mean) - [auto_parallel_search_mode](#auto_parallel_search_mode) - [strategy_ckpt_load_file](#strategy_ckpt_load_file) - [strategy_ckpt_save_file](#strategy_ckpt_save_file) @@ -26,8 +25,11 @@ - [init](#init) - [get_group_size](#get_group_size) - [get_rank](#get_rank) - - [数据并行](#数据并行) - - [自动并行](#自动并行) + - [分布式属性配置](#分布式属性配置) + - [cross_batch](#cross_batch) + - [fusion](#fusion) + - [数据并行](#数据并行) + - [自动并行](#自动并行) @@ -44,7 +46,7 @@ MindSpore提供了分布式并行训练的功能,它支持了包括数据并 MindSpore的分布式并行配置通过`auto_parallel_context`来进行集中管理,用户可根据自身需求和实际情况来进行个性化的配置。这些配置可分为四大类: - 通用配置:对数据并行和自动并行均起作用的配置,如:`device_num`、`global_rank`。 -- 自动并行配置:仅在自动并行模式下起作用的配置,如:`gradient_fp32_sync`、`loss_repeated_mean`。 +- 自动并行配置:仅在自动并行模式下起作用的配置,如:`gradient_fp32_sync`。 - 数据并行配置:仅在数据并行模式下起作用的配置,如:`enable_parallel_optimizer`。 - 混合并行配置:仅在混合并行模式下起作用的配置,如:`layerwise_parallel`。 @@ -98,15 +100,21 @@ context.get_auto_parallel_context("gradients_mean") - `stand_alone`:单机模式。 - `data_parallel`:数据并行模式。 - `hybrid_parallel`:混合并行模式。 -- `semi_auto_parallel`:半自动并行模式,即用户可通过`set_strategy`方法给算子配置切分策略,若不配置策略,则默认是数据并行策略。 +- `semi_auto_parallel`:半自动并行模式,即用户可通过`shard`方法给算子配置切分策略,若不配置策略,则默认是数据并行策略。 - `auto_parallel`:自动并行模式,即框架会自动建立代价模型,为用户选择最优的切分策略。 +其中`auto_parallel`和`data_parallel`在MindSpore教程中有完整样例: + +。 + 代码样例如下: ```python -from mindspore import context +from mindspore import context +from mindspore.ops import operations as P -context.set_auto_parallel_context(parallel_mode="auto_parallel") +context.set_auto_parallel_context(parallel_mode="semi_auto_parallel") +mul = P.Mul().shard(((2, 1), (2, 1))) context.get_auto_parallel_context("parallel_mode") ``` @@ -141,22 +149,9 @@ context.set_auto_parallel_context(gradient_fp32_sync=False) context.get_auto_parallel_context("gradient_fp32_sync") ``` -#### loss_repeated_mean - -`loss_repeated_mean`表示在loss重复计算的场景下,反向是否进行均值操作,其值为bool类型,默认为True。loss存在重复计算的场景下,反向进行均值操作能使分布式逻辑和单机保持一致。但在某些场景下,不进行均值操作可能会使网络收敛的速度更快。因此,MindSpore提供`loss_repeated_mean`接口,让用户自由配置。 - -代码样例如下: - -```python -from mindspore import context - -context.set_auto_parallel_context(loss_repeated_mean=False) -context.get_auto_parallel_context("loss_repeated_mean") -``` - #### auto_parallel_search_mode -MindSpore提供了`dynamic_programming`和`recursive_programming`两种搜索策略的算法。`dynamic_programming`能够搜索出代价模型刻画的最优策略,但在搜索巨大网络模型的并行策略时耗时较长;而`recursive_programming`能较快搜索出并行策略,但搜索出来的策略可能不是运行性能最优的。为此,MindSpore提供了参数,让用户自由选择搜索算法。 +MindSpore提供了`dynamic_programming`和`recursive_programming`两种搜索策略的算法。`dynamic_programming`能够搜索出代价模型刻画的最优策略,但在搜索巨大网络模型的并行策略时耗时较长;而`recursive_programming`能较快搜索出并行策略,但搜索出来的策略可能不是运行性能最优的。为此,MindSpore提供了参数,让用户自由选择搜索算法,默认是`dynamic_programming`。 代码样例如下: @@ -286,7 +281,33 @@ init() rank_id = get_rank() ``` -### 数据并行 +## 分布式属性配置 + +### cross_batch + +在特定场景下,`data_parallel`的计算逻辑和`stand_alone`是不一样的,`auto_parallel`在任何场景下都是和`stand_alone`的计算逻辑保持一致。而`data_parallel`的收敛效果可能更好,因此MindSpore提供了`cross_barch`这个参数,可以使`auto_parallel`的计算逻辑和`data_parallel`保持一致,用户可通过`add_prim_attr`方法进行配置,默认值是False。 + +代码样例如下: + +```python +from mindspore.ops import operations as P + +mul = P.Mul().set_prim_attr("cross_batch", True) +``` + +### fusion + +出于性能考虑,MindSpore提供了通信算子融合功能,`fusion`值相同的同类通信算子会融合在一起,`fusion`值为0时,表示不融合。 + +代码样例如下: + +```python +from mindspore.ops import operations as P + +allreduce = P.AllReduce().set_prim_attr("fusion", 1) +``` + +## 数据并行 数据并行是对数据进行切分的并行模式,一般按照batch维度切分,将数据分配到各个计算单元(worker)中,进行模型计算。在数据并行模式下,数据集要以数据并行的方式导入,并且`parallel_mode`要设置为`data_parallel`。 @@ -294,10 +315,11 @@ rank_id = get_rank() 。 -### 自动并行 +## 自动并行 自动并行是融合了数据并行、模型并行及混合并行的一种分布式并行模式,可以自动建立代价模型,为用户选择一种并行模式。其中,代价模型指基于内存的计算开销和通信开销对训练时间建模,并设计高效的算法找到训练时间较短的并行策略。在自动并行模式下,数据集也要以数据并行的方式导入,并且`parallel_mode`要设置为`auto_parallel`。 具体用例请参考MindSpore分布式并行训练教程: 。 + diff --git a/tutorials/training/source_en/advanced_use/distributed_training_gpu.md b/tutorials/training/source_en/advanced_use/distributed_training_gpu.md new file mode 100644 index 0000000000000000000000000000000000000000..66e074e17946e872283da4749d8ada1fd838e610 --- /dev/null +++ b/tutorials/training/source_en/advanced_use/distributed_training_gpu.md @@ -0,0 +1,148 @@ +# Distributed Parallel Training (GPU) + +`Linux` `GPU` `Model Training` `Intermediate` `Expert` + + + +- [Distributed Parallel Training (GPU)](#distributed-parallel-training-gpu) + - [Overview](#overview) + - [Preparation](#preparation) + - [Downloading the Dataset](#downloading-the-dataset) + - [Configuring Distributed Environment](#configuring-distributed-environment) + - [Calling the Collective Communication Library](#calling-the-collective-communication-library) + - [Defining the Network](#defining-the-network) + - [Running the Script](#running-the-script) + - [Running the Multi-Host Script](#running-the-multi-host-script) + + + + + +## Overview + +This tutorial describes how to train the ResNet-50 network using MindSpore data parallelism and automatic parallelism on GPU hardware platform. + +## Preparation + +### Downloading the Dataset + +The `CIFAR-10` dataset is used as an example. The method of downloading and loading the dataset is the same as that for the Ascend 910 AI processor. + +> The method of downloading and loading the dataset: +> +> + +### Configuring Distributed Environment + +- `OpenMPI-3.1.5`: multi-process communication library used by MindSpore. + + > Download the OpenMPI-3.1.5 source code package `openmpi-3.1.5.tar.gz` from . + > + > For details about how to install OpenMPI, see the official tutorial: . + +- `NCCL-2.4.8`: Nvidia collective communication library. + + > Download NCCL-2.4.8 from . + > + > For details about how to install NCCL, see the official tutorial: . + +- Password-free login between hosts (required for multi-host training). If multiple hosts are involved in the training, you need to configure password-free login between them. The procedure is as follows: + 1. Ensure that the same user is used to log in to each host. (The root user is not recommended.) + 2. Run the `ssh-keygen -t rsa -P ""` command to generate a key. + 3. Run the `ssh-copy-id DEVICE-IP` command to set the IP address of the host that requires password-free login. + 4. Run the`ssh DEVICE-IP` command. If you can log in without entering the password, the configuration is successful. + 5. Run the preceding command on all hosts to ensure that every two hosts can communicate with each other. + +### Calling the Collective Communication Library + +On the GPU hardware platform, MindSpore parallel distributed training uses NCCL for communication. + +> On the GPU platform, MindSpore does not support the following operations: +> +> `get_local_rank`, `get_local_size`, `get_world_rank_from_group_rank`, `get_group_rank_from_world_rank` and `create_group` + +The sample code for calling the HCCL is as follows: + +```python +from mindspore import context +from mindspore.communication.management import init + +if __name__ == "__main__": + context.set_context(mode=context.GRAPH_MODE, device_target="GPU") + init("nccl") + ... +``` + +In the preceding information, + +- `mode=context.GRAPH_MODE`: sets the running mode to graph mode for distributed training. (The PyNative mode does not support parallel running.) +- `init("nccl")`: enables NCCL communication and completes the distributed training initialization. + +## Defining the Network + +On the GPU hardware platform, the network definition is the same as that for the Ascend 910 AI processor. + +> For details about the definitions of the network, optimizer, and loss function, see . + +## Running the Script + +On the GPU hardware platform, MindSpore uses OpenMPI `mpirun` for distributed training. The following takes the distributed training script for eight devices as an example to describe how to run the script: + +> Obtain the running script of the example from: +> +> +> +> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. + +```bash +#!/bin/bash + +DATA_PATH=$1 +export DATA_PATH=${DATA_PATH} + +rm -rf device +mkdir device +cp ./resnet50_distributed_training.py ./resnet.py ./device +cd ./device +echo "start training" +mpirun -n 8 pytest -s -v ./resnet50_distributed_training.py > train.log 2>&1 & +``` + +The script requires the variable `DATA_PATH`, which indicates the path of the dataset. In addition, you need to modify the `resnet50_distributed_training.py` file. Since the `DEVICE_ID` environment variable does not need to be set on the GPU, you do not need to call `int(os.getenv('DEVICE_ID'))` in the script to obtain the physical sequence number of the device, and `context` does not require `device_id`. You need to set `device_target` to `GPU` and call `init("nccl")` to enable the NCCL. The log file is saved in the device directory, and the loss result is saved in train.log. The output loss values of the grep command are as follows: + +``` +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +``` + +## Running the Multi-Host Script + +If multiple hosts are involved in the training, you need to set the multi-host configuration in the `mpirun` command. You can use the `-H` option in the `mpirun` command. For example, `mpirun -n 16 -H DEVICE1_IP:8,DEVICE2_IP:8 python hello.py` indicates that eight processes are started on the host whose IP addresses are DEVICE1_IP and DEVICE2_IP, respectively. Alternatively, you can create a hostfile similar to the following and transfer its path to the `--hostfile` option of `mpirun`. Each line in the hostfile is in the format of `[hostname] slots=[slotnum]`, where hostname can be an IP address or a host name. +```bash +DEVICE1 slots=8 +DEVICE2 slots=8 +``` + +The following is the execution script of the 16-device two-host cluster. The variables `DATA_PATH` and `HOSTFILE` need to be transferred, indicating the dataset path and hostfile path. For details about more mpirun options, see the OpenMPI official website. + +```bash +#!/bin/bash + +DATA_PATH=$1 +HOSTFILE=$2 + +rm -rf device +mkdir device +cp ./resnet50_distributed_training.py ./resnet.py ./device +cd ./device +echo "start training" +mpirun -n 16 --hostfile $HOSTFILE -x DATA_PATH=$DATA_PATH -x PATH -mca pml ob1 pytest -s -v ./resnet50_distributed_training.py > train.log 2>&1 & +``` + +Run running on GPU, the model parameters can be saved and loaded for reference[Distributed Training Model Parameters Saving and Loading](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training_ascend.html#id12) diff --git a/tutorials/training/source_en/advanced_use/distributed_training_tutorials.rst b/tutorials/training/source_en/advanced_use/distributed_training_tutorials.rst index 84fcdf2fc5b56b422593c2711fdb581b381b8407..ce35667fe27e45fff70d9cdaf0ae69faf07e431f 100644 --- a/tutorials/training/source_en/advanced_use/distributed_training_tutorials.rst +++ b/tutorials/training/source_en/advanced_use/distributed_training_tutorials.rst @@ -17,6 +17,7 @@ MindSpore also provides the parallel distributed training function. It supports :maxdepth: 1 distributed_training_ascend + distributed_training_gpu apply_host_device_training apply_parameter_server_training save_load_model_hybrid_parallel diff --git a/tutorials/training/source_en/index.rst b/tutorials/training/source_en/index.rst index b6dde45805be5b7720672c923a7ca0615759c695..4c96f7287323f911e79358eca4885f330ed278da 100644 --- a/tutorials/training/source_en/index.rst +++ b/tutorials/training/source_en/index.rst @@ -36,7 +36,7 @@ Train with MindSpore .. toctree:: :glob: :maxdepth: 1 - :caption: + :caption: Build Networks advanced_use/custom_operator advanced_use/migrate_script @@ -83,6 +83,6 @@ Train with MindSpore advanced_use/computer_vision_application advanced_use/nlp_application - advanced_use/synchronization_training_and_evaluation.md - advanced_use/optimize_the_performance_of_data_preparation.md + advanced_use/synchronization_training_and_evaluation + advanced_use/optimize_the_performance_of_data_preparation