From c556e93e754c70cad87f98a4e09c2e09f625378e Mon Sep 17 00:00:00 2001 From: huanxiaoling <3174348550@qq.com> Date: Mon, 7 Nov 2022 17:15:40 +0800 Subject: [PATCH] update en file in parallel --- tutorials/experts/source_en/index.rst | 1 + .../source_en/parallel/fault_recover.md | 123 ++++++++++++++++++ .../source_zh_cn/parallel/fault_recover.md | 4 +- 3 files changed, 126 insertions(+), 2 deletions(-) create mode 100644 tutorials/experts/source_en/parallel/fault_recover.md diff --git a/tutorials/experts/source_en/index.rst b/tutorials/experts/source_en/index.rst index 33f52f594f..548ab29a51 100644 --- a/tutorials/experts/source_en/index.rst +++ b/tutorials/experts/source_en/index.rst @@ -75,6 +75,7 @@ For Experts parallel/introduction parallel/distributed_case parallel/distributed_inference + parallel/fault_recover parallel/save_load parallel/multi_dimensional parallel/other_features diff --git a/tutorials/experts/source_en/parallel/fault_recover.md b/tutorials/experts/source_en/parallel/fault_recover.md new file mode 100644 index 0000000000..820b4dbd12 --- /dev/null +++ b/tutorials/experts/source_en/parallel/fault_recover.md @@ -0,0 +1,123 @@ +# Distributed Fault Recovery + + + +## Overview + +It is very common to encounter failures when performing distributed training, similar to single-card training, which can be continued by loading the saved weight information during training. Distinct from pure data parallel training, when model parallelism is applied, the weights are sliced and the weight information saved between cards may not be consistent. +To solve this problem, one option is to aggregate the weights through the [AllGather](https://www.mindspore.cn/tutorials/experts/en/master/parallel/communicate_ops.html#allgather) before saving the weight checkpoint file, where each card stores a complete information about the weights. This one function has been introduced in [Distributed training model parameter saving and loading](https://www.mindspore.cn/tutorials/experts/en/master/parallel/train_ascend.html#saving-and-loading-distributed-training-model-parameters). +However, for large models, the overhead of using aggregated preservation is too large for all kinds of resources, so this document presents a recovery scheme where each card only saves its own weight information. For large models, both data parallelism and model parallelism are often applied, and the devices divided by the dimensions of data parallelism, which hold exactly the same weight information, provide a redundant backup for large models. This document will also point out how to go about obtaining this redundant information. +For the relationship between the parallel strategy and the slicing division of the weights, the following mapping can be performed. For more information on the concepts of data parallelism and model parallelism, please refer to [Distributed Training](https://www.mindspore.cn/tutorials/experts/en/master/parallel/train_ascend.html). For more information about optimizer parallelism, please refer to [Optimizer Parallelism](https://www.mindspore.cn/tutorials/experts/en/master/parallel/optimizer_parallel.html). + +- Data parallelism + keep optimizer parallelism off: The ranks in the parallel communication domain hold the same weight slice. +- Model parallism: The ranks in the parallel communication domain hold different weight slices. +- Data parallelism + keep optimizer parallelism on + the number of shards in optimizer parallelism is smaller than the number of all data parallel dimensions: Within the parallel communication domain, the rank within the communication domain sliced by the optimizer holds different weight slices, and the communication domain sliced by each optimizer holds the same weight slice between them. + +Also, it should be noted that this document introduces the distributed faults recovery scheme, which needs to be used in sink mode. This document will introduce the scheme as an example of distributed parallel training Transformer model. For detailed information of transformer, please refer to this tutorial. + +> Download the complete sample code here: [distributed_training_transformer](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/distributed_training_transformer) + +The directory structure is as follows: + +```text +└─sample_code + ├─distribute_training_transformer + ├── dataset.py + ├── model.py + ├── rank_table_8pcs.json + ├── run_parallel_save_ckpt.sh + ├── run_parallel_recover_ckpt.sh + ├── parallel_save_ckpt_train.py + └── parallel_recover_train.py +``` + +## Slicing Preservation Weight + +To save the weight information of the slices, simply configure integrated_save to False in CheckpointConfig. Also, configure the environment variable GROUP_INFO_FILE to store redundant information about the weights. + +```bash +export GROUP_INFO_FILE=./group_info.pb +``` + +The code section for weight storage is as follows. Note that training is configured to sink mode by specifying dataset_sink_mode to True. + +```python +import mindspore as ms +from mindspore.train import CheckpointConfig, ModelCheckpoint +from mindspore.nn import PipelineCell + +def train(): + # model create + # checkpoint save + ckpt_config = CheckpointConfig(save_ckpt_steps=callback_size, keep_ckpt_max=4, + integrated_save=False) + ckpoint_cb = ModelCheckpoint(prefix="test", config=ckpt_config) + callback = [ckpoint_cb] + model.train(4, dataset, callbacks=callback, dataset_sink_mode=True) +``` + +## Loading Weights to Continue Training + +After saving the weight slices in the previous step, the following files can be seen in the directory obtained from the training, taking the 0-card directory as an example. + +```text +└─ckpt_dir0 + ├── group_info.pb + ├── test-1_77.ckpt + └── train.log0 +``` + +In train.log0, you can see the current loss value after training, similar to the following. + +```text +epoch: 1 step: 77, loss is 7.187697 +epoch: 1 step: 77, loss is 6.612632 +epoch: 1 step: 77, loss is 6.393444 +epoch: 1 step: 77, loss is 6.271424 +``` + +Reading group_info.pb can get the redundant information of the weights. The file will be parsed out to get a list with the value of rank_id, which means that the weight slices corresponding to the rank_id in these lists are all the same and can be replaced with each other. +As in the following example, after the 0-card group_info.pb of is parsed, it is found that the weight slicing of 0-card and 4-card are exactly the same. When the 0-card checkpoint is lost, the 4-card checkpoint can be directly copied as the 0-card checkpoint and the 0-card checkpoint can be recovered. + +```python +import mindspore as ms +rank_list = ms.restore_group_info_list("./ckpt_dir0/group_info.pb") +print(rank_list) // [0, 4] +``` + +Distributed fault recovery requires prior access to the slicing scores, thus, it is necessary to first call [model.build](https://www.mindspore.cn/docs/zh-CN/master/api_python/train/mindspore.train.Model.html# mindspore.train.model.build) to compile and then perform the training. + +```python +import os +import mindspore as ms +def recover_train(): + # model create + # checkpoint load + if args_opt.ckpt_file: + param_dict = ms.load_checkpoint(args_opt.ckpt_file) + model.build(train_dataset=dataset, epoch=4) + ms.load_param_into_net(net, param_dict) + model.train(2, dataset, callbacks=callback, dataset_sink_mode=True) +``` + +## Running the Code + +First, please refer to the [Preparation Session](https://www.mindspore.cn/tutorials/experts/en/master/parallel/transformer.html#Preparation) in the Distributed Parallel Training Transformer Model tutorial to prepare the dataset. +After entering the code directory, execute the training script that saves the slice weights. + +```bash +bash run_parallel_save_ckpt.sh DATASET_PATH +``` + +Then, the fault recovery training script is executed. + +```bash +bash run_parallel_recover_ckpt.sh DATASET_PATH +``` + +After the recovery training, the loss is as follows. You can see that the loss starts to drop directly from 6.465892, indicating that the loading is successful. + +```text +epoch: 1 step: 77, loss is 6.465892 +epoch: 1 step: 77, loss is 6.239279 +``` diff --git a/tutorials/experts/source_zh_cn/parallel/fault_recover.md b/tutorials/experts/source_zh_cn/parallel/fault_recover.md index c7316df180..43783fb040 100644 --- a/tutorials/experts/source_zh_cn/parallel/fault_recover.md +++ b/tutorials/experts/source_zh_cn/parallel/fault_recover.md @@ -7,7 +7,7 @@ 在进行分布式训练时,遇到故障是非常普遍的,类似于单卡训练,可以通过加载训练过程中保存的权重信息继续进行训练。区别于纯数据并行训练,当应用了模型并行后,权重是进行了切分的,卡与卡之间保存的权重信息可能不一致。 为了解决这个问题,一个方案是在保存权重checkpoint文件前,就将权重通过[AllGather](https://www.mindspore.cn/tutorials/experts/zh-CN/master/parallel/communicate_ops.html#allgather) 算子进行汇聚,每张卡均存储一个完整的权重信息,这一个功能在[分布式训练模型参数保存和加载](https://www.mindspore.cn/tutorials/experts/zh-CN/master/parallel/train_ascend.html#分布式训练模型参数保存和加载) 中已经介绍了。 但是,对于大模型来说,使用汇聚保存对各种资源的开销都过于巨大,因此,本文档介绍的是每张卡仅仅保存自身的权重信息的恢复方案。对于大模型来说,往往会同时应用上数据并行与模型并行,而数据并行的维度所划分的设备,它们持有的权重信息是完全一致的,这也为大模型提供了冗余的备份,本文档也将指出如何去获取这个冗余信息。 -关于并行策略与权重的切片划分的关系,可以进行如下映射。关于数据并行,模型并行的概念,请参考[分布式训练](https://www.mindspore.cn/tutorials/experts/zh-CN/master/parallel/train_ascend.html) 、关于优化器并行,请参考[优化器并行](https://www.mindspore.cn/tutorials/experts/zh-CN/master/parallel/optimizer_parallel.html) 。 +关于并行策略与权重的切片划分的关系,可以进行如下映射。关于数据并行,模型并行的概念,请参考[分布式训练](https://www.mindspore.cn/tutorials/experts/zh-CN/master/parallel/train_ascend.html) 、关于优化器并行,请参考[优化器并行](https://www.mindspore.cn/tutorials/experts/zh-CN/master/parallel/optimizer_parallel.html)。 - 数据并行 + 不开启优化器并行:并行通信域内的rank持有相同权重切片。 - 模型并行:并行通信域内的rank持有不同权重切片。 @@ -86,7 +86,7 @@ rank_list = ms.restore_group_info_list("./ckpt_dir0/group_info.pb") print(rank_list) // [0, 4] ``` -分布式的故障恢复,需要事先获取切分的信息,因而,需要先调用[model.build](https://www.mindspore.cn/docs/zh-CN/master/api_python/train/mindspore.train.Model.html#mindspore.train.Model.build) 进行编译, 继而再执行训练。 +分布式的故障恢复,需要事先获取切分的信息,因而,需要先调用[model.build](https://www.mindspore.cn/docs/zh-CN/master/api_python/train/mindspore.train.Model.html#mindspore.train.Model.build) 进行编译,继而再执行训练。 ```python import os -- Gitee