From 3a8cf0d1ec2128e6f2cb940189cc9c677e73428a Mon Sep 17 00:00:00 2001 From: lzy0920232 Date: Sun, 29 Jun 2025 17:24:33 +0800 Subject: [PATCH] code_docs_parallel_en --- .../source_en/feature/parallel_training.md | 116 +++++++++++++++--- 1 file changed, 101 insertions(+), 15 deletions(-) diff --git a/docs/mindformers/docs/source_en/feature/parallel_training.md b/docs/mindformers/docs/source_en/feature/parallel_training.md index 14d19251ed..7dbcab2e7e 100644 --- a/docs/mindformers/docs/source_en/feature/parallel_training.md +++ b/docs/mindformers/docs/source_en/feature/parallel_training.md @@ -23,19 +23,62 @@ In actual applications, different parallelism strategies apply to different scen MindSpore Transformers supports multiple parallelism features. You can use these features to optimize the training of different model architectures and hardware configurations. The following table outlines these parallelism features and provides links to the details in the MindSpore documentation. -| **Parallelism Feature** | **Description** | -|-----------------------------------|---------------------------------------------------------------------------------| -| **[Data parallelism](https://www.mindspore.cn/tutorials/en/master/parallel/data_parallel.html)** | Splits data to multiple devices and trains the data on each device at the same time. This mode applies to training a simple model with a lot of data. | -| **[Model parallelism](https://www.mindspore.cn/tutorials/en/master/parallel/operator_parallel.html)** | Distributes model parameters to multiple devices. This mode applies to the scenario where a single device cannot accommodate the entire model. | -| **[Pipeline parallelism](https://www.mindspore.cn/tutorials/en/master/parallel/pipeline_parallel.html)** | Divides an ultra-large model into multiple phases with each running on different devices for efficient training. | -| **[Optimizer parallelism](https://www.mindspore.cn/tutorials/en/master/parallel/optimizer_parallel.html)** | Distributes the optimizer computation to multiple devices to reduce memory usage and improve training efficiency. | -| **Sequence parallelism** | Designed to share the memory and computation that cannot be sliced by model parallel, the inputs of LayerNorm and Dropout in the Transformer layer are sliced according to the sequence dimension to reduce the memory pressure on a single device. | -| **[Long sequence parallelism](#long-sequence-parallelism)** | Slices all inputs and output activations by sequence to further reduce the GPU memory usage of the model for processing long sequence inputs.| -| **[Multi-copy parallelism](https://www.mindspore.cn/docs/en/master/features/parallel/pipeline_parallel.html#mindspore-interleaved-pipeline-scheduler)** | Implements fine-grained parallel control among multiple copies to optimize performance and resource utilization. This mode is suitable for efficient training of models with large specifications. | +### Data Parallelism -For details about how to configure distributed parallel parameters, see [MindSpore Transformers Configuration Description](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html). +Data parallelism involves each device (worker) holding a complete set of model weights, dividing the input data into slices, and distributing them to different computing devices for parallel processing. Forward and backward propagation calculations are performed based on the allocated local data. After backward propagation is completed, the gradients computed on all devices are aggregated through a global reduction (AllReduce) operation to ensure consistency of model parameters across devices. When training with multiple data streams simultaneously, communication occurs only once during gradient updates, achieving optimal performance, but memory usage does not decrease. Data parallelism is suitable for scenarios with large data volumes and small model sizes. For the framework-side implementation of data parallelism, refer to the specific content of [MindSpore Data Parallelism](https://www.mindspore.cn/docs/en/master/features/parallel/data_parallel.html). -## Introduction to Parallel Characterization +MindSpore Transformers supports data parallelism and can be enabled by the following configuration items: + +```yaml +parallel_config: + ... + data_parallel: 2 + ... +``` + +Parameter description: + +- data_parallel: The number of parallel data sharding, which is set to 1 by default, is configured based on user requirements. + +For the configuration method of distributed parallel parameters, see the parallel configuration section in the [MindSpore Transformers Configuration Instructions](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html). + +### Model Parallelism + +In data parallel training, each device stores all model parameters, leading to high memory usage, which may become a bottleneck when the model size is large. Model parallelism splits the entire model and distributes it across an array of devices, with each device maintaining only a portion of the model's weights. The network performs parallel computations on their respective parts and communicates at positions like LayerNorm, which is the most memory-efficient but involves significant communication. Model parallelism is suitable for scenarios where the model size is large and a single device cannot accommodate the entire model. For framework-side implementations of model parallelism, refer to the specific content of [MindSpore Model Parallelism](https://www.mindspore.cn/docs/en/master/features/parallel/operator_parallel.html). + +MindSpore Transformers supports model parallelism and can be enabled by the following configuration items: + +```yaml +parallel_config: + ... + model_parallel: 2 + ... +``` + +Parameter description: + +- model_parallel: The number of parallel shards of the model, which is set to 1 by default, is configured according to user requirements. + +For the configuration method of distributed parallel parameters, see the parallel configuration section in the [MindSpore Transformers Configuration Instructions](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html). + +### Sequence parallelism + +The sequence parallel design is used to allocate the memory and computation that cannot be split in parallel in the model, and the inputs of LayerNorm and Dropout in the Transformer layer are segmented according to the sequence dimension, reducing the memory pressure of a single device. + +MindSpore Transformers supports sequence parallelism and can be enabled by the following configuration items: + +```yaml +parallel_config: + ... + use_seq_parallel: True + ... +``` + +Parameter description: + +- use_seq_parallel:Whether to enable sequence parallelism, which is Fasle by default. + +For the configuration method of distributed parallel parameters, see the parallel configuration section in the [MindSpore Transformers Configuration Instructions](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html). ### Long Sequence Parallelism @@ -125,11 +168,11 @@ For configuration method of distributed parallel parameters, refer to the conten ### Pipeline Parallelism -#### Sequence Pipeline Parallelism (Seq-Pipe) +#### Multi-pipeline Interleaved Parallelism -The model inputs are segmented along the sequence dimension and unfolded into multiple sequence chunks. In the original 1F1B (One Forward One Backward) and 1F1B-Interleave methods, the scheduling unit is reduced to a Sequence Chunk. `seq_split_num` represents the number of Sequence Chunk; when `seq_split_num`=1, it degrades to 1F1B or 1F1B-Interleave. +Multi-pipeline parallel reduces pipeline bubbles through data interweaving, interlayer interlayering, and forward and reverse interweaving. By configuring a pipeline scheduling policy, the model input is segmented according to the sequence dimension and expanded into multiple sequence chunks. On the original 1F1B (One Forward One Backward) and 1F1B-Interleave methods, the dispatch unit was reduced to Sequence Chunk. `seq_split_num` For the number of slices, when `seq_split_num` =1, it degenerates to 1F1B or 1F1B-Interleave. If the global_batch_size bubble is large, the idle time of the cluster can be significantly reduced, and the memory usage will be larger, resulting in additional communication. For more information about the framework-side implementation of pipeline parallelism, see [MindSpore Pipeline Parallelism](https://www.mindspore.cn/docs/en/master/features/parallel/pipeline_parallel.html). -MindSpore Transformers supports configuring the Seq-Pipe pipeline parallelism, which can be enabled through the following configuration items: +MindSpore Transformers supports the configuration of multi-pipeline interleaved parallelism, which can be enabled by the following configuration items: ```yaml # parallel context @@ -145,7 +188,8 @@ parallel_config: Parameter Descriptions: -- pipeline_scheduler: The scheduling strategy for the pipeline, currently, mindspore transformers only supports setting this to `"seqpipe"`. +- pipeline_interleave: Whether to enable multi-pipeline interleaved parallelism. +- pipeline_scheduler: The scheduling policy of the pipeline is currently only supported by mindformers "seqpipe" . - seq_split_num: The number of Sequence Chunk which splits along the sequence dimension of the input. Notes: @@ -155,6 +199,48 @@ Notes: For more information on configuring distributed parallel parameters, see the [MindSpore Transformers configuration description](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html), specifically the section on parallel configuration. +### Optimizer parallelism + +During data parallel training, there is redundant computation in the model parameter update part across cards. By optimizing optimizer parallelism, the computation of the optimizer can be distributed to the cards in the data parallel dimension, effectively reducing memory consumption and improving network performance on large-scale networks. For the framework-side implementation of optimizer parallelism, refer to the specific content of [MindSpore optimizer parallelism](https://www.mindspore.cn/docs/en/master/features/parallel/optimizer_parallel.html) . + +MindSpore Transformers supports the optimizer parallelism, which can be enabled by the following configuration items: + +```yaml +parallel: + ... + enable_parallel_optimizer: True + ... +``` + +Parameter Descriptions: + +- enable_parallel_optimizer:Whether to enable optimizer parallelism, which is Fasle by default. + +For more information on configuring distributed parallel parameters, see the [MindSpore Transformers configuration description](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html), specifically the section on parallel configuration. + +### Multi-replica Parallelism + +Multi-replica parallelism is used to achieve fine-grained parallel control between multiple replicas, optimize performance and resource utilization, and is suitable for efficient training of large-scale models. For more information about the framework-side implementation of multi-copy parallelism, see the [MindSpore multi-replica parallelism](https://www.mindspore.cn/docs/en/master/features/parallel/pipeline_parallel.html#interleaved-pipeline-scheduler). + +MindSpore Transformers supports multi-replica parallelism and can be enabled by the following configuration items: + +```yaml +model_config: + ... + fine_grain_interleave: 2 + ... +``` + +Parameter Descriptions: + +- fine_grain_interleave: the number of fine-grained multiple replicas. + +Notes: + +- Currently, only Llama and DeepSeek series models are supported. + +For more information on configuring distributed parallel parameters, see the [MindSpore Transformers configuration description](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html), specifically the section on parallel configuration. + ## MindSpore Transformers Distributed Parallel Application Practices In the [Llama3-70B fine-tuning configuration](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3/llama3_70b/finetune_llama3_70b.yaml#) file provided on the official website, multiple distributed parallelism strategies are used to improve the training efficiency in the multi-node multi-device environment. The main parallelism strategies and key parameters involved in the configuration file are as follows: -- Gitee