From c954094fd3667942ee8c164ad66620dceec4ef7c Mon Sep 17 00:00:00 2001 From: huan <3174348550@qq.com> Date: Thu, 14 Aug 2025 15:57:00 +0800 Subject: [PATCH] modify contents --- tutorials/source_en/parallel/dataset_slice.md | 2 +- tutorials/source_en/parallel/host_device_training.md | 3 ++- tutorials/source_zh_cn/parallel/comm_fusion.md | 6 +++--- tutorials/source_zh_cn/parallel/dataset_slice.md | 2 +- .../parallel/high_dimension_tensor_parallel.md | 4 ++-- tutorials/source_zh_cn/parallel/host_device_training.md | 9 +++++---- 6 files changed, 14 insertions(+), 12 deletions(-) diff --git a/tutorials/source_en/parallel/dataset_slice.md b/tutorials/source_en/parallel/dataset_slice.md index cecf49f750..a878165f18 100644 --- a/tutorials/source_en/parallel/dataset_slice.md +++ b/tutorials/source_en/parallel/dataset_slice.md @@ -4,7 +4,7 @@ ## Overview -When performing distributed training, taking image data as an example, when the size of a single image is too large, such as large-format images of remote sensing satellites, when an image is too large, it is necessary to slice the image and read a portion of each card to perform distributed training. Scenarios that deal with dataset slicing need to be combined with model parallelism to achieve the desired effect of reducing video memory, so this feature is provided based on automatic parallelism. The sample used in this tutorial is not a large-format network, and is intended as an example only. Real-life applications to large-format networks often require detailed design of parallel strategies. +When performing distributed training, taking image data as an example, when the size of a single image is too large, such as large-format images of remote sensing satellites, it is necessary to slice the image and read a portion of each card to perform distributed training. Scenarios that deal with dataset slicing need to be combined with model parallelism to achieve the desired effect of reducing video memory, so this feature is provided based on automatic parallelism. The sample used in this tutorial is not a large-format network, and is intended as an example only. Real-life applications to large-format networks often require detailed design of parallel strategies. > Dataset sharding is not involved in data parallel mode. diff --git a/tutorials/source_en/parallel/host_device_training.md b/tutorials/source_en/parallel/host_device_training.md index 1f34264cb7..997d8643c1 100644 --- a/tutorials/source_en/parallel/host_device_training.md +++ b/tutorials/source_en/parallel/host_device_training.md @@ -10,7 +10,7 @@ In MindSpore, users can easily implement hybrid training by configuring trainabl ### Basic Principle -Pipeline parallel and operator-level parallel are suitable for the model to have a large number of operators, and the parameters are more evenly distributed among the operators. What if the number of operators in the model is small, and the parameters are concentrated in only a few operators? Wide & Deep is an example of this, as shown in the image below. The Embedding table in Wide & Deep can be trained as a parameter of hundreds of GIGabytes or even a few terabytes. If it is executed on an accelerator (device), the number of accelerators required is huge, and the training cost is expensive. On the other hand, if you use accelerator computing, the training acceleration obtained is limited, and it will also trigger cross-server traffic, and the end-to-end training efficiency will not be very high. +Pipeline parallel and operator-level parallel are suitable for scenarios where there are a large number of model operators and parameters are distributed evenly across the operators. If there are fewer model operators and parameters are concentrated in a small number of operators, a different strategy is required. Wide & Deep is an example of this, as shown in the image below. The Embedding table in Wide & Deep can be trained as a parameter of hundreds of GIGabytes or even a few terabytes. If it is executed on an accelerator (device), the number of accelerators required is huge, and the training cost is expensive. On the other hand, if you use accelerator computing, the training acceleration obtained is limited, and it will also trigger cross-server traffic, and the end-to-end training efficiency will not be very high. ![image](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/tutorials/source_zh_cn/parallel/images/host_device_image_0_zh.png) @@ -69,6 +69,7 @@ The dataset is loaded and the data is parallelized consistently with the followi import os import mindspore as ms import mindspore.dataset as ds +from mindspore.communication import get_rank, get_group_size ms.set_seed(1) diff --git a/tutorials/source_zh_cn/parallel/comm_fusion.md b/tutorials/source_zh_cn/parallel/comm_fusion.md index f5bd3c2bfc..dca3d1344e 100644 --- a/tutorials/source_zh_cn/parallel/comm_fusion.md +++ b/tutorials/source_zh_cn/parallel/comm_fusion.md @@ -4,7 +4,7 @@ ## 简介 -在分布式并行训练场景下训练大规模参数量的模型(如GPT-3, Pangu-$\alpha$),跨设备甚至跨节点的数据传输是制约扩展性以及算力利用率的瓶颈[1]。通信融合是一种提升网络资源利用率、加速数据传输效率的重要方法,其将相同源节点和目的节点的通信算子打包同时执行,以避免多个单算子执行带来的额外开销。 +在分布式并行训练场景下训练大规模参数量的模型(如GPT-3、Pangu-$\alpha$),跨设备甚至跨节点的数据传输是制约扩展性以及算力利用率的瓶颈[1]。通信融合是一种提升网络资源利用率、加速数据传输效率的重要方法,其将相同源节点和目的节点的通信算子打包同时执行,以避免多个单算子执行带来的额外开销。 MindSpore支持对分布式训练中三种常用通信算子([AllReduce](https://www.mindspore.cn/docs/zh-CN/master/api_python/ops/mindspore.ops.AllReduce.html)、[AllGather](https://www.mindspore.cn/docs/zh-CN/master/api_python/ops/mindspore.ops.AllGather.html)、[ReduceScatter](https://www.mindspore.cn/docs/zh-CN/master/api_python/ops/mindspore.ops.ReduceScatter.html))的融合,并提供简洁易用的接口方便用户自行配置。在长稳训练任务支撑中,通信融合特性发挥了重要作用。 @@ -54,7 +54,7 @@ MindSpore提供两种接口来使能通信融合,下面分别进行介绍: 2. 利用`Cell`提供的接口 - 无论在哪种并行模式场景下,用户都可以通过[Cell.set_comm_fusion](https://www.mindspore.cn/docs/zh-CN/master/api_python/nn/mindspore.nn.Cell.html#mindspore.nn.Cell.set_comm_fusion)接口为模型某layer的参数设置index,MindSpore将融合相同index的参数所对应的通信算子。 + 无论在哪种并行模式场景下,用户都可以通过[Cell.set_comm_fusion](https://www.mindspore.cn/docs/zh-CN/master/api_python/nn/mindspore.nn.Cell.html#mindspore.nn.Cell.set_comm_fusion)接口为模型某个layer的参数设置index,MindSpore将融合相同index的参数所对应的通信算子。 ## 操作实践 @@ -91,7 +91,7 @@ net.comm_fusion(config={"allreduce": {"mode": "auto", "config": None}}) init() ``` -若将所有的同类通信算子融合成一个算子,在当前训练迭代中,传输需要等待计算完全结束后才能执行,这样会造成设备的等待。 +若将所有的同类通信算子融合成一个算子,在当前训练迭代中,需要等待计算完全结束后才能执行传输,这样会造成设备的等待。 为了避免上述问题,可以将网络参数进行分组融合:在下一组参数进行的计算的同时,进行上组参数的通信,使得计算和通信能够互相隐藏,可以通过限定fusion buffer的大小,或者index分区的方法进行分组融合。 diff --git a/tutorials/source_zh_cn/parallel/dataset_slice.md b/tutorials/source_zh_cn/parallel/dataset_slice.md index fcedd8e414..709c7135dd 100644 --- a/tutorials/source_zh_cn/parallel/dataset_slice.md +++ b/tutorials/source_zh_cn/parallel/dataset_slice.md @@ -4,7 +4,7 @@ ## 简介 -在进行分布式训练时,以图片数据为例,当单张图片的大小过大时,如遥感卫星等大幅面图片,当单张图片过大时,需要对图片进行切分,每张卡读取一部分图片,进行分布式训练。处理数据集切分的场景,需要配合模型并行一起才能达到预期的降低显存的效果,因此,基于自动并行提供了该项功能。本教程使用的样例不是大幅面的网络,仅作示例。真实应用到大幅面的网络时,往往需要详细设计并行策略。 +在进行分布式训练时,以图片数据为例,当单张图片的大小过大时,如遥感卫星等大幅面图片,需要对图片进行切分,每张卡读取一部分图片,进行分布式训练。处理数据集切分的场景,需要配合模型并行一起才能达到预期的降低显存的效果,因此,基于自动并行提供了该项功能。本教程使用的样例不是大幅面的网络,仅作示例。真实应用到大幅面的网络时,往往需要详细设计并行策略。 > 数据集切分在数据并行模式下不涉及。 diff --git a/tutorials/source_zh_cn/parallel/high_dimension_tensor_parallel.md b/tutorials/source_zh_cn/parallel/high_dimension_tensor_parallel.md index 104c534b22..3781617b1d 100644 --- a/tutorials/source_zh_cn/parallel/high_dimension_tensor_parallel.md +++ b/tutorials/source_zh_cn/parallel/high_dimension_tensor_parallel.md @@ -6,11 +6,11 @@ 大模型训练中,模型并行能够有效减少内存负荷,但其引入的通信是一个显著的性能瓶颈。因此需要优化整网模型切分策略以期引入最小的通信量。 -张量并行(Tensor Parallel,简称TP)训练是将一个张量沿特定维度分成 `N` 块,每个设备只持有整个张量的 `1/N`,进行MatMul/BatchMatMul等算子计算,并引入额外通信保证最终结果正确。而高维张量并行则允许灵活控制对张量的切分次数和切分轴,支持1D、2D、3D切分。2D/3D切分相对与1D切分,在合适的切分策略下,通信量随着TP设备数增长更慢,在TP设备数较大时有着更低的额外通信量,达到提高训练速度的目的。 +张量并行(Tensor Parallel,简称TP)训练是将一个张量沿特定维度分成 `N` 块,每个设备只持有整个张量的 `1/N`,进行MatMul/BatchMatMul等算子计算,并引入额外通信保证最终结果正确。而高维张量并行则允许灵活控制对张量的切分次数和切分轴,支持1D、2D、3D切分。2D/3D切分相对于1D切分,在合适的切分策略下,通信量随着TP设备数增长更慢,在TP设备数较大时有着更低的额外通信量,达到提高训练速度的目的。 > 本特性支持的硬件平台为Ascend,需要在Graph模式、半自动并行下运行。 -使用场景:在半自动模式下,网络中存在张量并行,且训练卡数较多时(一般不少于8卡)时,对MatMul/BatchMatMul进行2D/3D张量并行策略配置,并适配上下游算子的切分策略,可获得训练性能提升。 +使用场景:在半自动模式下,网络中存在张量并行,且训练卡数较多时(一般不少于8卡),对MatMul/BatchMatMul进行2D/3D张量并行策略配置,并适配上下游算子的切分策略,可获得训练性能提升。 ### 原理 diff --git a/tutorials/source_zh_cn/parallel/host_device_training.md b/tutorials/source_zh_cn/parallel/host_device_training.md index 63386c8a08..9620757121 100644 --- a/tutorials/source_zh_cn/parallel/host_device_training.md +++ b/tutorials/source_zh_cn/parallel/host_device_training.md @@ -10,13 +10,13 @@ ### 基本原理 -流水线并行和算子级并行适用于模型的算子数量较大,同时参数较均匀的分布在各个算子中。如果模型中的算子数量较少,同时参数只集中在几个算子中呢?Wide&Deep就是这样的例子,如下图所示。Wide&Deep中的Embedding table作为需训练的参数可达几百GB甚至几TB,若放在加速器(device)上执行,那么所需的加速器数量巨大,训练费用昂贵。另一方面,若使用加速器计算,其获得的训练加速有限,同时会引发跨服务器的通信量,端到端的训练效率不会很高。 +流水线并行和算子级并行适用于模型算子数量较多,且参数较均匀地分布在各算子中的场景。若模型算子较少而参数集中在少数算子中,则需要采用不同策略。Wide & Deep 是典型例子,如下图所示。Wide&Deep中的Embedding table作为需训练的参数可达几百GB甚至几TB,若放在加速器(device)上执行,那么所需的加速器数量巨大,训练费用昂贵。另一方面,若使用加速器计算,其获得的训练加速有限,同时会引发跨服务器的通信量,端到端的训练效率不会很高。 ![image](./images/host_device_image_0_zh.png) *图:Wide&Deep模型的部分结构* -仔细分析Wide&Deep模型的特殊结构后可得:Embedding table虽然参数量巨大,但其参与的计算量很少,可以将Embedding table和其对应的算子EmbeddingLookup算子放置在Host端计算,其余算子放置在加速器端。这样做能够同时发挥Host端内存量大、加速器端计算快的特性,同时利用了同一台服务器的Host到加速器高带宽的特性。下图展示了Wide&Deep异构切分的方式: +仔细分析Wide&Deep模型的特殊结构后可得:Embedding table虽然参数量巨大,但其参与的计算量很少,可以将Embedding table和其对应的EmbeddingLookup算子放置在Host端计算,其余算子放置在加速器端。这样做能够同时发挥Host端内存量大、加速器端计算快的特性,同时利用了同一台服务器的Host到加速器高带宽的特性。下图展示了Wide&Deep异构切分的方式: ![image](./images/host_device_image_1_zh.png) @@ -69,6 +69,7 @@ init() import os import mindspore as ms import mindspore.dataset as ds +from mindspore.communication import get_rank, get_group_size ms.set_seed(1) @@ -93,7 +94,7 @@ data_set = create_dataset(32) ### 网络定义 -网络定义与单卡网络区别在于,配置[ops.Add()](https://www.mindspore.cn/docs/en/master/api_python/ops/mindspore.ops.Add.html)算子在主机端运行,代码如下: +网络定义与单卡网络区别在于,配置[ops.Add()](https://www.mindspore.cn/docs/zh-CN/master/api_python/ops/mindspore.ops.Add.html)算子在主机端运行,代码如下: ```python import mindspore as ms @@ -180,7 +181,7 @@ for epoch in range(5): bash run.sh ``` -训练完后,关于Loss部分结果保存在`log_output/worker_*.log`中,示例如下: +训练完成后,关于Loss部分结果保存在`log_output/worker_*.log`中,示例如下: ```text epoch: 0, step: 0, loss is 2.302936 -- Gitee