diff --git a/tutorials/source_en/beginner/accelerate_with_static_graph.md b/tutorials/source_en/beginner/accelerate_with_static_graph.md index cecbf71c56e9f5df5684819515f5cf308120d695..72317d7bc7ba48f0b1072d677bcef43989941aaa 100644 --- a/tutorials/source_en/beginner/accelerate_with_static_graph.md +++ b/tutorials/source_en/beginner/accelerate_with_static_graph.md @@ -131,7 +131,7 @@ For an example of using static graphs for network compilation, see [Network Buil ## Static Graph Mode Startup Method -Usually, due to the flexibility of dynamic graphs, we choose to use PyNative mode for free neural network construction for model innovation and optimization. But when performance acceleration is needed, we need to accelerate the neural network partially or as a whole. MindSpore provides two ways of switching to graph mode, the decorator-based startup method and the global context-based startup method. +Usually, due to the flexibility of dynamic graphs, we choose to use PyNative mode for free neural network construction for model innovation and optimization. But when performance acceleration is needed, we need to accelerate the neural network partially or as a whole. MindSpore provides two ways of switching to static graph mode, the decorator-based startup method and the global context-based startup method. ### Decorator-based Startup Method diff --git a/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst b/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst index 3ec0aed33ef3640da7930c717e4c2d45d4105b96..b7f210325b54d93531bafbf5a853bcb095caee8f 100644 --- a/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst +++ b/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst @@ -94,9 +94,9 @@ To achieve the optimal cost-effectiveness, MindSpore LLM has undergone multiple - **Attention optimization**: The primary computation in the LLM's network involves the computation of attention. Since the attention size in mainstream models is often large (typically 4096 x 4096 or more), the performance of the entire inference process heavily relies on the efficiency of attention computation. Many studies focus on optimizing the performance of attention computation, with notable techniques such as flash attention and page attention. - - **Flash attention**: During attention computation, two large matrices (4096 x 4096) are multiplied. This computation breaks the large matrix into smaller matrices that can be processed on multiple chips. Subject to the minimum cache size of chips, data must continuously be moved between the cache and main memory. As a result, compute resources cannot be fully used. Consequently, attention computation is often bandwidth-bound. Flash attention addresses this by dividing attention into blocks, allowing each block to be computed independently on a chip, avoiding multiple data movements during the computation of KVs and enhancing attention computation performance. For details, see `Flash Attention `_. + - **Flash Attention**: During attention computation, two large matrices (4096 x 4096) are multiplied. This computation breaks the large matrix into smaller matrices that can be processed on multiple chips. Subject to the minimum cache size of chips, data must continuously be moved between the cache and main memory. As a result, compute resources cannot be fully used. Consequently, attention computation is often bandwidth-bound. Flash attention addresses this by dividing attention into blocks, allowing each block to be computed independently on a chip, avoiding multiple data movements during the computation of KVs and enhancing attention computation performance. For details, see `Flash Attention `_. - - **Page attention graphics memory optimization**: Standard flash attention reads and saves the entire input KV data each time. This method is simple but wastes many resources. For example, "China's capital" and "China's national flag" share "China's", leading to identical KVs for their attention. Standard flash attention needs to store two copies of KVs, wasting the graphics memory. Page attention optimizes KVCache based on the page table principle of the Linux OS. It stores KVs in blocks of a specific size. In the preceding example, "China", "'s", "capital", and "national flag" are stored as four pieces of KV data. Compared with the original six pieces of data, this method effectively saves graphics memory resources. In the service-oriented scenario, more idle graphics memory allows for a larger batch size for model inference, thereby achieving higher throughput. For details, see `Page Attention `_. + - **Paged Attention**: Standard Flash Attention reads and saves the entire input Key and Value data each time. Although this method is relatively simple, it causes a significant waste of resources. When multiple requests in a batch have inconsistent sequence lengths, Flash Attention requires the key and value to use the memory of the longest sequence. For example, "The capital of China is Beijing" and "The national flag of China is the Five-Star Red Flag", assuming that the words are divided by characters, 10 * 2 = 20 KVCache memory units are required. Paged attention optimizes KVCache based on the page table principle of the Linux OS. Store Key and Value data in blocks of a specific size. For example, when the block size is 2, you can use KVCache per block, only 4 * 2 + 5 * 2 = 18 KVCache memory units are required. Due to the discrete feature of Paged Attention, you can also combine it with technologies such as Prefix Cache to further reduce the memory occupied by "of China". Therefore only 3 * 2 + 5 * 2 = 16 KVCache units are ultimately required. In the service-oriented scenario, more idle graphics memory allows for a larger batch size for model inference, thereby achieving higher throughput. For details, see `Page Attention `_. - **Model quantization**: MindSpore LLM inference supports quantization to reduce the model size. It provides technologies such as A16W8, A16W4, A8W8, and KVCache quantizations to reduce model resource usage and improve the inference throughput. diff --git a/tutorials/source_zh_cn/beginner/accelerate_with_static_graph.ipynb b/tutorials/source_zh_cn/beginner/accelerate_with_static_graph.ipynb index 4f80abe32f06a91e830e7653ce5c9de0c907b58c..44f1e78f73008e7c6c3ad0ece6635185dfebbef2 100644 --- a/tutorials/source_zh_cn/beginner/accelerate_with_static_graph.ipynb +++ b/tutorials/source_zh_cn/beginner/accelerate_with_static_graph.ipynb @@ -171,7 +171,7 @@ "\n", "## 静态图模式开启方式\n", "\n", - "通常情况下,由于动态图的灵活性,我们会选择使用PyNative模式来进行自由的神经网络构建,以实现模型的创新和优化。但是当需要进行性能加速时,可以对神经网络部分或整体进行加速。MindSpore提供了两种切换为图模式的方式:基于装饰器的开启方式以及基于全局context的开启方式。\n", + "通常情况下,由于动态图的灵活性,我们会选择使用PyNative模式来进行自由的神经网络构建,以实现模型的创新和优化。但是当需要进行性能加速时,可以对神经网络部分或整体进行加速。MindSpore提供了两种切换为静态图模式的方式:基于装饰器的开启方式以及基于全局context的开启方式。\n", "\n", "### 基于装饰器的开启方式\n", "\n", diff --git a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst index cca47ae04ffb1a0c8c4fe8ac7a13bdb1fa6a52c4..edad80d5899e999f72211d38595267a042ef9a22 100644 --- a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst +++ b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst @@ -96,7 +96,7 @@ MindSpore大语言模型为了能够实现最优的性价比,针对大语言 - **Flash Attention**:Attention计算中会存在两个大矩阵相乘(4K大小),实际计算会将大矩阵分解为多个芯片能够计算的小矩阵单元进行计算,由于芯片的最小级的缓存大小限制,需要不断地将待计算数据在缓存和主存间搬入搬出,导致计算资源实际无法充分利用,因此当前主流芯片下,Attention计算实际上是带宽bound。Flash Attention技术将原本Attention进行分块,使得每一块计算都能够在芯片上独立计算完成,避免了在计算Key和Value时多次数据的搬入和搬出,从而提升Attention计算性能,具体可以参考 `Flash Attention `_。 - - **Page Attention显存优化**:标准的Flash Attention每次会读取和保存整个输入的Key和Value数据,这种方式虽然比较简单,但是会造成较多的资源浪费,如“中国的首都”和“中国的国旗”,都有共同的“中国的”作为公共前缀,其Attention对应的Key和Value值实际上是一样的,标准Flash Attention就需要存两份Key和Value,导致显存浪费。Page Attention基于Linux操作系统页表原理对KVCache进行优化,按照特定大小的块来存储Key和Value的数据,将上面例子中的Key和Value存储为“中国”、“的”、“首都”、“国旗”一共四份Key和Value数据,相比原来的六份数据,有效地节省了显存资源。在服务化的场景下,更多空闲显存可以让模型推理的batch更大,从而获得更高的吞吐量,具体可以参考 `Page Attention `_。 + - **Paged Attention**:标准的Flash Attention每次会读取和保存整个输入的Key和Value数据,这种方式虽然比较简单,但是会造成较多的资源浪费,当一个batch中多个请求序列长度不一致时,Flash Attention需要key和value用最长的序列的显存,如“中国的首都是北京“和“中国的国旗是五星红旗”,假设分词按字分词,则需要10 * 2 = 20个KVCache显存单元。Paged Attention基于Linux操作系统页表原理对KVCache进行优化,按照特定大小的块来存储Key和Value的数据,如块大小为2时,可以按照块使用KVCache,只需要4 * 2 + 5 * 2 = 18个KVCache的显存单元,由于Paged Attention离散的特性,也可以结合Prefix Cache这类技术进一步节省“中国的”所占用的显存,最终只需要3 * 2 + 5 * 2 = 16个KVCache单元。在服务化的场景下,更多空闲显存可以让模型推理的batch更大,从而获得更高的吞吐量,具体可以参考 `Page Attention `_。 - **模型量化**:MindSpore大语言模型推理支持通过量化技术减小模型体积,提供了A16W8、A16W4、A8W8量化以及KVCache量化等技术,减少模型资源占用,提升推理吞吐量。 diff --git a/tutorials/source_zh_cn/parallel/msrun_launcher.md b/tutorials/source_zh_cn/parallel/msrun_launcher.md index c0e48a22916c6e1a57106540fb8e217f822a43cd..56a8e1fa7a169417307035d7883905e6e554dbc7 100644 --- a/tutorials/source_zh_cn/parallel/msrun_launcher.md +++ b/tutorials/source_zh_cn/parallel/msrun_launcher.md @@ -4,7 +4,7 @@ ## 概述 -`msrun`是[动态组网](https://www.mindspore.cn/tutorials/zh-CN/master/parallel/dynamic_cluster.html)启动方式的封装,用户可使用`msrun`,以单个命令行指令的方式在各节点拉起多进程分布式任务,并且无需手动设置[动态组网环境变量](https://www.mindspore.cn/tutorials/zh-CN/master/parallel/dynamic_cluster.html)。`msrun`同时支持`Ascend`,`GPU`和`CPU`后端。与`动态组网`启动方式一样,`msrun`无需依赖第三方库以及配置文件。 +`msrun`是[动态组网](https://www.mindspore.cn/tutorials/zh-CN/master/parallel/dynamic_cluster.html)启动方式的封装,用户可使用`msrun`,以单个命令行指令的方式在各节点拉起多进程分布式任务,并且无需手动设置[动态组网环境变量](https://www.mindspore.cn/tutorials/zh-CN/master/parallel/dynamic_cluster.html)。`msrun`同时支持`Ascend`、`GPU`和`CPU`后端。与`动态组网`启动方式一样,`msrun`无需依赖第三方库以及配置文件。 > - `msrun`在用户安装MindSpore后即可使用,可使用指令`msrun --help`查看支持参数。 > - `msrun`支持`图模式`以及`PyNative模式`。 diff --git a/tutorials/source_zh_cn/parallel/optimize_technique.rst b/tutorials/source_zh_cn/parallel/optimize_technique.rst index 4a4df80d3f93bd596f480bbd8ebce011244be8a3..2a1bfe268eab6dfcf457612ea346e4136e2a91c5 100644 --- a/tutorials/source_zh_cn/parallel/optimize_technique.rst +++ b/tutorials/source_zh_cn/parallel/optimize_technique.rst @@ -21,7 +21,7 @@ 考虑到实际并行训练中,可能会对训练性能、吞吐量或规模有要求,可以从三个方面考虑优化:并行策略优化、内存优化和通信优化 -- 并行策略优化:并行策略优化主要包括并行策略的选择、算子级并行下的切分技巧以及多副本技巧。 +- 并行策略优化:主要包括并行策略的选择、算子级并行下的切分技巧以及多副本技巧。 - `策略选择 `_:根据模型规模和数据量大小,可以选择不同的并行策略,以提高训练效率和资源利用率。 - `切分技巧 `_:切分技巧是指通过手动配置某些关键算子的切分策略,减少张量重排布来提升训练效率。 diff --git a/tutorials/source_zh_cn/parallel/overview.md b/tutorials/source_zh_cn/parallel/overview.md index 3da1bc241819a4971bf00d48d8899053f28daf8f..b5a157c5814ae8706ccca7f7306f75e6771dab02 100644 --- a/tutorials/source_zh_cn/parallel/overview.md +++ b/tutorials/source_zh_cn/parallel/overview.md @@ -39,7 +39,7 @@ MindSpore提供两种粒度的算子级并行能力:算子级并行和高阶 ## 流水线并行 -近年来,神经网络的规模几乎是呈指数型增长。受单卡内存的限制,训练这些大模型用到的设备数量也在不断增加。受server间通信带宽低的影响,传统数据并行叠加模型并行的这种混合并行模式的性能表现欠佳,需要引入流水线并行。流水线并行能够将模型在空间上按阶段(Stage)进行切分,每个Stage只需执行网络的一部分,大大节省了内存开销,同时缩小了通信域,缩短了通信时间。MindSpore能够根据用户的配置,将单机模型自动地转换成流水线并行模式去执行。 +近年来,神经网络的规模几乎呈指数型增长。受单卡内存的限制,训练这些大模型用到的设备数量也在不断增加。受server间通信带宽低的影响,传统数据并行叠加模型并行的这种混合并行模式的性能表现欠佳,需要引入流水线并行。流水线并行能够将模型在空间上按阶段(Stage)进行切分,每个Stage只需执行网络的一部分,大大节省了内存开销,同时缩小了通信域,缩短了通信时间。MindSpore能够根据用户的配置,将单机模型自动地转换成流水线并行模式去执行。 详细可参考[流水线并行](https://www.mindspore.cn/tutorials/zh-CN/master/parallel/pipeline_parallel.html)章节。 diff --git a/tutorials/source_zh_cn/parallel/pipeline_parallel.md b/tutorials/source_zh_cn/parallel/pipeline_parallel.md index e38b87ae28d475f347e6b69f21f734f5b4fe9c04..3a8eaa4165b3d6fb8165419489eef6c4860e7ae3 100644 --- a/tutorials/source_zh_cn/parallel/pipeline_parallel.md +++ b/tutorials/source_zh_cn/parallel/pipeline_parallel.md @@ -4,7 +4,7 @@ ## 简介 -近年来,神经网络的规模几乎是呈指数型增长。受单卡内存的限制,训练这些大模型用到的设备数量也在不断增加。受server间通信带宽低的影响,传统数据并行叠加模型并行的这种混合并行模式的性能表现欠佳,需要引入流水线并行。流水线并行能够将模型在空间上按阶段(Stage)进行切分,每个Stage只需执行网络的一部分,大大节省了内存开销,同时缩小了通信域,缩短了通信时间。MindSpore能够根据用户的配置,将单机模型自动地转换成流水线并行模式去执行。 +近年来,神经网络的规模几乎呈指数型增长。受单卡内存的限制,训练这些大模型用到的设备数量也在不断增加。受server间通信带宽低的影响,传统数据并行叠加模型并行的这种混合并行模式的性能表现欠佳,需要引入流水线并行。流水线并行能够将模型在空间上按阶段(Stage)进行切分,每个Stage只需执行网络的一部分,大大节省了内存开销,同时缩小了通信域,缩短了通信时间。MindSpore能够根据用户的配置,将单机模型自动转换成流水线并行模式去执行。 ## 训练操作实践 diff --git a/tutorials/source_zh_cn/parallel/split_technique.md b/tutorials/source_zh_cn/parallel/split_technique.md index a6f1ef019c1fa2938acaecbdf7b4e323b7648155..9a23e6a058ef4a40a32b2e557dc27118921853cd 100644 --- a/tutorials/source_zh_cn/parallel/split_technique.md +++ b/tutorials/source_zh_cn/parallel/split_technique.md @@ -118,7 +118,7 @@ class CoreAttention(nn.Cell): -再看[FlashAttention](https://gitee.com/mindspore/mindformers/blob/master/mindformers/modules/flash_attention.py)的例子: +再看[FlashAttention](https://gitee.com/mindspore/mindformers/blob/master/mindformers/modules/flash_attention.py)的例子:
@@ -193,4 +193,4 @@ class LlamaForCausalLM(LlamaPretrainedModel):
-**用户无法确认是否需要对算子配置策略时,可以不配置,由算法传播找寻最优策略,但是可能无法获得最佳的并行效果;若用户能够确认该算子需要配置什么策略,则可以进行配置帮助算法获得预期效果。** +**用户无法确认是否需要对算子配置策略时,可以不配置,由算法传播找寻最优策略,但是可能无法获得最佳的并行效果;若用户能够确认该算子需要配置什么策略,则可以进行配置,帮助算法获得预期效果。**