From 8e4f3ffb2160c3ec40ba5e949588d967836040bc Mon Sep 17 00:00:00 2001 From: guangpengz Date: Wed, 2 Jul 2025 16:55:11 +0800 Subject: [PATCH] Add deprecation logs for ring attention. --- docs/mindformers/docs/source_en/feature/parallel_training.md | 2 ++ docs/mindformers/docs/source_zh_cn/feature/parallel_training.md | 2 ++ 2 files changed, 4 insertions(+) diff --git a/docs/mindformers/docs/source_en/feature/parallel_training.md b/docs/mindformers/docs/source_en/feature/parallel_training.md index 14d19251ed..17b16b9ac9 100644 --- a/docs/mindformers/docs/source_en/feature/parallel_training.md +++ b/docs/mindformers/docs/source_en/feature/parallel_training.md @@ -43,6 +43,8 @@ From generative AI to scientific models, long sequence training is becoming very #### Ring Attention Sequence Parallelism +> This feature has been deprecated and will be removed in subsequent versions. Currently, you can using other sequence parallel methods. If you have any questions or suggestions, please submit feedback through **[Community Issue](https://gitee.com/mindspore/mindformers/issues/new)**. Thank you for your understanding and support! + Long Sequence Parallel Algorithm, Ring Attention, is a representative technique for long sequence parallelism in the current industry, which is used to solve the memory overhead problem during long sequence training, while realizing computation and communication masking. The Ring Attention algorithm utilizes the chunking property of Attention, when the sequence parallelism is N, Q, K, V are sliced into N sub-chunks, and each card calls the Flash Attention algorithm to compute the Attention result of the local QKV sub-chunks respectively. Since each card only needs to compute the Attention of the sliced QKV sub-chunks, its memory occupation is reduced significantly. Ring Attention uses ring communication to collect and send sub-chunks to neighboring cards while doing FA computation to maximize the masking of computation and communication, which guarantees the overall performance of long sequence parallelism. MindSpore Transformers has support for configuring Ring Attention sequence parallel schemes, which can be enabled with the following configuration item: diff --git a/docs/mindformers/docs/source_zh_cn/feature/parallel_training.md b/docs/mindformers/docs/source_zh_cn/feature/parallel_training.md index 7fe00f2d6d..c15df2cf25 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/parallel_training.md +++ b/docs/mindformers/docs/source_zh_cn/feature/parallel_training.md @@ -86,6 +86,8 @@ parallel_config: #### Ring Attention序列并行 +> 本功能已废弃,将在后续版本中下架,可使用其他序列并行方法。如有任何问题或建议,请通过 **[社区Issue](https://gitee.com/mindspore/mindformers/issues/new)** 提交反馈,感谢您的理解和支持! + 长序列并行算法 Ring Attention 是当前业界长序列并行的代表性技术,用于解决长序列训练时的内存开销问题,同时实现计算与通信掩盖。Ring Attention 算法利用 Attention 的分块计算性质,当序列并行度为 N 时,将 Q,K,V 分别切分为 N 个子块,每张卡分别调用 Flash Attention 算子来计算本地 QKV 子块的 Attention 结果。由于每张卡只需要计算切分后 QKV 子块的 Attention,其内存占用大幅降低。Ring Attention 在做 FA 计算的同时采用环形通信向相邻卡收集和发送子块,实现计算与通信的最大化掩盖,保障了长序列并行的整体性能。 MindSpore Transformers已支持配置Ring Attention序列并行方案,可通过以下配置项使能: -- Gitee