From 5694dfdcff3376c8dc1310bbc883c381016f7db0 Mon Sep 17 00:00:00 2001 From: buxue Date: Mon, 9 Jun 2025 15:08:13 +0800 Subject: [PATCH] add training step pause for MindFormers --- .../source_en/feature/high_availability.md | 20 ++++++++++--------- .../source_zh_cn/feature/high_availability.md | 20 ++++++++++--------- .../source_en/api_python/env_var_list.rst | 8 ++++++++ .../source_zh_cn/api_python/env_var_list.rst | 8 ++++++++ 4 files changed, 38 insertions(+), 18 deletions(-) diff --git a/docs/mindformers/docs/source_en/feature/high_availability.md b/docs/mindformers/docs/source_en/feature/high_availability.md index bf7d8b4fa6..9b5ae5bafb 100644 --- a/docs/mindformers/docs/source_en/feature/high_availability.md +++ b/docs/mindformers/docs/source_en/feature/high_availability.md @@ -4,22 +4,23 @@ ## Overview -MindSpore Transformers high availability provides the following four functions: +MindSpore Transformers high availability provides the following five functions: - **End-of-life CKPT**: It is mainly aimed at accelerating the fault recovery in the training process of large models. This feature verifies the integrity and consistency of the intermediate state data after a fault occurs during the training process and generates an end-of-life CheckPoint data, which can be used to recover the training and reduce the loss of training iterations caused by the fault. - **UCE Fault-tolerant Recovery**: It mainly focuses on the detection of UCE faults in on-chip memory during the training process of large models, and accomplishes online repair to reach Step-level recomputation. - **TRE Training Result Excepition Recovery**:It mainly focuses on the detection of value excepton of loss, global-norm, etc. during the training process of large models, and accomplishes online repair to reach Step-level recomputation. - **ARF Process-Level Rescheduling Recovery**: Instead of pulling up the entire cluster again after an anomaly in training occurs, simply restart or replace it on a node-by-node basis to complete the repair and continue training. +- **TSP Training Step Pause Function**:After each training step is completed, enter the train pause interface,pause or resume training according to the needs of upper level operations. For example, pause training to perform communication network track switching, and resume training after successful switching. Constraints and dependencies of the high availability functions: -| | End-of-life CKPT | UCE | ARF | TRE | -| - | - | - | - | - | -| Depending on MindIO | Yes | Yes | Yes | No | -| Replica relationship between between cards | Yes | Yes | Yes | No | -| Sink Size is 1 | Yes | Yes | Yes | No | +| | End-of-life CKPT | UCE | ARF | TRE | TSP | +| - | - | - | - | - | - | +| Depending on MindIO | Yes | Yes | Yes | No | Yes | +| Replica relationship between between cards | Yes | Yes | Yes | No | No | +| Sink Size is 1 | Yes | Yes | Yes | No | No | -These four high availability functions are currently only supported in the MindSpore Ascend back-end graph schema to support Step-level recovery. +These five high availability functions are currently only supported in the MindSpore Ascend back-end graph schema to support Step-level recovery. The replica relationship between cards is used to make sure when one of the cards fails, it can be recovered from the other card. It requires that there must be at least two copies of redundancy in both the weights and the optimizer. To ensure this redundancy relationship, data parallelism must be turned on to ensure that there are two cards with the same weights, and also if optimizer parallelism is turned on, it must be ensured that there are two cards with the same optimizer state. @@ -35,17 +36,18 @@ For high availability functions which depend on MindIO, the user needs to instal ```shell export MINDIO_FOR_MINDSPORE=1 -export MS_ENABLE_TFT="{TTP:1,UCE:1,ARF:1,TRE:1}" +export MS_ENABLE_TFT="{TTP:1,UCE:1,ARF:1,TRE:1,TSP:1}" export MS_TFT_IP=127.0.0.1 export MS_TFT_PORT=30051 ``` - `MINDIO_FOR_MINDSPORE`: Enabling MindIO TFT SDK to support MindSpore -- `MS_ENABLE_TFT`: Indicates that the TTP, UCE, ARF and TRE functions are enabled. If you want to enable only one of these functions, set the corresponding value to 1. +- `MS_ENABLE_TFT`: Indicates that the TTP, UCE, ARF, TRE and TSP functions are enabled. If you want to enable only one of these functions, set the corresponding value to 1. - **TTP (Try To Persist)**: End-of-life CKPT function - **UCE (Uncorrectable Memory Error)**: UCE fault tolerance recovery - **ARF (Air Refuelling)**: Process-level rescheduling recovery function - **TRE (Training Result Error)**: Training result exception recovery + - **TSP (Training Step Pause)**:Training step pause function - When UCE or ARF is enabled, TTP is enabled by default. - TRE function can not be used with UCE or ARF feature - TRE does not depend on MindIO. It is not necessary to configure the MindIO-related environment variables MINDIO_FOR_MINDSPORE, MS_TFT_IP, and MS_TFT_PORT to enable only the TRE feature diff --git a/docs/mindformers/docs/source_zh_cn/feature/high_availability.md b/docs/mindformers/docs/source_zh_cn/feature/high_availability.md index e8acb538c8..9f887caba2 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/high_availability.md +++ b/docs/mindformers/docs/source_zh_cn/feature/high_availability.md @@ -4,22 +4,23 @@ ## 概述 -MindSpore Transformers 高可用特性提供了如下四个功能: +MindSpore Transformers 高可用特性提供了如下五个功能: - **临终 CKPT 功能**:主要针对大模型训练过程中的故障恢复加速,该特性在训练过程中发生故障后,校验中间状态数据的完整性和一致性,生成一次临终 CheckPoint 数据,恢复训练时能够通过该 CheckPoint 数据恢复,减少故障造成的训练迭代损失。 - **UCE 故障容错恢复功能**:主要是针对大模型训练过程中片上内存的 UCE 故障检测,并完成在线修复,达到 Step 级重计算。 - **TRE 训练结果异常恢复功能**:主要是针对大模型训练过程中出现loss或global norm等值异常检测,并完成在线修复,达到 Step 级重计算。 - **ARF 进程级重调度恢复功能**:训练发生异常后,不需要重新拉起整个集群,只需以节点为单位进行重启或替换,完成修复并继续训练。 +- **TSP 训练迭代暂停功能**:在每个训练step结束后,进入训练暂停接口,根据上层运维需要进行训练暂停和继续,例如,暂停训练执行通信网络轨道切换,切换成功后继续训练。 这几个高可用特性的**约束**和**依赖**如下: -| | 临终 CKPT | UCE | ARF | TRE | -| - | - | - | - | - | -| 依赖MindIO组件 | Yes | Yes | Yes | No | -| 卡间存在副本关系 | Yes | Yes | Yes | No | -| Sink Size 为 1 | Yes | Yes | Yes | No | +| | 临终 CKPT | UCE | ARF | TRE | TSP | +| - | - | - | - | - |----| +| 依赖MindIO组件 | Yes | Yes | Yes | No | Yes | +| 卡间存在副本关系 | Yes | Yes | Yes | No | No | +| Sink Size 为 1 | Yes | Yes | Yes | No | No | -目前这四个高可用特性只支持Ascend后端上图模式的Step级别恢复。 +目前这五个高可用特性只支持Ascend后端上图模式的Step级别恢复。 卡间存在副本关系的目的是当其中一张卡发生故障时,可从另外一张卡恢复,要求权重和优化器状态都会存在至少两份冗余。为保证这种冗余关系,必须开启数据并行,保证有两张卡权重一致,同时如果开启了优化器并行,也必须确保存在两张卡的优化器状态一致。 @@ -35,17 +36,18 @@ MindSpore Transformers 高可用特性提供了如下四个功能: ```shell export MINDIO_FOR_MINDSPORE=1 -export MS_ENABLE_TFT="{TTP:1,UCE:1,ARF:1,TRE:1}" +export MS_ENABLE_TFT="{TTP:1,UCE:1,ARF:1,TRE:1,TSP:1}" export MS_TFT_IP=127.0.0.1 export MS_TFT_PORT=30051 ``` - `MINDIO_FOR_MINDSPORE`:使能 MindIO TFT SDK 支持 MindSpore -- `MS_ENABLE_TFT`:表示启用 TTP、UCE、ARF 和 TRE 功能,如果只想启用其中的某一个功能,则将对应的值设置为 1 即可。 +- `MS_ENABLE_TFT`:表示启用 TTP、UCE、ARF、TRE、TSP功能,如果只想启用其中的某一个功能,则将对应的值设置为 1 即可。 - **TTP (Try To Persist)**:临终 CKPT 功能 - **UCE (Uncorrectable Memory Error)**:UCE 故障容错恢复功能 - **ARF (Air Refuelling)**:进程级重调度恢复功能 - **TRE (Training Result Error)**:TRE 训练结果异常恢复功能 + - **TSP (Training Step Pause)**:TSP 训练迭代暂停功能 - 开启 UCE 或者 ARF 功能时,默认开启 TTP 功能 - 目前 TRE 功能不可以与 UCE 或 ARF 功能同时使用 - TRE 功能不依赖 MindIO 组件,若只使能TRE特性,无需配置 MindIO 相关的环境变量 MINDIO_FOR_MINDSPORE、MS_TFT_IP 和 MS_TFT_PORT diff --git a/docs/mindspore/source_en/api_python/env_var_list.rst b/docs/mindspore/source_en/api_python/env_var_list.rst index 5841a403c3..10e9f51a40 100644 --- a/docs/mindspore/source_en/api_python/env_var_list.rst +++ b/docs/mindspore/source_en/api_python/env_var_list.rst @@ -375,6 +375,14 @@ Graph Compilation and Execution No setting or other value: Disable Tensor index optimization. - Used only in the Ascend AI processor environment when the graph compilation level is O0 or O1. Experimental environment variables. + * - MS_SUPPORT_BINARY + - Control whether support run pyc or so in graph mode. + - Integer + - 1:Support run pyc or so in graph mode. + + No setting or other value: Not support. + - + Dump Debugging --------------- diff --git a/docs/mindspore/source_zh_cn/api_python/env_var_list.rst b/docs/mindspore/source_zh_cn/api_python/env_var_list.rst index 29f7750481..2068ef2222 100644 --- a/docs/mindspore/source_zh_cn/api_python/env_var_list.rst +++ b/docs/mindspore/source_zh_cn/api_python/env_var_list.rst @@ -375,6 +375,14 @@ 不设置或其他值:不启用Tensor索引优化 - 仅限Ascend AI处理器环境,图编译等级为O0或O1流程使用。实验性质的环境变量。 + * - MS_SUPPORT_BINARY + - 控制是否支持在图模式下运行pyc或者so。 + - Integer + - 1:支持图模式下运行pyc或者so。 + + 不设置或其他值:不支持。 + - + Dump调试 -------- -- Gitee