diff --git a/docs/mindformers/docs/source_en/env_variables.md b/docs/mindformers/docs/source_en/env_variables.md
index 82af089a3c03e0571a7e6fd900801d723703709a..7e1496d3b7559b3607669a14f3f71b9930ed1e46 100644
--- a/docs/mindformers/docs/source_en/env_variables.md
+++ b/docs/mindformers/docs/source_en/env_variables.md
@@ -40,4 +40,4 @@ The following environment variables are supported by MindSpore Transformers.
| **MS_ENABLE_FA_FLATTEN** | on | Controls whether support FlashAttention flatten optimization. | `on`: Enable FlashAttention flatten optimization;
`off`: Disable FlashAttention flatten optimization. | Provide a fallback mechanism for models that have not yet been adapted to FlashAttention flatten optimization. |
| **EXPERIMENTAL_KERNEL_LAUNCH_GROUP** | NA | Control whether to support the batch parallel submission of operators. If supported, enable the parallel submission and configure the number of parallel submissions. | `thread_num`: The number of concurrent threads is not recommended to be increased. The default value is 2;
`kernel_group_num`: Total number of operator groups, 'kernel_group_num/thread_num' groups per thread, default is' 8 '. | This feature will continue to evolve in the future, and the subsequent behavior may change. Currently, only the `deepseek` reasoning scenario is supported, with certain performance optimization, but other models using this feature may deteriorate, and users need to use it with caution, as follows:`export EXPERIMENTAL_KERNEL_LAUNCH_GROUP="thread_num:2,kernel_group_num:8"`. |
| **FORCE_EAGER** | False | Control whether to disable jit mode. | `False`: Enable jit mode;
`True`: Do not enable jit mode. | Jit compiles functions into a callable MindSpore graph, sets FORCE_EAGER to False to enable jit mode, which can generate performance benefits. Currently, only inference mode is supported. |
-| **MS_ENABLE_TFT** | NA | Enable [MindIO TFT](https://www.hiascend.com/document/detail/zh/mindx-dl/600/clusterscheduling/ref/mindiottp/mindiotft001.html) feature. Turn on TTP, UCE, ARF or TRE feature. | The value of the environment variable can be:"{TTP:1,UCE:1,ARF:1,TRE:1}", when using a certain feature, the corresponding field can be configured as "1". | Usage can refer to [High Availability](https://www.mindspore.cn/mindformers/docs/en/dev/feature/high_availability.html). |
+| **MS_ENABLE_TFT** | NA | Enable [MindIO TFT](https://www.hiascend.com/document/detail/zh/mindx-dl/600/clusterscheduling/ref/mindiottp/mindiotft001.html) feature. Turn on TTP, UCE, HCCE, ARF or TRE feature. | The value of the environment variable can be:"{TTP:1,UCE:1,HCCE:1,ARF:1,TRE:1}", when using a certain feature, the corresponding field can be configured as "1". | Usage can refer to [High Availability](https://www.mindspore.cn/mindformers/docs/en/dev/feature/high_availability.html). |
diff --git a/docs/mindformers/docs/source_en/feature/high_availability.md b/docs/mindformers/docs/source_en/feature/high_availability.md
index bf7d8b4fa6b6fe89a5d429bfac39ae7cb5b8487c..4c85ada29eb84a426e57a6b6eaee53d7793b0fb6 100644
--- a/docs/mindformers/docs/source_en/feature/high_availability.md
+++ b/docs/mindformers/docs/source_en/feature/high_availability.md
@@ -4,22 +4,23 @@
## Overview
-MindSpore Transformers high availability provides the following four functions:
+MindSpore Transformers high availability provides the following five functions:
- **End-of-life CKPT**: It is mainly aimed at accelerating the fault recovery in the training process of large models. This feature verifies the integrity and consistency of the intermediate state data after a fault occurs during the training process and generates an end-of-life CheckPoint data, which can be used to recover the training and reduce the loss of training iterations caused by the fault.
- **UCE Fault-tolerant Recovery**: It mainly focuses on the detection of UCE faults in on-chip memory during the training process of large models, and accomplishes online repair to reach Step-level recomputation.
+- **HCCL Recompute Error Recovery**: It mainly focuses on hccl recompute error during the training process of large models, and accomplishes online repair to reach Step-level recomputation.
- **TRE Training Result Excepition Recovery**:It mainly focuses on the detection of value excepton of loss, global-norm, etc. during the training process of large models, and accomplishes online repair to reach Step-level recomputation.
- **ARF Process-Level Rescheduling Recovery**: Instead of pulling up the entire cluster again after an anomaly in training occurs, simply restart or replace it on a node-by-node basis to complete the repair and continue training.
Constraints and dependencies of the high availability functions:
-| | End-of-life CKPT | UCE | ARF | TRE |
-| - | - | - | - | - |
-| Depending on MindIO | Yes | Yes | Yes | No |
-| Replica relationship between between cards | Yes | Yes | Yes | No |
-| Sink Size is 1 | Yes | Yes | Yes | No |
+| | End-of-life CKPT | UCE | HCCE | ARF | TRE |
+| - | - | - | - | - | - |
+| Depending on MindIO | Yes | Yes | Yes | Yes | No |
+| Replica relationship between between cards | Yes | Yes | No | Yes | No |
+| Sink Size is 1 | Yes | Yes | Yes | Yes | No |
-These four high availability functions are currently only supported in the MindSpore Ascend back-end graph schema to support Step-level recovery.
+These five high availability functions are currently only supported in the MindSpore Ascend back-end graph schema to support Step-level recovery.
The replica relationship between cards is used to make sure when one of the cards fails, it can be recovered from the other card. It requires that there must be at least two copies of redundancy in both the weights and the optimizer. To ensure this redundancy relationship, data parallelism must be turned on to ensure that there are two cards with the same weights, and also if optimizer parallelism is turned on, it must be ensured that there are two cards with the same optimizer state.
@@ -35,7 +36,7 @@ For high availability functions which depend on MindIO, the user needs to instal
```shell
export MINDIO_FOR_MINDSPORE=1
-export MS_ENABLE_TFT="{TTP:1,UCE:1,ARF:1,TRE:1}"
+export MS_ENABLE_TFT="{TTP:1,UCE:1,HCCE:1,ARF:1,TRE:1}"
export MS_TFT_IP=127.0.0.1
export MS_TFT_PORT=30051
```
@@ -44,6 +45,7 @@ export MS_TFT_PORT=30051
- `MS_ENABLE_TFT`: Indicates that the TTP, UCE, ARF and TRE functions are enabled. If you want to enable only one of these functions, set the corresponding value to 1.
- **TTP (Try To Persist)**: End-of-life CKPT function
- **UCE (Uncorrectable Memory Error)**: UCE fault tolerance recovery
+ - **UCE (Huawei Collective Communication Error)**: HCCL recompute error recovery
- **ARF (Air Refuelling)**: Process-level rescheduling recovery function
- **TRE (Training Result Error)**: Training result exception recovery
- When UCE or ARF is enabled, TTP is enabled by default.
diff --git a/docs/mindformers/docs/source_zh_cn/env_variables.md b/docs/mindformers/docs/source_zh_cn/env_variables.md
index e58a500de1487620a79234cac67ee3a85d088d37..ea66b3fa5c2dd78e26768813cb5cff7eac7db8aa 100644
--- a/docs/mindformers/docs/source_zh_cn/env_variables.md
+++ b/docs/mindformers/docs/source_zh_cn/env_variables.md
@@ -40,4 +40,4 @@
| **MS_ENABLE_FA_FLATTEN** | on | 控制 是否支持 FlashAttention flatten 优化。 | `on`:启用 FlashAttention flatten 优化;
`off`: 禁用 FlashAttention flatten 优化。 | 对于还未适配FlashAttention flatten 优化的模型提供回退机制。 |
| **EXPERIMENTAL_KERNEL_LAUNCH_GROUP** | NA | 控制是否支持算子批量并行下发,支持开启并行下发,并配置并行数 | `thread_num`: 并发线程数,一般不建议增加,默认值为`2`;
`kernel_group_num`: 算子分组总数量,每线程`kernel_group_num/thread_num`个组,默认值为`8`。 | 该特性后续还会继续演进,后续行为可能会有变更,当前仅支持`deepseek`推理场景,有一定的性能优化,但是其他模型使用该特性可能会有劣化,用户需要谨慎使用,使用方法如下:`export EXPERIMENTAL_KERNEL_LAUNCH_GROUP="thread_num:2,kernel_group_num:8"`。 |
| **FORCE_EAGER** | False | 控制是否**不开启**jit模式。 | `False`: 开启jit模式;
`True`: 不开启jit模式。 | Jit将函数编译成一张可调用的MindSpore图,设置FORCE_EAGER为False开启jit模式,可以获取性能收益,当前仅支持推理模式。 |
-| **MS_ENABLE_TFT** | NA | 使能 [MindIO TFT](https://www.hiascend.com/document/detail/zh/mindx-dl/600/clusterscheduling/ref/mindiottp/mindiotft001.html) 特性,表示启用 TTP、UCE、ARF 或 TRE 功能。 | 取值为"{TTP:1,UCE:1,ARF:1,TRE:1}",使用某一功能时,可将对应字段配置为"1"。 | 使用方式可以参考[高可用特性](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/high_availability.html)。 |
+| **MS_ENABLE_TFT** | NA | 使能 [MindIO TFT](https://www.hiascend.com/document/detail/zh/mindx-dl/600/clusterscheduling/ref/mindiottp/mindiotft001.html) 特性,表示启用 TTP、UCE、HCCE、ARF 或 TRE 功能。 | 取值为"{TTP:1,UCE:1,HCCE:1,ARF:1,TRE:1}",使用某一功能时,可将对应字段配置为"1"。 | 使用方式可以参考[高可用特性](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/high_availability.html)。 |
diff --git a/docs/mindformers/docs/source_zh_cn/feature/high_availability.md b/docs/mindformers/docs/source_zh_cn/feature/high_availability.md
index e8acb538c89e773bd3c61ed39a9741454e0ca922..a6e36d119b1731938ecd9db472d0543274984095 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/high_availability.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/high_availability.md
@@ -4,22 +4,23 @@
## 概述
-MindSpore Transformers 高可用特性提供了如下四个功能:
+MindSpore Transformers 高可用特性提供了如下五个功能:
- **临终 CKPT 功能**:主要针对大模型训练过程中的故障恢复加速,该特性在训练过程中发生故障后,校验中间状态数据的完整性和一致性,生成一次临终 CheckPoint 数据,恢复训练时能够通过该 CheckPoint 数据恢复,减少故障造成的训练迭代损失。
- **UCE 故障容错恢复功能**:主要是针对大模型训练过程中片上内存的 UCE 故障检测,并完成在线修复,达到 Step 级重计算。
+- **HCCL 重计算失败恢复功能**:主要是针对大模型训练过程中HCCL通信算子重计算失败,并完成在线修复,达到 Step 级重计算。
- **TRE 训练结果异常恢复功能**:主要是针对大模型训练过程中出现loss或global norm等值异常检测,并完成在线修复,达到 Step 级重计算。
- **ARF 进程级重调度恢复功能**:训练发生异常后,不需要重新拉起整个集群,只需以节点为单位进行重启或替换,完成修复并继续训练。
这几个高可用特性的**约束**和**依赖**如下:
-| | 临终 CKPT | UCE | ARF | TRE |
-| - | - | - | - | - |
-| 依赖MindIO组件 | Yes | Yes | Yes | No |
-| 卡间存在副本关系 | Yes | Yes | Yes | No |
-| Sink Size 为 1 | Yes | Yes | Yes | No |
+| | 临终 CKPT | UCE | HCCE | ARF | TRE |
+| - | - | - | - | - | - |
+| 依赖MindIO组件 | Yes | Yes | Yes | Yes | No |
+| 卡间存在副本关系 | Yes | Yes | No | Yes | No |
+| Sink Size 为 1 | Yes | Yes | Yes | Yes | No |
-目前这四个高可用特性只支持Ascend后端上图模式的Step级别恢复。
+目前这五个高可用特性只支持Ascend后端上图模式的Step级别恢复。
卡间存在副本关系的目的是当其中一张卡发生故障时,可从另外一张卡恢复,要求权重和优化器状态都会存在至少两份冗余。为保证这种冗余关系,必须开启数据并行,保证有两张卡权重一致,同时如果开启了优化器并行,也必须确保存在两张卡的优化器状态一致。
@@ -35,7 +36,7 @@ MindSpore Transformers 高可用特性提供了如下四个功能:
```shell
export MINDIO_FOR_MINDSPORE=1
-export MS_ENABLE_TFT="{TTP:1,UCE:1,ARF:1,TRE:1}"
+export MS_ENABLE_TFT="{TTP:1,UCE:1,HCCE:1,ARF:1,TRE:1}"
export MS_TFT_IP=127.0.0.1
export MS_TFT_PORT=30051
```
@@ -44,6 +45,7 @@ export MS_TFT_PORT=30051
- `MS_ENABLE_TFT`:表示启用 TTP、UCE、ARF 和 TRE 功能,如果只想启用其中的某一个功能,则将对应的值设置为 1 即可。
- **TTP (Try To Persist)**:临终 CKPT 功能
- **UCE (Uncorrectable Memory Error)**:UCE 故障容错恢复功能
+ - **HCCE (Huawei Collective Communication Error)**:HCCL 重计算失败恢复功能
- **ARF (Air Refuelling)**:进程级重调度恢复功能
- **TRE (Training Result Error)**:TRE 训练结果异常恢复功能
- 开启 UCE 或者 ARF 功能时,默认开启 TTP 功能