diff --git a/docs/mindformers/docs/source_en/function/high_availability.md b/docs/mindformers/docs/source_en/function/high_availability.md index 72b5cc64b48e10964c9d55bada11a17da1d78928..934bb138613b2a7472c6859e502ea13446952d2a 100644 --- a/docs/mindformers/docs/source_en/function/high_availability.md +++ b/docs/mindformers/docs/source_en/function/high_availability.md @@ -40,6 +40,7 @@ export MS_TFT_PORT=30051 - **TTP (Try To Persist)**: End-of-life CKPT function - **UCE (Uncorrectable Memory Error)**: UCE fault tolerance recovery - **ARF (Air Refuelling)**: Process-level rescheduling recovery function + - **TRE(Train Result Error)**: TRE fault tolerance recovery - When UCE or ARF is enabled, TTP is enabled by default. - `MS_TFT_IP` and `MS_TFT_PORT` represent the IP and port number of TFT Controller respectively, no default value, need to be specified by user. If the Controller is started by MindSpore Transformers, the IP and port number of the rank0 node in the user's cluster are configured. If the Controller is started by the user, configure the IP and port number of the Controller. diff --git a/docs/mindformers/docs/source_zh_cn/function/high_availability.md b/docs/mindformers/docs/source_zh_cn/function/high_availability.md index bf6f64b70a39ad9f04fa8fdff4d53662e275b61b..93b359a5f424dbc73d908cc92468bdb2be68aab3 100644 --- a/docs/mindformers/docs/source_zh_cn/function/high_availability.md +++ b/docs/mindformers/docs/source_zh_cn/function/high_availability.md @@ -40,6 +40,7 @@ export MS_TFT_PORT=30051 - **TTP (Try To Persist)**:临终 CKPT 功能 - **UCE (Uncorrectable Memory Error)**:UCE 故障容错恢复功能 - **ARF (Air Refuelling)**:进程级重调度恢复功能 + - **TRE(Train Result Error)**:TRE 故障容错恢复功能 - 开启 UCE 或者 ARF 功能时,默认开启 TTP 功能 - `MS_TFT_IP` 和 `MS_TFT_PORT` 分别表示 TFT Controller 的 IP 和端口号,无默认值,需要用户指定。如果由 MindSpore Transformers 启动 Controller,则配置用户集群中 rank0 节点的 IP 和端口号。如果用户自行启动 Controller,则配置 Controller 的 IP 和端口号。