From 24881b2dab7c223046a1ca7768008a3c5babb444 Mon Sep 17 00:00:00 2001 From: jizewei Date: Tue, 17 Jun 2025 16:25:24 +0800 Subject: [PATCH] add environment variable MS_SDC_DETECT_ENABLE --- .../mindspore/source_en/api_python/env_var_list.rst | 13 ++++++++++--- .../source_zh_cn/api_python/env_var_list.rst | 13 ++++++++++--- 2 files changed, 20 insertions(+), 6 deletions(-) diff --git a/docs/mindspore/source_en/api_python/env_var_list.rst b/docs/mindspore/source_en/api_python/env_var_list.rst index fc58157be3..804f68eb03 100644 --- a/docs/mindspore/source_en/api_python/env_var_list.rst +++ b/docs/mindspore/source_en/api_python/env_var_list.rst @@ -363,7 +363,7 @@ Graph Compilation and Execution - Control whether the operator is launched in a single thread in PyNative mode. When enabled, the operator will be launched in a single thread in PyNative mode. - Integer - 1: The operator will be launched in a single thread in PyNative mode. - + No setting or use other value: Multi-thread launching is enabled for operator in PyNative mode. - @@ -834,8 +834,8 @@ Log Note: glog does not support log file wrapping. If you need to control the log file occupation of disk space, you can use the log file management tool provided by the operating system, for example: logrotate for Linux. Please set the log environment variables before `import mindspore` . -Feature Value Detection ------------------------------- +Silent Data Corruption Detection +-------------------------------- .. list-table:: :widths: 20 20 10 30 20 @@ -877,6 +877,13 @@ Feature Value Detection By default, if this environment variable is not configured, `NPU_ASD_SIGMA_THRESH=100000,5000` - + * - MS_SDC_DETECT_ENABLE + - Whether to enable CheckSum for silent data corruption detection + - Integer + - 0: Disable CheckSum for silent data corruption detection + + 1: Enable CheckSum for silent data corruption Detection + - Currently, this feature only supports Atlas A2 training series products, and only supports CheckSum for MatMul with bfloat16 data type in O0 or O1 mode For more information on feature value detection, see `Feature Value Detection `_. diff --git a/docs/mindspore/source_zh_cn/api_python/env_var_list.rst b/docs/mindspore/source_zh_cn/api_python/env_var_list.rst index 2ac93724d8..7bbcff459c 100644 --- a/docs/mindspore/source_zh_cn/api_python/env_var_list.rst +++ b/docs/mindspore/source_zh_cn/api_python/env_var_list.rst @@ -363,7 +363,7 @@ - 控制动态图算子是否单线程下发。开启后,动态图算子将采用单线程下发。 - Integer - 1:动态图算子采用单线程下发。 - + 不设置或其他值:动态图算子不开启单线程下发。 - @@ -831,7 +831,7 @@ Dump调试 注意:glog不支持日志文件的绕接,如果需要控制日志文件对磁盘空间的占用,可选用操作系统提供的日志文件管理工具,例如:Linux的logrotate。请在 `import mindspore` 之前设置日志相关环境变量。 -特征值检测 +静默故障检测 ------------ .. list-table:: @@ -853,7 +853,7 @@ Dump调试 2:检测到异常,打印日志,检测算子抛出异常 3:特征值正常和异常场景下都会打印(备注:正常场景下只有CANN开启了INFO及DEBUG级别才会打印),检测到异常时检测算子抛出异常 - - 目前本特性仅支持Atlas A2 训练系列产品,仅支持检测Transformer类模型,bfloat16数据类型,训练过程中出现的特征值检测异常 + - 目前本特性仅支持Atlas A2训练系列产品,仅支持检测Transformer类模型,bfloat16数据类型,训练过程中出现的特征值检测异常 考虑到无法事先知道数据特征值的分布范围,建议设置NPU_ASD_ENABLE的值为1来使能静默检测,以防止误检导致训练中断 * - NPU_ASD_UPPER_THRESH @@ -874,6 +874,13 @@ Dump调试 在不配置该环境变量的默认情况下,`NPU_ASD_SIGMA_THRESH=100000,5000` - + * - MS_SDC_DETECT_ENABLE + - 是否使能CheckSum检测静默故障 + - Integer + - 0:关闭CheckSum检测静默故障 + + 1:使能CheckSum检测静默故障 + - 目前本特性仅支持Atlas A2训练系列产品,仅支持在O0或O1模式下,对bfloat16数据类型的MatMul算子进行CheckSum校验 特征值检测的更多内容详见 `特征值检测 `_ 。 -- Gitee