From efdef685d818d48742bba6e882a9fb8ccadc44f2 Mon Sep 17 00:00:00 2001 From: jizewei Date: Tue, 26 Aug 2025 20:10:34 +0800 Subject: [PATCH] silent detect docs --- .../source_en/api_python/env_var_list.rst | 29 ++++- .../source_zh_cn/api_python/env_var_list.rst | 23 ++++ tutorials/source_en/debug/sdc.md | 103 ++++++++++++++++++ tutorials/source_zh_cn/debug/sdc.md | 103 ++++++++++++++++++ 4 files changed, 255 insertions(+), 3 deletions(-) diff --git a/docs/mindspore/source_en/api_python/env_var_list.rst b/docs/mindspore/source_en/api_python/env_var_list.rst index b6f2c18b7e..a378effa02 100644 --- a/docs/mindspore/source_en/api_python/env_var_list.rst +++ b/docs/mindspore/source_en/api_python/env_var_list.rst @@ -889,7 +889,7 @@ Silent Data Corruption Detection 2: Enable feature value detection function, when error was detected, throw exception 3: Enable feature value detection function, when error was detected, throw exception, but at the same time write value detection info of each time to log file (this requires set ascend log level to info or debug) - - Currently, this feature only supports Atlas A2 training series products, and only detects abnormal feature value that occur during the training of Transformer class models with bfloat16 data type + - Currently, this feature only supports the Atlas A2 training series products, and only detects abnormal feature value that occur during the training of Transformer class models with bfloat16 data type Considering that the feature value range can not be known ahead, setting NPU_ASD_ENABLE to 1 is recommended to enable silent check, which prevents training interruption caused by false detection * - NPU_ASD_UPPER_THRESH @@ -915,8 +915,31 @@ Silent Data Corruption Detection - Integer - 0: Disable CheckSum for silent data corruption detection - 1: Enable CheckSum for silent data corruption Detection - - Currently, this feature only supports Atlas A2 training series products, and only supports CheckSum for MatMul with bfloat16 data type in O0 or O1 mode + 1: Enable CheckSum for silent data corruption detection + - Currently, this feature only supports the Atlas A2 training series products, and only supports CheckSum for MatMul with bfloat16 data type in O0 or O1 mode + * - MS_NPU_ASD_CONFIG + - Configure the silent detection + - Strings + - Configuration items, with the format "key: value", multiple configuration items separated by commas, for example, "export MS_NPU_ASD_CONFIG=enable:true,with_checksum:true,grad_sample_interval:10,upper_thresh1:1000000,upper_thresh2:100,cooldown:5,strikes_num:3,strikes_window:480,checksum_cooldown:180". + + enable: Whether to enable feature value detection, with a default value of false + + with_checksum: Whether to work with CheckSum, with a default value of false + + grad_sample_interval: Gradient feature value sampling interval, with a default value of 10, i.e. a sampling rate of 1/10 + + upper_thresh1: First-level threshold for feature value detection, in the format of an integer greater than or equal to 3, with a default value of 1000000 + + upper_thresh2: Second-level threshold for feature value detection, in the format of an integer greater than or equal to 3, with a default value of 100 + + cooldown: Feature value detection anomaly cooldown time and CheckSum execution time, in the format of a positive integer, in minutes, with a default value of 5 + + strikes_num: strikes_num: Number of times that feature value anomalies trigger CheckSum detection, with a default value of 3 + + strikes_window: Time window for counting the number of feature values detection anomalies, in the format of a positive integer, in minutes, with a default value of 480 + + checksum_cooldown: CheckSum detection cooldown time, in the format of a positive integer, in minutes, with a default value of 180 + - Currently, this feature only supports the Atlas A2 training series products and is limited to networks that support automatic and semi-automatic parallel training modes. For more information on feature value detection, see `Feature Value Detection `_. diff --git a/docs/mindspore/source_zh_cn/api_python/env_var_list.rst b/docs/mindspore/source_zh_cn/api_python/env_var_list.rst index 7b6d8d6c24..3bacd164c6 100644 --- a/docs/mindspore/source_zh_cn/api_python/env_var_list.rst +++ b/docs/mindspore/source_zh_cn/api_python/env_var_list.rst @@ -914,6 +914,29 @@ Dump调试 1:使能CheckSum检测静默故障 - 目前本特性仅支持Atlas A2训练系列产品,仅支持在O0或O1模式下,对bfloat16数据类型的MatMul算子进行CheckSum校验 + * - MS_NPU_ASD_CONFIG + - 设置静默检测选项 + - String + - 配置项,格式为key:value,多个配置项以逗号分隔,例如 `export MS_NPU_ASD_CONFIG=enable:true,with_checksum:true,grad_sample_interval:10,upper_thresh1:1000000,upper_thresh2:100,cooldown:5,strikes_num:3,strikes_window:480,checksum_cooldown:180` + + enable: 是否开启特征值检测,默认值为false + + with_checksum: 是否联动CheckSum检测,默认值为false + + grad_sample_interval: 梯度特征值采样间隔,默认值为10,即1/10的采样率 + + upper_thresh1: 特征值检测的一级阈值,格式为大于等于3的正整数,默认值为1000000 + + upper_thresh2: 特征值检测的二级阈值,格式为大于等于3的正整数,默认值为100 + + cooldown: 特征值异常冷却时间和CheckSum执行时间,格式为正整数,单位为分钟,默认值为5 + + strikes_num: 触发CheckSum检测的特征值异常次数,默认值为3 + + strikes_window: 统计特征值异常次数的时间窗口,格式为正整数,单位分钟,默认值为480 + + checksum_cooldown: CheckSum检测冷却时间,格式为正整数,单位为分钟,默认值为180 + - 目前本特性仅支持Atlas A2训练系列产品,仅支持检测自动并行和半自动并行模式的网络 特征值检测的更多内容详见 `特征值检测 `_ 。 diff --git a/tutorials/source_en/debug/sdc.md b/tutorials/source_en/debug/sdc.md index 45fd8379c2..e5126400b2 100644 --- a/tutorials/source_en/debug/sdc.md +++ b/tutorials/source_en/debug/sdc.md @@ -14,6 +14,8 @@ The MindSpore framework version 2.4 provides a solution for feature value detect For default feature value detection checkpoints, users can enable detection capability using the environment variable `NPU_ASD_ENABLE=1`, `NPU_ASD_ENABLE=2` or `NPU_ASD_ENABLE=3`, and adjust the detection intensity by configuring the environment variables `NPU_ASD_UPPER_THRESH` and `NPU_ASD_SIGMA_THRESH`. +The Mindspore framework version 2.7.1 provides a combined detection scheme using feature value and CheckSum, which can more accurately locate silent faults. It samples parameter gradients for feature value detection. When multiple anomalies occur, CheckSum is triggered by a "strike out" mechanism to further locate the faulty device. Users can configure the combined detection through `MS_NPU_ASD_CONFIG`. + For information on configuring related environment variables, see **Feature Switches and Configuration**. For an introduction to default feature value detection checkpoints, and design guidelines for custom feature value detection checkpoints, see **Usage Recommendations and Detection Principles**. @@ -32,12 +34,16 @@ Based on experimental results, the following empirical conclusions are drawn: After enabling the detection switch (set `NPU_ASD_ENABLE` to `1`, `2` or `3`), during the backpropagation phase of training Transformer structure models, abnormality is determined by collecting the activation value gradients of the Norm layer through calling the detection operator inserted before the communication operator in the backward graph, and using an algorithm to determine if an anomaly exists. If an anomaly occurs, print the relevant logs or terminate the training depending on the different values of the environment variable `NPU_ASD_ENABLE`, and set the NPU state on the device where the anomaly is detected to Warning to report the fault event. +When using the combined detection scheme of feature value and CheckSum (set `enable:true` in `MS_NPU_ASD_CONFIG`), feature values are sampled from parameter gradients before communication in the backward graph, and an algorithm is used to detect if an anomaly exists. If feature value detection working with CheckSum (set `with_checksum:true` in `MS_NPU_ASD_CONFIG`), CheckSum will be performed when the number of anomalies exceeds the threshold within a time window. CheckSum verifies the calculation results of the MatMul operator of bfloat16 data type on each device to identify the faulty one. + The reasons for feature value anomalies can be divided into two categories: hardware errors and software errors, which can be referred to in the **Fault Handling** section for further analysis. ### Usage Restrictions Currently, this feature only supports Atlas A2 training series products, detects abnormal feature value during the training process with Transformer model within 8-D and bfloat16, float32 data type. +The combined detection scheme currently only supports `auto_parallel` or `semi_auto_parallel` modes. CheckSum only verifies the MatMul operator of bfloat16 data type. + ## Feature Switches and Configuration The environment variable `NPU_ASD_ENABLE` serves as a feature switch, `export NPU_ASD_ENABLE=1`, `export NPU_ASD_ENABLE=2` or `export NPU_ASD_ENABLE=3` to enable this feature; if this environment variable is not configured or `export NPU_ASD_ENABLE=0`, this feature is disabled. @@ -46,6 +52,8 @@ The environment variable `NPU_ASD_UPPER_THRESH` controls the absolute numerical The environment variable `NPU_ASD_SIGMA_THRESH` controls the relative numerical threshold of detection, in the same format as the above, where the first element controls the first-level threshold of numerical changes, and the second element controls the second-level threshold of numerical changes; by default, `NPU_ASD_SIGMA_THRESH=100000,5000`. +The environment variable `MS_NPU_ASD_CONFIG` configures the combined detection scheme of feature value and CheckSum, in the format of `key:value`, with each configuration item separated by commas. `enable` is the feature value detection switch, `with_checksum` is the CheckSum linkage switch, `grad_sample_interval` is the feature value sampling interval, `upper_thresh1` and `upper_thresh2` control the absolute and relative thresholds of feature value detection respectively, `cooldown` is the feature value detection anomaly cooldown time and the CheckSum execution time, `strikes_num` and `strikes_window` are the number of feature value detection anomalies and the time window required to trigger CheckSum, and `checksum_cooldown` is the CheckSum cooldown time. By default, `MS_NPU_ASD_CONFIG="enable:false,with_checksum:false,grad_sample_interval:10,upper_thresh1:1000000,upper_thresh2:100,cooldown:5,strikes_num:3,strikes_window:480,checksum_cooldown:180"`. + For details of above environment variables, see [Environment Variables](https://www.mindspore.cn/docs/en/master/api_python/env_var_list.html). ## Use Cases @@ -262,6 +270,95 @@ device-2/device-311523_20250225184632284.log:1829:[INFO] AICPU(26559,aicpu_sched device-2/device-311523_20250225184632284.log:1891:[ERROR] AICPU(26559,aicpu_scheduler):2025-02-25-18:46:51.762.577 [silent_check_v3.cc:250][ComputeL1Error][tid:26572]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[5.752281e-08], step=[5], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [3]. ``` +### Combined Detection + +When using the combined detection scheme (set `enable:true` in `MS_NPU_ASD_CONFIG`), the feature value detection method corresponding to this scheme will be used, and `NPU_ASD_ENABLE` will be ignored. + +A simple neural network is constructed here, and feature value anomalies are simulated through MindSpore's fault injection operator. The network script (`silent_detect.py`) is as below: + +```python +"""Silent Detect Demo""" +import time +import numpy as np + +import mindspore as ms +from mindspore import nn, Tensor, Parameter, context, ops, jit +from mindspore.communication import init, get_rank +from mindspore.nn import Momentum, TrainOneStepCell +from mindspore.parallel.auto_parallel import AutoParallel + +context.set_context(mode=context.GRAPH_MODE) +init() +ms.set_seed(1) +np.random.seed(1) + + +class Net(nn.Cell): + def __init__(self): + super(Net, self).__init__() + self.fc1 = nn.Dense(1, 8) + self.fc2 = nn.Dense(8, 8) + self.relu = ops.ReLU() + self.eod_mask = ops.auto_generate.GenerateEodMaskV2() + self.cur_step = Parameter(Tensor(-1, ms.int64), requires_grad=False) + rank_id = get_rank() + if rank_id == 2: + self.flip_mode = 'bitflip_designed' + else: + self.flip_mode = 'multiply' + + def construct(self, x): + x = self.fc1(x) + x = self.relu(x) + ele_pos = Tensor(0, ms.int64) + seed = Tensor(0, ms.int64) + offset = Tensor(0, ms.int64) + start = 0 + steps = [5] + error_mode = 'cycle' + multiply_factor = 1.0 + bit_pos = 0 + flip_probability = 0.0 + self.cur_step = self.cur_step + 1 + x = self.eod_mask(x, ele_pos, self.cur_step, seed, offset, start, steps, error_mode, self.flip_mode, + multiply_factor, bit_pos, flip_probability) + x = self.fc2(x) + return x + + +if __name__ == '__main__': + net = Net() + optimizer = Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9) + net = TrainOneStepCell(net, optimizer) + net.set_train() + + @jit + def compiled_one_step(inputs): + net(inputs) + + parallel_net = AutoParallel(compiled_one_step, parallel_mode='semi_auto') + for i in range(200): + inputs = Tensor(np.random.rand(8, 1).astype(np.float32)) + parallel_net(inputs) + time.sleep(1) +``` + +Start command: + +```bash +export MS_NPU_ASD_CONFIG="enable:true,with_checksum:true,grad_sample_interval:1,cooldown:1,strikes_num:1" +msrun --worker_num=8 --local_worker_num=8 --master_addr=127.0.0.1 --master_port=11235 --join=True python silent_detect.py +``` + +Feature value detection anomalies and CheckSum verification results can be observed in the training logs (default is `worker_*.log`): + +```bash +$ grep -m1 'Silent detect strike' worker_0.log +[WARNING] DEBUG(2950752,fffee7e591e0,python):2025-08-26-10:46:26.665.782 [mindspore/ccsrc/tools/silent_detect/silent_detector.cc:109] SilentDetect] Silent detect strike detected: StrikeRecord{timestamp: 1756176386, name: fc1.weight, value: inf, stat: StatData{avg: 6.44326e+12, pre_value: 6.441e+14, count: 6, none_zero_count: 6}} +$ grep -m1 'Global CheckSum result is' worker_0.log +[WARNING] DEBUG(2950752,fffda37fe1e0,python):2025-08-26-10:47:28.934.305 [mindspore/ccsrc/tools/silent_detect/silent_detector.cc:316] DoCheckSum] Global CheckSum result is 0 +``` + ## Detection Results and Handling ### Abnormal Detection Results @@ -274,6 +371,12 @@ When numerical anomalies are detected, the training task fails and alerts are re * Monitor the NPU health status: if Health Status displays Warning, Error Code displays 80818C00, and Error Information displays node type=SoC, sensor type=Check Sensor, event state=check fail; * Check the [Ascend Device Plugin](https://github.com/Ascend/ascend-device-plugin) events, report error code 80818C00, event type is fault event, and the fault level is minor. +When using combined detection, if feature value detection anomalies occur and CheckSum detects silent faults, warning logs can be found in the training logs: + +* The keyword for feature value detection anomalies is "Silent detect strike"; +* The keyword for triggering CheckSum is "Feature value detection strikes out"; +* The keywords for silent errors are "CheckSum detects MatMul error on rank" and "SilentCheck detects SDC error". + ### Fault Handling Isolate the abnormal device, resume training with checkpoint recovery; meanwhile, on the abnormal device, use the Ascend-DMI tool to perform AICore ERROR stress diagnostics to detect whether there are faulty NPUs on the device. For details, see [ToolBox User Guide](https://www.hiascend.com/document/detail/zh/mindx-dl/600/toolbox/ascenddmi/toolboxug_000002.html) in the "ascend-dmi tool usage > fault diagnosis" section. diff --git a/tutorials/source_zh_cn/debug/sdc.md b/tutorials/source_zh_cn/debug/sdc.md index 6f644702f0..6a448f0aa2 100644 --- a/tutorials/source_zh_cn/debug/sdc.md +++ b/tutorials/source_zh_cn/debug/sdc.md @@ -14,6 +14,8 @@ MindSpore框架2.4版本提供了网络模型的特征值检测方案,该方 对于默认的特征值检测点,用户可以设置环境变量 `NPU_ASD_ENABLE` 为`1`、`2`或`3`使能检测能力,并且通过配置环境变量 `NPU_ASD_UPPER_THRESH`, `NPU_ASD_SIGMA_THRESH`,调整检测强度。 +MindSpore框架2.7.1版本提供了特征值与CheckSum联合检测方案,能够更加准确地定位静默故障。该方案采样参数梯度进行特征值检测,并在检测到多次特征值异常时,通过“三振出局”机制触发CheckSum校验,进一步定位故障卡。用户可以通过`MS_NPU_ASD_CONFIG`对联合检测进行配置。 + 关于相关环境变量的配置,见 **特性开关及配置**。 关于默认的特征值检测点的介绍,以及对于自定义特征值检测点的设计指导,见 **使用建议与检测原理** 。 @@ -32,12 +34,16 @@ MindSpore框架2.4版本提供了网络模型的特征值检测方案,该方 开启检测开关(设置`NPU_ASD_ENABLE`为`1`,`2`或`3`)后,针对Transformer结构模型训练的反向阶段,通过在反向图的通信算子前插入检测算子,采集Norm层的激活值梯度,并通过算法判断是否异常。若出现异常,则根据环境变量`NPU_ASD_ENABLE`的不同取值打印相关日志或终止训练,并将检测到异常的设备上的NPU状态置为Warning,上报故障事件。 +采用特征值与CheckSum联合检测方案(`MS_NPU_ASD_CONFIG`中设置`enable:true`)时,会在反向图中对参数通信前的梯度进行特征值采样,并通过算法判断是否异常。当联合CheckSum校验(`MS_NPU_ASD_CONFIG`中设置`with_checksum:true`)时,若在时间窗口内异常次数超过阈值,会进一步开启CheckSum校验,对各卡bfloat16数据类型的MatMul算子的计算结果进行校验。 + 特征值异常原因可分为两类:硬件错误与软件错误,可参考**故障处理**章节进行后续分析。 ### 使用限制 目前本特性仅支持Atlas A2 训练系列产品,仅支持检测8维以内Transformer类模型,bfloat16和float32数据类型,训练过程中出现的特征值检测异常。 +联合检测方案目前仅支持自动并行或半自动并行模式。CheckSum仅针对bfloat16数据类型的MatMul算子进行校验。 + ## 特性开关及配置 环境变量`NPU_ASD_ENABLE`作为特性开关,`export NPU_ASD_ENABLE=1`、`export NPU_ASD_ENABLE=2`或`export NPU_ASD_ENABLE=3`开启本特性;不配置该环境变量或`export NPU_ASD_ENABLE=0`关闭本特性。 @@ -46,6 +52,8 @@ MindSpore框架2.4版本提供了网络模型的特征值检测方案,该方 环境变量`NPU_ASD_SIGMA_THRESH`控制检测的相对数值阈值,格式与上者相同,其中第一个元素控制数值跳变一级阈值,第二个元素控制数值跳变二级阈值;默认情况下,`NPU_ASD_SIGMA_THRESH=100000,5000`。 +环境变量`MS_NPU_ASD_CONFIG`对特征值和CheckSum联合检测方案进行配置,格式为key:value,并以逗号分隔各个配置项。其中`enable`为特征值检测开关,`with_checksum`为联动CheckSum开关,`grad_sample_interval`为特征值采样间隔,`upper_thresh1`和`upper_thresh2`分别控制特征值检测的绝对阈值和相对阈值,`cooldown`为特征值异常冷却时间和单次CheckSum执行时长,`strikes_num`和`strikes_window`为触发CheckSum所需的特征值异常次数和时间窗口大小,`checksum_cooldown`为CheckSum冷却时间。默认情况下,`MS_NPU_ASD_CONFIG="enable:false,with_checksum:false,grad_sample_interval:10,upper_thresh1:1000000,upper_thresh2:100,cooldown:5,strikes_num:3,strikes_window:480,checksum_cooldown:180"`。 + 上述环境变量的详细说明参见[环境变量](https://www.mindspore.cn/docs/zh-CN/master/api_python/env_var_list.html)。 ## 使用用例 @@ -262,6 +270,95 @@ device-2/device-311523_20250225184632284.log:1829:[INFO] AICPU(26559,aicpu_sched device-2/device-311523_20250225184632284.log:1891:[ERROR] AICPU(26559,aicpu_scheduler):2025-02-25-18:46:51.762.577 [silent_check_v3.cc:250][ComputeL1Error][tid:26572]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[5.752281e-08], step=[5], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [3]. ``` +### 联合检测 + +使用联合检测方案时(`MS_NPU_ASD_CONFIG`中设置`enable:true`),会采用该方案对应的特征值检测方法,`NPU_ASD_ENABLE`不再生效。 + +这里构造了一个简单的神经网络,并通过MindSpore的故障注入算子模拟特征值异常。网络脚本(`silent_detect.py`)如下: + +```python +"""Silent Detect Demo""" +import time +import numpy as np + +import mindspore as ms +from mindspore import nn, Tensor, Parameter, context, ops, jit +from mindspore.communication import init, get_rank +from mindspore.nn import Momentum, TrainOneStepCell +from mindspore.parallel.auto_parallel import AutoParallel + +context.set_context(mode=context.GRAPH_MODE) +init() +ms.set_seed(1) +np.random.seed(1) + + +class Net(nn.Cell): + def __init__(self): + super(Net, self).__init__() + self.fc1 = nn.Dense(1, 8) + self.fc2 = nn.Dense(8, 8) + self.relu = ops.ReLU() + self.eod_mask = ops.auto_generate.GenerateEodMaskV2() + self.cur_step = Parameter(Tensor(-1, ms.int64), requires_grad=False) + rank_id = get_rank() + if rank_id == 2: + self.flip_mode = 'bitflip_designed' + else: + self.flip_mode = 'multiply' + + def construct(self, x): + x = self.fc1(x) + x = self.relu(x) + ele_pos = Tensor(0, ms.int64) + seed = Tensor(0, ms.int64) + offset = Tensor(0, ms.int64) + start = 0 + steps = [5] + error_mode = 'cycle' + multiply_factor = 1.0 + bit_pos = 0 + flip_probability = 0.0 + self.cur_step = self.cur_step + 1 + x = self.eod_mask(x, ele_pos, self.cur_step, seed, offset, start, steps, error_mode, self.flip_mode, + multiply_factor, bit_pos, flip_probability) + x = self.fc2(x) + return x + + +if __name__ == '__main__': + net = Net() + optimizer = Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9) + net = TrainOneStepCell(net, optimizer) + net.set_train() + + @jit + def compiled_one_step(inputs): + net(inputs) + + parallel_net = AutoParallel(compiled_one_step, parallel_mode='semi_auto') + for i in range(200): + inputs = Tensor(np.random.rand(8, 1).astype(np.float32)) + parallel_net(inputs) + time.sleep(1) +``` + +启动命令: + +```bash +export MS_NPU_ASD_CONFIG="enable:true,with_checksum:true,grad_sample_interval:1,cooldown:1,strikes_num:1" +msrun --worker_num=8 --local_worker_num=8 --master_addr=127.0.0.1 --master_port=11235 --join=True python silent_detect.py +``` + +通过查看训练日志(默认为`worker_*.log`),可以观察到特征值异常记录、CheckSum校验结果: + +```bash +$ grep -m1 'Silent detect strike' worker_0.log +[WARNING] DEBUG(2950752,fffee7e591e0,python):2025-08-26-10:46:26.665.782 [mindspore/ccsrc/tools/silent_detect/silent_detector.cc:109] SilentDetect] Silent detect strike detected: StrikeRecord{timestamp: 1756176386, name: fc1.weight, value: inf, stat: StatData{avg: 6.44326e+12, pre_value: 6.441e+14, count: 6, none_zero_count: 6}} +$ grep -m1 'Global CheckSum result is' worker_0.log +[WARNING] DEBUG(2950752,fffda37fe1e0,python):2025-08-26-10:47:28.934.305 [mindspore/ccsrc/tools/silent_detect/silent_detector.cc:316] DoCheckSum] Global CheckSum result is 0 +``` + ## 检测结果及处理 ### 异常检测结果 @@ -274,6 +371,12 @@ device-2/device-311523_20250225184632284.log:1891:[ERROR] AICPU(26559,aicpu_sche * 通过监控NPU健康状态:Health Status显示Warning,Error Code显示80818C00,Error Information显示node type=SoC, sensor type=Check Sensor, event state=check fail; * 通过查看[Ascend Device Plugin](https://github.com/Ascend/ascend-device-plugin)事件,上报错误码80818C00,事件类型为故障事件,故障级别次要。 +当使用联合检测时,若训练中发生特征值特异常、CheckSum检测出静默故障,会在业务训练日志中产生告警: + +* 特征值异常日志关键字为“Silent detect strike”; +* 触发CheckSum校验日志关键字为“Feature value detection strikes out”; +* 联合CheckSum识别出静默故障日志关键字为“CheckSum detects MatMul error on rank”和“SilentCheck detects SDC error”。 + ### 故障处理 将异常设备隔离,断点续训拉起继续训练;同时在异常设备上,通过Ascend-DMI工具执行AICore ERROR压测诊断,检测该设备上是否存在故障NPU。详情请查看[《ToolBox用户指南》](https://www.hiascend.com/document/detail/zh/mindx-dl/600/toolbox/ascenddmi/toolboxug_000002.html) “ascend-dmi工具使用 > 故障诊断”章节。 -- Gitee