diff --git a/docs/mindspore/source_en/api_python/env_var_list.rst b/docs/mindspore/source_en/api_python/env_var_list.rst index e18ace87b7212d39faac4d63519fbab38de398b4..d88ee1b49fc9af3e5317bb8f7ee3ae7b9420f49b 100644 --- a/docs/mindspore/source_en/api_python/env_var_list.rst +++ b/docs/mindspore/source_en/api_python/env_var_list.rst @@ -882,37 +882,6 @@ Silent Data Corruption Detection - Type - Value - Description - * - NPU_ASD_ENABLE - - Whether to enable feature value detection function - - Integer - - 0: Disable feature value detection function - - 1: Enable feature value detection function, when error was detected, just print log, not throw exception - - 2: Enable feature value detection function, when error was detected, throw exception - - 3: Enable feature value detection function, when error was detected, throw exception, but at the same time write value detection info of each time to log file (this requires set ascend log level to info or debug) - - Currently, this feature only supports the Atlas A2 training series products, and only detects abnormal feature value that occur during the training of Transformer class models with bfloat16 data type - - Considering that the feature value range can not be known ahead, setting NPU_ASD_ENABLE to 1 is recommended to enable silent check, which prevents training interruption caused by false detection - * - NPU_ASD_UPPER_THRESH - - Controls the absolute numerical threshold for detection - - String - - The format is a pair of integers, where the first element controls the first-level absolute numerical threshold, and the second element controls the second-level absolute numerical threshold - - Decreasing the threshold can detect smaller fluctuations of abnormal data, increasing the detection rate, while increasing the threshold has the opposite effect - - By default, if this environment variable is not configured, `NPU_ASD_UPPER_THRESH=1000000,10000` - - - * - NPU_ASD_SIGMA_THRESH - - Controls the relative numerical threshold for detection - - String - - The format is a pair of integers, where the first element controls the first-level relative numerical threshold, and the second element controls the second-level relative numerical threshold - - Decreasing the threshold can detect smaller fluctuations of abnormal data, increasing the detection rate, while increasing the threshold has the opposite effect - - By default, if this environment variable is not configured, `NPU_ASD_SIGMA_THRESH=100000,5000` - - * - MS_SDC_DETECT_ENABLE - Whether to enable CheckSum for silent data corruption detection - Integer diff --git a/docs/mindspore/source_zh_cn/api_python/env_var_list.rst b/docs/mindspore/source_zh_cn/api_python/env_var_list.rst index fa7134defdecf282b7acef6208184081d24c672b..01b4557a4989da3b2f795c2e3391458b2c531be4 100644 --- a/docs/mindspore/source_zh_cn/api_python/env_var_list.rst +++ b/docs/mindspore/source_zh_cn/api_python/env_var_list.rst @@ -879,37 +879,6 @@ Dump调试 - 类型 - 取值 - 说明 - * - NPU_ASD_ENABLE - - 是否开启特征值检测功能 - - Integer - - 0:关闭特征值检测功能 - - 1:检测到异常,只打印日志,但检测算子不抛异常 - - 2:检测到异常,打印日志,检测算子抛出异常 - - 3:特征值正常和异常场景下都会打印(备注:正常场景下只有CANN开启了INFO及DEBUG级别才会打印),检测到异常时检测算子抛出异常 - - 目前本特性仅支持Atlas A2训练系列产品,仅支持检测Transformer类模型,bfloat16数据类型,训练过程中出现的特征值检测异常 - - 考虑到无法事先知道数据特征值的分布范围,建议设置NPU_ASD_ENABLE的值为1来使能静默检测,以防止误检导致训练中断 - * - NPU_ASD_UPPER_THRESH - - 控制检测的绝对数值阈值 - - String - - 格式为整型数据对,其中第一个元素控制绝对数值一级阈值,第二个元素控制绝对数值二级阈值 - - 减小阈值可以检出波动更小的异常数据,增加检出率,增大阈值与之相反 - - 在不配置该环境变量的默认情况下,`NPU_ASD_UPPER_THRESH=1000000,10000` - - - * - NPU_ASD_SIGMA_THRESH - - 控制检测的相对数值阈值 - - String - - 格式为整型数据对,其中第一个元素控制相对数值一级阈值,第二个元素控制相对数值二级阈值 - - 减小阈值可以检出波动更小的异常数据,增加检出率,增大阈值与之相反 - - 在不配置该环境变量的默认情况下,`NPU_ASD_SIGMA_THRESH=100000,5000` - - * - MS_SDC_DETECT_ENABLE - 是否使能CheckSum检测静默故障 - Integer diff --git a/tutorials/source_en/debug/sdc.md b/tutorials/source_en/debug/sdc.md index 02f51c912effe48e99b6040b03328262dc1d8b84..8da5d4f24b04cc163911f4140569ff3b123615e2 100644 --- a/tutorials/source_en/debug/sdc.md +++ b/tutorials/source_en/debug/sdc.md @@ -10,10 +10,6 @@ During model training, processors may encounter feature value detection anomalie ### Solution -The MindSpore framework version 2.4 provides a solution for feature value detection of Transformer structure models. Internally a feature value detection operator is inserted before the communication operator in the backward graph to monitor feature value and prevent anomaly from spreading to other cards. - -For default feature value detection checkpoints, users can enable detection capability using the environment variable `NPU_ASD_ENABLE=1`, `NPU_ASD_ENABLE=2` or `NPU_ASD_ENABLE=3`, and adjust the detection intensity by configuring the environment variables `NPU_ASD_UPPER_THRESH` and `NPU_ASD_SIGMA_THRESH`. - The Mindspore framework version 2.7.1 provides a combined detection scheme using feature value and CheckSum, which can more accurately locate silent faults. It samples parameter gradients for feature value detection. When multiple anomalies occur, CheckSum is triggered by a "strike out" mechanism to further locate the faulty device. Users can configure the combined detection through `MS_NPU_ASD_CONFIG`. For information on configuring related environment variables, see **Feature Switches and Configuration**. @@ -32,8 +28,6 @@ Based on experimental results, the following empirical conclusions are drawn: * Setting too many checkpoints will affect model training performance. * Based on experiments on the sensitivity of calculation errors, the MindSpore framework defaults to selecting the `Norm` activation value gradient in the backpropagation calculation process as the detection feature value, with performance loss less than 2% based on **Llama 2 - 7B** testing. -After enabling the detection switch (set `NPU_ASD_ENABLE` to `1`, `2` or `3`), during the backpropagation phase of training Transformer structure models, abnormality is determined by collecting the activation value gradients of the Norm layer through calling the detection operator inserted before the communication operator in the backward graph, and using an algorithm to determine if an anomaly exists. If an anomaly occurs, print the relevant logs or terminate the training depending on the different values of the environment variable `NPU_ASD_ENABLE`, and set the NPU state on the device where the anomaly is detected to Warning to report the fault event. - When using the combined detection scheme of feature value and CheckSum (set `enable:true` in `MS_NPU_ASD_CONFIG`), feature values are sampled from parameter gradients before communication in the backward graph, and an algorithm is used to detect if an anomaly exists. If feature value detection working with CheckSum (set `with_checksum:true` in `MS_NPU_ASD_CONFIG`), CheckSum will be performed when the number of anomalies exceeds the threshold within a time window. CheckSum verifies the calculation results of the MatMul operator of bfloat16 data type on each device to identify the faulty one. The reasons for feature value anomalies can be divided into two categories: hardware errors and software errors, which can be referred to in the **Fault Handling** section for further analysis. @@ -46,12 +40,6 @@ The combined detection scheme currently only supports `auto_parallel` or `semi_a ## Feature Switches and Configuration -The environment variable `NPU_ASD_ENABLE` serves as a feature switch, `export NPU_ASD_ENABLE=1`, `export NPU_ASD_ENABLE=2` or `export NPU_ASD_ENABLE=3` to enable this feature; if this environment variable is not configured or `export NPU_ASD_ENABLE=0`, this feature is disabled. - -The environment variable `NPU_ASD_UPPER_THRESH` controls the absolute numerical threshold of detection, in the format of integer pairs, where the first element controls the first-level threshold of absolute numerical values, and the second element controls the second-level threshold of absolute numerical values; reducing the threshold can detect smaller fluctuations in abnormal data, increase the detection rate, and increasing the threshold is the opposite. In the default case where this environment variable is not configured, `NPU_ASD_UPPER_THRESH=1000000,10000`. - -The environment variable `NPU_ASD_SIGMA_THRESH` controls the relative numerical threshold of detection, in the same format as the above, where the first element controls the first-level threshold of numerical changes, and the second element controls the second-level threshold of numerical changes; by default, `NPU_ASD_SIGMA_THRESH=100000,5000`. - The environment variable `MS_NPU_ASD_CONFIG` configures the combined detection scheme of feature value and CheckSum, in the format of `key:value`, with each configuration item separated by commas. `enable` is the feature value detection switch, `with_checksum` is the CheckSum linkage switch, `grad_sample_interval` is the feature value sampling interval, `upper_thresh1` and `upper_thresh2` control the absolute and relative thresholds of feature value detection respectively, `cooldown` is the feature value detection anomaly cooldown time and the CheckSum execution time, `strikes_num` and `strikes_window` are the number of feature value detection anomalies and the time window required to trigger CheckSum, and `checksum_cooldown` is the CheckSum cooldown time. By default, `MS_NPU_ASD_CONFIG="enable:false,with_checksum:false,grad_sample_interval:10,upper_thresh1:1000000,upper_thresh2:100,cooldown:5,strikes_num:3,strikes_window:480,checksum_cooldown:180"`. For details of above environment variables, see [Environment Variables](https://www.mindspore.cn/docs/en/master/api_python/env_var_list.html). @@ -60,220 +48,6 @@ For details of above environment variables, see [Environment Variables](https:// > This document describes the usage methods and use cases of feature value detection. -### Model and Dataset Preparation - -To provide a complete experience, simple neural networks and a simulated dataset are constructed here, and the use of feature value detection is demonstrated by simulating feature value anomalies through MindSpore's fault injection operator (a step that is not required in the actual network). - -The full python script (`silent_check.py`) is as below: - -```python -"""Silent Check Demo""" - -import os -import numpy as np -import mindspore as ms -import mindspore.dataset as ds -from mindspore import nn, ops -from mindspore.communication import init -from mindspore.common.initializer import initializer -from mindspore.parallel.auto_parallel import AutoParallel -from mindspore.nn.utils import no_init_parameters - - -ms.set_context(mode=ms.GRAPH_MODE) -ms.runtime.set_memory(max_size="2GB") -init() -ms.set_seed(1) -np.random.seed(1) - -class Network(nn.Cell): - """Network""" - def __init__(self): - super().__init__() - self.flatten = ops.Flatten() - self.fc1_weight = ms.Parameter(initializer("normal", [28*28, 512], ms.float32)) - self.fc2_weight = ms.Parameter(initializer("normal", [512, 512], ms.float32)) - self.fc3_weight = ms.Parameter(initializer("normal", [512, 10], ms.float32)) - self.matmul1 = ops.MatMul() - self.relu1 = ops.ReLU() - self.matmul2 = ops.MatMul() - self.relu2 = ops.ReLU() - self.matmul3 = ops.MatMul() - # ====== begin ====== operator and parameter for injecting fault ========================== - self.eod_mask = ops.auto_generate.GenerateEodMaskV2() - self.cur_step = ms.Parameter(ms.Tensor(-1, ms.int64), requires_grad=False) - rank_id = os.environ['RANK_ID'] - print(f'rank id of process {os.getpid()} is {rank_id}') - if rank_id == '2': - self.flip_mode = 'bitflip_designed' # bitflip, bitflip_designed, multiply, multiply_max - else: - self.flip_mode = 'multiply' # bitflip, bitflip_designed, multiply, multiply_max - # ====== *end* ====== operator and parameter for injecting fault ========================== - - def construct(self, x): - x = self.flatten(x) - x = self.matmul1(x, self.fc1_weight) - # ====== begin ====== inject eod_mask ===================================================== - ele_pos = ms.Tensor(1, ms.int64) - seed = ms.Tensor(0, ms.int64) - offset = ms.Tensor(0, ms.int64) - start = 0 - steps = [5] - error_mode = 'cycle' # cycle, specific - multiply_factor = 1.0 - bit_pos = 0 - flip_probability = 0.0 - # GenerateEodMaskV2()(input=, ele_pos=, cur_step=, seed= - # , offset=, start=, steps=, error_mode= - # , flip_mode=, multiply_factor=, bit_pos=, flip_probability=) - self.cur_step = self.cur_step + 1 - x = self.eod_mask(x, ele_pos, self.cur_step, seed, offset, start, steps, error_mode, self.flip_mode, - multiply_factor, bit_pos, flip_probability) - # ====== *end* ====== inject eod_mask ===================================================== - x = self.relu1(x) - x = self.matmul2(x, self.fc2_weight) - x = self.relu2(x) - logits = self.matmul3(x, self.fc3_weight) - return logits - -with no_init_parameters(): - net = Network() - optimizer = nn.SGD(net.trainable_params(), 1e-2) -net.matmul1.shard(((1, 4), (4, 1))) -net.relu1.shard(((4, 1),)) -net.matmul2.shard(((1, 4), (4, 1))) -net.relu2.shard(((4, 1),)) -parallel_net = AutoParallel(net, parallel_mode='semi_auto') - -# fake dataset -def create_dataset(batch_size): - # """create dataset""" - # Random-accessible object as input source - class RandomAccessDataset: - def __init__(self): - self.dataset_size = 20 - - def __getitem__(self, index): - image_np = np.random.randn(batch_size, 1, 28, 28).astype(np.float32) + 10 - label_np = np.random.randint(low=0, high=10, size=batch_size, dtype=np.int32) - return ms.Tensor(image_np), ms.Tensor(label_np) - - def __len__(self): - return self.dataset_size - - loader = RandomAccessDataset() - return ds.GeneratorDataset(source=loader, column_names=["image", "label"]) - - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() - -def forward_fn(data, target): - """forward propagation""" - logits = net(data) - loss = loss_fn(logits, target) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) - -@ms.jit -def train_step(inputs, targets): - """train_step""" - (loss_value, _), grads = grad_fn(inputs, targets) - optimizer(grads) - return loss_value - -# training -for epoch in range(1): - i = 0 - for image, label in data_set: - loss_output = train_step(image, label) - myrank_id = os.environ['RANK_ID'] - if i % 10 == 0 and myrank_id == '0': - print("rank %s, epoch: %s, step: %s, loss is %s" % (myrank_id, epoch, i, loss_output)) - i += 1 -``` - -### Running Silent Check Script - -This silent check demo uses 4 NPU cards, the start script (`run_silent_check.sh`) is as follows: - -```bash -#!/bin/bash - -# set cann log level to info -export ASCEND_GLOBAL_LOG_LEVEL=1 - -mpirun -n 4 --output-filename log_output --merge-stderr-to-stdout python silent_check.py -``` - -### Different Detection Levels and Running Results - -#### Execution Result of Setting NPU_ASD_ENABLE to 1 - -When `NPU_ASD_ENABLE` was set to `1`, if error was detected, just print `ERROR` log, not stop training process. - -Start command: - -```bash -NPU_ASD_ENABLE=1 bash run_silent_check.sh -``` - -From the CANN log, by default the log path is `~/ascend/log/`, the main `ERROR` logs are as follows, there are many `ERROR` logs. After error was detected, the training process was not stopped. - -```bash -$ cd ~/ascend/log/debug/ -$ grep -nr 'silent_check_v[2-9].cc:.*SilentCheck' device-* -device-0/device-299066_20250225184036913.log:1968:[ERROR] AICPU(26533,aicpu_scheduler):2025-02-25-18:40:56.176.403 [silent_check_v3.cc:250][ComputeL1Error][tid:26552]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[1.128970e-09], step=[5], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [1]. -device-0/device-299066_20250225184036913.log:2134:[ERROR] AICPU(26533,aicpu_scheduler):2025-02-25-18:40:56.269.071 [silent_check_v3.cc:250][ComputeL1Error][tid:26547]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[6.995705e-08], step=[6], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [1]. -device-0/device-299066_20250225184036913.log:2190:[ERROR] AICPU(26533,aicpu_scheduler):2025-02-25-18:40:56.275.860 [silent_check_v3.cc:250][ComputeL1Error][tid:26548]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[nan], step=[6], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [1]. -device-0/device-299066_20250225184036913.log:2246:[ERROR] AICPU(26533,aicpu_scheduler):2025-02-25-18:40:56.282.746 [silent_check_v3.cc:250][ComputeL1Error][tid:26549]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[1.526131e-09], step=[6], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [1]. -device-0/device-299066_20250225184036913.log:2357:[ERROR] AICPU(26533,aicpu_scheduler):2025-02-25-18:40:56.366.766 [silent_check_v3.cc:250][ComputeL1Error][tid:26549]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[nan], step=[7], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [1]. -device-0/device-299066_20250225184036913.log:2413:[ERROR] AICPU(26533,aicpu_scheduler):2025-02-25-18:40:56.373.589 [silent_check_v3.cc:250][ComputeL1Error][tid:26550]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[nan], step=[7], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [1]. -``` - -#### Execution Result of Setting NPU_ASD_ENABLE to 2 - -When `NPU_ASD_ENABLE` was set to `2`, if error was detected, print `ERROR` log and stop training process. - -Start command: - -```bash -NPU_ASD_ENABLE=2 bash run_silent_check.sh -``` - -From the CANN log, by default the log path os `~/ascend/log/`, the main `ERROR` logs are as follows, there only one `ERROR` log. After error was detected, the training process was stopped. - -```bash -$ cd ~/ascend/log/debug/ -$ grep -nr 'silent_check_v[2-9].cc:.*SilentCheck' device-* -device-2/device-305322_20250225184310213.log:1859:[ERROR] AICPU(25787,aicpu_scheduler):2025-02-25-18:43:29.395.610 [silent_check_v3.cc:250][ComputeL1Error][tid:25799]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[5.752283e-08], step=[5], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [2]. -``` - -#### Execution Result of Setting NPU_ASD_ENABLE to 3 - -When `NPU_ASD_ENABLE` was set to `3`, the action is similar to detection level `2`, i.e. if error was detected, print `ERROR` log and stop training process. Besides an `INFO` log as also output for non anomaly feature values (In oreder to see logs of level INFO, need to set `export ASCEND_GLOBAL_LOG_LEVEL=0` to enable log level `DEBUG` or set `export ASCEND_GLOBAL_LOG_LEVEL=1` to enable log level `INFO`). - -Start command: - -```bash -NPU_ASD_ENABLE=3 bash run_silent_check.sh -``` - -From the CANN log, by default the log path os `~/ascend/log/`, the main `ERROR` logs are as follows, there are `INFO` logs except for `ERROR` log about feature value info. - -```bash -$ cd ~/ascend/log/debug/ -$ grep -nr 'silent_check_v[2-9].cc:.*SilentCheck' device-* -device-2/device-311523_20250225184632284.log:1767:[INFO] AICPU(26559,aicpu_scheduler):2025-02-25-18:46:51.678.981 [silent_check_v3.cc:240][SilentCheck][tid:26572]SilentCheckV3 normal case, val=[3.350337e-08], max=[3.350337e-08], avg=[9.879098e-10], step=[4], c_threshold_l1=[1.000000e+06], c_threshold_l2=[1.000000e+04], beta1=[9.900000e-01], npu_asd_detect=[3]. -device-2/device-311523_20250225184632284.log:1829:[INFO] AICPU(26559,aicpu_scheduler):2025-02-25-18:46:51.684.993 [silent_check_v3.cc:240][SilentCheck][tid:26570]SilentCheckV3 normal case, val=[2.016393e+02], max=[2.016393e+02], avg=[7.349676e+00], step=[4], c_threshold_l1=[1.000000e+06], c_threshold_l2=[1.000000e+04], beta1=[9.900000e-01], npu_asd_detect=[3]. -device-2/device-311523_20250225184632284.log:1891:[ERROR] AICPU(26559,aicpu_scheduler):2025-02-25-18:46:51.762.577 [silent_check_v3.cc:250][ComputeL1Error][tid:26572]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[5.752281e-08], step=[5], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [3]. -``` - -### Combined Detection - -When using the combined detection scheme (set `enable:true` in `MS_NPU_ASD_CONFIG`), the feature value detection method corresponding to this scheme will be used, and `NPU_ASD_ENABLE` will be ignored. - A simple neural network is constructed here, and feature value anomalies are simulated through MindSpore's fault injection operator. The network script (`silent_detect.py`) is as below: ```python diff --git a/tutorials/source_zh_cn/debug/sdc.md b/tutorials/source_zh_cn/debug/sdc.md index 7ad0be2e8e43a174277d2219c235b0d8fb7fb73c..8d2f71d3473e681425d548a2ad910aa73540e8c9 100644 --- a/tutorials/source_zh_cn/debug/sdc.md +++ b/tutorials/source_zh_cn/debug/sdc.md @@ -10,10 +10,6 @@ ### 解决方案 -MindSpore框架2.4版本提供了网络模型的特征值检测方案,该方案主要是在反向图的通信算子前插入特征值检测算子,监测异常值并防止异常值扩散到其他卡。 - -对于默认的特征值检测点,用户可以设置环境变量 `NPU_ASD_ENABLE` 为`1`、`2`或`3`使能检测能力,并且通过配置环境变量 `NPU_ASD_UPPER_THRESH`, `NPU_ASD_SIGMA_THRESH`,调整检测强度。 - MindSpore框架2.7.1版本提供了特征值与CheckSum联合检测方案,能够更加准确地定位静默故障。该方案采样参数梯度进行特征值检测,并在检测到多次特征值异常时,通过“三振出局”机制触发CheckSum校验,进一步定位故障卡。用户可以通过`MS_NPU_ASD_CONFIG`对联合检测进行配置。 关于相关环境变量的配置,见 **特性开关及配置**。 @@ -32,8 +28,6 @@ MindSpore框架2.7.1版本提供了特征值与CheckSum联合检测方案,能 * 过多的检测点设置会影响模型训练性能。 * 根据计算错误检测敏感性实验结果,MindSpore框架默认选择反向传播计算过程中的`Norm`激活值梯度作为检测特征值,基于 **Llama 2 - 7B** 测试性能损失小于 2%。 -开启检测开关(设置`NPU_ASD_ENABLE`为`1`,`2`或`3`)后,针对Transformer结构模型训练的反向阶段,通过在反向图的通信算子前插入检测算子,采集Norm层的激活值梯度,并通过算法判断是否异常。若出现异常,则根据环境变量`NPU_ASD_ENABLE`的不同取值打印相关日志或终止训练,并将检测到异常的设备上的NPU状态置为Warning,上报故障事件。 - 采用特征值与CheckSum联合检测方案(`MS_NPU_ASD_CONFIG`中设置`enable:true`)时,会在反向图中对参数通信前的梯度进行特征值采样,并通过算法判断是否异常。当联合CheckSum校验(`MS_NPU_ASD_CONFIG`中设置`with_checksum:true`)时,若在时间窗口内异常次数超过阈值,会进一步开启CheckSum校验,对各卡bfloat16数据类型的MatMul算子的计算结果进行校验。 特征值异常原因可分为两类:硬件错误与软件错误,可参考**故障处理**章节进行后续分析。 @@ -46,12 +40,6 @@ MindSpore框架2.7.1版本提供了特征值与CheckSum联合检测方案,能 ## 特性开关及配置 -环境变量`NPU_ASD_ENABLE`作为特性开关,`export NPU_ASD_ENABLE=1`、`export NPU_ASD_ENABLE=2`或`export NPU_ASD_ENABLE=3`开启本特性;不配置该环境变量或`export NPU_ASD_ENABLE=0`关闭本特性。 - -环境变量`NPU_ASD_UPPER_THRESH`控制检测的绝对数值阈值,格式为整型数据对,其中第一个元素控制绝对数值一级阈值,第二个元素控制绝对数值二级阈值;减小阈值可以检出波动更小的异常数据,增加检出率,增大阈值与之相反。在不配置该环境变量的默认情况下,`NPU_ASD_UPPER_THRESH=1000000,10000`。 - -环境变量`NPU_ASD_SIGMA_THRESH`控制检测的相对数值阈值,格式与上者相同,其中第一个元素控制数值跳变一级阈值,第二个元素控制数值跳变二级阈值;默认情况下,`NPU_ASD_SIGMA_THRESH=100000,5000`。 - 环境变量`MS_NPU_ASD_CONFIG`对特征值和CheckSum联合检测方案进行配置,格式为key:value,并以逗号分隔各个配置项。其中`enable`为特征值检测开关,`with_checksum`为联动CheckSum开关,`grad_sample_interval`为特征值采样间隔,`upper_thresh1`和`upper_thresh2`分别控制特征值检测的绝对阈值和相对阈值,`cooldown`为特征值异常冷却时间和单次CheckSum执行时长,`strikes_num`和`strikes_window`为触发CheckSum所需的特征值异常次数和时间窗口大小,`checksum_cooldown`为CheckSum冷却时间。默认情况下,`MS_NPU_ASD_CONFIG="enable:false,with_checksum:false,grad_sample_interval:10,upper_thresh1:1000000,upper_thresh2:100,cooldown:5,strikes_num:3,strikes_window:480,checksum_cooldown:180"`。 上述环境变量的详细说明参见[环境变量](https://www.mindspore.cn/docs/zh-CN/master/api_python/env_var_list.html)。 @@ -60,220 +48,6 @@ MindSpore框架2.7.1版本提供了特征值与CheckSum联合检测方案,能 > 本文档介绍特征值检测的使用方法以及用例。 -### 模型与数据集准备 - -为了提供完整的体验,这里构造了简单的神经网络和一个模拟数据集,并通过MindSpore的故障注入算子模拟特征值异常(实际网络中不需要该步骤)来展示特征值检测的使用方法。 - -完整的脚本(`silent_check.py`)如下: - -```python -"""Silent Check Demo""" - -import os -import numpy as np -import mindspore as ms -import mindspore.dataset as ds -from mindspore import nn, ops -from mindspore.communication import init -from mindspore.common.initializer import initializer -from mindspore.parallel.auto_parallel import AutoParallel -from mindspore.nn.utils import no_init_parameters - - -ms.set_context(mode=ms.GRAPH_MODE) -ms.runtime.set_memory(max_size="2GB") -init() -ms.set_seed(1) -np.random.seed(1) - -class Network(nn.Cell): - """Network""" - def __init__(self): - super().__init__() - self.flatten = ops.Flatten() - self.fc1_weight = ms.Parameter(initializer("normal", [28*28, 512], ms.float32)) - self.fc2_weight = ms.Parameter(initializer("normal", [512, 512], ms.float32)) - self.fc3_weight = ms.Parameter(initializer("normal", [512, 10], ms.float32)) - self.matmul1 = ops.MatMul() - self.relu1 = ops.ReLU() - self.matmul2 = ops.MatMul() - self.relu2 = ops.ReLU() - self.matmul3 = ops.MatMul() - # ====== begin ====== operator and parameter for injecting fault ========================== - self.eod_mask = ops.auto_generate.GenerateEodMaskV2() - self.cur_step = ms.Parameter(ms.Tensor(-1, ms.int64), requires_grad=False) - rank_id = os.environ['RANK_ID'] - print(f'rank id of process {os.getpid()} is {rank_id}') - if rank_id == '2': - self.flip_mode = 'bitflip_designed' # bitflip, bitflip_designed, multiply, multiply_max - else: - self.flip_mode = 'multiply' # bitflip, bitflip_designed, multiply, multiply_max - # ====== *end* ====== operator and parameter for injecting fault ========================== - - def construct(self, x): - x = self.flatten(x) - x = self.matmul1(x, self.fc1_weight) - # ====== begin ====== inject eod_mask ===================================================== - ele_pos = ms.Tensor(1, ms.int64) - seed = ms.Tensor(0, ms.int64) - offset = ms.Tensor(0, ms.int64) - start = 0 - steps = [5] - error_mode = 'cycle' # cycle, specific - multiply_factor = 1.0 - bit_pos = 0 - flip_probability = 0.0 - # GenerateEodMaskV2()(input=, ele_pos=, cur_step=, seed= - # , offset=, start=, steps=, error_mode= - # , flip_mode=, multiply_factor=, bit_pos=, flip_probability=) - self.cur_step = self.cur_step + 1 - x = self.eod_mask(x, ele_pos, self.cur_step, seed, offset, start, steps, error_mode, self.flip_mode, - multiply_factor, bit_pos, flip_probability) - # ====== *end* ====== inject eod_mask ===================================================== - x = self.relu1(x) - x = self.matmul2(x, self.fc2_weight) - x = self.relu2(x) - logits = self.matmul3(x, self.fc3_weight) - return logits - -with no_init_parameters(): - net = Network() - optimizer = nn.SGD(net.trainable_params(), 1e-2) -net.matmul1.shard(((1, 4), (4, 1))) -net.relu1.shard(((4, 1),)) -net.matmul2.shard(((1, 4), (4, 1))) -net.relu2.shard(((4, 1),)) -parallel_net = AutoParallel(net, parallel_mode='semi_auto') - -# fake dataset -def create_dataset(batch_size): - # """create dataset""" - # Random-accessible object as input source - class RandomAccessDataset: - def __init__(self): - self.dataset_size = 20 - - def __getitem__(self, index): - image_np = np.random.randn(batch_size, 1, 28, 28).astype(np.float32) + 10 - label_np = np.random.randint(low=0, high=10, size=batch_size, dtype=np.int32) - return ms.Tensor(image_np), ms.Tensor(label_np) - - def __len__(self): - return self.dataset_size - - loader = RandomAccessDataset() - return ds.GeneratorDataset(source=loader, column_names=["image", "label"]) - - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() - -def forward_fn(data, target): - """forward propagation""" - logits = net(data) - loss = loss_fn(logits, target) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) - -@ms.jit -def train_step(inputs, targets): - """train_step""" - (loss_value, _), grads = grad_fn(inputs, targets) - optimizer(grads) - return loss_value - -# training -for epoch in range(1): - i = 0 - for image, label in data_set: - loss_output = train_step(image, label) - myrank_id = os.environ['RANK_ID'] - if i % 10 == 0 and myrank_id == '0': - print("rank %s, epoch: %s, step: %s, loss is %s" % (myrank_id, epoch, i, loss_output)) - i += 1 -``` - -### 网络运行脚本 - -上面网络脚本是4卡并行,其启动脚本(`run_silent_check.sh`)内容如下: - -```bash -#!/bin/bash - -# set cann log level to info -export ASCEND_GLOBAL_LOG_LEVEL=1 - -mpirun -n 4 --output-filename log_output --merge-stderr-to-stdout python silent_check.py -``` - -### 不同检测级别及运行日志 - -#### NPU_ASD_ENABLE 取值为 1 的运行日志 - -`NPU_ASD_ENABLE`取值为`1`的行为是当检测到特征值异常时,只打印 ERROR 日志,不中止训练。 - -启动命令: - -```bash -NPU_ASD_ENABLE=1 bash run_silent_check.sh -``` - -通过查看 CANN 的 device 日志,默认在 `~/ascend/log/` 目录下,关键 ERROR 日志如下,从中可以有多条 ERROR 日志,即检测到异常值是并未中止训练。 - -```bash -$ cd ~/ascend/log/debug/ -$ grep -nr 'silent_check_v[2-9].cc:.*SilentCheck' device-* -device-0/device-299066_20250225184036913.log:1968:[ERROR] AICPU(26533,aicpu_scheduler):2025-02-25-18:40:56.176.403 [silent_check_v3.cc:250][ComputeL1Error][tid:26552]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[1.128970e-09], step=[5], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [1]. -device-0/device-299066_20250225184036913.log:2134:[ERROR] AICPU(26533,aicpu_scheduler):2025-02-25-18:40:56.269.071 [silent_check_v3.cc:250][ComputeL1Error][tid:26547]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[6.995705e-08], step=[6], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [1]. -device-0/device-299066_20250225184036913.log:2190:[ERROR] AICPU(26533,aicpu_scheduler):2025-02-25-18:40:56.275.860 [silent_check_v3.cc:250][ComputeL1Error][tid:26548]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[nan], step=[6], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [1]. -device-0/device-299066_20250225184036913.log:2246:[ERROR] AICPU(26533,aicpu_scheduler):2025-02-25-18:40:56.282.746 [silent_check_v3.cc:250][ComputeL1Error][tid:26549]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[1.526131e-09], step=[6], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [1]. -device-0/device-299066_20250225184036913.log:2357:[ERROR] AICPU(26533,aicpu_scheduler):2025-02-25-18:40:56.366.766 [silent_check_v3.cc:250][ComputeL1Error][tid:26549]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[nan], step=[7], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [1]. -device-0/device-299066_20250225184036913.log:2413:[ERROR] AICPU(26533,aicpu_scheduler):2025-02-25-18:40:56.373.589 [silent_check_v3.cc:250][ComputeL1Error][tid:26550]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[nan], step=[7], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [1]. -``` - -#### NPU_ASD_ENABLE 取值为 2 的运行日志 - -`NPU_ASD_ENABLE`取值为`2`的行为是当检测到特征值异常时,打印 ERROR 日志并中止训练。 - -启动命令: - -```bash -NPU_ASD_ENABLE=2 bash run_silent_check.sh -``` - -通过查看 CANN 的 device 日志,默认在 `~/ascend/log/` 目录下,关键 ERROR 日志如下,发现只有一条 ERROR 日志,即检测到异常值时中止了训练: - -```bash -$ cd ~/ascend/log/debug/ -$ grep -nr 'silent_check_v[2-9].cc:.*SilentCheck' device-* -device-2/device-305322_20250225184310213.log:1859:[ERROR] AICPU(25787,aicpu_scheduler):2025-02-25-18:43:29.395.610 [silent_check_v3.cc:250][ComputeL1Error][tid:25799]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[5.752283e-08], step=[5], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [2]. -``` - -#### NPU_ASD_ENABLE 取值为 3 的运行日志 - -`NPU_ASD_ENABLE`取值为`3`的行为是当检测到特征值异常时,与取值`2`行为类似,除打印 ERROR 日志并中止训练外,还会在 CANN 日志中打印特征值没有异常时检测算子的入参信息(需要通过`export ASCEND_GLOBAL_LOG_LEVEL=0`开启debug级别日志或`export ASCEND_GLOBAL_LOG_LEVEL=1`开启info级别日志才会把非异常场景下的日志输出到 CANN 日志中)。 - -启动命令: - -```bash -NPU_ASD_ENABLE=3 bash run_silent_check.sh -``` - -通过查看 CANN 的 device 日志,默认在 `~/ascend/log/` 目录下,关键 ERROR 日志如下,发现除了 ERROR 日志之外,还有一些 SilentCheck 的 INFO 日志: - -```bash -$ cd ~/ascend/log/debug/ -$ grep -nr 'silent_check_v[2-9].cc:.*SilentCheck' device-* -device-2/device-311523_20250225184632284.log:1767:[INFO] AICPU(26559,aicpu_scheduler):2025-02-25-18:46:51.678.981 [silent_check_v3.cc:240][SilentCheck][tid:26572]SilentCheckV3 normal case, val=[3.350337e-08], max=[3.350337e-08], avg=[9.879098e-10], step=[4], c_threshold_l1=[1.000000e+06], c_threshold_l2=[1.000000e+04], beta1=[9.900000e-01], npu_asd_detect=[3]. -device-2/device-311523_20250225184632284.log:1829:[INFO] AICPU(26559,aicpu_scheduler):2025-02-25-18:46:51.684.993 [silent_check_v3.cc:240][SilentCheck][tid:26570]SilentCheckV3 normal case, val=[2.016393e+02], max=[2.016393e+02], avg=[7.349676e+00], step=[4], c_threshold_l1=[1.000000e+06], c_threshold_l2=[1.000000e+04], beta1=[9.900000e-01], npu_asd_detect=[3]. -device-2/device-311523_20250225184632284.log:1891:[ERROR] AICPU(26559,aicpu_scheduler):2025-02-25-18:46:51.762.577 [silent_check_v3.cc:250][ComputeL1Error][tid:26572]SilentCheckV3 ComputeL1Error:val = [nan], max = [nan], avg=[5.752281e-08], step=[5], c_thresh_l1 = [1.000000e+06], c_thresh_l2 = [1.000000e+04], beta1 = [9.900000e-01], npu_asd_detect = [3]. -``` - -### 联合检测 - -使用联合检测方案时(`MS_NPU_ASD_CONFIG`中设置`enable:true`),会采用该方案对应的特征值检测方法,`NPU_ASD_ENABLE`不再生效。 - 这里构造了一个简单的神经网络,并通过MindSpore的故障注入算子模拟特征值异常。网络脚本(`silent_detect.py`)如下: ```python