From 31cdfb67e171ec06e55c9a4b16cc23b2b9f0a7b9 Mon Sep 17 00:00:00 2001 From: Soleil <1272134437@qq.com> Date: Tue, 1 Jun 2021 17:19:30 +0800 Subject: [PATCH] Add Translation --- .../source_en/neural_network_debug.md | 249 ++++++++++++++++++ 1 file changed, 249 insertions(+) create mode 100644 docs/migration_guide/source_en/neural_network_debug.md diff --git a/docs/migration_guide/source_en/neural_network_debug.md b/docs/migration_guide/source_en/neural_network_debug.md new file mode 100644 index 0000000000..7eecc57ab8 --- /dev/null +++ b/docs/migration_guide/source_en/neural_network_debug.md @@ -0,0 +1,249 @@ +# Network Debugging + + + +- [Network Debugging](#网络调试) + - [The Basic Process Of Network Debugging](#网络调试的基本流程) + - [Common Methods Used In Network Debugging](#网络调试中的常用方法) + - [Process Debugging](#流程调试) + - [Process Debugging With PyNative Mode](#用pynative模式进行流程调试) + - [Getting More Error Messages](#获取更多报错信息) + - [Common Errors](#常见错误) + - [Loss Value Comparison](#loss值对比检查) + - [Main Steps](#主要步骤) + - [Related Issues Positioning](#相关问题定位) + - [Precision Debugging Tools](#精度调试工具) + - [Customized Debugging Information](#自定义调试信息) + - [Hyperreference Tuning With MindOptimizer](#使用mindoptimizer进行超参调优) + - [Loss Value Anomaly Locating](#loss值异常定位) + + + + + +This chapter will introduce the basic ideas and common tools of Network Debugging, as well as some solutions to some common problems. + +## The Basic Process Of Network Debugging + +The process of Network Debugging is divided into the following steps: + +1. The network process is successfully debugged with no error in the whole network execution, proper output of loss value, and normal completion of parameter update. + + In general, if you use the `model.train` interface to execute a step completely without reporting an error, it means that it is executed normally and completed the parameter update; if you need to confirm precisely, you can save the checkpoint file for two consecutive steps by using the parameter `save_checkpoint_ steps=1` in `mindspore.train.callback.CheckpointConfig`, or use the `save_checkpoint` interface to save the Checkpoint file directly, and then print the weight values in the Checkpoint file with the following code to see if the weights of the two steps have changed. Finally, complete the update. + + ```python + import mindspore + import numpy as np + ckpt = mindspore.load_checkpoint(ckpt_path) + for param in ckpt: + value = ckpt[param].data.asnumpy() + print(value) + ``` + +2. Multiple iterations of the network are executed to output the loss values, and the loss values have a basic convergence trend. + +3. Network accuracy debugging, over-reference tuning. + +## Common Methods Used In Network Debugging + +### Process Debugging + +This section introduces the problems and solutions that may occur during Network Debugging process after the script development is basically completed. + +#### Process Debugging With PyNative mode + +For script development and network process debugging, we recommend using PyNative mode for debugging. pyNative mode supports executing single operators, normal functions and networks, as well as separate operations for finding gradients. In PyNative mode, you can not only easily set breakpoints and get intermediate results of network execution, but also debug the network by means of pdb. + +By default, MindSpore is in PyNative mode, which can also be defined explicitly via `context.set_context(mode=context.PYNATIVE_MODE)`. Related examples can be found in [Debugging With PyNative Mode](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/debug_in_pynative_mode.html#pynative). + +#### Getting More Error Messages + +During the network process debugging, if you need to get more information about error reporting, you can get it by the following ways: + +- Using pdb for debugging in PyNative mode, and using pdb to print relevant stack and contextual information to help locate problems. +- Using Print operator to print more contextual information. Related examples can be found in [Print Operator Features](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/custom_debugging_info.html#print). +- Adjusting the log level to get more error information, MindSpore can easily adjust the log level through environment variables. Related examples can be found in [Logging-related Environment Variables And Configurations](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/custom_debugging_info.html#id6). + +#### Common Errors + +In network process debugging, the common errors are as follows: + +- The operator execution reports an error. + + During the network process debugging, errors are often reported in the execution of arithmetic such as shape mismatch and unsupported dtype. Then, according to the error message, you should check whether the arithmetic is used correctly and whether the shape of the operator input data is consistent with the expectation and make corresponding modifications. + + Related operator support and API introduction can be found in [Operator Support List](https://www.mindspore.cn/doc/programming_guide/zh-CN/master/operator_list.html) and [Operators Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html). + +- The same script works in PyNative mode, but reports bugs in Graph mode. + + In MindSpore's Graph mode, the code in the `construct` function is parsed by the MindSpore framework, and there is some Python syntax that is not yet supported which results in errors. In this case, you should follow [MindSpore's Syntax Description](https://www.mindspore.cn/doc/note/zh-CN/master/static_graph_syntax_support.html) according to the error message. + +- Distributed parallel training script is misconfigured. + + Distributed parallel training scripts and environment configuration can be found in [Distributed Parallel Training Tutorial](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/distributed_training_tutorials.html). + +### Loss Value Comparison + +Having a benchmark script, the loss values run by the benchmark script can be compared with those run by the MindSpore script which can be used to verify the correctness of the overall network structure and the accuracy of the operator. + +#### Main Steps + +1. Guaranteed Identical Input + + It is necessary to ensure that the inputs are the same in both networks, so that the network outputs are the same with the same network structure. The same inputs can be guaranteed using following ways: + + - Using numpy to construct the input data by itself to ensure the same network input. MindSpore supports free conversion of Tensor and numpy. The following script can be used to construct the input data. + + ```python + input = Tensor(np.random.randint(0, 10, size=(3, 5, 10)).astype(np.float32)) + ``` + + - Using the same dataset for computation. MindSpore supports the use of the TFRecord dataset, which can be read using the `mindspore.dataset.TFRecordDataset` interface. + +2. Removing The Influence Of Randomness In The Network + + The main methods to remove the effect of randomness in the network are setting the same randomness seed, turning off the data shuffle, modifying the code to remove the effect of dropout, initializer and other operators with randomness in the network, etc. + +3. Ensure The Same Settings For The Relevant Hyperparameters + + It is necessary to ensure the same settings for the hyperparameters in the network in order to guarantee the same input and the same output of the operator. + +4. Running the network and comparing the output loss values. Generally, the error of the loss value is about 1%. Because the operator itself has a certain accuracy error. As the number of steps increases, the error will have a certain accumulation. + +#### Related Issues Positioning + +If the loss errors are large, the problem location can be done using following ways: + +- Checking whether the input and superparameter settings are the same, and whether the randomness effect is completely removed. + + if the loss value differs significantly after multiple reruns of the same script, it means that the effect of randomness in the network is not completely removed. + +- Overall Judgment + + If there is a large error in the first round of loss values, it means that there is a problem with the forward calculation of the network. + + If the first round loss value is within the error range but the second round starts with a large error in the loss value, it means that there should be no problem in the forward calculation of the network and there may be problems in the reverse gradient and weight update calculation. + +- After having the overall judgment, compare the accuracy of input and output values from coarse to fine. + + First, comparing the input and output values layer by layer for each subnet starting from the input, and identifying the subnets that initially have problems. + + Then, compar the network structure in the subnet and the input and output of the operator, find the network structure or operator that has problems, and modify it. + + If you find any operator accuracy problems during the process, you can raise an issue on the [MindSpore Code Hosting Platform](https://gitee.com/mindspore/mindspore), and the relevant personnel will follow up on the problem. + +- MindSpore provides various tools for acquiring intermediate network data, which can be used according to the actual situation. + + - [Data Dump function](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/custom_debugging_info.html#dump) + - [Use Print Operator To Print Related Information](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/custom_debugging_info.html#print) + - [Using The Visualization Component MindInsight](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/visualization_tutorials.html) + +### Precision Debugging Tools + +#### Customized Debugging Information + +- [Callback Function](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/custom_debugging_info.html#callback) + + MindSpore has provided ModelCheckpoint, LossMonitor, SummaryCollector and other Callback classes for saving model parameters, monitoring loss values, saving training process information, etc. Users can also customize Callback functions for related functions like starting and ending runs at each epoch and step, please refer to [Custom Callback](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/custom_debugging_info.html#id3) for specific examples. + +- [MindSpore Metrics Function](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/custom_debugging_info.html#mindspore-metrics) + + When the training is finished, metrics can be used to evaluate the training results. MindSpore provides various metrics for evaluation, such as: `accuracy`, `loss`, `precision`, `recall`, `F1`, etc. + +- [Reasoning With Training](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/evaluate_the_model_during_training.html) + + Inference can be performed at training time by defining a CallBack function for inference. + +- [Custom Training Loops](https://www.mindspore.cn/doc/programming_guide/zh-CN/master/train.html#%E8%87%AA%E5%AE%9A%E4%B9%89%E8%AE%AD%E7%BB%83%E5%BE%AA%E7%8E%AF) + +- Customized Learning Rate + + MindSpore provides some common implementations of dynamic learning rate and some common optimizers with adaptive learning rate adjustment functions, referring to [Dynamic Learning Rate](https://www.mindspore.cn/doc/api_python/zh-CN/master/mindspore/mindspore.nn.html#dynamic-learning-rate) and [Optimizer Functions](https://www.mindspore.cn/doc/api_python/zh-CN/master/mindspore/mindspore.nn.html#optimizer-functions) in the API documentation. + + At the same time, the user can implement a customized dynamic learning rate, as exemplified by WarmUpLR: + + ```python + class WarmUpLR(LearningRateSchedule): + def __init__(self, learning_rate, warmup_steps): + super(WarmUpLR, self).__init__() + ## check the input + if not isinstance(learning_rate, float): + raise TypeError("learning_rate must be float.") + validator.check_non_negative_float(learning_rate, "learning_rate", self.cls_name) + validator.check_positive_int(warmup_steps, 'warmup_steps', self.cls_name) + ## define the operators + self.warmup_steps = warmup_steps + self.learning_rate = learning_rate + self.min = ops.Minimum() + self.cast = ops.Cast() + + def construct(self, global_step): + ## calculate the lr + warmup_percent = self.cast(self.min(global_step, self.warmup_steps), mstype.float32)/ self.warmup_steps + return self.learning_rate * warmup_percent + ``` + +#### Hyperreference Tuning With MindOptimizer + +MindSpore provides MindOptimizer tool to help users to do more convenient super-reference tuning, please refer to [Super-reference Tuning With MindOptimizer](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/hyper_parameters_auto_tuning.html) for detailed examples and usage methods. + +#### Loss Value Anomaly Locating + +For cases where the loss value is INF, NAN, or the loss value does not converge, you can investigate the following scenarios: + +1. Checking for loss_scale overflow. + + In the scenario of using loss_scale with mixed precision, the situation that the loss value is INF and NAN may be caused by the scale value being too large. If it is dynamic loss_scale, the scale value will be adjusted automatically; if it is static loss_scale, the scale value needs to be reduced. + + If the `scale=1` case still has a loss value of INF, NAN, then there should be an overflow of operators in the network and further localization is needed. + +2. The causes of abnormal loss values may be caused by abnormal input data, operator overflow, gradient disappearance, gradient explosion, etc. + + To check the intermediate value of the network such as operator overflow, gradient of 0, abnormal weight, gradient disappearance and gradient explosion, it is recommended to use [MindInsight Debugger](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/debugger.html) to set the corresponding detection points for detection and debugging, which can locate the problem in a more comprehensive way with stronger debuggability. + + The following are a few simple initial troubleshooting methods: + + - Observing whether the weight values appear or loading the saved Checkpoint file to print the weight values can make a preliminary judgment. Printing the weight values can refer to the following code: + + ```python + import mindspore + import numpy as np + ckpt = mindspore.load_checkpoint(ckpt_path) + for param in ckpt: + value = ckpt[param].data.asnumpy() + print(value) + ``` + + - Checking whether the gradient is 0 or comparing whether the weight values of Checkpoint files saved in different steps have changed can make a preliminary judgment. The comparison of the weight values of Checkpoint files can be referred to the following code: + + ```python + import mindspore + import numpy as np + ckpt1 = mindspore.load_checkpoint(ckpt1_path) + ckpt2 = mindspore.load_checkpoint(ckpt2_path) + sum = 0 + same = 0 + for param1,param2 in zip(ckpt1,ckpt2): + sum = sum + 1 + value1 = ckpt[param1].data.asnumpy() + value2 = ckpt[param2].data.asnumpy() + if value1 == value2: + print('same value: ', param1, value1) + same = same + 1 + print('All params num: ', sum) + print('same params num: ', same) + ``` + + - Checking whether there is NAN, INF abnormal data in the weight value, you can also load the Checkpoint file for a simple judgment. In general, if there is NAN, INF in the weight value, then there is also NAN, INF in the gradient calculation, and there may be an overflow situation. The relevant code reference is as follows: + + ```python + import mindspore + import numpy as np + ckpt = mindspore.load_checkpoint(ckpt_path) + for param in ckpt: + value = ckpt[param].data.asnumpy() + if np.isnan(value): + print('NAN value:', value) + if np.isinf(value): + print('INF value:', value) + ``` -- Gitee