diff --git a/docs/migration_guide/source_en/neural_network_debug.en.md b/docs/migration_guide/source_en/neural_network_debug.en.md new file mode 100644 index 0000000000000000000000000000000000000000..012e892b2ac4e81a6698c16916700f82f4cb777b --- /dev/null +++ b/docs/migration_guide/source_en/neural_network_debug.en.md @@ -0,0 +1,231 @@ +# Network Debugging + +[TOC] + +[![img](https://gitee.com/mindspore/docs/raw/master/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/master/docs/migration_guide/source_zh_cn/neural_network_debug.md) + +This chapter will introduce the basic ideas of network debugging, common tools, and some common problems. + +## Network debugging of the basic process + +The process of network debugging is mainly divided into the following steps: + +1. The network process is debugged successfully, the network execution does not report an error as a whole, the loss value is output correctly, and the parameter update is completed normally. + + In general, the `model.train` interface is used to complete a step and no error is reported, that is, normal execution and parameter update is completed; If you need to confirm exactly, can pass `mindspore.Train.Callback.CheckpointConfig` parameters in `save_checkpoint_steps = 1` save for two step Checkpoint file, or use the `save_checkpoint` interface to directly save the Checkpoint file, and then print the weight value in the Checkpoint file with the following code to check whether the weight of the two steps has changed, and complete the update. + + ```python + import mindspore + import numpy as np + ckpt = mindspore.load_checkpoint(ckpt_path) + for param in ckpt: + value = ckpt[param].data.asnumpy() + print(value) + ``` + +2. Multiple rounds of network iteration are executed to output loss values, and loss values have a basic convergence trend. + +3. Network precision debugging, hyper-parameters tuning. + +## Common methods in network debugging + +### Process debugging + +This section introduces the problems and solutions that may occur in the network process debugging process after the script development is basically completed. + +#### Use PyNative mode for process debugging + +In script development and network process debugging, we recommend using PyNative mode for debugging. The PyNative mode supports the execution of single operators, ordinary functions and networks, as well as the operation of obtaining gradients separately. In PyNative mode, you can easily set breakpoints to obtain intermediate results of network execution, and you can also debug the network through pdb. + +By default, MindSpore is in PyNative mode, and it can also be explicitly defined by `context.set_context(mode=context.PYNATIVE_MODE)`. For related examples, please refer to [Use PyNative Mode to Debug](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/debug_in_pynative_mode.html#pynative). + +#### Get more error information + +In the process of network process debugging, if you need to obtain more error information, you can obtain it in the following ways: + +- In PyNative mode, you can use pdb for debugging. Use pdb to print related stack and context information to help you locate the problem. +- Use the Print operator to print more contextual information. For specific examples, please refer to [Print operator function introduction](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/custom_debugging_info.html#print). +- Adjust the log level to get more error information. MindSpore can easily adjust the log level through environment variables. For details, please refer to [Log-related environment variables and configuration](https://www.mindspore.cn/tutorial/training/zh-CN /master/advanced_use/custom_debugging_info.html#id6). + +#### Common mistakes + +In network process debugging, common errors are as follows: + +- Operator execution error + + During network process debugging, there are often operator execution errors such as shape mismatch and dtype unsupported. In this case, you should check whether the operator is used correctly according to the error message, and whether the shape of the operator input data is consistent with expectations, and modify accordingly. + + For related operator support and API introduction, please refer to [Operator Support List](https://www.mindspore.cn/doc/programming_guide/zh-CN/master/operator_list.html) and [Operator Python API](https: //www.mindspore.cn/doc/api_python/zh-CN/master/index.html). + +- The same script can run through in PyNative mode, but an error is reported in Graph mode + + In MindSpore's Graph mode, the code in the `construct` function is parsed by the MindSpore framework. Some Python grammars are not yet supported, which leads to errors.At this time, you should modify the relevant code in accordance with the [MindSpore syntax description](https://www.mindspore.cn/doc/note/zh-CN/master/static_graph_syntax_support.html) according to the error message. + +- Distributed Parallel training script configuration errors + + For the distributed parallel training script and environment configuration, please refer to [Distributed Parallel Training Tutorial](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/distributed_training_tutorials.html). + +### Loss value comparison check + +In the case of having a standard script, the loss value can be compared to a standard script ran loss value MindSpore script ran out, the whole network architecture to verify the accuracy and correctness of the operator. + +#### The main steps + +1. Ensure that the inputs are the same + + It is necessary to ensure that the inputs in the two networks are the same, so that under the same network structure, the network outputs are the same. There are several ways to ensure the same input: + + - Use NumPy to construct the input data yourself to ensure that the network input is the same. MindSpore supports free conversion between Tensor and NumPy. Construct the input data may refer to the following script: + + ```python + input = Tensor(np.random.randint(0, 10, size=(3, 5, 10)).astype(np.float32)) + ``` + + - Calculated using the same data set, MindSpore supports TFRecord data set, using `mindspore.dataset.TFRecordDataset` interface to read. + +2. Remove the effects of random factors in the network + + To remove the influence of randomness in the network, the main methods include setting the same randomness seed, turning off the data shuffle, and modifying the code to remove the influence of random operators in the network such as dropout and initializer. + +3. Ensure that the settings of related hyper-parameters are the same + + It is necessary to ensure that the hyper-parameter settings in the network are the same to ensure the same input and the same output of the operator. + +4. Run the network and compare the output loss value. Generally, the loss value error is about 1‰, because the operator itself has a certain accuracy error. As the number of steps increases, the error will accumulate to a certain extent. + +#### Related problem positioning + +If the loss value error is large, the following methods can be used to locate the problem: + +- Check whether the input and hyper-parameter settings are the same, and whether the random effect is completely removed. + + The same script is re-run several times, and the loss value is quite different, which means that the random influence in the network has not been completely removed. + +- Overall judgment. + + If there is a large error in the loss value in the first round, it indicates that there is a problem with the forward calculation of the network. + + If the first round loss value within an error range, the second round starts loss larger error value, then the current network should be no problem to the calculation, the reverse calculation of the gradient and the weight updating can be problematic. + +- With the overall judgment, the accuracy of the input and output values is compared from coarse to fine. + + First, compare the input and output values layer by layer for each subnet from the input to determine the subnet that initially has the problem. + + Then, compare the network structure in the subnet and the input and output of the operator, find the network structure or operator where the problem occurs, and modify it. + + If you find problems with the operator accuracy during this process, you can submit an issue on the [MindSpore code hosting platform](https://gitee.com/mindspore/mindspore), and the relevant personnel will track the problem. + +- MindSpore provides a wealth of tools to obtain network intermediate data, which can be selected according to actual conditions. + + - [Data dump function](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/custom_debugging_info.html#dump) + - [Use the Print operator to print related information](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/custom_debugging_info.html#print) + - [Using visual components MindInsight](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/visualization_tutorials.html) + +### Precision Debugging Tools + +#### Custom debugging information + +- [Callback function](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/custom_debugging_info.html#callback) + + MindSpore has provided ModelCheckpoint, LossMonitor, SummaryCollector and other Callback classes for saving model parameters, monitoring loss values, saving training process information and other functions. Users can also customize the Callback function to achieve the start and end of each epoch and step, for specific examples, please refer to [Custom Callback](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/custom_debugging_info.html#id3). + +- [MindSpore metrics function](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/custom_debugging_info.html#callback) + + After the end of the training, you can use metrics to evaluate training results good or bad. MindSpore provides a variety of metrics evaluation indicators, such as: `accuracy`, `loss`, `precision`, `recall`, `F1`, etc. + +- [Inference while training](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/evaluate_the_model_during_training.html) + + Inference can be performed during training by defining the inference CallBack function. + +- [Custom training loop](https://www.mindspore.cn/doc/programming_guide/zh-CN/master/train.html#自定义训练循环) + +- Custom learning rate + + MindSpore provides some common dynamic learning rate implementations and some common optimizers with adaptive learning rate adjustment functions. You can refer to [Dynamic Learning Rate](https://www.mindspore.cn/doc/api_python) and [Optimizer Functions](https://www.mindspore.cn/doc/api_python/zh-CN/master/mindspore/mindspore .nn.html#optimizer-functions) in the API documentation. + + At the same time, users can implement a customized dynamic learning rate, take WarmUpLR as an example: + + ```python + class WarmUpLR(LearningRateSchedule): + def __init__(self, learning_rate, warmup_steps): + super(WarmUpLR, self).__init__() + ## check the input + if not isinstance(learning_rate, float): + raise TypeError("learning_rate must be float.") + validator.check_non_negative_float(learning_rate, "learning_rate", self.cls_name) + validator.check_positive_int(warmup_steps, 'warmup_steps', self.cls_name) + ## define the operators + self.warmup_steps = warmup_steps + self.learning_rate = learning_rate + self.min = ops.Minimum() + self.cast = ops.Cast() + def construct(self, global_step): + ## calculate the lr + warmup_percent = self.cast(self.min(global_step, self.warmup_steps), mstype.float32)/ self.warmup_steps + return self.learning_rate * warmup_percent + ``` + +#### Use MindOptimizer for hyper-parameter tuning + +MindSpore provides the MindOptimizer tool to help users perform more convenient hyper-parameter tuning. For detailed examples and usage methods, please refer to [Using MindOptimizer for hyper-parameter tuning](https://www.mindspore.cn/tutorial/training/zh-CN /master/advanced_use/hyper_parameters_auto_tuning.html). + +#### Loss value abnormal positioning + +If the loss value is INF, NAN, or the loss value does not converge, you can investigate from the following situations: + +1. Check the loss_scale overflow. + + In the scenario where the mixed precision uses loss_scale, the loss value is INF or NAN, which may be caused by the excessive scale value. If it is dynamic loss_scale, the scale value will be automatically adjusted; if it is static loss_scale, the scale value needs to be reduced. + + If there is still a loss value of INF and NAN in the case of `scale=1`, there should be an operator in the network overflowing, and further positioning is required. + +2. The reason for the abnormal loss value may be caused by abnormal input data, operator overflow, gradient disappearance, gradient explosion and other reasons. + + To troubleshoot network intermediate values such as operator overflow, gradient 0, weight abnormality, gradient disappearance, and gradient explosion, it is recommended to use [MindInsight Debugger](https://www.mindspore.cn/tutorial/training/zh-CN/master /advanced_use/debugger.html) Set the corresponding detection points for detection and debugging. This method can more comprehensively locate the problem, and the debugger is relatively strong. + + Here are a few simple preliminary troubleshooting methods: + + - Observe whether the weight value has a gradient explosion. You can also load the saved Checkpoint file and print the weight value to make a preliminary judgment. To print the weight value, refer to the following code: + + ```python + import mindspore + import numpy as np + ckpt = mindspore.load_checkpoint(ckpt_path) + for param in ckpt: + value = ckpt[param].data.asnumpy() + print(value) + ``` + + - Check whether the gradient is 0. You can also make a preliminary judgment by comparing the weight values of Checkpoint files saved in different steps. For the weight value comparison of Checkpoint files, refer to the following code: + + ```python + import mindspore + import numpy as np + ckpt1 = mindspore.load_checkpoint(ckpt1_path) + ckpt2 = mindspore.load_checkpoint(ckpt2_path) + sum = 0 + same = 0 + for param1,param2 in zip(ckpt1,ckpt2): + sum = sum + 1 + value1 = ckpt[param1].data.asnumpy() + value2 = ckpt[param2].data.asnumpy() + if value1 == value2: + print('same value: ', param1, value1) + same = same + 1 + print('All params num: ', sum) + print('same params num: ', same) + ``` + + - Check whether there are abnormal data of NAN and INF in the weight value. You can also make a simple judgment by loading the Checkpoint file. Generally speaking, if NAN and INF appear in the weight value, NAN and INF also appear in the gradient calculation, and overflow may occur. , The relevant code can refer to: + + ```python + import mindspore + import numpy as np + ckpt = mindspore.load_checkpoint(ckpt_path) + for param in ckpt: + value = ckpt[param].data.asnumpy() + if np.isnan(value): + print('NAN value:', value) + if np.isinf(value): + print('INF value:', value) + ``` \ No newline at end of file