From 39d945acba848bebdecca8cf4ff2fdd7753a8231 Mon Sep 17 00:00:00 2001
From: huanxiaoling <3174348550@qq.com>
Date: Fri, 18 Nov 2022 10:56:32 +0800
Subject: [PATCH] update the en files in tutorials
---
.../experts/source_en/debug/error_analyze.md | 238 ++++++++++
.../source_en/debug/function_debug.rst | 2 +
.../experts/source_en/debug/minddata_debug.md | 414 ++++++++++++++++++
.../experts/source_en/debug/mindrt_debug.md | 115 +++++
.../experts/source_en/debug/pynative_debug.md | 80 ++++
.../source_zh_cn/debug/minddata_debug.md | 8 +-
.../source_zh_cn/debug/mindrt_debug.md | 4 +-
7 files changed, 855 insertions(+), 6 deletions(-)
create mode 100644 tutorials/experts/source_en/debug/error_analyze.md
create mode 100644 tutorials/experts/source_en/debug/minddata_debug.md
create mode 100644 tutorials/experts/source_en/debug/mindrt_debug.md
create mode 100644 tutorials/experts/source_en/debug/pynative_debug.md
diff --git a/tutorials/experts/source_en/debug/error_analyze.md b/tutorials/experts/source_en/debug/error_analyze.md
new file mode 100644
index 0000000000..81f5952273
--- /dev/null
+++ b/tutorials/experts/source_en/debug/error_analyze.md
@@ -0,0 +1,238 @@
+# Error Analysis
+
+
+
+As mentioned before, error analysis refers to analyzing and inferring possible error causes based on the obtained network and framework information (such as error messages and network code).
+
+During error analysis, the first step is to identify the scenario where the error occurs and determine whether the error is caused by data loading and processing or network construction and training. You can determine whether it is a data or network problem based on the format of the error message. In the distributed parallel scenario, you can use a single-device execution network for verification. If there is no data loading and processing or network construction and training problem, the error is caused by a parallel scenario problem. The following describes the error analysis methods in different scenarios.
+
+## Data Loading and Processing Error Analysis
+
+When an error is reported during data processing, check whether C++ error messages are contained as shown in Figure 1. Typically, the name of the data processing operation using the C++ language is the same as that using Python. Therefore, you can determine the data processing operation that reports the error based on the error message and locate the error in the Python code.
+
+
+
+Figure 1
+
+As shown in the following figure, `batch_op.cc` reports a C++ error. The batch operation combines multiple consecutive pieces of data in a dataset into a batch for data processing, which is implemented at the backend. According to the error description, the input data does not meet the parameter requirements of the batch operation. Data to be batch operated has the same shape, and the sizes of different shapes are displayed.
+
+Data loading and processing has three phases: data preparation, data loading, and data augmentation. The following table lists common errors.
+
+| Error Type| Error Description| Case Analysis|
+|-------------|---------|---|
+| Data preparation error| The dataset is faulty, involving a path or MindRecord file problem.| [Error Case](https://www.mindspore.cn/tutorials/experts/en/master/debug/minddata_debug.html)|
+| Data loading error| Incorrect resource configuration, customized loading method, or iterator usage in the data loading phase.| [Error Case](https://www.mindspore.cn/tutorials/experts/en/master/debug/minddata_debug.html)|
+| Data augmentation error| Unmatched data format/size, high resource usage, or multi-thread suspension.| [Error Case](https://www.mindspore.cn/tutorials/experts/en/master/debug/minddata_debug.html)|
+
+## Network Construction and Training Error Analysis
+
+The network construction and training process can be executed in the dynamic graph mode or static graph mode, and has two phases: build and execution. The error analysis method varies according to the execution phase in different modes.
+
+The following table lists common network construction and training errors.
+
+| Error Type | Error Description| Case Analysis|
+| - | - | - |
+| Incorrect context configuration| An error occurs when the system configures the context.| [Error Analysis](https://mindspore.cn/tutorials/experts/en/master/debug/mindrt_debug.html)|
+| Syntax error | Python syntax errors and MindSpore static graph syntax errors, such as unsupported control flow syntax and tensor slicing errors| [Error Analysis](https://mindspore.cn/tutorials/experts/en/master/debug/mindrt_debug.html)|
+| Operator build error | The operator parameter value, type, or shape does not meet the requirements, or the operator function is restricted.| [Error Analysis](https://mindspore.cn/tutorials/experts/en/master/debug/mindrt_debug.html)|
+| Operator execution error | Input data exceptions, operator implementation errors, function restrictions, resource restrictions, etc.| [Error Analysis](https://mindspore.cn/tutorials/experts/en/master/debug/mindrt_debug.html)|
+| Insufficient resources | The device memory is insufficient, the number of function call stacks exceeds the threshold, and the number of flow resources exceeds the threshold.| [Error Analysis](https://mindspore.cn/tutorials/experts/en/master/debug/mindrt_debug.html)|
+
+- Error analysis of the dynamic graph mode
+
+ In dynamic graph mode, the program is executed line by line according to the code writing sequence, and the execution result can be returned in time. Figure 2 shows the error message reported during dynamic graph build. The error message is from the Python frontend, indicating that the number of function parameters does not meet the requirements. Through the Python call stack, you can locate the error code: `c = self.mul(b, self.func(a,a,b))`.
+
+ Generally, the error message may contain `WARNING` logs. During error analysis, analyze the error message following Traceback first.
+
+ 
+
+ Figure 2
+
+ In dynamic graph mode, common network construction and training errors are found in environment configuration, Python syntax, and operator usage. The general analysis method is as follows:
+
+ - Determine the object where the error is reported based on the error description, for example, the operator API name.
+ - Locate the code line where the error is reported based on the Python call stack information.
+ - Analyze the code input data and calculation logic at the position where the error occurs, and find the error cause based on the description and specifications of the error object in the [MindSpore API](https://www.mindspore.cn/docs/en/master/api_python/mindspore.html).
+
+- Error analysis of the static graph mode
+
+ In static graph mode, MindSpore builds the network structure into a computational graph, and then performs the computation operations involved in the graph. Therefore, errors reported in static graph mode include computational graph build errors and computational graph execution errors. Figure 3 shows the error message reported during computational graph build. When an error occurs, the `analyze_failed.dat` file is automatically saved to help analyze the location of the error code.
+
+ 
+
+ Figure 3
+
+ The general error analysis method in static graph mode is as follows:
+
+ Check whether the error is caused by graph build or graph execution based on the error description.
+
+ - If the error is reported during computational graph build, analyze the cause and location of the failure based on the error description and the `analyze_failed.dat` file automatically saved when the error occurs.
+ - If the error is reported during computational graph execution, the error may be caused by insufficient resources or improper operator execution. You need to further distinguish the error based on the error message. If the error is reported during operator execution, locate the operator, use the dump function to save the input data of the operator, and analyze the cause of the error based on the input data.
+
+ For details about how to analyze and infer the failure cause, see the analysis methods described in [`analyze_failed.dat`](https://www.mindspore.cn/tutorials/experts/en/master/debug/mindir.html#example-1-parameters-number-mismatch).
+
+ For details about how to use Dump to save the operator input data, see [Dump Function Debugging](https://www.mindspore.cn/tutorials/experts/en/master/debug/dump.html).
+
+## Distributed Parallel Error Analysis
+
+MindSpore provides the distributed parallel training function and supports multiple parallel modes. The following table lists common distributed parallel training errors and possible causes.
+
+| Error Type| Error Description |
+| ------------ | -------------------------------- |
+| Incorrect policy configuration| Incorrect operator logic.|
+| | Incorrect scalar policy configuration. |
+| | No policy configuration. |
+| Parallel script error| Incorrect script startup, or unmatched parallel configuration and startup task.|
+
+- Incorrect policy configuration
+
+ Policy check errors may be reported after you enable automatic parallelism using `mindspore.set_auto_parallel_context(parallel_mode="semi_auto_parallel")`. These policy check errors are reported due to specific operator slicing restrictions. The following uses three examples to describe how to analyze the three types of errors.
+
+ - Incorrect operator logic
+
+ The error message is as follows:
+
+ ```python
+ [ERROR]Check StridedSliceInfo1414: When there is a mask, the input is not supported to be split
+ ```
+
+ The following shows a piece of possible error code where the network input is a [2, 4] tensor. The network is sliced to obtain the first half of dimension 0 in the input tensor. It is equivalent to the x[:1, :]operation in NumPy, where x is the input tensor. On the network, the (2,1) policy is configured for the stridedslice operator to slice dimension 0.
+
+ ```python
+ tensor = Tensor(ones((2, 4)))
+ stridedslice = ops.StridedSlice((0, 0),(1, 4), (1, 1))
+
+ class MyStridedSlice(nn.Cell):
+ def __init__(self):
+ super(MyStridedSlice, self).__init__()
+ self.slice = stridedslice.shard(((2,1),))
+
+ def construct(self, x):
+ # x is a two-dimensional tensor
+ return self.slice(x)
+ ```
+
+ Error cause:
+
+ The piece of code performs the slice operation on dimension 0. However, the configured policy (2,1) indicates that the slice operation is performed on both dimension 0 and dimension 1 of the input tensor. According to the description of operator slicing in the [MindSpore API](https://www.mindspore.cn/docs/en/master/note/operator_list_parallel.html),
+
+ > only the mask whose value is all 0s is supported. All dimensions that are sliced must be extracted together. The input dimensions whose strides is not set to 1 cannot be sliced.
+
+ Dimensions that are sliced cannot be separately extracted. Therefore, the policy must be modified as follows:
+
+ Change the policy of dimension 0 from 2 to 1. In this way, dimension 0 will be sliced into one, that is, dimension 0 will not be sliced. Therefore, the policy meets the operator restrictions and the policy check is successful.
+
+ ```python
+ class MyStridedSlice(nn.Cell):
+ def __init__(self):
+ super(MyStridedSlice, self).__init__()
+ self.slice = stridedslice.shard(((1,1),))
+
+ def construct(self, x):
+ # x is a two-dimensional tensor
+ return self.slice(x)
+ ```
+
+ - Incorrect scalar policy configuration
+
+ Error message:
+
+ ```
+ [ERROR] The strategy is ..., strategy len:. is not equal to inputs len:., index:
+ ```
+
+ Possible error code:
+
+ ```python
+ class MySub(nn.Cell):
+ def __init__(self):
+ super(MySub, self).__init__()
+ self.sub = ops.Sub().shard(((1,1), (1,)))
+ def construct(self, x):
+ # x is a two-dimensional tensor
+ return self.sub(x, 1)
+ ```
+
+ The input of many operators can be scalars, such as addition, subtraction, multiplication, and division operations and axis of operators such as concat and gather. For such operations with scalar input, do not configure policies for these scalars. If the preceding method is used to configure a policy for the subtraction operation and the policy (1,) is configured for scalar 1, an error is reported.
+ That is, the length of the policy whose index is 1 is 1, which is not equal to the length 0 of the corresponding input. In this case, the input is a scalar.
+
+ Modified code:
+
+ In this case, set an empty policy for the scalar or do not set any policy (recommended method).
+
+ ```
+ self.sub = ops.Sub().shard(((1,1),()))
+
+ self.sub = ops.Sub().shard(((1,1),))
+ ```
+
+ - No policy configuration
+
+ ```
+ [ERROR]The strategy is ((8, 1)), shape 4 can not be divisible by strategy value 8
+ ```
+
+ Possible error code:
+
+ ```python
+ class MySub(nn.Cell):
+ def __init__(self):
+ super(MySub, self).__init__()
+ self.sub = ops.Sub()
+ def construct(self, x):
+ # x is a two-dimensional tensor
+ return self.sub(x, 1)
+ ```
+
+ The following piece of code runs training in an 8-device environment in semi-automatic parallel mode. No policy is configured for the Sub operator in the example and the default policy of the Sub operator is data parallel. Assume that the input x is a matrix of size [2, 4]. After the build starts, an error is reported, indicating that the input dimensions are insufficient for slicing. In this case, you need to modify the policy as follows (ensure that the number of dimensions for slicing are less than that of the input tensor):
+
+ ```python
+ class MySub(nn.Cell):
+ def __init__(self):
+ super(MySub, self).__init__()
+ self.sub = ops.Sub().shard(((2, 1), ()))
+ def construct(self, x):
+ # x is a two-dimensional tensor
+ return self.sub(x, 1)
+ ```
+
+ (2, 1) indicates that dimension 0 of the first input tensor is sliced into two parts, and dimension 1 is sliced into one, that is, dimension 1 is not sliced. The second input of `ops.Sub` is a scalar that cannot be sliced. Therefore, the slicing policy is set to empty ().
+
+- Parallel script error
+
+ The following is a piece of code for running an 8-device Ascend environment and using the bash script to start the training task.
+
+ ```bash
+ #!/bin/bash
+ set -e
+ EXEC_PATH=$(pwd)
+ export RANK_SIZE=8
+ export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_8pcs.json
+
+ for((i=0;i env$i.log
+ python ./train.py > train.log$i 2>&1 &
+ cd ../
+ done
+ echo "The program launch succeed, the log is under device0/train.log0."
+ ```
+
+ Errors may occur in the following scenarios:
+
+ 1) The number of training tasks (`RANK_SIZE`) started using the for loop does not match the number of devices configured in the `rank_table_8pcs.json` configuration file. As a result, an error is reported.
+
+ 2) The command for executing the training script is not executed in asynchronous mode (`python ./train.py > train.log$i 2>&1`). As a result, training tasks are started at different time, and an error is reported. In this case, add the `&` operator to the end of the command, indicating that the command is executed asynchronously in the subshell. In this way, multiple tasks can be started synchronously.
+
+ In parallel scenarios, you may encounter the `Distribute Task Failed` error. In this case, analyze whether the error occurs in the computational graph build phase or the execution phase of printing training loss to further locate the error.
+
+ For details, visit the following website:
+
+ For more information about distributed parallel errors in MindSpore, see [Distributed Task Failed](https://bbs.huaweicloud.com/forum/thread-181820-1-1.html).
diff --git a/tutorials/experts/source_en/debug/function_debug.rst b/tutorials/experts/source_en/debug/function_debug.rst
index d81c37016d..3041939829 100644
--- a/tutorials/experts/source_en/debug/function_debug.rst
+++ b/tutorials/experts/source_en/debug/function_debug.rst
@@ -8,9 +8,11 @@ Function Debug
:maxdepth: 1
:hidden:
+ error_analyze
custom_debug
mindir
dump
+ pynative_debug
pynative
fixing_randomness
diff --git a/tutorials/experts/source_en/debug/minddata_debug.md b/tutorials/experts/source_en/debug/minddata_debug.md
new file mode 100644
index 0000000000..76c2fa68be
--- /dev/null
+++ b/tutorials/experts/source_en/debug/minddata_debug.md
@@ -0,0 +1,414 @@
+# Common Data Processing Errors and Analysis Methods
+
+
+
+## Data Preparation
+
+Common errors you may encounter in the data preparation phase include dataset path and MindRecord file errors when you read or save data from or to a path or when you read or write a MindRecord file.
+
+- The dataset path contains Chinese characters.
+
+ Error log:
+
+ ```python
+ RuntimeError: Unexpected error. Failed to open file, file path E:\深度学习\models-master\official\cv\ssd\MsindRecord_COCO\test.mindrecord
+ ```
+
+ Two solutions are available:
+
+ 1. Specify the output path of the MindRecord dataset to a path containing only English characters.
+
+ 2. Upgrade MindSpore to a version later than 1.6.0.
+
+ For details, visit the following website:
+
+ [MindRecord Data Preparation - Unexpected error. Failed to open file_MindSpore](https://bbs.huaweicloud.com/forum/thread-183183-1-1.html)
+
+- MindRecord file error
+
+ - The duplicate file is not deleted.
+
+ Error log:
+
+ ```python
+ MRMOpenError: [MRMOpenError]: MindRecord File could not open successfully.
+ ```
+
+ Solution:
+
+ 1. Add the file deletion logic to the code to ensure that the MindRecord file with the same name in the directory is deleted before the file is saved.
+
+ 2. In versions later than MindSpore 1.6.0, when defining the `FileWriter` object, add `overwrite=True` to implement overwriting.
+
+ For details, visit the following website:
+
+ [MindSpore Data Preparation - MindRecord File could not open successfully](https://bbs.huaweicloud.com/forum/thread-184006-1-1.html)
+
+ - The file is moved.
+
+ Error log:
+
+ ```
+ RuntimeError: Thread ID 1 Unexpected error. Fail to open ./data/cora
+ RuntimeError: Unexpected error. Invalid file, DB file can not match file
+ ```
+
+ When MindSpore 1.4 or an earlier version is used, in the Windows environment, after a MindRecord dataset file is generated and moved, the file cannot be loaded to MindSpore.
+
+ Solution:
+
+ 1. Do not move the MindRecord file generated in the Windows environment.
+
+ 2. Upgrade MindSpore to 1.5.0 or a later version and regenerate a MindRecord dataset. Then, the dataset can be copied and moved properly.
+
+ For details, visit the following website:
+
+ [MindSpore Data Preparation - Invalid file,DB file can not match_MindSpore](https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=183187&page=1&authorid=&replytype=&extra=#pid1436140)
+
+ - The user-defined data type is incorrect.
+
+ Error log:
+
+ ```
+ RuntimeError: Unexpected error. Invalid data, the number of schema should be positive but got: 0. Please check the input schema.
+ ```
+
+ Solution:
+
+ Modify the input data type to ensure that it is consistent with the type definition in the script.
+
+ For details, visit the following website:
+
+ [MindSpore Data Preparation - Unexpected error. Invalid data](https://bbs.huaweicloud.com/forum/thread-189349-1-1.html)
+
+## Data Loading
+
+In the data loading phase, errors may be reported in resource configuration, `GeneratorDataset`, and iterators.
+
+- Resource configuration
+
+ - Incorrect number of CPU cores
+
+ Error log:
+
+ ```
+ RuntimeError: Thread ID 140706176251712 Unexpected error. GeneratorDataset's num_workers=8, this value is not within the required range of [1, cpu_thread_cnt=2].
+ ```
+
+ Solution:
+
+ 1. Add the following code to manually configure the number of CPU cores: `ds.config.set_num_parallel_workers()`
+
+ 2. Upgrade to MindSpore 1.6.0, which automatically adapts to the number of CPU cores in the hardware to prevent errors caused by insufficient CPU cores.
+
+ For details, visit the following website:
+
+ [MindSpore Data Loading - Unexpected error. GeneratorDataset's num_workers=8, this value is not within the required range of](https://bbs.huaweicloud.com/forum/thread-189861-1-1.html)
+
+ - Incorrect PageSize setting
+
+ Error log:
+
+ ```
+ RuntimeError: Syntax error. Invalid data, Page size: 1048576 is too small to save a blob row.
+ ```
+
+ Solution:
+
+ Call the set_page_size API to set pagesize to a larger value. The setting method is as follows:
+
+ ```python
+ from mindspore.mindrecord import FileWriter
+ writer = FileWriter(file_name="test.mindrecord", shard_num=1)
+ writer.set_page_size(1 << 26) # 128MB
+ ```
+
+ For details, visit the following website:
+
+ [MindSpore Data Loading - Invalid data,Page size is too small"](https://bbs.huaweicloud.com/forum/thread-190004-1-1.html)
+
+- `GeneratorDataset`
+
+ - Suspended `GeneratorDataset` thread
+
+ No error log is generated, and the thread is suspended.
+
+ During customized data processing, the `numpy.ndarray` and `mindspore.Tensor` data type are mixed and the `numpy.array(Tensor)` type is incorrectly used for conversion. As a result, the global interpreter lock (GIL) cannot be released and the `GeneratorDataset` cannot work properly.
+
+ Solution:
+
+ 1. When defining the first input parameter `source` of `GeneratorDataset`, use the `numpy.ndarray` data type if a Python function needs to be invoked.
+
+ 2. Use the `Tensor.asnumpy()` method to convert `Tensor` to `numpy.ndarray`.
+
+ For details, visit the following website:
+
+ [MindSpore Data Loading - Suspended GeneratorDataset Thread](https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=183188&page=1&authorid=&replytype=&extra=#pid1436147)
+
+ - Incorrect user-defined return type
+
+ Error log:
+
+ ```
+ Unexpected error. Invalid data type.
+ ```
+
+ Error description:
+
+ A user-defined `Dataset` or `map` operation returns data of the dict type, not a numpy array or a tuple consisting of numpy arrays. Data types (such as dict and object) other than numpy array or a tuple consisting of numpy arrays are not controllable and the data storage mode is unclear. As a result, the `Invalid type` error is reported.
+
+ Solution:
+
+ 1. Check the return type of the customized data processing. The return type must be numpy array or a tuple consisting of numpy arrays.
+
+ 2. Check the return type of the `__getitem__` function during customized data loading. The return type must be a tuple consisting of numpy arrays.
+
+ For details, visit the following website:
+
+ [MindSpore Dataset Loading - Unexpected error. Invalid data type_MindSpore](https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=183190&page=1&authorid=&replytype=&extra=#pid1436154)
+
+ - User-defined sampler initialization error
+
+ Error log:
+
+ ```
+ AttributeError: 'IdentitySampler' object has no attribute 'child_sampler'
+ ```
+
+ Solution:
+
+ In the user-defined sampler initialization method '\_\_init\_\_()', use 'super().\_\_init\_\_()' to invoke the constructor of the parent class.
+
+ For details, visit the following website:
+
+ [MindSpore Dataset Loading - 'IdentitySampler' has no attribute child_sampler](https://bbs.huaweicloud.com/forum/thread-184010-1-1.html#pid1439794)
+
+ - Repeated access definition
+
+ Error log:
+
+ ```
+ For 'Tensor', the type of "input_data" should be one of ...
+ ```
+
+ Solution:
+
+ Select a proper data input method: random access (`__getitem__`) or sequential access (iter, next).
+
+ For details, visit the following website:
+
+ [MindSpore Dataset Loading - the type of `input_data` should be one of](https://bbs.huaweicloud.com/forum/thread-184041-1-1.html)
+
+ - Inconsistency between the fields returned by the user-defined data and the defined fields
+
+ Error log:
+
+ ```
+ RuntimeError: Exception thrown from PyFunc. Invalid python function, the 'source' of 'GeneratorDataset' should return same number of NumPy arrays as specified in column_names
+ ```
+
+ Solution:
+
+ Check whether the fields returned by `GeneratorDataset` are the same as those defined in `columns`.
+
+ For details, visit the following website:
+
+ [MindSpore Dataset Loading -Exception thrown from PyFunc](https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=189645&page=1&authorid=&replytype=&extra=#pid1474252)
+
+ - Incorrect user script
+
+ Error log:
+
+ ```
+ TypeError: parse() missing 1 required positionnal argument: 'self'
+ ```
+
+ Solution:
+
+ Debug the code step by step and check the syntax in the script to see whether '()' is missing.
+
+ For details, visit the following website:
+
+ [MindSpore Dataset Loading - parse() missing 1 required positional](https://bbs.huaweicloud.com/forum/thread-189950-1-1.html)
+
+ - Incorrect use of tensor operations or operators in custom datasets
+
+ Error log:
+
+ ```
+ RuntimeError: Exception thrown from PyFunc. RuntimeError: mindspore/ccsrc/pipeline/pynative/pynative_execute.cc:1116 GetOpOutput] : The pointer[cnode] is null.
+ ```
+
+ Error description:
+
+ Tensor operations or operators are used in custom datasets. Because data processing is performed in multi-thread parallel mode and tensor operations or operators do not support multi-thread parallel execution, an error is reported.
+
+ Solution:
+
+ In the user-defined Pyfunc, do not use MindSpore tensor operations or operators in `__getitem__` in the dataset. You are advised to convert the input parameters to the Numpy type and then perform Numpy operations to implement related functions.
+
+ For details, visit the following website:
+
+ [MindSpore Dataset Loading - The pointer[cnode] is null](https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=183191)
+
+ - Index out of range due to incorrect iteration initialization
+
+ Error log:
+
+ ```
+ list index out of range
+ ```
+
+ Solution:
+
+ Remove unnecessary `index` member variables, or set `index` to 0 before each iteration to perform the reset operation.
+
+ For details, visit the following website:
+
+ [MindSpore Dataset Loading - list index out of range](https://bbs.huaweicloud.com/forum/thread-184036-1-1.html)
+
+ - No iteration initialization
+
+ Error log:
+
+ ```
+ Unable to fetch data from GeneratorDataset, try iterate the source function of GeneratorDataset or check value of num_epochs when create iterator.
+ ```
+
+ The value of `len` is inconsistent with that of `iter` because iteration initialization is not performed.
+
+ Solution:
+
+ Clear the value of `iter`.
+
+ For details, visit the following website:
+
+ [MindSpore Dataset Loading - Unable to fetch data from GeneratorDataset](https://bbs.huaweicloud.com/forum/thread-189895-1-1.html)
+
+- Iterator
+
+ - Repeated iterator creation
+
+ Error log:
+
+ ```
+ oserror: [errno 24] too many open files
+ ```
+
+ Error description:
+
+ If `iter()` is repeatedly called, iterators are repeatedly created. However, because `GeneratorDataset` loads datasets in multi-thread mode by default, the handles opened each time cannot be released before the main process stops. As a result, the number of opened handles keeps increasing.
+
+ Solution:
+
+ Use the dict iterator `create_dict_iterator()` and tuple iterator `create_tuple_iterator()` provided by MindSpore.
+
+ For details, visit the following website:
+
+ [MindSpore Data Loading - too many open files](https://bbs.huaweicloud.com/forum/thread-184134-1-1.html)
+
+ - Improper data acquisition from the iterator
+
+ Error log:
+
+ ```
+ 'DictIterator' has no attribute 'get_next'
+ ```
+
+ Solution:
+
+ You can obtain the next piece of data from the iterator in either of the following ways:
+
+ ```
+ item = next(ds_test.create_dict_iterator())
+
+ for item in ds_test.create_dict_iterator():
+ ```
+
+ For details, visit the following website:
+
+ [MindSpore Dataset Loading - 'DictIterator' has no attribute 'get_next'](https://bbs.huaweicloud.com/forum/thread-184026-1-1.html#pid1439832)
+
+## Data Augmentation
+
+In the data augmentation phase, the read data is processed. Currently, MindSpore supports common data processing operations, such as shuffle, batch, repeat, and concat. You may encounter the following errors in this phase: data type errors, interface parameter type errors, consumption node conflict, data batch errors, and memory resource errors.
+
+- Incorrect data type for invoking a third-party library API in a user-defined data augmentation operation
+
+ Error log:
+
+ ```
+ TypeError: Invalid object with type'' and value''.
+ ```
+
+ Solution:
+
+ Check the data type requirements of the third-party library API used in the user-defined function, and convert the input data type to the data type expected by the API.
+
+ For details, visit the following website:
+
+ [MindSpore Data Augmentation - TypeError: Invalid with type](https://bbs.huaweicloud.com/forum/thread-184123-1-1.html)
+
+- Incorrect parameter type in a user-defined data augmentation operation
+
+ Error log:
+
+ ```
+ Exception thrown from PyFunc. TypeError: args should be Numpy narray. Got .
+ ```
+
+ Solution:
+
+ Change the number of input parameters of `call` (except `self`) to the number of parameters in `input_columns` and their type to numpy.ndarray. If `input_columns` is ignored, the number of all data columns is used by default.
+
+ For details, visit the following website:
+
+ [MindSpore Data Augmentation - args should be Numpy narray](https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=183196&page=1&authorid=&replytype=&extra=#pid1436178)
+
+- Consumption node conflict in the dataset
+
+ Error log:
+
+ ```
+ ValueError: The data pipeline is not a tree (i.e. one node has 2 consumers)
+ ```
+
+ Error description:
+
+ A branch occurs in the dataset definition. As a result, the dataset cannot determine the direction.
+
+ Solution:
+
+ Check the dataset name. Generally, retain the same dataset name.
+
+ For details, visit the following website:
+
+ [MindSpore Data Augmentation - The data pipeline is not a tree](https://bbs.huaweicloud.com/forum/thread-183193-1-1.html)
+
+- Improper batch operation due to inconsistent data shapes
+
+ Error log:
+
+ ```
+ RuntimeError: Unexpected error. Inconsistent batch shapes, batch operation expect same shape for each data row, but got inconsistent shape in column 0, expected shape for this column is:, got shape:
+ ```
+
+ Solution:
+
+ 1. Check the shapes of the data that requires the batch operation. If the shapes are inconsistent, cancel the batch operation.
+
+ 2. If you need to perform the batch operation on the data with inconsistent shapes, sort out the dataset and unify the shapes of the input data by padding.
+
+ For details, visit the following website:
+
+ [MindSpore Data Augmentation - Unexpected error. Inconsistent batch](https://bbs.huaweicloud.com/forum/thread-190394-1-1.html)
+
+- High memory usage due to data augmentation
+
+ Error description:
+
+ If the memory is insufficient when MindSpore performs data augmentation, MindSpore may automatically exit. In MindSpore 1.7 and later versions, an alarm is generated when the memory usage exceeds 80%. When performing large-scale data training, pay attention to the memory usage to prevent direct exit due to high memory usage.
+
+ For details, visit the following website:
+
+ [MindSpore Data Augmentation - Automatic Exit Due to Insufficient Memory](https://bbs.huaweicloud.com/forum/thread-190001-1-1.html)
diff --git a/tutorials/experts/source_en/debug/mindrt_debug.md b/tutorials/experts/source_en/debug/mindrt_debug.md
new file mode 100644
index 0000000000..74584161ef
--- /dev/null
+++ b/tutorials/experts/source_en/debug/mindrt_debug.md
@@ -0,0 +1,115 @@
+# Network Construction and Training Error Analysis
+
+
+
+The following lists the common network construction and training errors in static graph mode.
+
+## Incorrect Context Configuration
+
+When performing network training, you must run the following command to specify the backend device: `set_context(device_target=device)`. MindSpore supports CPU, GPU, and Ascend. If a GPU backend device is incorrectly specified as Ascend by running `set_context(device_target="Ascend")`, the following error message will be displayed:
+
+```python
+ValueError: For 'set_context', package type mindspore-gpu support 'device_target' type gpu or cpu, but got Ascend.
+```
+
+The running backend specified by the script must match the actual hardware device.
+
+For details, visit the following website:
+
+[MindSpore Configuration Error - 'set_context' Configuration Error](https://bbs.huaweicloud.com/forum/thread-183514-1-1.html)
+
+For details about the context configuration, see ['set_context'](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.set_context.html).
+
+## Syntax Errors
+
+- Incorrect construct parameter
+
+ In MindSpore, the basic unit of the neural network is `nn.Cell`. All the models or neural network layers must inherit this base class. As a member function of this base class, `construct` defines the calculation logic to be executed and must be rewritten in all inherited classes. The function prototype of `construct` is:
+
+ ```python
+ def construct(self, *inputs, **kwargs):
+ ```
+
+ When the function is rewritten, the following error message may be displayed:
+
+ ```python
+ TypeError: The function construct needs 0 positional argument and 0 default argument, but provided 1
+ ```
+
+ This is because the function parameter list is incorrect when the user-defined `construct` function is implemented, for example, `"def construct(*inputs, **kwargs):"`, where `self` is missing. In this case, an error is reported when MindSpore parses the syntax.
+
+ For details, visit the following website:
+
+ [MindSpore Syntax Error - 'construct' Definition Error](https://bbs.huaweicloud.com/forum/thread-178902-1-1.html)
+
+- Incorrect control flow syntax
+
+ In static graph mode, Python code is not executed by the Python interpreter. Instead, the code is built into a static computational graph for execution. The control flow syntax supported by MindSpore includes if, for, and while statements. The attributes of the objects returned by different branches of the if statement may be inconsistent. As a result, an error is reported. The error message is displayed as follows:
+
+ ```c++
+ TypeError: Cannot join the return values of different branches, perhaps you need to make them equal.
+ Type Join Failed: dtype1 = Float32, dtype2 = Float16.
+ ```
+
+ According to the error message, the return values of different branches of the if statement are of different types. One is float32, and the other is float16. As a result, a build error is reported.
+
+ ```c++
+ ValueError: Cannot join the return values of different branches, perhaps you need to make them equal.
+ Shape Join Failed: shape1 = (2, 3, 4, 5), shape2 = ().
+ ```
+
+ According to the error message, the dimension shapes of the return values of different branches of the if statement are different. One is a 4-bit tensor of `2*3*4*5`, and the other is a scalar. As a result, a build error is reported.
+
+ For details, visit the following website:
+
+ [MindSpore Syntax Error - Type (Shape) Join Failed](https://www.mindspore.cn/docs/en/master/faq/network_compilation.html)
+
+ The number of loops of the for and while statements may exceed the permitted range. As a result, the function call stack exceeds the threshold. The error message is displayed as follows:
+
+ ```c++
+ RuntimeError: Exceed function call depth limit 1000, (function call depth: 1001, simulate call depth: 997).
+ ```
+
+ One solution to the problem that the function call stack exceeds the threshold is to simplify the network structure and reduce the number of loops. Another method is to use `set_context(max_call_depth=value)` to increase the threshold of the function call stack.
+
+ For details, visit the following website:
+
+ [MindSpore Syntax Error - Exceed function call depth limit](https://bbs.huaweicloud.com/forum/thread-182165-1-1.html)
+
+## Operator Build Errors
+
+Operator build errors are mainly caused by input parameters that do not meet requirements or operator functions that are not supported.
+
+For example, when the ReduceSum operator is used, the following error message is displayed if the input data exceeds eight dimensions:
+
+```c++
+RuntimeError: ({'errCode': 'E80012', 'op_name': 'reduce_sum_d', 'param_name': 'x', 'min_value': 0, 'max_value': 8, 'real_value': 10}, 'In op, the num of dimensions of input/output[x] should be in the range of [0, 8], but actually is [10].')
+```
+
+For details, visit the following website:
+
+[MindSpore Operator Build Error - ReduceSum Operator Does Not Support Input of More Than Eight Dimensions](https://bbs.huaweicloud.com/forum/thread-182168-1-1.html)
+
+For example, the Parameter parameter does not support automatic type conversion. When the Parameter operator is used, an error is reported during data type conversion. The error message is as follows:
+
+```c++
+RuntimeError: Data type conversion of 'Parameter' is not supported, so data type int32 cannot be converted to data type float32 automatically.
+```
+
+For details, visit the following website:
+
+[MindSpore Operator Build Error - Error Reported Due to Inconsistent ScatterNdUpdate Operator Parameter Types](https://bbs.huaweicloud.com/forum/thread-182175-1-1.html)
+
+## Operator Execution Errors
+
+Operator execution errors are mainly caused by improper input data , operator implementation, or operator initialization. Generally, the analogy method can be used to analyze operator execution errors.
+
+For details, see the following example:
+
+[MindSpore Operator Execution Error - nn.GroupNorm Operator Output Exception](https://bbs.huaweicloud.com/forum/thread-182191-1-1.html)
+
+## Insufficient Resources
+
+During network debugging, `Out Of Memory` errors often occur. MindSpore divides the memory into four layers for management on the Ascend device, including runtime, context, dual cursors, and memory overcommitment.
+
+For details about memory management and FAQs of MindSpore on the Ascend device, see [MindSpore Ascend Memory Management](https://bbs.huaweicloud.com/forum/thread-171161-1-1.html).
diff --git a/tutorials/experts/source_en/debug/pynative_debug.md b/tutorials/experts/source_en/debug/pynative_debug.md
new file mode 100644
index 0000000000..934deab8c1
--- /dev/null
+++ b/tutorials/experts/source_en/debug/pynative_debug.md
@@ -0,0 +1,80 @@
+# Debugging in PyNative Mode
+
+
+
+## Introduction
+
+PyNative, also called the dynamic graph mode. In this mode, Python commands are executed statement by statement based on the Python syntax. After each Python command is executed, the execution result of the Python statement can be obtained. Therefore, in PyNative mode, you can debug network scripts command by command or at a specific command location.
+
+## Breakpoint Debugging
+
+Breakpoint debugging is to set a breakpoint before or after a command in a network script. When the network script runs to the breakpoint, the script stops. You can view the variable information at the breakpoint or debug the script step by step. During the debugging, you can view the current value of each variable. You can determine whether the current code is correct by analyzing whether the variables at the breakpoint are proper. In PyNative mode, Python commands are executed statement by statement based on the Python syntax. Therefore, in PyNative mode, you can use the breakpoint debugging tool pdb provided by Python to debug network scripts.
+
+The following piece of code is used to demonstrate the breakpoint debugging function.
+
+```python
+import pdb
+import numpy as np
+import mindspore as ms
+from mindspore import Tensor, nn, set_context
+from mindspore import Parameter, ParameterTuple
+from mindspore import ops
+set_context(mode=ms.PYNATIVE_MODE)
+class Net(nn.Cell):
+ def __init__(self):
+ super(Net, self).__init__()
+ self.w1 = Parameter(Tensor(np.random.randn(5, 6).astype(np.float32)), name="w1", requires_grad=True)
+ self.w2 = Parameter(Tensor(np.random.randn(5, 6).astype(np.float32)), name="w2", requires_grad=True)
+ self.relu = nn.ReLU()
+ self.pow = ops.Pow()
+
+ def construct(self, x, y):
+ x = self.relu(x * self.w1) * self.w2
+ pdb.set_trace()
+ out = self.pow(x - y, 2)
+ return out
+
+x = Tensor(np.random.randn(5, 6).astype(np.float32))
+y = Tensor(np.random.randn(5, 6).astype(np.float32))
+
+net = Net()
+ret = net(x, y)
+weights = ParameterTuple(filter(lambda x : x.requires_grad, net.get_parameters()))
+grads = ms.grad(net, grad_position=None, weights=weights)(x, y)
+print("grads: ", grads)
+
+```
+
+1. You can import pdb to the script to use the breakpoint debugging function.
+
+ ```python
+ import pdb
+ ```
+
+2. Set the following command at the position where the breakpoint is required to stop the network script when the command is executed:
+
+ **Demo code:**
+
+ ```python
+ x = self.relu(x * self.w1) * self.w2
+ pdb.set_trace()
+ out = self.pow(x - y, 2)
+ return out
+ ```
+
+ As shown in Figure 1, the script stops at the `out = self.pow(x-y, 2)` command and waits for the next pdb command.
+
+ 
+
+ Figure 1
+
+3. When a network script stops at a breakpoint, you can use common pdb debugging commands to debug the network script. For example, you can print variable values, view program call stacks, and perform step-by-step debugging.
+
+ * To print the value of a variable, run the p command, as shown in (1) in Figure 1.
+ * To view the program call stack, run the bt command, as shown in (2) in Figure 1.
+ * To view the network script context of the breakpoint, run the l command, as shown in (3) in Figure 1.
+ * To debug the network script step by step, run the n command, as shown in (4) in Figure 1.
+
+## Common pdb Commands
+
+For details about how to use pdb commands, see the [pdb official document](https://docs.python.org/3/library/pdb.html).
diff --git a/tutorials/experts/source_zh_cn/debug/minddata_debug.md b/tutorials/experts/source_zh_cn/debug/minddata_debug.md
index 02e3a5f964..9dd20f96c4 100644
--- a/tutorials/experts/source_zh_cn/debug/minddata_debug.md
+++ b/tutorials/experts/source_zh_cn/debug/minddata_debug.md
@@ -231,7 +231,7 @@
[MindSpore 数据集加载 - parse() missing 1 required positional](https://bbs.huaweicloud.com/forum/thread-189950-1-1.html)
- - 自定义数据集使用Tensor计算算子
+ - 自定义数据集使用了算子或Tensor操作
错误日志:
@@ -241,11 +241,11 @@
错误描述:
- 在自定义数据集里面使用了 Tensor 计算算子,此时会调用后端底层的算子进行执行,但是数据处理又是多线程并行处理,因此会起多个线程进行计算,但是计算算子可能不支持多线程执行,因此报错。
+ 在自定义数据集里面使用了算子或Tensor操作,而数据处理时采用多线程并行处理,但算子或Tensor操作并不支持多线程执行,因此报错。
参考解决方法:
- 用户自定义的 Pyfunc 中,在数据集中的`__getitem__` 中不使用 MindSpore 的 Tensor 及相关操作,建议先把入参转为 Numpy 类型,再通过 Numpy 相关操作实现相关功能。
+ 用户自定义的 Pyfunc 中,在数据集中的`__getitem__` 中不使用 MindSpore的Tensor操作或算子,建议先把入参转为 Numpy 类型,再通过 Numpy 相关操作实现相关功能。
参考实例链接:
@@ -395,7 +395,7 @@
参考解决方法:
- ① 检查需要进行 batch 操作的数据 shape,不一致时放弃进行 shape 操作。
+ ① 检查需要进行 batch 操作的数据 shape,不一致时放弃进行 batch 操作。
② 如果一定要对 shape 不一致的数据进行 batch 操作,需要整理数据集,通过 pad 补全等方式进行输入数据 shape 的统一。
diff --git a/tutorials/experts/source_zh_cn/debug/mindrt_debug.md b/tutorials/experts/source_zh_cn/debug/mindrt_debug.md
index ef8c307969..41251dbff1 100644
--- a/tutorials/experts/source_zh_cn/debug/mindrt_debug.md
+++ b/tutorials/experts/source_zh_cn/debug/mindrt_debug.md
@@ -6,7 +6,7 @@
## context配置问题
-执行网络训练时,需要指定后端设备,使用方式是:`set_context(device_target=device)`。MindSpore支持CPU,GPU和昇腾后端Ascend。如果在GPU设备上,错误指定后端设备为Ascend,即`set_context(device_target="Ascend")`, 会得到如下报错信息:
+执行网络训练时,需要指定后端设备,使用方式是:`set_context(device_target=device)`。MindSpore支持CPU,GPU和昇腾后端Ascend。如果在GPU设备上,错误指定后端设备为Ascend,即`set_context(device_target="Ascend")`,会得到如下报错信息:
```python
ValueError: For 'set_context', package type mindspore-gpu support 'device_target' type gpu or cpu, but got Ascend.
@@ -88,7 +88,7 @@ RuntimeError: ({'errCode': 'E80012', 'op_name': 'reduce_sum_d', 'param_name': 'x
参考实例链接:
-[MindSpore 算子编译问题 - ReduceSum算子不支持8维以上输入](https://bbs.huaweicloud.com/forum/thread-182168-1-1.html)。
+[MindSpore 算子编译问题 - ReduceSum算子不支持8维以上输入](https://bbs.huaweicloud.com/forum/thread-182168-1-1.html)
例如,Parameter参数不支持类型自动转换,使用Parameter算子时,进行数据类型转换时报错,报错信息如下:
--
Gitee