From 5fa1f8e9e34d446941f8fdb53eb351eb4f0863b4 Mon Sep 17 00:00:00 2001 From: zhangyi Date: Thu, 31 Mar 2022 17:59:05 +0800 Subject: [PATCH] fix the English files --- .../mindspore/faq/source_en/feature_advice.md | 58 +++--- .../faq/source_en/implement_problem.md | 171 +++++++++++++----- 2 files changed, 161 insertions(+), 68 deletions(-) diff --git a/docs/mindspore/faq/source_en/feature_advice.md b/docs/mindspore/faq/source_en/feature_advice.md index 7faa2758c2..8b9f6351ce 100644 --- a/docs/mindspore/faq/source_en/feature_advice.md +++ b/docs/mindspore/faq/source_en/feature_advice.md @@ -8,9 +8,9 @@ A: The format is not fixed. This step is to create an input for constructing the
-**Q: What framework models and formats can be directly read by MindSpore? Can the PTH Model Obtained Through Training in PyTorch Be Loaded to the MindSpore Framework for Use?** +**Q: What framework models and formats can be directly read by MindSpore? Can the PTH Model obtained through training in PyTorch be loaded to the MindSpore framework for use?** -A: MindSpore uses protocol buffers (Protobuf) to store training parameters and cannot directly read framework models. A model file stores parameters and their values. You can use APIs of other frameworks to read parameters, obtain the key-value pairs of parameters, and load the key-value pairs to MindSpore. If you want to use the .ckpt file trained by a framework, read the parameters and then call the `save_checkpoint` API of MindSpore to save the file as a .ckpt file that can be read by MindSpore. +A: MindSpore uses Protobuf to store training parameters and cannot directly read framework models. A model file stores parameters and their values. You can use APIs of other frameworks to read parameters, obtain the key-value pairs of parameters, and load the key-value pairs to MindSpore. If you want to use the .ckpt file trained by other framework, read the parameters and then call the `save_checkpoint` API of MindSpore to save the file as a .ckpt file that can be read by MindSpore.
@@ -24,13 +24,13 @@ A: When a single Protobuf data is too large, because Protobuf itself limits the A: Compare through the following four aspects: -- In terms of network execution:operators used in the two modes are the same. Therefore, when the same network and operators are executed in the two modes, the accuracy is the same. As Graph mode uses graph optimization, calculation graph sinking and other technologies, it has higher performance and efficiency in executing the network. +- In terms of network execution: operators used in the two modes are the same. Therefore, when the same network and operators are executed in the two modes, the accuracy is the same. As Graph mode uses graph optimization, calculation graph sinking and other technologies, it has higher performance and efficiency in executing the network. -- In terms of application scenarios,:Graph mode requires the network structure to be built at the beginning, and then the framework performs entire graph optimization and execution. This mode is suitable to scenarios where the network is fixed and high performance is required. +- In terms of application scenarios: Graph mode requires the network structure to be built at the beginning, and then the framework performs entire graph optimization and execution. This mode is suitable to scenarios where the network is fixed and high performance is required. -- The two modes are supported on different hardware (such as `Ascend`, `GPU`, and `CPU`). +- In term of different hardware (such as `Ascend`, `GPU`, and `CPU`) resources: the two modes are supported. -- In terms of code debugging,:since operators are executed line by line in PyNative mode, you can directly debug the Python code and view the `/api` output or execution result of the corresponding operator at any breakpoint in the code. In Graph mode, the network is built but not executed in the constructor function. Therefore, you cannot obtain the output of the corresponding operator at breakpoints in the `construct` function. You can only specify operators and print their output results, and then view the results after the network execution is completed. +- In terms of code debugging: since operators are executed line by line in PyNative mode, you can directly debug the Python code and view the `/api` output or execution result of the corresponding operator at any breakpoint in the code. In Graph mode, the network is built but not executed in the constructor function. Therefore, you cannot obtain the output of the corresponding operator at breakpoints in the `construct` function. You can only specify operators and print their output results, and then view the results after the network execution is completed.
@@ -46,21 +46,33 @@ A: You can use the two frameworks in a python file. Pay attention to the differe
-**Q: Can MindSpore read a TensorFlow checkpoint?** +**Q: Can MindSpore read a ckpt file of TensorFlow?** -A: The checkpoint format of MindSpore is different from that of TensorFlow. Although both use the Protocol Buffers, their definitions are different. Currently, MindSpore cannot read the TensorFlow or Pytorch checkpoints. +A: The formats of `ckpt` of MindSpore and `ckpt`of TensorFlow are not generic. Although both use the `Protocol` Buffers, the definition of `proto` are different. Currently, MindSpore cannot read the TensorFlow or Pytorch `ckpt` files.
**Q: How do I use models trained by MindSpore on Ascend 310? Can they be converted to models used by HiLens Kit?** -A: Yes. HiLens Kit uses Ascend 310 as the inference core. Therefore, the two questions are essentially the same. Ascend 310 requires a dedicated OM model. Use MindSpore to export the ONNX or AIR model and convert it into an OM model supported by Ascend 310. For details, see [Multi-platform Inference](https://www.mindspore.cn/docs/programming_guide/en/master/multi_platform_inference_ascend_310.html). +A: Yes. HiLens Kit uses Ascend 310 as the inference core. Therefore, the two questions are essentially the same, which both need to convert as OM model. Ascend 310 requires a dedicated OM model. Use MindSpore to export the ONNX or AIR model and convert it into an OM model supported by Ascend 310. For details, see [Multi-platform Inference](https://www.mindspore.cn/docs/programming_guide/en/master/multi_platform_inference_ascend_310.html). + +
+ +**Q: Does MindSpore only be run on Huawei own `Ascend`?** + +A: MindSpore supports Huawei's own `Ascend`, `GPU` and `CPU` at the same time, and supports heterogeneous computing power.
**Q: Can MindSpore be converted to an AIR model on Ascend 310?** -A: An AIR model cannot be exported from the Ascend 310. You need to load a trained checkpoint on the Ascend 910, export an AIR model, and then convert the AIR model into an OM model for inference on the Ascend 310. For details about the Ascend 910 installation, see the MindSpore Installation Guide at [here](https://www.mindspore.cn/install/en). +A: An AIR cannot be exported from the Ascend 310. You need to load a trained checkpoint on the Ascend 910, export an AIR model, and then convert the AIR model into an OM model for inference on the Ascend 310. For details about the Ascend 910 installation, see the MindSpore Installation Guide at [here](https://www.mindspore.cn/install/en). + +
+ +**Q: Does MindSpore have any limitation on the input size of a single Tensor for exporting and loading models?** + +A: Due to hardware limitations of Protobuf, when exporting to AIR and ONNX formats, the size of model parameters cannot exceed 2G; when exporting to MINDIR format, there is no limit to the size of model parameters. MindSpore only supports MINDIR and doesn't support AIR and ONNX formats. The import size limitation is the same as that of export.
@@ -70,21 +82,21 @@ A: MindSpore currently supports CPU, GPU, and Ascend. Currently, you can try out
-**Q: Does MindSpore have any limitation on the input size of a single Tensor for exporting and loading models?** +**Q: Does MindSpore have any plan on supporting heterogeneous computing hardwares?** -A: Due to hardware limitations of Protobuf, when exporting to AIR and ONNX formats, the size of model parameters cannot exceed 2G; when exporting to MINDIR format, there is no limit to the size of model parameters. MindSpore only supports MINDIR, and the size of a single Tensor cannot exceed 2G. +A: MindSpore provides pluggable device management interface, so that developer could easily integrate other types of heterogeneous computing hardwares (like FPGA) to MindSpore. We welcome more backend support in MindSpore from the community.
-**Q: Does MindSpore have any plan on supporting other types of heterogeneous computing hardwares?** +**Q: What is the relationship between MindSpore and ModelArts? Can we use MindSpore in ModelArts?** -A: MindSpore provides pluggable device management interface so that developer could easily integrate other types of heterogeneous computing hardwares like FPGA to MindSpore. We welcome more backend support in MindSpore from the community. +A: ModelArts is Huawei public cloud online training and inference platform, and MindSpore is Huawei deep learning framework, which can be found in [MindSpore official website tutorial](https://www.mindspore.cn/docs/programming_guide/zh-CN/master/use_on_the_cloud.html). The tutorial shows how users can use ModelArts to train ModelsSpore models in detail.
**Q: The recent announced programming language such as taichi got Python extensions that could be directly used as `import taichi as ti`. Does MindSpore have similar support?** -A: MindSpore supports Python native expression via `import mindspore`. +A: MindSpore supports Python native expression and `import mindspore` related package can be used.
@@ -108,25 +120,25 @@ A: In addition to data parallelism, MindSpore distributed training also supports **Q: How does MindSpore implement semantic collaboration and processing? Is the popular Formal Concept Analysis (FCA) used?** -A: The MindSpore framework does not support FCA. For semantic models, you can call third-party tools to perform FCA in the data preprocessing phase. MindSpore supports Python therefore `import FCA` could do the trick. +A: The MindSpore framework does not support FCA. For semantic models, you can call third-party tools to perform FCA in the data preprocessing phase. MindSpore supports Python therefore `import FCA` related package could do the trick.
-**Q: Does MindSpore have any plan or consideration on the edge and device when the training and inference functions on the cloud are relatively mature?** +**Q: Does MindSpore have any plan on the edge and device when the training and inference functions of MindSpore on the cloud are relatively mature?** -A: MindSpore is a unified cloud-edge-device training and inference framework. Edge has been considered in its design, so MindSpore can perform inference at the edge. The open-source version will support Ascend 310-based inference. The optimizations supported in the current inference stage include quantization, operator fusion, and memory overcommitment. +A: MindSpore is a unified cloud-edge-device training and inference framework, which supports exporting cloud-side trained models to Ascend AI processors and terminal devices for inference. The optimizations supported in the current inference stage include quantization, operator fusion, and memory overcommitment.
**Q: How does MindSpore support automatic parallelism?** -A: Automatic parallelism on CPUs and GPUs are being improved. You are advised to use the automatic parallelism feature on the Ascend 910 AI processor. Follow our open source community and apply for a MindSpore developer experience environment for trial use. +A: Automatic parallelism on CPUs and GPUs are being improved. You are advised to use the automatic parallelism on the Ascend 910 AI processor. Follow our open source community and apply for a MindSpore developer experience environment for trial use.
-**Q: Does MindSpore have a module that can implement object detection algorithms as TensorFlow does?** +**Q: Does MindSpore have a similar module that can implement object detection algorithms based on TensorFlow?** -A: The TensorFlow's object detection pipeline API belongs to the TensorFlow's Model module. After MindSpore's detection models are complete, similar pipeline APIs will be provided. +A: The TensorFlow's object detection Pipeline API belongs to the TensorFlow's Model module. After MindSpore's detection models are complete, similar Pipeline APIs will be provided.
@@ -138,9 +150,9 @@ A: PyNative mode is compatible with transfer learning. For more tutorial informa **Q: What is the difference between [MindSpore ModelZoo](https://gitee.com/mindspore/models/tree/master) and [Ascend ModelZoo](https://www.hiascend.com/software/modelzoo)?** -A: `MindSpore ModelZoo` contains models only implemented by MindSpore. But these models support different devices including Ascend, GPU, CPU and mobile. `Ascend ModelZoo` contains models only running on Ascend which are implemented by different ML platform including MindSpore, PyTorch, TensorFlow and Caffe. You can refer to the corresponding [gitee repository](https://gitee.com/ascend/modelzoo). +A: `MindSpore ModelZoo` contains models mainly implemented by MindSpore. But these models support different devices including Ascend, GPU, CPU and Mobile. `Ascend ModelZoo` contains models only running on Ascend which are implemented by different ML platform including MindSpore, PyTorch, TensorFlow and Caffe. You can refer to the corresponding [gitee repository](https://gitee.com/ascend/modelzoo). -As for the models implemented by MindSpore running on Ascend, these are maintained in `MindSpore ModelZoo`, and will be released to `Ascend ModelZoo` regularly. +The combination of MindSpore and Ascend is overlapping, and this part of the model will be based on MindSpore's ModelZoo as the main version, and will be released to Ascend ModelZoo regularly.
diff --git a/docs/mindspore/faq/source_en/implement_problem.md b/docs/mindspore/faq/source_en/implement_problem.md index 90dc101a35..9ce9aaadde 100644 --- a/docs/mindspore/faq/source_en/implement_problem.md +++ b/docs/mindspore/faq/source_en/implement_problem.md @@ -4,11 +4,11 @@ **Q: How do I use MindSpore to implement multi-scale training?** -A: During multi-scale training, when different `shape` are used to call `Cell` objects, different graphs are automatically built and called based on different `shape`. Note that multi-scale training supports only the non-data sink mode and does not support the data offloading mode. For details, see the multi-scale training of the [yolov3](https://gitee.com/mindspore/models/tree/master/official/cv/yolov3_darknet53). +A: During multi-scale training, when different `shape` are used to call `Cell` objects, different graphs are automatically built and called based on different `shape`, to implement the multi-scale training. Note that multi-scale training supports only the non-data sink mode and does not support the data offloading mode. For details, see the multi-scale training implement of [yolov3](https://gitee.com/mindspore/models/tree/master/official/cv/yolov3_darknet53).
-**Q: If a `tensor` whose `requirements_grad` is set to `False` is converted into `numpy` for processing and then converted into `tensor`, will the computational graph and backward propagation be affected?** +**Q: If a `tensor` of MindSpore whose `requirements_grad` is set to `False` is converted into `numpy` for processing and then converted into `tensor`, will the computational graph and backward propagation be affected?** A: In PyNative mode, if `numpy` is used for computation, gradient transfer will be interrupted. In the scenario where `requirements_grad` is set to `False`, if the backward propagation of `tensor` is not transferred to other parameters, there is no impact. If `requirements_grad` is set to `True`, there is an impact. @@ -22,30 +22,30 @@ A: The `nn.Dense` interface is similar to `torch.nn.functional.linear()`. `nn.De **Q: What is the function of the `.meta` file generated after the model is saved using MindSpore? Can the `.meta` file be used to import the graph structure?** -A: The `.meta` file is a built graph structure. However, this structure cannot be directly imported currently. If you do not know the graph structure, you still need to use the MindIR file to import the network. +A: The `.meta` file is a compiled graph structure. However, this structure cannot be directly imported currently. If you do not know the graph structure, you still need to use the MindIR file to import the network.
**Q: Can the `yolov4-tiny-3l.weights` model file be directly converted into a MindSpore model?** -A: No. You need to convert the parameters trained by other frameworks into the MindSpore format, and then convert the model file into a MindSpore model. +A: No. You need to convert the parameters trained by other frameworks into the MindSpore format, and then convert the model into a MindSpore model.
-**Q: Why an error is reported when MindSpore is used to set `model.train`?** +**Q: Why an error message is displayed when MindSpore is used to set `model.train`?** ```python model.train(1, dataset, callbacks=LossMonitor(1), dataset_sink_mode=True) model.train(1, dataset, callbacks=LossMonitor(1), dataset_sink_mode=False) ``` -A: If the offloading mode has been set, it cannot be set to non-offloading mode. This is a restriction on the running mechanism. +A: If the offloading mode has been set, it cannot be set to non-offloading mode, which is a restriction on the running mechanism.
-**Q: What should I pay attention to when using MindSpore to train a model in the `eval` phase? Can the network and parameters be loaded directly? Does the optimizer need to be used in the model?** +**Q: What should I pay attention to when using MindSpore to train a model in the `eval` phase? Can the network and parameters be loaded directly? Does the optimizer need to be used in the Model?** -A: It mainly depends on what is required in the `eval` phase. For example, the output of the `eval` network of the image classification task is the probability value of each class, and the `acc` is computed with the corresponding label. +A: It mainly depends on what is required in the `eval` phase. For example, the output of the `eval` network of the image classification task is the probability of each class, and the `acc` is circulated with the corresponding label. In most cases, the training network and parameters can be directly reused. Note that the inference mode needs to be set. ```python @@ -71,13 +71,13 @@ A: To change the value according to `epoch`, use [Dynamic LR](https://www.mindsp **Q: How do I modify parameters (such as the dropout value) on MindSpore?** -A: When building a network, use `if self.training: x = dropput(x)`. When reasoning, set `network.set_train(mode_false)` before execution to disable the dropout function. During training, set `network.set_train(mode_false)` to True to enable the dropout function. +A: When building a network, use `if self.training: x = dropput(x)`. When inferring, set `network.set_train(mode_false)` before execution to disable the dropout function. During training, set `network.set_train(mode_false)` to True to enable the dropout function.
**Q: How do I view the number of model parameters?** -A: You can load the checkpoint to count the parameter number. Variables in the momentum and optimizer may be counted, so you need to filter them out. +A: You can load the checkpoint count directly. Variables in the momentum and optimizer may be counted, so you need to filter them out. You can refer to the following APIs to collect the number of network parameters: ```python @@ -98,9 +98,9 @@ def count_params(net):
-**Q: How do I monitor the loss during training and save the training parameters when the `loss` is the lowest?** +**Q: How do I monitor the `loss` during training and save the training parameters when the `loss` is the lowest?** -A: You can customize a `callback`.For details, see the writing method of `ModelCheckpoint`. In addition, the logic for determining loss is added. +A: You can customize a `callback`.For details, see the writing method of `ModelCheckpoint`. In addition, the logic for determining `loss` is added. ```python class EarlyStop(Callback): @@ -115,9 +115,9 @@ class EarlyStop(Callback):
-**Q: How do I obtain the expected `feature map` when `nn.Conv2d` is used?** +**Q: How do I obtain `feature map` with the expected size when `nn.Conv2d` is used?** -A: For details about how to derive the `Conv2d shape`, click [here](https://www.mindspore.cn/docs/api/en/master/api_python/nn/mindspore.nn.Conv2d.html#mindspore.nn.Conv2d) Change `pad_mode` of `Conv2d` to `same`. Alternatively, you can calculate the `pad` based on the Conv2d shape derivation formula to keep the `shape` unchanged. Generally, the pad is `(kernel_size-1)//2`. +A: For details about how to derive the `Conv2d shape`, click [here](https://www.mindspore.cn/docs/api/en/master/api_python/nn/mindspore.nn.Conv2d.html#mindspore.nn.Conv2d) Change `pad_mode` of `Conv2d` to `same`. Alternatively, you can calculate the `pad` based on the `Conv2d shape` derivation formula to keep the `shape` unchanged. Generally, the pad is `(kernel_size-1)//2`.
@@ -150,7 +150,7 @@ class EarlyStop(Callback): cb_params = run_context.original_args() loss = cb_params.net_outputs if loss.asnumpy() < self._control_loss: - # Stop training. + # Stop training run_context._stop_requested = True stop_cb = EarlyStop(control_loss=1) @@ -186,9 +186,70 @@ net = Vgg(cfg['16'], num_classes=num_classes, args=args, batch_norm=args.batch_n
+**Q: How to handle cache server exception shutdown?** + +A: During the use of the cache server, system resources such as IPC share memory and socket files are allocated. If overflow is allowed, there will be overflowing data files on disk space. In general, if the server is shut down normally via the `cache_admin --stop` command, these resources will be automatically cleaned up. + +However, if the cache server is shut down abnormally, such as the cache service process is killed, the user needs to try to restart the server first. If the startup fails, you should follow the following steps to manually clean up the system resources: + +- Delete the IPC resource. + + 1. Check for IPC shared memory residue. + + In general, the system allocates 4GB of share memory for the caching service. The following command allows you to view the usage of share memory blocks in the system. + + ```text + $ ipcs -m + ------ Shared Memory Segments -------- + key shmid owner perms bytes nattch status + 0x61020024 15532037 root 666 4294967296 1 + ``` + + where `shmid` is the share memory block id, `bytes` is the size of the share memory block, and `nattch` is the number of processes linking to the shared memory block. `nattch` is not 0, which indicates that there are still processes that use the share memory block. Before you delete share memory, you need to stop all processes that use that memory block. + + 2. Delete the IPC share memory. + + Find the corresponding share memory id, and delete via the following command. + + ```text + ipcrm -m {shmid} + ``` + +- Delete socket files. + +In general, socket files is located `/tmp/mindspore/cache`. Enter the folder, and execute the following command to delete socket files. + +```text +rm cache_server_p{port_number} +``` + +where `port_number` is the port number specified when the user creates the cache server, which defaults to 50052. + +- Delete data files that overflow to disk space. + +Enter the specified overflow data path when you enabled the cache server. In general, the default overflow path is `/tmp/mindspore/cache`. Find the corresponding data folders under the path and delete them one by one. + +
+ +**Q: Can the `vgg16` model be loaded by using the GPU via Hub and whether can the migration model be done?** + +A: Please manually modify the following two parameters: + +```python +# Increase **kwargs parameter: as the following +def vgg16(num_classes=1000, args=None, phase="train", **kwargs): +``` + +```python +# Increase **kwargs parameter: as the following +net = Vgg(cfg['16'], num_classes=num_classes, args=args, batch_norm=args.batch_norm, phase=phase, **kwargs) +``` + +
+ **Q: How to obtain middle-layer features of a VGG model?** -A: Obtaining the middle-layer features of a network is not closely related to the specific framework. For the `vgg` model defined in `torchvison`, the `features` field can be used to obtain the middle-layer features. The `vgg` source code of `torchvison` is as follows: +A: Obtaining the middle-layer features of a network is not closely related to the specific framework. For the `vgg` model defined in `torchvison`, the `features` field can be used to obtain the "middle-layer features". The `vgg` source code of `torchvison` is as follows: ```python class VGG(nn.Module): @@ -216,7 +277,7 @@ A: The `dataset` received by the defined `model.train` API can consist of multip **Q: How do I load the PyTorch weight to MindSpore during model transfer?** -A: First, enter the `PTH` file of PyTorch. Take `ResNet-18` as an example. The network structure of MindSpore is the same as that of PyTorch. After transferring, the file can be directly loaded to the network. Only `BN` and `Conv2D` are used during loading. If the network names of MindSpore and PyTorch at other layers are different, change the names to the same. +A: First, enter the `PTH` file of PyTorch. Taking `ResNet-18` as an example, the network structure of MindSpore is the same as that of PyTorch. After transferring, the file can be directly loaded to the network. Only `BN` and `Conv2D` are used during loading. If the network names of `ms` and PyTorch at other layers are different, change the names to the same.
@@ -298,17 +359,17 @@ Modify the following items to fit $f(x) = ax^2 + bx + c$: The following explains detailed information about the modification: ```python -# The selected optimizer does not support CPUs. Therefore, the GPU computing platform is used for training. You need to install MindSpore of the GPU version. +# Since the selected optimizer does not support CPU, so the training computing platform is changed to GPU, which requires readers to install the corresponding GPU version of MindSpore. context.set_context(mode=context.GRAPH_MODE, device_target="GPU") -# Assume that the function to be fitted is f(x)=2x^2+3x+4. Modify the data generation function as follows: +# Assume that the function to be fitted this time is f(x)=2x^2+3x+4, the data generation function is modified as follows: def get_data(num, a=2.0, b=3.0 ,c = 4): for i in range(num): x = np.random.uniform(-10.0, 10.0) noise = np.random.normal(0, 1) - # For details about how to generate the value of y, see the to-be-fitted objective function ax^2+bx+c. + # The y value is generated by the fitting target function ax^2+bx+c. y = x * x * a + x * b + c + noise - # When fitting a*x^2 + b*x +c, a and b are weight parameters, and c is the offset parameter bias. The training data corresponding to the two weights is x^2 and x, respectively. Therefore, the dataset generation mode is changed as follows: + # When a*x^2+b*x+c is fitted, a and b are weight parameters and c is offset parameter bias. The training data corresponding to the two weights are x^2 and x respectively, so the dataset generation mode is changed as follows: yield np.array([x*x, x]).astype(np.float32), np.array([y]).astype(np.float32) def create_dataset(num_data, batch_size=16, repeat_size=1): @@ -320,7 +381,7 @@ def create_dataset(num_data, batch_size=16, repeat_size=1): class LinearNet(nn.Cell): def __init__(self): super(LinearNet, self).__init__() - # Two training parameters are input for the full connection function. Therefore, the input value is changed to 2. The first Normal(0.02) automatically allocates random weights to the two input parameters, and the second Normal is the random bias. + # Because the full join function inputs two training parameters, the input value is changed to 2, the first Nomral(0.02) will automatically assign random weights to the input two parameters, and the second Normal is the random bias. self.fc = nn.Dense(2, 1, Normal(0.02), Normal(0.02)) def construct(self, x): @@ -336,7 +397,7 @@ if __name__ == "__main__": net = LinearNet() net_loss = nn.loss.MSELoss() - # RMSProp optimizer with better effect is selected for quadratic function fitting. Currently, Ascend and GPU computing platforms are supported. + # RMSProp optimalizer with better effect is selected for quadratic function fitting, Currently, Ascend and GPU computing platforms are supported. opt = nn.RMSProp(net.trainable_params(), learning_rate=0.1) model = Model(net, net_loss, opt) @@ -350,33 +411,51 @@ if __name__ == "__main__": **Q: How do I execute a single `ut` case in `mindspore/tests`?** -A: `ut` cases are usually based on the MindSpore package of the debug version, which is not provided on the official website. You can run `sh build.sh` to compile the source code and then run the `pytest` command. The compilation in debug mode does not depend on the backend. Run the `sh build.sh -t on` command. For details about how to execute cases, see the `tests/runtest.sh` script. +A: `ut` cases are usually based on the MindSpore package of the debug version, which is not provided on the official website. You can run `sh build.sh` to compile based on the source code and then run the `pytest` command. The compilation in debug mode does not depend on the backend. Compile the `sh build.sh -t on` option. For details about how to execute cases, see the `tests/runtest.sh` script.
-**Q: For Ascend users, how to get more detailed logs when the `run task error` is reported?** +**Q: For Ascend users, how to get more detailed logs to help position the problems when the `run task error` is reported during executing the cases?** A: Use the msnpureport tool to set the on-device log level. The tool is stored in `/usr/local/Ascend/latest/driver/tools/msnpureport`. +- Global-level: + ```bash -- Global: /usr/local/Ascend/latest/driver/tools/msnpureport -g info +/usr/local/Ascend/latest/driver/tools/msnpureport -g info ``` +- Module-level + ```bash -- Module-level: /usr/local/Ascend/latest/driver/tools/msnpureport -m SLOG:error +/usr/local/Ascend/latest/driver/tools/msnpureport -m SLOG:error ``` +- Event-level + ```bash -- Event-level: /usr/local/Ascend/latest/driver/tools/msnpureport -e disable/enable +/usr/local/Ascend/latest/driver/tools/msnpureport -e disable/enable ``` +- Multi-device ID-level + ```bash -- Multi-device ID-level: /usr/local/Ascend/latest/driver/tools/msnpureport -d 1 -g warning +/usr/local/Ascend/latest/driver/tools/msnpureport -d 1 -g warning ``` -Assume that the value range of deviceID is [0, 7], and `devices 0–3` and `devices 4–7` are on the same OS. `Devices 0–3` share the same log configuration file and `devices 4–7` share the same configuration file. In this way, changing the log level of any device (for example device 0) will change that of other devices (for example `devices 1–3`). This rule also applies to `devices 4–7`. +Assume that the value range of deviceID is [0, 7], and `devices 0–3` and `devices 4–7` are on the same OS. `devices 0` to `device3` share the same log configuration file and `device4`-`device7` shares the same configuration file. In this way, changing any log level in `devices 0` to `device3` will change that of other `device`. This rule also applies to `device4`-`device7` . -After the driver package is installed (assuming that the installation path is /usr/local/HiAI and the execution file `msnpureport.exe` is in the C:\ProgramFiles\Huawei\Ascend\Driver\tools\ directory on Windows), run the command in the /home/shihangbo/ directory to export logs on the device to the current directory and store logs in a folder named after the timestamp. +After the `Driver` package is installed (assuming that the installation path is /usr/local/HiAI and the execution file `msnpureport.exe` is in the C:\ProgramFiles\Huawei\Ascend\Driver\tools\ directory on Windows), suppose the user executes the command line directly in the /home/shihangbo/directory, the Device side logs are exported to the current directory and stored in a timestamp-named folder. + +
+ +**Q: How can I do when the error message `Out of Memory!!! total[3212254720] (dynamic[0] memory poll[524288000]) malloc[32611480064] failed!` is displayed by performing the training process using the Ascend platform?** + +A: This issue is a memory shortage problem caused by too much memory usage, which can be caused by two possible causes: + +- Set the value of `batch_size` too large. Solution: Reduce the value of `batch_size`. +- Introduce the abnormally large `parameter`, for example, a single data shape is [640,1024,80,81]. The data type is float32, and the single data size is over 15G. In this way, the two data with the similar size are added together, and the memory occupied is over 3*15G, which easily causes `Out of Memory`. Solution: Check the `shape` of the parameter. If it is abnormally large, the shape can be reduced. +- If the following operations cannot solve the problem, you can raise the problem on the [official forum](https://bbs.huaweicloud.com/forum/forum-1076-1.html), and there are dedicated technical personnels for help.
@@ -386,25 +465,27 @@ A: Sorry, this function is not available yet. You can find the optimal hyperpara
-**Q: What should I do when error `error while loading shared libraries: libge_compiler.so: cannot open shared object file: No such file or directory` prompts during application running?** +**Q: What should I do when error `error while loading shared libraries: libge_compiler.so: cannot open shared object file: No such file or directory` is displayed during application running?** -A: While installing Ascend 310 AI Processor software packages,the `CANN` package should install the full-featured `toolkit` version instead of the `nnrt` version. +A: While installing Ascend 310 AI Processor software packages depended by MindSpore, the `CANN` package should install the full-featured `toolkit` version instead of the `nnrt` version.
**Q: Why does context.set_ps_context(enable_ps=True) in model_zoo/official/cv/resnet/train.py in the MindSpore code have to be set before init?** -A: In MindSpore Ascend mode, if init is called first, then all processes will be allocated cards, but in parameter server training mode, the server does not need to allocate cards, then the worker and server will use the same card, resulting in an error: HCCL dependent tsd is not open. +A: In MindSpore Ascend mode, if init is called first, all processes will be allocated cards, but in parameter server training mode, the server does not need to allocate cards, and the worker and server will use the same card, resulting in an error: HCCL dependent tsd is not open.
**Q: What should I do if the memory continues to increase when resnet50 training is being performed on the CPU ARM platform?** -A: When performing resnet50 training on the CPU ARM, some operators are implemented based on the oneDNN library, and the oneDNN library is based on the libgomp library to achieve multi-threaded parallelism. Currently, libgomp has multiple parallel domain configurations. The number of threads is different and the memory usage continues to increase. The continuous growth of memory can be controlled by configuring a uniform number of threads globally. For comprehensive performance considerations, it is recommended to configure a unified configuration to 1/4 of the number of physical cores, such as export `OMP_NUM_THREADS=32`. +A: When resnet50 training is performed on the CPU ARM, some operators are implemented based on the oneDNN library, and the oneDNN library achieves multi-threaded parallelism based on the libgomp library. Currently, there is a problem in libgomp where the number of threads configured for multiple parallel domains is different and the memory consumption continues to grow. The continuous growth of the memory can be controlled by configuring a uniform number of threads globally. For comprehensive performance considerations, it is recommended to configure a unified configuration to 1/4 of the number of physical cores, such as `export OMP_NUM_THREADS=32`. + +
**Q: Why report an error that the stream exceeds the limit when executing the model on the Ascend platform?** -A: Stream represents an operation queue. Tasks on the same stream are executed in sequence, and different streams can be executed in parallel. Various operations in the network generate tasks and are assigned to streams to control the concurrent mode of task execution. Ascend platform has a limit on the number of tasks on the same stream, and tasks that exceed the limit will be assigned to new streams. The multiple parallel methods of MindSpore will also assign new streams, such as parallel communication operators. Therefore, when the number of assigned streams exceeds the resource limit of the Ascend platform, an error will be reported. Reference solution: +A: Stream represents an operation queue. Tasks on the same stream are executed in sequence, and different streams can be executed in parallel. Various operations in the network generate tasks and are assigned to streams to control the concurrent mode of task execution. Ascend platform has a limit on the number of tasks on the same stream, and tasks that exceed the limit will be assigned to new streams. The multiple parallel methods of MindSpore will also be assigned to new streams, such as parallel communication operators. Therefore, when the number of assigned streams exceeds the resource limit of the Ascend platform, an error will be reported. Reference solution: - Reduce the size of the network model @@ -414,9 +495,9 @@ A: Stream represents an operation queue. Tasks on the same stream are executed i
-**Q: On the Ascend platform, if an error "Ascend error occurred, error message:" is reported and followed by an error code, such as "E40011", how to find the cause of the error code?** +**Q: On the Ascend platform, if an error "Ascend error occurred, error message:" is reported in the log and followed by an error code, such as "E40011", how to find the cause of the error code?** -A: When "Ascend error occurred, error message:" appears, it indicates that a module of Ascend CANN is abnormal and the error code is reported. +A: When "Ascend error occurred, error message:" appears, it indicates that a module of Ascend CANN is abnormal and the error log is reported. At this time, there is an error message after the error code. If you need a more detailed possible cause and solution for this exception, please refer to the "error code troubleshooting" section of the corresponding Ascend version document, such as [CANN Community 5.0.3 alpha 002 (training) Error Code troubleshooting](https://support.huaweicloud.com/trouble-cann503alpha2training/atlaspd_15_0001.html). @@ -454,7 +535,7 @@ Method 2: If the problem persists, delete the cache file of the wheel installati
-**Q: What should I do if I encounter `matplotlib.pyplot.show()` (most often plt.show()) cannot be executed during the tutorial is running?** +**Q: What should I do if I encounter `matplotlib.pyplot.show()` or `plt.show` not be executed during the documentation sample code is running?** A: First confirm whether `matplotlib` is installed. If it is not installed, you can execute `pip install matplotlib` on the command line to install it. @@ -462,13 +543,13 @@ Secondly, because the function of `matplotlib.pyplot.show()` is to display graph
-**Q: What issues should be paid attention to when using the *Run in ModelArts* in tutorials?** +**Q: How to handle running failures when encountering an online runtime provided in the documentation?** A: Need to confirm that the following preparations have been done. - First, you need to log in to ModelArts through your HUAWEI CLOUD account. -- Secondly, note that the hardware environment supported by the tags in the tutorial document is Ascend, GPU or CPU. Since the hardware environment used by default after login is CPU, the Ascend environment and GPU environment need to be switched manually by the user. -- Finally, confirm that the current `Kernel` of Jupyter Notebook is MindSpore. +- Secondly, note that the hardware environment supported by the tags in the tutorial document and the hardware environment configured in the example code is Ascend, GPU or CPU. Since the hardware environment used by default after login is CPU, the Ascend environment and GPU environment need to be switched manually by the user. +- Finally, confirm that the current `Kernel` is MindSpore. After completing the above steps, you can run the tutorial. @@ -478,9 +559,9 @@ For the specific operation process, please refer to [Based on ModelArts Online E **Q: No error is reported when using result of division in GRAPH mode, but an error is reported when using result of division in PYNATIVE mode?** -A: In GRAPH mode, the data type of the output result of the operator is determined at the graph compilation stage. +A: In GRAPH mode, since the graph compilation is used, the data type of the output result of the operator is determined at the graph compilation stage. -For example, the following code is executed in GRAPH mode, the type of input data is int type, so the output result is also int type according to graph compiler. +For example, the following code is executed in GRAPH mode, and the type of input data is int, so the output result is also int type according to graph compilation. ```python from mindspore import context @@ -507,7 +588,7 @@ output: 4 ``` -Change GRAPH_MODE to PYNATIVE_MODE. Since the Python syntax is used in PyNative mode, the type of any division output is float type, so the execution result is as follows. +Change the execution mode and change GRAPH_MODE to PYNATIVE_MODE. Since the Python syntax is used in PyNative mode, the type of any division output to Python syntax is float type, so the execution result is as follows. ```text 4.0 -- Gitee