diff --git a/docs/en/PyTorch API Support List/PyTorch API Support List.md b/docs/en/PyTorch API Support List/PyTorch API Support List.md index f1d83dffd361f15774265dc7b641cee463d8a917..e0776803554b8ee33c8377fad2b03d27c1684647 100644 --- a/docs/en/PyTorch API Support List/PyTorch API Support List.md +++ b/docs/en/PyTorch API Support List/PyTorch API Support List.md @@ -1,17 +1,17 @@ # PyTorch API Support List -- [Tensors](#tensors.md) -- [Generators](#generators.md) -- [Random sampling](#random-sampling.md) -- [Serialization](#serialization.md) -- [Math operations](#math-operations.md) -- [Utilities](#utilities.md) -- [Other](#other.md) -- [torch.Tensor](#torch-tensor.md) -- [Layers \(torch.nn\)](#layers-(torch-nn).md) -- [Functions\(torch.nn.functional\)](#functions(torch-nn-functional).md) -- [torch.distributed](#torch-distributed.md) -- [NPU and CUDA Function Alignment](#npu-and-cuda-function-alignment.md) -

Tensors

+- [Tensors](#tensors) +- [Generators](#generators) +- [Random sampling](#random-sampling) +- [Serialization](#serialization) +- [Math operations](#math-operations) +- [Utilities](#utilities) +- [Other](#other) +- [torch.Tensor](#torch-tensor) +- [Layers \(torch.nn\)](#layers-torch-nn) +- [Functions\(torch.nn.functional\)](#functionstorch-nn-functional) +- [torch.distributed](#torch-distributed) +- [NPU and CUDA Function Alignment](#npu-and-cuda-function-alignment) +

Tensors

No.

@@ -361,7 +361,7 @@
-

Generators

+

Generators

No.

@@ -424,7 +424,7 @@
-

Random sampling

+

Random sampling

No.

@@ -641,7 +641,7 @@
-

Serialization

+

Serialization

No.

@@ -669,7 +669,7 @@
-

Math operations

+

Math operations

No.

@@ -1873,7 +1873,7 @@
-

Utilities

+

Utilities

No.

@@ -1915,7 +1915,7 @@
-

Other

+

Other

No.

@@ -1978,7 +1978,7 @@
-

torch.Tensor

+

torch.Tensor

No.

@@ -4484,7 +4484,7 @@
-

Layers \(torch.nn\)

+

Layers \(torch.nn\)

No.

@@ -6710,7 +6710,7 @@
-

Functions\(torch.nn.functional\)

+

Functions\(torch.nn.functional\)

No.

@@ -7417,7 +7417,7 @@
-

torch.distributed

+

torch.distributed

No.

@@ -7641,7 +7641,7 @@
-

NPU and CUDA Function Alignment

+

NPU and CUDA Function Alignment

No.

diff --git a/docs/en/PyTorch-Installation-Guide/PyTorch-Installation-Guide.md b/docs/en/PyTorch Installation Guide/PyTorch Installation Guide.md similarity index 94% rename from docs/en/PyTorch-Installation-Guide/PyTorch-Installation-Guide.md rename to docs/en/PyTorch Installation Guide/PyTorch Installation Guide.md index f48cc191c6b0e314e6cc07641f341e8f5d4d16a5..1239a71fed6a2e9f1331dd796043de22b49ff1b0 100644 --- a/docs/en/PyTorch-Installation-Guide/PyTorch-Installation-Guide.md +++ b/docs/en/PyTorch Installation Guide/PyTorch Installation Guide.md @@ -1,42 +1,42 @@ # FrameworkPTAdapter 2.0.2 PyTorch Installation Guide -- [Overview](#overview.md) -- [Manual Build and Installation](#manual-build-and-installation.md) - - [Prerequisites](#prerequisites.md) - - [Installing the PyTorch Framework](#installing-the-pytorch-framework.md) - - [Configuring Environment Variables](#configuring-environment-variables.md) - - [Installing the Mixed Precision Module](#installing-the-mixed-precision-module.md) -- [Using the Ascend Hub Image](#using-the-ascend-hub-image.md) - - [Obtaining the PyTorch Image from the Ascend Hub](#obtaining-the-pytorch-image-from-the-ascend-hub.md) - - [Configuring Environment Variables](#configuring-environment-variables-0.md) -- [References](#references.md) - - [Installing CMake](#installing-cmake.md) - - [How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0.md) - - [What Do I Do If "torch 1.5.0xxxx" and "torchvision" Do Not Match When torch-\*.whl Is Installed?](#what-do-i-do-if-torch-1-5-0xxxx-and-torchvision-do-not-match-when-torch--whl-is-installed.md) -

Overview

+- [Overview](#overview) +- [Manual Build and Installation](#manual-build-and-installation) + - [Prerequisites](#prerequisites) + - [Installing the PyTorch Framework](#installing-the-pytorch-framework) + - [Configuring Environment Variables](#configuring-environment-variables) + - [Installing the Mixed Precision Module](#installing-the-mixed-precision-module) +- [Using the Ascend Hub Image](#using-the-ascend-hub-image) + - [Obtaining the PyTorch Image from the Ascend Hub](#obtaining-the-pytorch-image-from-the-ascend-hub) + - [Configuring Environment Variables](#configuring-environment-variables-0) +- [References](#references) + - [Installing CMake](#installing-cmake) + - [How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0) + - [What Do I Do If "torch 1.5.0xxxx" and "torchvision" Do Not Match When torch-\*.whl Is Installed?](#what-do-i-do-if-torch-1-5-0xxxx-and-torchvision-do-not-match-when-torch--whl-is-installed) +

Overview

When setting up the environment for PyTorch model porting and training, you can manually build and install the modules adapted to the PyTorch framework on a training server, or use the base image provided by the Ascend Hub image center \(the PyTorch module and mixed precision module have been installed in the image\). **Figure 1** Environment setup process ![](figures/environment-setup-process.png "environment-setup-process") -

Manual Build and Installation

+

Manual Build and Installation

-- **[Prerequisites](#prerequisites.md)** +- **[Prerequisites](#prerequisites)** -- **[Installing the PyTorch Framework](#installing-the-pytorch-framework.md)** +- **[Installing the PyTorch Framework](#installing-the-pytorch-framework)** -- **[Configuring Environment Variables](#configuring-environment-variables.md)** +- **[Configuring Environment Variables](#configuring-environment-variables)** -- **[Installing the Mixed Precision Module](#installing-the-mixed-precision-module.md)** +- **[Installing the Mixed Precision Module](#installing-the-mixed-precision-module)** -

Prerequisites

+

Prerequisites

## Prerequisites - The development or operating environment of CANN has been installed. For details, see the _CANN Software Installation Guide_. -- CMake 3.12.0 or later has been installed. For details about how to install CMake, see [Installing CMake](#installing-cmake.md). -- GCC 7.3.0 or later has been installed. For details about how to install and use GCC 7.3.0, see [How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0.md). +- CMake 3.12.0 or later has been installed. For details about how to install CMake, see [Installing CMake](#installing-cmake). +- GCC 7.3.0 or later has been installed. For details about how to install and use GCC 7.3.0, see [How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0). - The Patch and Git tools have been installed in the environment. To install the tools for Ubuntu and CentOS, run the following commands: - Ubuntu @@ -54,7 +54,7 @@ When setting up the environment for PyTorch model porting and training, you can -

Installing the PyTorch Framework

+

Installing the PyTorch Framework

## Installation Process @@ -155,7 +155,7 @@ When setting up the environment for PyTorch model porting and training, you can >**pip3 list | grep torch** -

Configuring Environment Variables

+

Configuring Environment Variables

After the software packages are installed, configure environment variables to use Ascend PyTorch. You are advised to build a startup script, for example, the **set\_env.sh** script, and run **source set\_env.sh** to configure the environment variables. The content of the **set\_env.sh** script is as follows \(the **root** user is used as the installation user and the default installation path is used\): @@ -224,7 +224,7 @@ export HCCL_IF_IP="1.1.1.1" # 1.1.1.1 is the NIC IP address of the host. Change

LD_LIBRARY_PATH

Dynamic library search path. Set this variable based on the preceding example.

-

If you need to upgrade GCC in OSs such as CentOS, Debian, and BC-Linux, add ${install_path}/lib64 to the LD_LIBRARY_PATH variable of the dynamic library search path. Replace {install_path} with the GCC installation path. For details, see 5.

+

If you need to upgrade GCC in OSs such as CentOS, Debian, and BC-Linux, add ${install_path}/lib64 to the LD_LIBRARY_PATH variable of the dynamic library search path. Replace {install_path} with the GCC installation path. For details, see 5.

PYTHONPATH

@@ -279,12 +279,12 @@ export HCCL_IF_IP="1.1.1.1" # 1.1.1.1 is the NIC IP address of the host. Change
-

Installing the Mixed Precision Module

+

Installing the Mixed Precision Module

## Prerequisites 1. Ensure that the PyTorch framework adapted to Ascend AI Processors in the operating environment can be used properly. -2. Before building and installing Apex, you have configured the environment variables on which the build depends. See [Configuring Environment Variables](#configuring-environment-variables.md). +2. Before building and installing Apex, you have configured the environment variables on which the build depends. See [Configuring Environment Variables](#configuring-environment-variables). ## Installation Process @@ -376,14 +376,14 @@ export HCCL_IF_IP="1.1.1.1" # 1.1.1.1 is the NIC IP address of the host. Change >**pip3 list | grep apex** -

Using the Ascend Hub Image

+

Using the Ascend Hub Image

-- **[Obtaining the PyTorch Image from the Ascend Hub](#obtaining-the-pytorch-image-from-the-ascend-hub.md)** +- **[Obtaining the PyTorch Image from the Ascend Hub](#obtaining-the-pytorch-image-from-the-ascend-hub)** -- **[Configuring Environment Variables](#configuring-environment-variables-0.md)** +- **[Configuring Environment Variables](#configuring-environment-variables-0)** -

Obtaining the PyTorch Image from the Ascend Hub

+

Obtaining the PyTorch Image from the Ascend Hub

## Prerequisites @@ -417,20 +417,20 @@ Log in to the [Ascend Hub](https://ascendhub.huawei.com/#/home) to obtain the -

Configuring Environment Variables

+

Configuring Environment Variables

-After starting and entering the image container, configure the environment variables on which model training depends by referring to [Configuring Environment Variables](#configuring-environment-variables.md). +After starting and entering the image container, configure the environment variables on which model training depends by referring to [Configuring Environment Variables](#configuring-environment-variables). -

References

+

References

-- **[Installing CMake](#installing-cmake.md)** +- **[Installing CMake](#installing-cmake)** -- **[How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0.md)** +- **[How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0)** -- **[What Do I Do If "torch 1.5.0xxxx" and "torchvision" Do Not Match When torch-\*.whl Is Installed?](#what-do-i-do-if-torch-1-5-0xxxx-and-torchvision-do-not-match-when-torch--whl-is-installed.md)** +- **[What Do I Do If "torch 1.5.0xxxx" and "torchvision" Do Not Match When torch-\*.whl Is Installed?](#what-do-i-do-if-torch-1-5-0xxxx-and-torchvision-do-not-match-when-torch--whl-is-installed)** -

Installing CMake

+

Installing CMake

Procedure for upgrading CMake to 3.12.1 @@ -469,7 +469,7 @@ Procedure for upgrading CMake to 3.12.1 If the message "cmake version 3.12.1" is displayed, the installation is successful. -

How Do I Install GCC 7.3.0?

+

How Do I Install GCC 7.3.0?

Perform the following steps as the **root** user. @@ -550,7 +550,7 @@ Perform the following steps as the **root** user. >Skip this step if you do not need to use the compilation environment with GCC upgraded. -

What Do I Do If "torch 1.5.0xxxx" and "torchvision" Do Not Match When torch-\*.whl Is Installed?

+

What Do I Do If "torch 1.5.0xxxx" and "torchvision" Do Not Match When torch-\*.whl Is Installed?

## Symptom diff --git a/docs/en/PyTorch-Installation-Guide/figures/en-us_image_0000001180656411.png b/docs/en/PyTorch Installation Guide/figures/en-us_image_0000001180656411.png similarity index 100% rename from docs/en/PyTorch-Installation-Guide/figures/en-us_image_0000001180656411.png rename to docs/en/PyTorch Installation Guide/figures/en-us_image_0000001180656411.png diff --git a/docs/en/PyTorch-Installation-Guide/figures/environment-setup-process.png b/docs/en/PyTorch Installation Guide/figures/environment-setup-process.png similarity index 100% rename from docs/en/PyTorch-Installation-Guide/figures/environment-setup-process.png rename to docs/en/PyTorch Installation Guide/figures/environment-setup-process.png diff --git a/docs/en/PyTorch-Installation-Guide/public_sys-resources/icon-caution.gif b/docs/en/PyTorch Installation Guide/public_sys-resources/icon-caution.gif similarity index 100% rename from docs/en/PyTorch-Installation-Guide/public_sys-resources/icon-caution.gif rename to docs/en/PyTorch Installation Guide/public_sys-resources/icon-caution.gif diff --git a/docs/en/PyTorch-Installation-Guide/public_sys-resources/icon-danger.gif b/docs/en/PyTorch Installation Guide/public_sys-resources/icon-danger.gif similarity index 100% rename from docs/en/PyTorch-Installation-Guide/public_sys-resources/icon-danger.gif rename to docs/en/PyTorch Installation Guide/public_sys-resources/icon-danger.gif diff --git a/docs/en/PyTorch-Installation-Guide/public_sys-resources/icon-note.gif b/docs/en/PyTorch Installation Guide/public_sys-resources/icon-note.gif similarity index 100% rename from docs/en/PyTorch-Installation-Guide/public_sys-resources/icon-note.gif rename to docs/en/PyTorch Installation Guide/public_sys-resources/icon-note.gif diff --git a/docs/en/PyTorch-Installation-Guide/public_sys-resources/icon-notice.gif b/docs/en/PyTorch Installation Guide/public_sys-resources/icon-notice.gif similarity index 100% rename from docs/en/PyTorch-Installation-Guide/public_sys-resources/icon-notice.gif rename to docs/en/PyTorch Installation Guide/public_sys-resources/icon-notice.gif diff --git a/docs/en/PyTorch-Installation-Guide/public_sys-resources/icon-tip.gif b/docs/en/PyTorch Installation Guide/public_sys-resources/icon-tip.gif similarity index 100% rename from docs/en/PyTorch-Installation-Guide/public_sys-resources/icon-tip.gif rename to docs/en/PyTorch Installation Guide/public_sys-resources/icon-tip.gif diff --git a/docs/en/PyTorch-Installation-Guide/public_sys-resources/icon-warning.gif b/docs/en/PyTorch Installation Guide/public_sys-resources/icon-warning.gif similarity index 100% rename from docs/en/PyTorch-Installation-Guide/public_sys-resources/icon-warning.gif rename to docs/en/PyTorch Installation Guide/public_sys-resources/icon-warning.gif diff --git a/docs/en/PyTorch Network Model Porting and Training Guide/PyTorch Network Model Porting and Training Guide.md b/docs/en/PyTorch Network Model Porting and Training Guide/PyTorch Network Model Porting and Training Guide.md index 3987aa95e2033affacda4761b5f6979a87e5960d..a8b75148b3686cf6abc9456d76cd6f8ac4b3022a 100644 --- a/docs/en/PyTorch Network Model Porting and Training Guide/PyTorch Network Model Porting and Training Guide.md +++ b/docs/en/PyTorch Network Model Porting and Training Guide/PyTorch Network Model Porting and Training Guide.md @@ -1,95 +1,95 @@ # PyTorch Network Model Porting and Training Guide -- [Overview](#overview.md) -- [Restrictions and Limitations](#restrictions-and-limitations.md) -- [Porting Process](#porting-process.md) -- [Model Porting Evaluation](#model-porting-evaluation.md) -- [Environment Setup](#environment-setup.md) - - [Setting Up the Operating Environment](#setting-up-the-operating-environment.md) - - [Configuring Environment Variables](#configuring-environment-variables.md) -- [Model Porting](#model-porting.md) - - [Tool-Facilitated](#tool-facilitated.md) - - [Introduction](#introduction.md) - - [Instructions](#instructions.md) - - [Result Analysis](#result-analysis.md) - - [Manual](#manual.md) - - [Single-Device Training Model Porting](#single-device-training-model-porting.md) - - [Multi-Device Training Model Porting](#multi-device-training-model-porting.md) - - [Replacing PyTorch-related APIs](#replacing-pytorch-related-apis.md) - - [Mixed Precision](#mixed-precision.md) - - [Performance Optimization](#performance-optimization.md) - - [Overview](#overview-0.md) - - [Changing the CPU Performance Mode \(x86 Server\)](#changing-the-cpu-performance-mode-(x86-server).md) - - [Changing the CPU Performance Mode \(ARM Server\)](#changing-the-cpu-performance-mode-(arm-server).md) - - [Installing the High-Performance Pillow Library \(x86 Server\)](#installing-the-high-performance-pillow-library-(x86-server).md) - - [\(Optional\) Installing the OpenCV Library of the Specified Version](#(optional)-installing-the-opencv-library-of-the-specified-version.md) -- [Model Training](#model-training.md) -- [Performance Analysis and Optimization](#performance-analysis-and-optimization.md) - - [Prerequisites](#prerequisites.md) - - [Commissioning Process](#commissioning-process.md) - - [Overall Guideline](#overall-guideline.md) - - [Collecting Data Related to the Training Process](#collecting-data-related-to-the-training-process.md) - - [Performance Optimization](#performance-optimization-1.md) - - [Affinity Library](#affinity-library.md) - - [Source](#source.md) - - [Functions](#functions.md) -- [Precision Commissioning](#precision-commissioning.md) - - [Prerequisites](#prerequisites-2.md) - - [Commissioning Process](#commissioning-process-3.md) - - [Overall Guideline](#overall-guideline-4.md) - - [Precision Optimization Methods](#precision-optimization-methods.md) -- [Model Saving and Conversion](#model-saving-and-conversion.md) - - [Introduction](#introduction-5.md) - - [Saving a Model](#saving-a-model.md) - - [Exporting an ONNX Model](#exporting-an-onnx-model.md) -- [Samples](#samples.md) - - [ResNet-50 Model Porting](#resnet-50-model-porting.md) - - [Obtaining Samples](#obtaining-samples.md) - - [Porting the Training Script](#porting-the-training-script.md) - - [Single-Device Training Modification](#single-device-training-modification.md) - - [Distributed Training Modification](#distributed-training-modification.md) - - [Executing the Script](#executing-the-script.md) - - [ShuffleNet Model Optimization](#shufflenet-model-optimization.md) - - [Obtaining Samples](#obtaining-samples-6.md) - - [Evaluating the Model](#evaluating-the-model.md) - - [Porting the Network](#porting-the-network.md) - - [Commissioning the Network](#commissioning-the-network.md) -- [References](#references.md) - - [Single-Operator Sample Building](#single-operator-sample-building.md) - - [Single-Operator Dump Method](#single-operator-dump-method.md) - - [Common Environment Variables](#common-environment-variables.md) - - [dump op Method](#dump-op-method.md) - - [How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0.md) -- [FAQs](#faqs.md) - - [FAQs About Software Installation](#faqs-about-software-installation.md) - - [pip3.7 install Pillow==5.3.0 Installation Failed](#pip3-7-install-pillow-5-3-0-installation-failed.md) - - [FAQs About Model and Operator Running](#faqs-about-model-and-operator-running.md) - - [What Do I Do If the Error Message "RuntimeError: ExchangeDevice:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-runtimeerror-exchangedevice-is-displayed-during-model-or-operator.md) - - [What Do I Do If the Error Message "Error in atexit.\_run\_exitfuncs:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-error-in-atexit-_run_exitfuncs-is-displayed-during-model-or-operat.md) - - [What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what\(\): HelpACLExecute:" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-terminate-called-after-throwing-an-instance-of-c10-error-what()-he.md) - - [What Do I Do If the Error Message "ImportError: libhccl.so." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-importerror-libhccl-so-is-displayed-during-model-running.md) - - [What Do I Do If the Error Message "RuntimeError: Initialize." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-runtimeerror-initialize-is-displayed-during-model-running.md) - - [What Do I Do If the Error Message "TVM/te/cce error." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-tvm-te-cce-error-is-displayed-during-model-running.md) - - [What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-running.md) - - [What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-running-7.md) - - [What Do I Do If the Error Message "HelpACLExecute." Is Displayed After Multi-Task Delivery Is Disabled \(export TASK\_QUEUE\_ENABLE=0\) During Model Running?](#what-do-i-do-if-the-error-message-helpaclexecute-is-displayed-after-multi-task-delivery-is-disabled.md) - - [What Do I Do If the Error Message "55056 GetInputConstDataOut: ErrorNo: -1\(failed\)" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-55056-getinputconstdataout-errorno--1(failed)-is-displayed-during.md) - - [FAQs About Model Commissioning](#faqs-about-model-commissioning.md) - - [What Do I Do If the Error Message "RuntimeError: malloc:/..../pytorch/c10/npu/NPUCachingAllocator.cpp:293 NPU error, error code is 500000." Is Displayed During Model Commissioning?](#what-do-i-do-if-the-error-message-runtimeerror-malloc-pytorch-c10-npu-npucachingallocator-cpp-293-np.md) - - [What Do I Do If the Error Message "RuntimeError: Could not run 'aten::trunc.out' with arguments from the 'NPUTensorId' backend." Is Displayed During Model Commissioning](#what-do-i-do-if-the-error-message-runtimeerror-could-not-run-aten-trunc-out-with-arguments-from-the.md) - - [What Do I Do If the MaxPoolGradWithArgmaxV1 and max Operators Report Errors During Model Commissioning?](#what-do-i-do-if-the-maxpoolgradwithargmaxv1-and-max-operators-report-errors-during-model-commissioni.md) - - [What Do I Do If the Error Message "ModuleNotFoundError: No module named 'torch.\_C'" Is Displayed When torch Is Called?](#what-do-i-do-if-the-error-message-modulenotfounderror-no-module-named-torch-_c-is-displayed-when-tor.md) - - [FAQs About Other Operations](#faqs-about-other-operations.md) - - [What Do I Do If an Error Is Reported During CUDA Stream Synchronization?](#what-do-i-do-if-an-error-is-reported-during-cuda-stream-synchronization.md) - - [What Do I Do If aicpu\_kernels/libpt\_kernels.so Does Not Exist?](#what-do-i-do-if-aicpu_kernels-libpt_kernels-so-does-not-exist.md) - - [What Do I Do If the Python Process Is Residual When the npu-smi info Command Is Used to View Video Memory?](#what-do-i-do-if-the-python-process-is-residual-when-the-npu-smi-info-command-is-used-to-view-video-m.md) - - [What Do I Do If the Error Message "match op inputs failed"Is Displayed When the Dynamic Shape Is Used?](#what-do-i-do-if-the-error-message-match-op-inputs-failed-is-displayed-when-the-dynamic-shape-is-used.md) - - [What Do I Do If the Error Message "Op type SigmoidCrossEntropyWithLogitsV2 of ops kernel AIcoreEngine is unsupported" Is Displayed?](#what-do-i-do-if-the-error-message-op-type-sigmoidcrossentropywithlogitsv2-of-ops-kernel-aicoreengine.md) - - [What Do I Do If a Hook Failure Occurs?](#what-do-i-do-if-a-hook-failure-occurs.md) - - [What Do I Do If the Error Message "load state\_dict error." Is Displayed When the Weight Is Loaded?](#what-do-i-do-if-the-error-message-load-state_dict-error-is-displayed-when-the-weight-is-loaded.md) - - [FAQs About Distributed Model Training](#faqs-about-distributed-model-training.md) - - [What Do I Do If the Error Message "host not found." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-host-not-found-is-displayed-during-distributed-model-training.md) - - [What Do I Do If the Error Message "RuntimeError: connect\(\) timed out." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-runtimeerror-connect()-timed-out-is-displayed-during-distributed-m.md) -

Overview

+- [Overview](#overview) +- [Restrictions and Limitations](#restrictions-and-limitations) +- [Porting Process](#porting-process) +- [Model Porting Evaluation](#model-porting-evaluation) +- [Environment Setup](#environment-setup) + - [Setting Up the Operating Environment](#setting-up-the-operating-environment) + - [Configuring Environment Variables](#configuring-environment-variables) +- [Model Porting](#model-porting) + - [Tool-Facilitated](#tool-facilitated) + - [Introduction](#introduction) + - [Instructions](#instructions) + - [Result Analysis](#result-analysis) + - [Manual](#manual) + - [Single-Device Training Model Porting](#single-device-training-model-porting) + - [Multi-Device Training Model Porting](#multi-device-training-model-porting) + - [Replacing PyTorch-related APIs](#replacing-pytorch-related-apis) + - [Mixed Precision](#mixed-precision) + - [Performance Optimization](#performance-optimization) + - [Overview](#overview-0) + - [Changing the CPU Performance Mode \(x86 Server\)](#changing-the-cpu-performance-mode-x86-server) + - [Changing the CPU Performance Mode \(ARM Server\)](#changing-the-cpu-performance-mode-arm-server) + - [Installing the High-Performance Pillow Library \(x86 Server\)](#installing-the-high-performance-pillow-library-x86-server) + - [\(Optional\) Installing the OpenCV Library of the Specified Version](#optional-installing-the-opencv-library-of-the-specified-version) +- [Model Training](#model-training) +- [Performance Analysis and Optimization](#performance-analysis-and-optimization) + - [Prerequisites](#prerequisites) + - [Commissioning Process](#commissioning-process) + - [Overall Guideline](#overall-guideline) + - [Collecting Data Related to the Training Process](#collecting-data-related-to-the-training-process) + - [Performance Optimization](#performance-optimization-1) + - [Affinity Library](#affinity-library) + - [Source](#source) + - [Functions](#functions) +- [Precision Commissioning](#precision-commissioning) + - [Prerequisites](#prerequisites-2) + - [Commissioning Process](#commissioning-process-3) + - [Overall Guideline](#overall-guideline-4) + - [Precision Optimization Methods](#precision-optimization-methods) +- [Model Saving and Conversion](#model-saving-and-conversion) + - [Introduction](#introduction-5) + - [Saving a Model](#saving-a-model) + - [Exporting an ONNX Model](#exporting-an-onnx-model) +- [Samples](#samples) + - [ResNet-50 Model Porting](#resnet-50-model-porting) + - [Obtaining Samples](#obtaining-samples) + - [Porting the Training Script](#porting-the-training-script) + - [Single-Device Training Modification](#single-device-training-modification) + - [Distributed Training Modification](#distributed-training-modification) + - [Executing the Script](#executing-the-script) + - [ShuffleNet Model Optimization](#shufflenet-model-optimization) + - [Obtaining Samples](#obtaining-samples-6) + - [Evaluating the Model](#evaluating-the-model) + - [Porting the Network](#porting-the-network) + - [Commissioning the Network](#commissioning-the-network) +- [References](#references) + - [Single-Operator Sample Building](#single-operator-sample-building) + - [Single-Operator Dump Method](#single-operator-dump-method) + - [Common Environment Variables](#common-environment-variables) + - [dump op Method](#dump-op-method) + - [How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0) +- [FAQs](#faqs) + - [FAQs About Software Installation](#faqs-about-software-installation) + - [pip3.7 install Pillow==5.3.0 Installation Failed](#pip3-7-install-pillow-5-3-0-installation-failed) + - [FAQs About Model and Operator Running](#faqs-about-model-and-operator-running) + - [What Do I Do If the Error Message "RuntimeError: ExchangeDevice:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-runtimeerror-exchangedevice-is-displayed-during-model-or-operator) + - [What Do I Do If the Error Message "Error in atexit.\_run\_exitfuncs:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-error-in-atexit-_run_exitfuncs-is-displayed-during-model-or-operat) + - [What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what\(\): HelpACLExecute:" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-terminate-called-after-throwing-an-instance-of-c10-error-what-he) + - [What Do I Do If the Error Message "ImportError: libhccl.so." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-importerror-libhccl-so-is-displayed-during-model-running) + - [What Do I Do If the Error Message "RuntimeError: Initialize." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-runtimeerror-initialize-is-displayed-during-model-running) + - [What Do I Do If the Error Message "TVM/te/cce error." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-tvm-te-cce-error-is-displayed-during-model-running) + - [What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-running) + - [What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-running-7) + - [What Do I Do If the Error Message "HelpACLExecute." Is Displayed After Multi-Task Delivery Is Disabled \(export TASK\_QUEUE\_ENABLE=0\) During Model Running?](#what-do-i-do-if-the-error-message-helpaclexecute-is-displayed-after-multi-task-delivery-is-disabled) + - [What Do I Do If the Error Message "55056 GetInputConstDataOut: ErrorNo: -1\(failed\)" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-55056-getinputconstdataout-errorno--1failed-is-displayed-during) + - [FAQs About Model Commissioning](#faqs-about-model-commissioning) + - [What Do I Do If the Error Message "RuntimeError: malloc:/..../pytorch/c10/npu/NPUCachingAllocator.cpp:293 NPU error, error code is 500000." Is Displayed During Model Commissioning?](#what-do-i-do-if-the-error-message-runtimeerror-malloc-pytorch-c10-npu-npucachingallocator-cpp-293-np) + - [What Do I Do If the Error Message "RuntimeError: Could not run 'aten::trunc.out' with arguments from the 'NPUTensorId' backend." Is Displayed During Model Commissioning](#what-do-i-do-if-the-error-message-runtimeerror-could-not-run-aten-trunc-out-with-arguments-from-the) + - [What Do I Do If the MaxPoolGradWithArgmaxV1 and max Operators Report Errors During Model Commissioning?](#what-do-i-do-if-the-maxpoolgradwithargmaxv1-and-max-operators-report-errors-during-model-commissioni) + - [What Do I Do If the Error Message "ModuleNotFoundError: No module named 'torch.\_C'" Is Displayed When torch Is Called?](#what-do-i-do-if-the-error-message-modulenotfounderror-no-module-named-torch-_c-is-displayed-when-tor) + - [FAQs About Other Operations](#faqs-about-other-operations) + - [What Do I Do If an Error Is Reported During CUDA Stream Synchronization?](#what-do-i-do-if-an-error-is-reported-during-cuda-stream-synchronization) + - [What Do I Do If aicpu\_kernels/libpt\_kernels.so Does Not Exist?](#what-do-i-do-if-aicpu_kernels-libpt_kernels-so-does-not-exist) + - [What Do I Do If the Python Process Is Residual When the npu-smi info Command Is Used to View Video Memory?](#what-do-i-do-if-the-python-process-is-residual-when-the-npu-smi-info-command-is-used-to-view-video-m) + - [What Do I Do If the Error Message "match op inputs failed"Is Displayed When the Dynamic Shape Is Used?](#what-do-i-do-if-the-error-message-match-op-inputs-failed-is-displayed-when-the-dynamic-shape-is-used) + - [What Do I Do If the Error Message "Op type SigmoidCrossEntropyWithLogitsV2 of ops kernel AIcoreEngine is unsupported" Is Displayed?](#what-do-i-do-if-the-error-message-op-type-sigmoidcrossentropywithlogitsv2-of-ops-kernel-aicoreengine) + - [What Do I Do If a Hook Failure Occurs?](#what-do-i-do-if-a-hook-failure-occurs) + - [What Do I Do If the Error Message "load state\_dict error." Is Displayed When the Weight Is Loaded?](#what-do-i-do-if-the-error-message-load-state_dict-error-is-displayed-when-the-weight-is-loaded) + - [FAQs About Distributed Model Training](#faqs-about-distributed-model-training) + - [What Do I Do If the Error Message "host not found." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-host-not-found-is-displayed-during-distributed-model-training) + - [What Do I Do If the Error Message "RuntimeError: connect\(\) timed out." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-runtimeerror-connect-timed-out-is-displayed-during-distributed-m) +

Overview

Currently, the solution of adapting to the Ascend AI Processor is an online solution. @@ -110,7 +110,7 @@ Currently, the main reasons for selecting the online adaptation solution are as 4. It has good scalability. During the streamlining process, only the development and implementation of related compute operators are involved for new network types or structures. Framework operators, reverse graph building, and implementation mechanisms can be reused. 5. The usage and style are the same as those of the GPU-based implementation. During online adaption, you only need to specify the device as the Ascend AI Processor in Python and device operations to develop, train, and debug the network in PyTorch using the Ascend AI Processor. You do not need to pay attention to the underlying details of the Ascend AI Processor. In this way, you can minimize the modification and complete porting with low costs. -

Restrictions and Limitations

+

Restrictions and Limitations

- In the **infershape** phase, operators do not support unknown shape inference. - Only the float16 operator can be used for cube computing. @@ -124,7 +124,7 @@ Currently, the main reasons for selecting the online adaptation solution are as - Only the int8, int32, float16, and float32 data types are supported. -

Porting Process

+

Porting Process

Model porting refers to moving models that have been implemented in the open-source community to an Ascend AI Processor. [Figure 1](#fig759451810422) shows the model porting process. @@ -142,12 +142,12 @@ Model porting refers to moving models that have been implemented in the open-sou

Model selection

-

For details, see Model Selection.

+

For details, see Model Selection.

Model porting evaluation

-

For details, see Model Porting Evaluation.

+

For details, see Model Porting Evaluation.

Operator development

@@ -157,17 +157,17 @@ Model porting refers to moving models that have been implemented in the open-sou

Environment setup

-

For details, see Environment Setup.

+

For details, see Environment Setup.

Model porting

-

For details, see Model Porting.

+

For details, see Model Porting.

Model training

-

For details, see Model Training.

+

For details, see Model Training.

Error analysis

@@ -177,17 +177,17 @@ Model porting refers to moving models that have been implemented in the open-sou

Performance analysis and optimization

-

For details, see Performance Optimization and Analysis.

+

For details, see Performance Optimization and Analysis.

Precision commissioning

-

For details, see Precision Commissioning.

+

For details, see Precision Commissioning.

Model saving and conversion

-

For details, see Model Saving and Conversion and "ATC Tool Instructions" in the CANN Auxiliary Development Tool User Guide .

+

For details, see Model Saving and Conversion and "ATC Tool Instructions" in the CANN Auxiliary Development Tool User Guide .

Application software development

@@ -197,33 +197,33 @@ Model porting refers to moving models that have been implemented in the open-sou

FAQs

-

Describes how to prepare the environment, port models, commission models, and resolve other common problems. For details, see FAQs.

+

Describes how to prepare the environment, port models, commission models, and resolve other common problems. For details, see FAQs.

-

Model Porting Evaluation

+

Model Porting Evaluation

1. When selecting models, select authoritative PyTorch models as benchmarks, including but not limited to PyTorch \([example](https://github.com/pytorch/examples/tree/master/imagenet)/[vision](https://github.com/pytorch/vision)\), facebookresearch \([Detectron](https://github.com/facebookresearch/Detectron)/[detectron2](https://github.com/facebookresearch/detectron2)\), and open-mmlab \([mmdetection](https://github.com/open-mmlab/mmdetection)/[mmpose](https://github.com/open-mmlab/mmpose)\). -2. Check the operator adaptation. Before porting the original model and training script to an Ascend AI Processor, train the original model and training script on the CPU, obtain the operator information by using the dump op method, and compare the operator information with that in the _PyTorch Adapted Operator List_ to check whether the operator is supported. For details about the dump op method, see [dump op Method](#dump-op-method.md). If an operator is not supported, develop the operator. For details, see the _PyTorch Operator Development Guide_. +2. Check the operator adaptation. Before porting the original model and training script to an Ascend AI Processor, train the original model and training script on the CPU, obtain the operator information by using the dump op method, and compare the operator information with that in the _PyTorch Adapted Operator List_ to check whether the operator is supported. For details about the dump op method, see [dump op Method](#dump-op-method). If an operator is not supported, develop the operator. For details, see the _PyTorch Operator Development Guide_. >![](public_sys-resources/icon-note.gif) **NOTE:** >You can also port the model and training script to the Ascend AI Processor for training to view the error information. For details about how to port the model and training script, see the following sections. Generally, a message is displayed, indicating that an operator \(the first operator that is not supported\) cannot run in the backend of the Ascend AI Processor. -

Environment Setup

+

Environment Setup

-- **[Setting Up the Operating Environment](#setting-up-the-operating-environment.md)** +- **[Setting Up the Operating Environment](#setting-up-the-operating-environment)** -- **[Configuring Environment Variables](#configuring-environment-variables.md)** +- **[Configuring Environment Variables](#configuring-environment-variables)** -

Setting Up the Operating Environment

+

Setting Up the Operating Environment

For details about how to set up the PyTorch operating environment, see the . -

Configuring Environment Variables

+

Configuring Environment Variables

After the software packages are installed, configure environment variables to use Ascend PyTorch. You are advised to build a startup script, for example, the **set\_env.sh** script, and run **source set\_env.sh** to configure the environment variables. The content of the **set\_env.sh** script is as follows \(the **root** user is used as the installation user and the default installation path is used\): @@ -347,29 +347,29 @@ export HCCL_IF_IP="1.1.1.1" # 1.1.1.1 is the NIC IP address of the host. Change -

Model Porting

+

Model Porting

-- **[Tool-Facilitated](#tool-facilitated.md)** +- **[Tool-Facilitated](#tool-facilitated)** -- **[Manual](#manual.md)** +- **[Manual](#manual)** -- **[Mixed Precision](#mixed-precision.md)** +- **[Mixed Precision](#mixed-precision)** -- **[Performance Optimization](#performance-optimization.md)** +- **[Performance Optimization](#performance-optimization)** -

Tool-Facilitated

+

Tool-Facilitated

The Ascend platform provides a script conversion tool to enable you to port training scripts to Ascend AI Processors using commands. The following will provide the details. In addition to using commands, you can also use the PyTorch GPU2Ascend function integrated in MindStudio to port scripts. For details, see the _MindStudio User Guide_. -- **[Introduction](#introduction.md)** +- **[Introduction](#introduction)** -- **[Instructions](#instructions.md)** +- **[Instructions](#instructions)** -- **[Result Analysis](#result-analysis.md)** +- **[Result Analysis](#result-analysis)** -

Introduction

+

Introduction

## Overview @@ -676,7 +676,7 @@ msFmkTransplt runs on Ubuntu 18.04, CentOS 7.6, and EulerOS 2.8 only. Set up the development environment by referring to the _CANN Software Installation Guide_. -

Instructions

+

Instructions

## Command-line Options @@ -835,7 +835,7 @@ An example of a custom conversion rule is as follows: 3. Find the converted script in the specified output path. -

Result Analysis

+

Result Analysis

You can view the result files in the output path when the script is converted. @@ -846,16 +846,16 @@ You can view the result files in the output path when the script is converted. │ ├── unsupported_op.xlsx // File of the unsupported operator list ``` -

Manual

+

Manual

-- **[Single-Device Training Model Porting](#single-device-training-model-porting.md)** +- **[Single-Device Training Model Porting](#single-device-training-model-porting)** -- **[Multi-Device Training Model Porting](#multi-device-training-model-porting.md)** +- **[Multi-Device Training Model Porting](#multi-device-training-model-porting)** -- **[Replacing PyTorch-related APIs](#replacing-pytorch-related-apis.md)** +- **[Replacing PyTorch-related APIs](#replacing-pytorch-related-apis)** -

Single-Device Training Model Porting

+

Single-Device Training Model Porting

The advantage of the online adaption is that the training on the Ascend AI Processor is consistent with the usage of the GPU. During online adaption,** you only need to specify the device as the Ascend AI Processor in Python and device operations** to develop, train, and debug the network in PyTorch using the Ascend AI Processor. For single-device model training, main changes for porting are as follows: @@ -885,9 +885,9 @@ The code ported to the Ascend AI Processor is as follows: target = target.to(CALCULATE_DEVICE) ``` -For details, see [Single-Device Training Modification](#single-device-training-modification.md). +For details, see [Single-Device Training Modification](#single-device-training-modification). -

Multi-Device Training Model Porting

+

Multi-Device Training Model Porting

To port a multi-device training model,** you need to specify the device as the Ascend AI Processor in Python and device operations**. In addition, you can perform distributed training using PyTorch **DistributedDataParallel**, that is, run **init\_process\_group** during model initialization, and then initialize the model into a **DistributedDataParallel** model. Note that the **backend **must be set to **hccl **and the initialization mode must be shielded when **init\_process\_group** is executed. @@ -911,9 +911,9 @@ def main(): lr_scheduler) ``` -For details, see [Distributed Training Modification](#distributed-training-modification.md). +For details, see [Distributed Training Modification](#distributed-training-modification). - + 1. To enable the Ascend AI Processor to use the capabilities of the PyTorch framework, the native PyTorch framework needs to be adapted at the device layer. The APIs related to the CPU and CUDA need to be replaced for external presentation. During network porting, some device-related APIs need to be replaced with the APIs related to the Ascend AI Processor. [Table 1](#table1922064517344) lists the supported device-related APIs. @@ -1112,7 +1112,7 @@ For details, see [Distributed Training Modification](#distributed-training-modi For more APIs, see the _PyTorch API Support_. -

Mixed Precision

+

Mixed Precision

## Overview @@ -1203,27 +1203,27 @@ In addition to the preceding advantages, the mixed precision module Apex adapted ``` -

Performance Optimization

+

Performance Optimization

-- **[Overview](#overview-0.md)** +- **[Overview](#overview-0)** -- **[Changing the CPU Performance Mode \(x86 Server\)](#changing-the-cpu-performance-mode-(x86-server).md)** +- **[Changing the CPU Performance Mode \(x86 Server\)](#changing-the-cpu-performance-mode-x86-server)** -- **[Changing the CPU Performance Mode \(ARM Server\)](#changing-the-cpu-performance-mode-(arm-server).md)** +- **[Changing the CPU Performance Mode \(ARM Server\)](#changing-the-cpu-performance-mode-arm-server)** -- **[Installing the High-Performance Pillow Library \(x86 Server\)](#installing-the-high-performance-pillow-library-(x86-server).md)** +- **[Installing the High-Performance Pillow Library \(x86 Server\)](#installing-the-high-performance-pillow-library-x86-server)** -- **[\(Optional\) Installing the OpenCV Library of the Specified Version](#(optional)-installing-the-opencv-library-of-the-specified-version.md)** +- **[\(Optional\) Installing the OpenCV Library of the Specified Version](#optional-installing-the-opencv-library-of-the-specified-version)** -

Overview

+

Overview

During PyTorch model porting and training, the number of images recognized within one second \(FPS\) for some network models is low and the performance does not meet the requirements. In this case, you need to perform the following optimization operations on the server: - Change the CPU performance mode. - Install the high-performance Pillow library. -

Changing the CPU Performance Mode \(x86 Server\)

+

Changing the CPU Performance Mode \(x86 Server\)

## Setting the Power Policy to High Performance @@ -1330,7 +1330,7 @@ Perform the following steps as the **root** user: 4. Perform [Step 1](#li158435131344) again to check whether the current CPU mode is set to performance. -

Changing the CPU Performance Mode \(ARM Server\)

+

Changing the CPU Performance Mode \(ARM Server\)

## Setting the Power Policy to High Performance @@ -1359,7 +1359,7 @@ Some models that have demanding requirements on the CPUs on the host, for exampl 6. Press **F10** to save the settings and reboot the server. -

Installing the High-Performance Pillow Library \(x86 Server\)

+

Installing the High-Performance Pillow Library \(x86 Server\)

1. Run the following command to install the dependencies for the high-performance pillow library: @@ -1397,7 +1397,7 @@ Some models that have demanding requirements on the CPUs on the host, for exampl >``` -3. Modify the torchvision code to solve the problem that the pillow-simd does not contain the **PILLOW\_VERSION** field. For details about how to install torchvision, see [How to Obtain](#obtaining-samples.md). +3. Modify the torchvision code to solve the problem that the pillow-simd does not contain the **PILLOW\_VERSION** field. For details about how to install torchvision, see [How to Obtain](#obtaining-samples). Modify the code in line 5 of **/usr/local/python3.7.5/lib/python3.7/site-packages/torchvision/transforms/functional.py** as follows: @@ -1410,42 +1410,42 @@ Some models that have demanding requirements on the CPUs on the host, for exampl ``` -

\(Optional\) Installing the OpenCV Library of the Specified Version

+

\(Optional\) Installing the OpenCV Library of the Specified Version

If the model depends on OpenCV, you are advised to install OpenCV 3.4.10 to ensure training performance. 1. Source code: [Link](https://opencv.org/releases/) 2. Installation guide: [Link](https://docs.opencv.org/3.4.10/d7/d9f/tutorial_linux_install.html) -

Model Training

+

Model Training

-After the training scripts are migrated, set environment variables by following the instructions in [Configuring Environment Variables](#configuring-environment-variables.md) and run the **python3.7** _xxx_ command to train a model. For details, see [Executing the Script](#executing-the-script.md). +After the training scripts are migrated, set environment variables by following the instructions in [Configuring Environment Variables](#configuring-environment-variables) and run the **python3.7** _xxx_ command to train a model. For details, see [Executing the Script](#executing-the-script). -

Performance Analysis and Optimization

+

Performance Analysis and Optimization

-- **[Prerequisites](#prerequisites.md)** +- **[Prerequisites](#prerequisites)** -- **[Commissioning Process](#commissioning-process.md)** +- **[Commissioning Process](#commissioning-process)** -- **[Affinity Library](#affinity-library.md)** +- **[Affinity Library](#affinity-library)** -

Prerequisites

+

Prerequisites

-1. Modify the open-source code to ensure that the model can run properly, including data preprocessing, forward propagation, loss calculation, mixed precision, back propagation, and parameter update. For details, see [Samples](#samples.md). +1. Modify the open-source code to ensure that the model can run properly, including data preprocessing, forward propagation, loss calculation, mixed precision, back propagation, and parameter update. For details, see [Samples](#samples). 2. During model porting, check whether the model can run properly and whether the existing operators can meet the requirements. If no operator meets the requirements, develop an adapted operator. For details, see the _PyTorch Operator Development Guide_. 3. Prioritize the single-device function, and then enable the multi-device function. -

Commissioning Process

+

Commissioning Process

-- **[Overall Guideline](#overall-guideline.md)** +- **[Overall Guideline](#overall-guideline)** -- **[Collecting Data Related to the Training Process](#collecting-data-related-to-the-training-process.md)** +- **[Collecting Data Related to the Training Process](#collecting-data-related-to-the-training-process)** -- **[Performance Optimization](#performance-optimization-1.md)** +- **[Performance Optimization](#performance-optimization-1)** -

Overall Guideline

+

Overall Guideline

1. Check whether the throughput meets the expected requirements based on the training execution result. 2. If the throughput does not meet requirements, you need to find out the causes of the performance bottleneck. Possible causes are as follows: @@ -1456,7 +1456,7 @@ After the training scripts are migrated, set environment variables by following 3. Analyze the preceding causes of performance bottlenecks and optimize the performance. - + ## Profile Data Collection @@ -1545,20 +1545,20 @@ The network model is executed as an operator \(OP\). The OPInfo log can be used 6. Analyze the extra tasks in TaskInfo, especially transdata. -

Performance Optimization

+

Performance Optimization

## Operator Bottleneck Optimization -1. Obtain the profile data during training. For details, see [Profile Data Collection](#collecting-data-related-to-the-training-process.md). +1. Obtain the profile data during training. For details, see [Profile Data Collection](#collecting-data-related-to-the-training-process). 2. Analyze the profile data to obtain the time-consuming operator. -3. See [Single-Operator Sample Building](#single-operator-sample-building.md) to build the single-operator sample of the time-consuming operator, and compare the execution time of a single-operator sample on the CPU and GPU. If the performance is insufficient, use either of the following methods to solve the problem: +3. See [Single-Operator Sample Building](#single-operator-sample-building) to build the single-operator sample of the time-consuming operator, and compare the execution time of a single-operator sample on the CPU and GPU. If the performance is insufficient, use either of the following methods to solve the problem: - Workaround: Use other efficient operators with the same semantics. - Solution: Improve the operator performance. ## Copy Bottleneck Optimization -1. Obtain the profile data during training. For details, see [Profile Data Collection](#collecting-data-related-to-the-training-process.md). +1. Obtain the profile data during training. For details, see [Profile Data Collection](#collecting-data-related-to-the-training-process). 2. Analyze the Profile data to obtain the execution time of **D2DCopywithStreamSynchronize**, **PTCopy**, or **format\_contiguous** in the entire network. 3. If the execution takes a long time, use either of the following methods to solve the problem: - Method 1 \(workaround\): Replace view operators with compute operators. In PyTorch, view operators cause conversion from non-contiguous tensors to contiguous tensors. The optimization idea is to replace view operators with compute operators. Common view operators include view, permute, and transpose operators. For more view operators, go to [https://pytorch.org/docs/stable/tensor\_view.html](https://pytorch.org/docs/stable/tensor_view.html). @@ -1567,7 +1567,7 @@ The network model is executed as an operator \(OP\). The OPInfo log can be used ## Framework Bottleneck Optimization -1. Obtain the operator information \(OP\_INFO\) during the training. For details, see [Obtaining Operator Information \(OP\_INFO\)](#collecting-data-related-to-the-training-process.md). +1. Obtain the operator information \(OP\_INFO\) during the training. For details, see [Obtaining Operator Information \(OP\_INFO\)](#collecting-data-related-to-the-training-process). 2. Analyze the specifications and calling relationship of operators in OP\_INFO to check whether redundant operators are inserted. Pay special attention to check whether transdata is proper. 3. Solution: Specify the initialization format of some operators to eliminate cast operators. 4. In **pytorch/torch/nn/modules/module.py**, specify the operator initialization format in **cast\_weight**, as shown in the following figure. @@ -1582,25 +1582,25 @@ The network model is executed as an operator \(OP\). The OPInfo log can be used ## Compilation Bottleneck Optimization -1. Obtain the operator information \(OP\_INFO\) during the training. For details, see [Obtaining Operator Information \(OP\_INFO\)](#collecting-data-related-to-the-training-process.md). +1. Obtain the operator information \(OP\_INFO\) during the training. For details, see [Obtaining Operator Information \(OP\_INFO\)](#collecting-data-related-to-the-training-process). 2. View the INFO log and check the keyword **aclopCompile::aclOp** after the first step. If **Match op iunputs/type failed** or **To compile op** is displayed, the operator is dynamically compiled and needs to be optimized. 3. Use either of the following methods to solve the problem: - Workaround: Based on the understanding of model semantics and related APIs, replace dynamic shape with static shape. - Solution: Reduce compilation or do not compile the operator. -

Affinity Library

+

Affinity Library

-- **[Source](#source.md)** +- **[Source](#source)** -- **[Functions](#functions.md)** +- **[Functions](#functions)** -

Source

+

Source

The common network structures and functions in the public models are optimized to greatly improve computing performance. In addition, the network structures and functions are integrated into the PyTorch framework to facilitate model performance optimization. -

Functions

+

Functions

- - -

Function

@@ -1645,30 +1645,30 @@ The common network structures and functions in the public models are optimized t >![](public_sys-resources/icon-note.gif) **NOTE:** >The optimization content will be enhanced and updated with the version. Use the content in the corresponding path of the actual PyTorch version. -

Precision Commissioning

+

Precision Commissioning

-- **[Prerequisites](#prerequisites-2.md)** +- **[Prerequisites](#prerequisites-2)** -- **[Commissioning Process](#commissioning-process-3.md)** +- **[Commissioning Process](#commissioning-process-3)** -

Prerequisites

+

Prerequisites

Run a certain number of epochs \(20% of the total number of epoches is recommended\) with the same semantics and hyperparameters to align the precision and loss with the corresponding level of the GPU. After the alignment is complete, align the final precision. -

Commissioning Process

+

Commissioning Process

-- **[Overall Guideline](#overall-guideline-4.md)** +- **[Overall Guideline](#overall-guideline-4)** -- **[Precision Optimization Methods](#precision-optimization-methods.md)** +- **[Precision Optimization Methods](#precision-optimization-methods)** -

Overall Guideline

+

Overall Guideline

To locate the precision problem, you need to find out the step in which the problem occurs. The following aspects are involved: 1. Model network calculation error - - Locating method: Add a hook to the network to determine which part is suspected. Then build a [single-operator sample](#single-operator-sample-building.md) to narrow down the error range. This can prove that the operator calculation is incorrect in the current network. You can compare the result with the CPU or GPU result to prove the problem. + - Locating method: Add a hook to the network to determine which part is suspected. Then build a [single-operator sample](#single-operator-sample-building) to narrow down the error range. This can prove that the operator calculation is incorrect in the current network. You can compare the result with the CPU or GPU result to prove the problem. - Workaround: Use other operators with the same semantics. @@ -1697,7 +1697,7 @@ To locate the precision problem, you need to find out the step in which the prob -

Precision Optimization Methods

+

Precision Optimization Methods

1. Determine whether the calculation on the Ascend AI Processor is correct by comparing the calculation result of the CPU and that of the Ascend AI Processor. @@ -1757,16 +1757,16 @@ To locate the precision problem, you need to find out the step in which the prob ``` -

Model Saving and Conversion

+

Model Saving and Conversion

-- **[Introduction](#introduction-5.md)** +- **[Introduction](#introduction-5)** -- **[Saving a Model](#saving-a-model.md)** +- **[Saving a Model](#saving-a-model)** -- **[Exporting an ONNX Model](#exporting-an-onnx-model.md)** +- **[Exporting an ONNX Model](#exporting-an-onnx-model)** -

Introduction

+

Introduction

After the model training is complete, save the model file and export the ONNX model by using the APIs provided by PyTorch. Then use the ATC tool to convert the model into an .om file that adapts to the Ascend AI Processor for offline inference. @@ -1778,7 +1778,7 @@ For details about how to build an offline inference application, see the _CANN ![](figures/en-us_image_0000001106176222.png) -

Saving a Model

+

Saving a Model

During PyTorch training, **torch.save\(\)** is used to save checkpoint files. Based on the usage of model files, model files are saved in the following two formats: @@ -1851,7 +1851,7 @@ During PyTorch training, **torch.save\(\)** is used to save checkpoint files. >![](public_sys-resources/icon-notice.gif) **NOTICE:** >Generally, an operator is processed in different ways in the training graph and inference graph \(for example, BatchNorm and dropout operators\), and the input formats are also different. Therefore, before inference or ONNX model exporting, **model.eval\(\)** must be called to set the dropout and batch normalization layers to the inference mode. -

Exporting an ONNX Model

+

Exporting an ONNX Model

## Introduction @@ -1938,23 +1938,23 @@ if __name__ == "__main__": convert() ``` -

Samples

+

Samples

-- **[ResNet-50 Model Porting](#resnet-50-model-porting.md)** +- **[ResNet-50 Model Porting](#resnet-50-model-porting)** -- **[ShuffleNet Model Optimization](#shufflenet-model-optimization.md)** +- **[ShuffleNet Model Optimization](#shufflenet-model-optimization)** -

ResNet-50 Model Porting

+

ResNet-50 Model Porting

-- **[Obtaining Samples](#obtaining-samples.md)** +- **[Obtaining Samples](#obtaining-samples)** -- **[Porting the Training Script](#porting-the-training-script.md)** +- **[Porting the Training Script](#porting-the-training-script)** -- **[Executing the Script](#executing-the-script.md)** +- **[Executing the Script](#executing-the-script)** -

Obtaining Samples

+

Obtaining Samples

## How to Obtain @@ -1984,7 +1984,7 @@ if __name__ == "__main__": >![](public_sys-resources/icon-note.gif) **NOTE:** >ResNet-50 is a model built in PyTorch. For more built-in models, visit the [PyTorch official website](https://pytorch.org/). - 2. During script execution, set **arch** to **resnet50**. This method is used in the sample. For details, see [Executing the Script](#executing-the-script.md). + 2. During script execution, set **arch** to **resnet50**. This method is used in the sample. For details, see [Executing the Script](#executing-the-script). ``` --arch resnet50 @@ -2000,14 +2000,14 @@ The structure of major directories and files is as follows: ├──main.py ``` -

Porting the Training Script

+

Porting the Training Script

-- **[Single-Device Training Modification](#single-device-training-modification.md)** +- **[Single-Device Training Modification](#single-device-training-modification)** -- **[Distributed Training Modification](#distributed-training-modification.md)** +- **[Distributed Training Modification](#distributed-training-modification)** -

Single-Device Training Modification

+

Single-Device Training Modification

1. Add the header file to **main.py** to support model training on the Ascend 910 AI Processor based on the PyTorch framework. @@ -2150,7 +2150,7 @@ The structure of major directories and files is as follows: ``` -

Distributed Training Modification

+

Distributed Training Modification

1. Add the header file to **main.py** to support mixed-precision model training on the Ascend 910 AI Processor based on the PyTorch framework. @@ -2501,7 +2501,7 @@ The structure of major directories and files is as follows: ``` -

Executing the Script

+

Executing the Script

## Preparing a Dataset @@ -2509,7 +2509,7 @@ Prepare a dataset and upload it to a directory in the operating environment, for ## Configuring Environment Variables -For details, see [Configuring Environment Variables](#configuring-environment-variables.md). +For details, see [Configuring Environment Variables](#configuring-environment-variables). ## Command @@ -2552,18 +2552,18 @@ python3.7 main.py /home/data/resnet50/imagenet --addr='1.1.1.1' \ >![](public_sys-resources/icon-note.gif) **NOTE:** >**dist-backend** must be set to **hccl** to support distributed training on the Ascend AI device. -

ShuffleNet Model Optimization

+

ShuffleNet Model Optimization

-- **[Obtaining Samples](#obtaining-samples-6.md)** +- **[Obtaining Samples](#obtaining-samples-6)** -- **[Evaluating the Model](#evaluating-the-model.md)** +- **[Evaluating the Model](#evaluating-the-model)** -- **[Porting the Network](#porting-the-network.md)** +- **[Porting the Network](#porting-the-network)** -- **[Commissioning the Network](#commissioning-the-network.md)** +- **[Commissioning the Network](#commissioning-the-network)** -

Obtaining Samples

+

Obtaining Samples

## How to Obtain @@ -2586,17 +2586,17 @@ The structure of major directories and files is as follows: ├──main.py ``` -

Evaluating the Model

+

Evaluating the Model

Model evaluation focuses on operator adaptation. Use the dump op method to obtain the ShuffleNet operator information and compare the information with that in the _PyTorch Adapted Operator List_. If an operator is not supported, in simple scenarios, you can replace the operator with a similar operator or place the operator on the CPU to avoid this problem. In complex scenarios, operator development is required. For details, see the _PyTorch Operator Development Guide_. -

Porting the Network

+

Porting the Network

-For details about how to port the training scripts, see [Single-Device Training Modification](#single-device-training-modification.md) and [Distributed Training Modification](#distributed-training-modification.md). During the script execution, select the **--arch shufflenet\_v2\_x1\_0** parameter. +For details about how to port the training scripts, see [Single-Device Training Modification](#single-device-training-modification) and [Distributed Training Modification](#distributed-training-modification). During the script execution, select the **--arch shufflenet\_v2\_x1\_0** parameter. -

Commissioning the Network

+

Commissioning the Network

-For details about how to commission the network, see [Commissioning Process](#commissioning-process.md). After check, it is found that too much time is consumed by operators during ShuffleNet running. The following provides the time consumption data and solutions. +For details about how to commission the network, see [Commissioning Process](#commissioning-process). After check, it is found that too much time is consumed by operators during ShuffleNet running. The following provides the time consumption data and solutions. ## Forward check @@ -2665,10 +2665,10 @@ The forward check record table is as follows: The details are as follows: -- The native **torch.transpose\(x, 1, 2\).contiguous\(\)** uses the view operator transpose, which produced non-contiguous tensors. For example, the copy bottleneck described in the [copy bottleneck optimization](#performance-optimization-1.md) uses **channel\_shuffle\_index\_select** to replace the framework operator with the compute operator when the semantics is the same, reducing the time consumption. -- ShuffleNet V2 contains a large number of chunk operations, and chunk operations are framework operators in PyTorch. As a result, a tensor is split into several non-contiguous tensors of the same length. The operation of converting non-contiguous tensors to contiguous tensors takes a long time. Therefore, the compute operator is used to eliminate non-contiguous tensors. For details, see the copy bottleneck described in the [copy bottleneck optimization](#performance-optimization-1.md) +- The native **torch.transpose\(x, 1, 2\).contiguous\(\)** uses the view operator transpose, which produced non-contiguous tensors. For example, the copy bottleneck described in the [copy bottleneck optimization](#performance-optimization-1) uses **channel\_shuffle\_index\_select** to replace the framework operator with the compute operator when the semantics is the same, reducing the time consumption. +- ShuffleNet V2 contains a large number of chunk operations, and chunk operations are framework operators in PyTorch. As a result, a tensor is split into several non-contiguous tensors of the same length. The operation of converting non-contiguous tensors to contiguous tensors takes a long time. Therefore, the compute operator is used to eliminate non-contiguous tensors. For details, see the copy bottleneck described in the [copy bottleneck optimization](#performance-optimization-1) - During operator adaptation, the output format is specified as the input format by default. However, Concat does not support the 5HD format whose C dimension is not an integral multiple of 16, so it converts the format into 4D for processing. In addition, the Concat is followed by the GatherV2 operator, which supports only the 4D format. Therefore, the data format conversion process is 5HD \> 4D \> Concat \> 5HD \> 4D \> GatherV2 \> 5HD. The solution is to modify the Concat output format. When the output format is not an integer multiple of 16, the specified output format is 4D. After the optimization, the data format conversion process is 5HD \> 4D \> Concat \> GatherV2 \> 5HD. For details about the method for ShuffleNet, see line 121 in **pytorch/aten/src/ATen/native/npu/CatKernelNpu.cpp**. -- Set the weight initialization format to avoid repeated transdata during calculation, for example, the framework bottleneck described in the [copy bottleneck optimization](#performance-optimization-1.md). +- Set the weight initialization format to avoid repeated transdata during calculation, for example, the framework bottleneck described in the [copy bottleneck optimization](#performance-optimization-1). - The output format of the DWCONV weight is rectified to avoid the unnecessary conversion from 5HD to 4D. ## Entire Network Check @@ -3093,22 +3093,22 @@ for group in [2, 4, 8]: ``` -

References

+

References

-- **[Single-Operator Sample Building](#single-operator-sample-building.md)** +- **[Single-Operator Sample Building](#single-operator-sample-building)** -- **[Single-Operator Dump Method](#single-operator-dump-method.md)** +- **[Single-Operator Dump Method](#single-operator-dump-method)** -- **[Common Environment Variables](#common-environment-variables.md)** +- **[Common Environment Variables](#common-environment-variables)** -- **[dump op Method](#dump-op-method.md)** +- **[dump op Method](#dump-op-method)** -- **[How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0.md)** +- **[How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0)** -

Single-Operator Sample Building

+

Single-Operator Sample Building

-When a problem occurs in a model, it is costly to reproduce the problem in the entire network. You can build a single-operator sample to reproduce the precision or performance problem to locate and solve the problem. A single-operator sample can be built in either of the following ways: For details about single-operator dump methods, see [Single-Operator Dump Method](#single-operator-dump-method.md). +When a problem occurs in a model, it is costly to reproduce the problem in the entire network. You can build a single-operator sample to reproduce the precision or performance problem to locate and solve the problem. A single-operator sample can be built in either of the following ways: For details about single-operator dump methods, see [Single-Operator Dump Method](#single-operator-dump-method). 1. Build a single-operator sample test case. You can directly call the operator to reproduce the error scenario. @@ -3201,7 +3201,7 @@ When a problem occurs in a model, it is costly to reproduce the problem in the e ``` -

Single-Operator Dump Method

+

Single-Operator Dump Method

## Collecting Dump Data @@ -3309,7 +3309,7 @@ The fields in the dump data path and file are described as follows: The dimension and **Dtype** information no longer exist in the .txt file. For details, visit the NumPy website. -

Common Environment Variables

+

Common Environment Variables

1. Enables the task delivery in multi-thread mode. When this function is enabled, the training performance of the entire network is improved in most cases. @@ -3327,7 +3327,7 @@ The fields in the dump data path and file are described as follows: **export DUMP\_GRAPH\_LEVEL=3** -

dump op Method

+

dump op Method

1. Use the profile API to reconstruct the loss calculation and optimization process of the original code training script and print the operator information. The following is a code example. @@ -3342,7 +3342,7 @@ The fields in the dump data path and file are described as follows: 2. Train the reconstructed training script on the CPU. The related operator information is displayed. -

How Do I Install GCC 7.3.0?

+

How Do I Install GCC 7.3.0?

Perform the following steps as the **root** user. @@ -3424,25 +3424,25 @@ Perform the following steps as the **root** user. >Skip this step if you do not need to use the compilation environment with GCC upgraded. -

FAQs

+

FAQs

-- **[FAQs About Software Installation](#faqs-about-software-installation.md)** +- **[FAQs About Software Installation](#faqs-about-software-installation)** -- **[FAQs About Model and Operator Running](#faqs-about-model-and-operator-running.md)** +- **[FAQs About Model and Operator Running](#faqs-about-model-and-operator-running)** -- **[FAQs About Model Commissioning](#faqs-about-model-commissioning.md)** +- **[FAQs About Model Commissioning](#faqs-about-model-commissioning)** -- **[FAQs About Other Operations](#faqs-about-other-operations.md)** +- **[FAQs About Other Operations](#faqs-about-other-operations)** -- **[FAQs About Distributed Model Training](#faqs-about-distributed-model-training.md)** +- **[FAQs About Distributed Model Training](#faqs-about-distributed-model-training)** -

FAQs About Software Installation

+

FAQs About Software Installation

-- **[pip3.7 install Pillow==5.3.0 Installation Failed](#pip3-7-install-pillow-5-3-0-installation-failed.md)** +- **[pip3.7 install Pillow==5.3.0 Installation Failed](#pip3-7-install-pillow-5-3-0-installation-failed)** -

pip3.7 install Pillow==5.3.0 Installation Failed

+

pip3.7 install Pillow==5.3.0 Installation Failed

## Symptom @@ -3465,30 +3465,30 @@ Run the following commands to install the dependencies: **apt-get install libjpeg python-devel zlib-devel libjpeg-turbo-devel** -

FAQs About Model and Operator Running

+

FAQs About Model and Operator Running

-- **[What Do I Do If the Error Message "RuntimeError: ExchangeDevice:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-runtimeerror-exchangedevice-is-displayed-during-model-or-operator.md)** +- **[What Do I Do If the Error Message "RuntimeError: ExchangeDevice:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-runtimeerror-exchangedevice-is-displayed-during-model-or-operator)** -- **[What Do I Do If the Error Message "Error in atexit.\_run\_exitfuncs:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-error-in-atexit-_run_exitfuncs-is-displayed-during-model-or-operat.md)** +- **[What Do I Do If the Error Message "Error in atexit.\_run\_exitfuncs:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-error-in-atexit-_run_exitfuncs-is-displayed-during-model-or-operat)** -- **[What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what\(\): HelpACLExecute:" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-terminate-called-after-throwing-an-instance-of-c10-error-what()-he.md)** +- **[What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what\(\): HelpACLExecute:" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-terminate-called-after-throwing-an-instance-of-c10-error-what-he)** -- **[What Do I Do If the Error Message "ImportError: libhccl.so." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-importerror-libhccl-so-is-displayed-during-model-running.md)** +- **[What Do I Do If the Error Message "ImportError: libhccl.so." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-importerror-libhccl-so-is-displayed-during-model-running)** -- **[What Do I Do If the Error Message "RuntimeError: Initialize." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-runtimeerror-initialize-is-displayed-during-model-running.md)** +- **[What Do I Do If the Error Message "RuntimeError: Initialize." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-runtimeerror-initialize-is-displayed-during-model-running)** -- **[What Do I Do If the Error Message "TVM/te/cce error." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-tvm-te-cce-error-is-displayed-during-model-running.md)** +- **[What Do I Do If the Error Message "TVM/te/cce error." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-tvm-te-cce-error-is-displayed-during-model-running)** -- **[What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-running.md)** +- **[What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-running)** -- **[What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-running-7.md)** +- **[What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-running-7)** -- **[What Do I Do If the Error Message "HelpACLExecute." Is Displayed After Multi-Task Delivery Is Disabled \(export TASK\_QUEUE\_ENABLE=0\) During Model Running?](#what-do-i-do-if-the-error-message-helpaclexecute-is-displayed-after-multi-task-delivery-is-disabled.md)** +- **[What Do I Do If the Error Message "HelpACLExecute." Is Displayed After Multi-Task Delivery Is Disabled \(export TASK\_QUEUE\_ENABLE=0\) During Model Running?](#what-do-i-do-if-the-error-message-helpaclexecute-is-displayed-after-multi-task-delivery-is-disabled)** -- **[What Do I Do If the Error Message "55056 GetInputConstDataOut: ErrorNo: -1\(failed\)" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-55056-getinputconstdataout-errorno--1(failed)-is-displayed-during.md)** +- **[What Do I Do If the Error Message "55056 GetInputConstDataOut: ErrorNo: -1\(failed\)" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-55056-getinputconstdataout-errorno--1failed-is-displayed-during)** -

What Do I Do If the Error Message "RuntimeError: ExchangeDevice:" Is Displayed During Model or Operator Running?

+

What Do I Do If the Error Message "RuntimeError: ExchangeDevice:" Is Displayed During Model or Operator Running?

## Symptom @@ -3502,7 +3502,7 @@ Currently, only one NPU device can be called in a thread. When different NPU dev In the code, when **torch.npu.set\_device\(device\)**, **tensor.to\(device\)**, or **model.to\(device\)** is called in the same thread, the device names are inconsistent. For multiple threads \(such as multi-device training\), each thread can call only a fixed NPU device. -

What Do I Do If the Error Message "Error in atexit.\_run\_exitfuncs:" Is Displayed During Model or Operator Running?

+

What Do I Do If the Error Message "Error in atexit.\_run\_exitfuncs:" Is Displayed During Model or Operator Running?

## Symptom @@ -3516,7 +3516,7 @@ If no NPU device is specified by **torch.npu.device\(id\)** during torch initi Before calling an NPU device, specify the NPU device by using **torch.npu.set\_device\(device\)**. -

What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what\(\): HelpACLExecute:" Is Displayed During Model Running?

+

What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what\(\): HelpACLExecute:" Is Displayed During Model Running?

## Symptom @@ -3533,7 +3533,7 @@ You can resolve this exception by using either of the following methods: - Check the host error log information. The default log path is **/var/log/npu/slog/host-0/**. Search for the log file whose name is prefixed with **host-0** based on the time identifier, open the log file, and search for error information using keyword **ERROR**. - Disable multi-thread delivery \(**export TASK\_QUEUE\_ENABLE=0**\) and run the code again. Generally, you can locate the fault based on the error information reported by the terminal. -

What Do I Do If the Error Message "ImportError: libhccl.so." Is Displayed During Model Running?

+

What Do I Do If the Error Message "ImportError: libhccl.so." Is Displayed During Model Running?

## Symptom @@ -3547,7 +3547,7 @@ Currently, the released PyTorch installation package uses the NPU and HCCL funct Add the path of the HCCL module to the environment variables. Generally, the path of the HCCL library file is **.../fwkacllib/python/site-packages/hccl** in the installation package. -

What Do I Do If the Error Message "RuntimeError: Initialize." Is Displayed During Model Running?

+

What Do I Do If the Error Message "RuntimeError: Initialize." Is Displayed During Model Running?

## Symptom @@ -3585,7 +3585,7 @@ To solve the problem, perform the following steps: 4. Contact Huawei technical support personnel. -

What Do I Do If the Error Message "TVM/te/cce error." Is Displayed During Model Running?

+

What Do I Do If the Error Message "TVM/te/cce error." Is Displayed During Model Running?

## Symptom @@ -3601,7 +3601,7 @@ Update the versions of components such as TE. The **te-\*.whl** and **topi-\* ![](figures/faq10-1.png) -

What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?

+

What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?

## Symptom @@ -3670,7 +3670,7 @@ Perform the following steps to locate the fault based on the actual error inform 4. Print the shape, dtype, and npu\_format of all stack parameters. Construct a single-operator case to reproduce the problem. The cause is that the data types of the input parameters for subtraction are different. As a result, the data types of the a-b and b-a results are different, and an error is reported in the stack operator. 5. Convert the data types of the stack input parameters to the same one to temporarily avoid the problem. -

What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?

+

What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?

## Symptom @@ -3739,7 +3739,7 @@ Perform the following steps to locate the fault based on the actual error inform 4. Print the shape, dtype, and npu\_format of all stack parameters. Construct a single-operator case to reproduce the problem. The cause is that the data types of the input parameters for subtraction are different. As a result, the data types of the a-b and b-a results are different, and an error is reported in the stack operator. 5. Convert the data types of the stack input parameters to the same one to temporarily avoid the problem. -

What Do I Do If the Error Message "HelpACLExecute." Is Displayed After Multi-Task Delivery Is Disabled \(export TASK\_QUEUE\_ENABLE=0\) During Model Running?

+

What Do I Do If the Error Message "HelpACLExecute." Is Displayed After Multi-Task Delivery Is Disabled \(export TASK\_QUEUE\_ENABLE=0\) During Model Running?

## Symptom @@ -3759,7 +3759,7 @@ The error information in the log indicates that the error operator is topKD and Locate the topKD operator in the model code and check whether the operator can be replaced by another operator. If the operator can be replaced by another operator, use the replacement solution and report the operator error information to Huawei engineers. If the operator cannot be replaced by another operator, contact Huawei technical support. -

What Do I Do If the Error Message "55056 GetInputConstDataOut: ErrorNo: -1\(failed\)" Is Displayed During Model Running?

+

What Do I Do If the Error Message "55056 GetInputConstDataOut: ErrorNo: -1\(failed\)" Is Displayed During Model Running?

## Symptom @@ -3775,18 +3775,18 @@ A public API is called. The error information does not affect the training function and performance and can be ignored. -

FAQs About Model Commissioning

+

FAQs About Model Commissioning

-- **[What Do I Do If the Error Message "RuntimeError: malloc:/..../pytorch/c10/npu/NPUCachingAllocator.cpp:293 NPU error, error code is 500000." Is Displayed During Model Commissioning?](#what-do-i-do-if-the-error-message-runtimeerror-malloc-pytorch-c10-npu-npucachingallocator-cpp-293-np.md)** +- **[What Do I Do If the Error Message "RuntimeError: malloc:/..../pytorch/c10/npu/NPUCachingAllocator.cpp:293 NPU error, error code is 500000." Is Displayed During Model Commissioning?](#what-do-i-do-if-the-error-message-runtimeerror-malloc-pytorch-c10-npu-npucachingallocator-cpp-293-np)** -- **[What Do I Do If the Error Message "RuntimeError: Could not run 'aten::trunc.out' with arguments from the 'NPUTensorId' backend." Is Displayed During Model Commissioning](#what-do-i-do-if-the-error-message-runtimeerror-could-not-run-aten-trunc-out-with-arguments-from-the.md)** +- **[What Do I Do If the Error Message "RuntimeError: Could not run 'aten::trunc.out' with arguments from the 'NPUTensorId' backend." Is Displayed During Model Commissioning](#what-do-i-do-if-the-error-message-runtimeerror-could-not-run-aten-trunc-out-with-arguments-from-the)** -- **[What Do I Do If the MaxPoolGradWithArgmaxV1 and max Operators Report Errors During Model Commissioning?](#what-do-i-do-if-the-maxpoolgradwithargmaxv1-and-max-operators-report-errors-during-model-commissioni.md)** +- **[What Do I Do If the MaxPoolGradWithArgmaxV1 and max Operators Report Errors During Model Commissioning?](#what-do-i-do-if-the-maxpoolgradwithargmaxv1-and-max-operators-report-errors-during-model-commissioni)** -- **[What Do I Do If the Error Message "ModuleNotFoundError: No module named 'torch.\_C'" Is Displayed When torch Is Called?](#what-do-i-do-if-the-error-message-modulenotfounderror-no-module-named-torch-_c-is-displayed-when-tor.md)** +- **[What Do I Do If the Error Message "ModuleNotFoundError: No module named 'torch.\_C'" Is Displayed When torch Is Called?](#what-do-i-do-if-the-error-message-modulenotfounderror-no-module-named-torch-_c-is-displayed-when-tor)** -

What Do I Do If the Error Message "RuntimeError: malloc:/..../pytorch/c10/npu/NPUCachingAllocator.cpp:293 NPU error, error code is 500000." Is Displayed During Model Commissioning?

+

What Do I Do If the Error Message "RuntimeError: malloc:/..../pytorch/c10/npu/NPUCachingAllocator.cpp:293 NPU error, error code is 500000." Is Displayed During Model Commissioning?

## Symptom @@ -3800,7 +3800,7 @@ For the malloc error in **NPUCachingAllocator**, the possible cause is that the During model commissioning, you can decrease the value of the **batch size** parameter to reduce the size of the occupied video memory on the NPU. -

What Do I Do If the Error Message "RuntimeError: Could not run 'aten::trunc.out' with arguments from the 'NPUTensorId' backend." Is Displayed During Model Commissioning

+

What Do I Do If the Error Message "RuntimeError: Could not run 'aten::trunc.out' with arguments from the 'NPUTensorId' backend." Is Displayed During Model Commissioning

## Symptom @@ -3814,7 +3814,7 @@ Currently, the NPU supports only some PyTorch operators. The preceding error is During model commissioning, you can decrease the value of the **batch size** parameter to reduce the size of the occupied video memory on the NPU. -

What Do I Do If the MaxPoolGradWithArgmaxV1 and max Operators Report Errors During Model Commissioning?

+

What Do I Do If the MaxPoolGradWithArgmaxV1 and max Operators Report Errors During Model Commissioning?

## Symptom @@ -3840,9 +3840,9 @@ Locate the operators based on the error information and perform the following st In the preceding figure, the error information indicates that the MaxPoolGradWithArgmaxV1 and max operators report the error. MaxPoolGradWithArgmaxV1 reports the error during backward propagation. Therefore, construct a reverse scenario. The max operator reports the error during forward propagation. Therefore, construct a forward scenario. -If an operator error is reported in the model, you are advised to build a single-operator test case and determine the error scenario and cause. If a single-operator case cannot be built in a single operator, you need to construct a context-based single-operator scenario. For details about how to build a test case, see [Single-Operator Sample Building](#single-operator-sample-building.md). +If an operator error is reported in the model, you are advised to build a single-operator test case and determine the error scenario and cause. If a single-operator case cannot be built in a single operator, you need to construct a context-based single-operator scenario. For details about how to build a test case, see [Single-Operator Sample Building](#single-operator-sample-building). -

What Do I Do If the Error Message "ModuleNotFoundError: No module named 'torch.\_C'" Is Displayed When torch Is Called?

+

What Do I Do If the Error Message "ModuleNotFoundError: No module named 'torch.\_C'" Is Displayed When torch Is Called?

## Symptom @@ -3856,24 +3856,24 @@ In the preceding figure, the error path is **.../code/pytorch/torch/\_\_init\_\ Switch to another directory to run the script. -

FAQs About Other Operations

+

FAQs About Other Operations

-- **[What Do I Do If an Error Is Reported During CUDA Stream Synchronization?](#what-do-i-do-if-an-error-is-reported-during-cuda-stream-synchronization.md)** +- **[What Do I Do If an Error Is Reported During CUDA Stream Synchronization?](#what-do-i-do-if-an-error-is-reported-during-cuda-stream-synchronization)** -- **[What Do I Do If aicpu\_kernels/libpt\_kernels.so Does Not Exist?](#what-do-i-do-if-aicpu_kernels-libpt_kernels-so-does-not-exist.md)** +- **[What Do I Do If aicpu\_kernels/libpt\_kernels.so Does Not Exist?](#what-do-i-do-if-aicpu_kernels-libpt_kernels-so-does-not-exist)** -- **[What Do I Do If the Python Process Is Residual When the npu-smi info Command Is Used to View Video Memory?](#what-do-i-do-if-the-python-process-is-residual-when-the-npu-smi-info-command-is-used-to-view-video-m.md)** +- **[What Do I Do If the Python Process Is Residual When the npu-smi info Command Is Used to View Video Memory?](#what-do-i-do-if-the-python-process-is-residual-when-the-npu-smi-info-command-is-used-to-view-video-m)** -- **[What Do I Do If the Error Message "match op inputs failed"Is Displayed When the Dynamic Shape Is Used?](#what-do-i-do-if-the-error-message-match-op-inputs-failed-is-displayed-when-the-dynamic-shape-is-used.md)** +- **[What Do I Do If the Error Message "match op inputs failed"Is Displayed When the Dynamic Shape Is Used?](#what-do-i-do-if-the-error-message-match-op-inputs-failed-is-displayed-when-the-dynamic-shape-is-used)** -- **[What Do I Do If the Error Message "Op type SigmoidCrossEntropyWithLogitsV2 of ops kernel AIcoreEngine is unsupported" Is Displayed?](#what-do-i-do-if-the-error-message-op-type-sigmoidcrossentropywithlogitsv2-of-ops-kernel-aicoreengine.md)** +- **[What Do I Do If the Error Message "Op type SigmoidCrossEntropyWithLogitsV2 of ops kernel AIcoreEngine is unsupported" Is Displayed?](#what-do-i-do-if-the-error-message-op-type-sigmoidcrossentropywithlogitsv2-of-ops-kernel-aicoreengine)** -- **[What Do I Do If a Hook Failure Occurs?](#what-do-i-do-if-a-hook-failure-occurs.md)** +- **[What Do I Do If a Hook Failure Occurs?](#what-do-i-do-if-a-hook-failure-occurs)** -- **[What Do I Do If the Error Message "load state\_dict error." Is Displayed When the Weight Is Loaded?](#what-do-i-do-if-the-error-message-load-state_dict-error-is-displayed-when-the-weight-is-loaded.md)** +- **[What Do I Do If the Error Message "load state\_dict error." Is Displayed When the Weight Is Loaded?](#what-do-i-do-if-the-error-message-load-state_dict-error-is-displayed-when-the-weight-is-loaded)** -

What Do I Do If an Error Is Reported During CUDA Stream Synchronization?

+

What Do I Do If an Error Is Reported During CUDA Stream Synchronization?

## Symptom @@ -3892,7 +3892,7 @@ stream = torch.npu.current_stream() stream.synchronize() ``` -

What Do I Do If aicpu\_kernels/libpt\_kernels.so Does Not Exist?

+

What Do I Do If aicpu\_kernels/libpt\_kernels.so Does Not Exist?

## Symptom @@ -3910,7 +3910,7 @@ Import the AI CPU. \(The following describes how to install the Toolkit software export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest ``` -

What Do I Do If the Python Process Is Residual When the npu-smi info Command Is Used to View Video Memory?

+

What Do I Do If the Python Process Is Residual When the npu-smi info Command Is Used to View Video Memory?

## Symptom @@ -3928,7 +3928,7 @@ Kill the Python process. pkill -9 python ``` -

What Do I Do If the Error Message "match op inputs failed"Is Displayed When the Dynamic Shape Is Used?

+

What Do I Do If the Error Message "match op inputs failed"Is Displayed When the Dynamic Shape Is Used?

## Symptom @@ -3942,7 +3942,7 @@ The operator compiled by **PTIndexPut** does not match the input shape, and th **PTIndexPut** corresponds to **tensor\[indices\] = value**. Locate the field in the code and change the dynamic shape to a fixed shape. -

What Do I Do If the Error Message "Op type SigmoidCrossEntropyWithLogitsV2 of ops kernel AIcoreEngine is unsupported" Is Displayed?

+

What Do I Do If the Error Message "Op type SigmoidCrossEntropyWithLogitsV2 of ops kernel AIcoreEngine is unsupported" Is Displayed?

## Symptom @@ -3959,7 +3959,7 @@ The input data type is not supported by the SigmoidCrossEntropyWithLogitsV2 oper Check the input data type in the Python code and modify the data type. -

What Do I Do If a Hook Failure Occurs?

+

What Do I Do If a Hook Failure Occurs?

## Symptom @@ -4015,7 +4015,7 @@ if len(self._backward_hooks) > 0: return result ``` -

What Do I Do If the Error Message "load state\_dict error." Is Displayed When the Weight Is Loaded?

+

What Do I Do If the Error Message "load state\_dict error." Is Displayed When the Weight Is Loaded?

## Symptom @@ -4044,14 +4044,14 @@ The script is as follows: model.load_state_dict(state_dict) ``` -

FAQs About Distributed Model Training

+

FAQs About Distributed Model Training

-- **[What Do I Do If the Error Message "host not found." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-host-not-found-is-displayed-during-distributed-model-training.md)** +- **[What Do I Do If the Error Message "host not found." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-host-not-found-is-displayed-during-distributed-model-training)** -- **[What Do I Do If the Error Message "RuntimeError: connect\(\) timed out." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-runtimeerror-connect()-timed-out-is-displayed-during-distributed-m.md)** +- **[What Do I Do If the Error Message "RuntimeError: connect\(\) timed out." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-runtimeerror-connect-timed-out-is-displayed-during-distributed-m)** -

What Do I Do If the Error Message "host not found." Is Displayed During Distributed Model Training?

+

What Do I Do If the Error Message "host not found." Is Displayed During Distributed Model Training?

## Symptom @@ -4065,7 +4065,7 @@ During distributed model training, the Huawei Collective Communication Library \ Set the correct IP address in the running script. If a single server is deployed, set the IP address to the IP address of the server. If multiple servers are deployed, set the IP address in the script on each server to the IP address of the active node. -

What Do I Do If the Error Message "RuntimeError: connect\(\) timed out." Is Displayed During Distributed Model Training?

+

What Do I Do If the Error Message "RuntimeError: connect\(\) timed out." Is Displayed During Distributed Model Training?

## Symptom diff --git a/docs/en/PyTorch Online Inference User Guide/PyTorch Online Inference User Guide.md b/docs/en/PyTorch Online Inference User Guide/PyTorch Online Inference User Guide.md index 933c21999e1af0a5f1b61c00ab235580cf1531f5..dd5e572028a8c639244dac5bd6a14f84103e09be 100644 --- a/docs/en/PyTorch Online Inference User Guide/PyTorch Online Inference User Guide.md +++ b/docs/en/PyTorch Online Inference User Guide/PyTorch Online Inference User Guide.md @@ -1,14 +1,14 @@ # PyTorch Online Inference Guide -- [Application Scenario](#application-scenario.md) -- [Basic Workflow](#basic-workflow.md) - - [Prerequisites](#prerequisites.md) - - [Online Inference Process](#online-inference-process.md) - - [Environment Variable Configuration](#environment-variable-configuration.md) - - [Sample Reference](#sample-reference.md) -- [Special Topics](#special-topics.md) - - [Mixed Precision](#mixed-precision.md) -- [How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0.md) -

Application Scenario

+- [Application Scenario](#application-scenario) +- [Basic Workflow](#basic-workflow) + - [Prerequisites](#prerequisites) + - [Online Inference Process](#online-inference-process) + - [Environment Variable Configuration](#environment-variable-configuration) + - [Sample Reference](#sample-reference) +- [Special Topics](#special-topics) + - [Mixed Precision](#mixed-precision) +- [How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0) +

Application Scenario

Online inference, unlike offline inference, allows developers to perform inference directly with PyTorch models using the **model.eval\(\)** method. @@ -20,29 +20,29 @@ Ascend 910 AI Processor Ascend 710 AI Processor -

Basic Workflow

+

Basic Workflow

-- **[Prerequisites](#prerequisites.md)** +- **[Prerequisites](#prerequisites)** -- **[Online Inference Process](#online-inference-process.md)** +- **[Online Inference Process](#online-inference-process)** -- **[Environment Variable Configuration](#environment-variable-configuration.md)** +- **[Environment Variable Configuration](#environment-variable-configuration)** -- **[Sample Reference](#sample-reference.md)** +- **[Sample Reference](#sample-reference)** -

Prerequisites

+

Prerequisites

The PyTorch framework and mixed precision module have been installed. For details, see the . -

Online Inference Process

+

Online Inference Process

[Figure 1](#fig13802941161818) shows the online inference process. **Figure 1** Online inference process ![](figures/online-inference-process.png "online-inference-process") -

Environment Variable Configuration

+

Environment Variable Configuration

The following are the environment variables required for starting the inference process on PyTorch: @@ -133,7 +133,7 @@ export TASK_QUEUE_ENABLE=0 >![](public_sys-resources/icon-note.gif) **NOTE:** >For more log information, see the _CANN Log Reference_. -

Sample Reference

+

Sample Reference

## Sample Code @@ -420,7 +420,7 @@ The following uses the ResNet-50 model as an example to describe how to perform 3. Run inference. - Set environment variables by referring to [Environment Variable Configuration](#environment-variable-configuration.md) and then run the following command: + Set environment variables by referring to [Environment Variable Configuration](#environment-variable-configuration) and then run the following command: ``` python3 pytorch-resnet50-apex.py --data /data/imagenet \ @@ -433,12 +433,12 @@ The following uses the ResNet-50 model as an example to describe how to perform >The preceding command is an example only. Modify the arguments as needed. -

Special Topics

+

Special Topics

-- **[Mixed Precision](#mixed-precision.md)** +- **[Mixed Precision](#mixed-precision)** -

Mixed Precision

+

Mixed Precision

## Overview @@ -503,7 +503,7 @@ However, the mixed precision training is limited by the precision range expresse model, optimizer = amp.initialize(model, optimizer) ``` - For details, see "Initialize the mixed precision model."# in [Sample Code](#sample-reference.md). + For details, see "Initialize the mixed precision model."# in [Sample Code](#sample-reference). ``` model, optimizer = amp.initialize(model, optimizer, opt_level="O2", loss_scale=1024, verbosity=1) @@ -514,9 +514,9 @@ However, the mixed precision training is limited by the precision range expresse After the mixed precision model is initialized, perform model forward propagation. -Sample code: For details, see the implementation of **validate\(val\_loader, model, args\)** in [Sample Code](#sample-reference.md). +Sample code: For details, see the implementation of **validate\(val\_loader, model, args\)** in [Sample Code](#sample-reference). -

How Do I Install GCC 7.3.0?

+

How Do I Install GCC 7.3.0?

Perform the following steps as the **root** user. diff --git a/docs/en/PyTorch Operator Development Guide/PyTorch Operator Development Guide.md b/docs/en/PyTorch Operator Development Guide/PyTorch Operator Development Guide.md index 65fe9c21471de4d49984dcb61cd099b5343b6403..e3534a433f7781c5119b8f2e3440593fd800544f 100644 --- a/docs/en/PyTorch Operator Development Guide/PyTorch Operator Development Guide.md +++ b/docs/en/PyTorch Operator Development Guide/PyTorch Operator Development Guide.md @@ -1,38 +1,37 @@ # PyTorch Operator Development Guide -- [Introduction](#introduction.md) -- [Operator Development Process](#operator-development-process.md) -- [Operator Development Preparations](#operator-development-preparations.md) - - [Setting Up the Environment](#setting-up-the-environment.md) - - [Looking Up Operators](#looking-up-operators.md) -- [Operator Adaptation](#operator-adaptation.md) - - [Prerequisites](#prerequisites.md) - - [Obtaining the PyTorch Source Code](#obtaining-the-pytorch-source-code.md) - - [Registering Operator Development](#registering-operator-development.md) - - [Developing an Operator Adaptation Plugin](#developing-an-operator-adaptation-plugin.md) - - [Compiling and Installing the PyTorch Framework](#compiling-and-installing-the-pytorch-framework.md) -- [Operator Function Verification](#operator-function-verification.md) - - [Overview](#overview.md) - - [Implementation](#implementation.md) -- [FAQs](#faqs.md) - - [Pillow==5.3.0 Installation Failed](#pillow-5-3-0-installation-failed.md) - - [pip3.7 install torchvision Installation Failed](#pip3-7-install-torchvision-installation-failed.md) - - ["torch 1.5.0xxxx" and "torchvision" Do Not Match When torch-\*.whl Is Installed](#torch-1-5-0xxxx-and-torchvision-do-not-match-when-torch--whl-is-installed.md) - - [如何查看测试的运行日志](#en-us_topic_0000001117914770.md) - - [What Is the Meaning Of The NPU Error Code Output During the Test? Is There Any Corresponding Explanation?](#what-is-the-meaning-of-the-npu-error-code-output-during-the-test-is-there-any-corresponding-explanat.md) - - [Why Cannot the Custom TBE Operator Be Called?](#why-cannot-the-custom-tbe-operator-be-called.md) - - [How Do I Determine Whether the TBE Operator Is Correctly Called for PyTorch Adaptation?](#how-do-i-determine-whether-the-tbe-operator-is-correctly-called-for-pytorch-adaptation.md) - - [PyTorch Compilation Fails and the Message "error: ld returned 1 exit status" Is Displayed](#pytorch-compilation-fails-and-the-message-error-ld-returned-1-exit-status-is-displayed.md) - - [PyTorch Compilation Fails and the Message "error: call of overload...." Is Displayed](#pytorch-compilation-fails-and-the-message-error-call-of-overload-is-displayed.md) -- [Appendixes](#appendixes.md) - - [Installing CMake](#installing-cmake.md) - - [Exporting a Custom Operator](#exporting-a-custom-operator.md) -

Introduction

+- [Introduction](#introduction) +- [Operator Development Process](#operator-development-process) +- [Operator Development Preparations](#operator-development-preparations) + - [Setting Up the Environment](#setting-up-the-environment) + - [Looking Up Operators](#looking-up-operators) +- [Operator Adaptation](#operator-adaptation) + - [Prerequisites](#prerequisites) + - [Obtaining the PyTorch Source Code](#obtaining-the-pytorch-source-code) + - [Registering Operator Development](#registering-operator-development) + - [Developing an Operator Adaptation Plugin](#developing-an-operator-adaptation-plugin) + - [Compiling and Installing the PyTorch Framework](#compiling-and-installing-the-pytorch-framework) +- [Operator Function Verification](#operator-function-verification) + - [Overview](#overview) + - [Implementation](#implementation) +- [FAQs](#faqs) + - [Pillow==5.3.0 Installation Failed](#pillow-5-3-0-installation-failed) + - [pip3.7 install torchvision Installation Failed](#pip3-7-install-torchvision-installation-failed) + - ["torch 1.5.0xxxx" and "torchvision" Do Not Match When torch-\*.whl Is Installed](#torch-1-5-0xxxx-and-torchvision-do-not-match-when-torch--whl-is-installed) + - [What Is the Meaning Of The NPU Error Code Output During the Test? Is There Any Corresponding Explanation?](#what-is-the-meaning-of-the-npu-error-code-output-during-the-test-is-there-any-corresponding-explanat) + - [Why Cannot the Custom TBE Operator Be Called?](#why-cannot-the-custom-tbe-operator-be-called) + - [How Do I Determine Whether the TBE Operator Is Correctly Called for PyTorch Adaptation?](#how-do-i-determine-whether-the-tbe-operator-is-correctly-called-for-pytorch-adaptation) + - [PyTorch Compilation Fails and the Message "error: ld returned 1 exit status" Is Displayed](#pytorch-compilation-fails-and-the-message-error-ld-returned-1-exit-status-is-displayed) + - [PyTorch Compilation Fails and the Message "error: call of overload...." Is Displayed](#pytorch-compilation-fails-and-the-message-error-call-of-overload-is-displayed) +- [Appendixes](#appendixes) + - [Installing CMake](#installing-cmake) + - [Exporting a Custom Operator](#exporting-a-custom-operator) +

Introduction

## Overview To enable the PyTorch deep learning framework to run on Ascend AI Processors, you need to use Tensor Boost Engine \(TBE\) to customize the framework operators. -

Operator Development Process

+

Operator Development Process

PyTorch operator development includes TBE operator development and operator adaptation to the PyTorch framework. @@ -67,7 +66,7 @@ PyTorch operator development includes TBE operator development and operator adap

Set up the development and operating environments required for operator development, execution, and verification.

Operator Development Preparations

+

Operator Development Preparations

2

@@ -84,7 +83,7 @@ PyTorch operator development includes TBE operator development and operator adap

Obtain the PyTorch source code from the Ascend Community.

Operator Adaptation

+

Operator Adaptation

4

@@ -114,25 +113,25 @@ PyTorch operator development includes TBE operator development and operator adap

Verify the operator functions in the real-world hardware environment.

Operator Function Verification

+

Operator Function Verification

-

Operator Development Preparations

+

Operator Development Preparations

-- **[Setting Up the Environment](#setting-up-the-environment.md)** +- **[Setting Up the Environment](#setting-up-the-environment)** -- **[Looking Up Operators](#looking-up-operators.md)** +- **[Looking Up Operators](#looking-up-operators)** -

Setting Up the Environment

+

Setting Up the Environment

## Prerequisites - The development or operating environment of CANN has been installed. For details, see the _CANN Software Installation Guide_. -- CMake 3.12.0 or later has been installed. For details, see [Installing CMake](#installing-cmake.md). +- CMake 3.12.0 or later has been installed. For details, see [Installing CMake](#installing-cmake). - GCC 7.3.0 or later has been installed. For details about how to install and use GCC 7.3.0, see "Installing GCC 7.3.0" in the _CANN Software Installation Guide_. - The Git tool has been installed. To install Git for Ubuntu and CentOS, run the following commands: - Ubuntu @@ -160,9 +159,9 @@ pip3.7 install Pillow==5.3.0 ``` >![](public_sys-resources/icon-note.gif) **NOTE:** ->If an error is reported in the preceding process, rectify the fault by referring to [FAQs](#faqs.md). +>If an error is reported in the preceding process, rectify the fault by referring to [FAQs](#faqs). -

Looking Up Operators

+

Looking Up Operators

During operator development, you can query the list of operators supported by Ascend AI Processors and the list of operators adapted to PyTorch. Develop or adapt operators to PyTorch based on the query result. @@ -178,25 +177,25 @@ The following describes how to query the operators supported by Ascend AI Proces - For the list of operators adapted to PyTorch, see the _PyTorch Adapted Operator List_. -

Operator Adaptation

+

Operator Adaptation

-- **[Prerequisites](#prerequisites.md)** +- **[Prerequisites](#prerequisites)** -- **[Obtaining the PyTorch Source Code](#obtaining-the-pytorch-source-code.md)** +- **[Obtaining the PyTorch Source Code](#obtaining-the-pytorch-source-code)** -- **[Registering Operator Development](#registering-operator-development.md)** +- **[Registering Operator Development](#registering-operator-development)** -- **[Developing an Operator Adaptation Plugin](#developing-an-operator-adaptation-plugin.md)** +- **[Developing an Operator Adaptation Plugin](#developing-an-operator-adaptation-plugin)** -- **[Compiling and Installing the PyTorch Framework](#compiling-and-installing-the-pytorch-framework.md)** +- **[Compiling and Installing the PyTorch Framework](#compiling-and-installing-the-pytorch-framework)** -

Prerequisites

+

Prerequisites

-- The development and operating environments have been set up, and related dependencies have been installed. For details, see [Setting Up the Environment](#setting-up-the-environment.md). +- The development and operating environments have been set up, and related dependencies have been installed. For details, see [Setting Up the Environment](#setting-up-the-environment). - TBE operators have been developed and deployed. For details, see the _CANN TBE Custom Operator Development Guide_. -

Obtaining the PyTorch Source Code

+

Obtaining the PyTorch Source Code

Visit [https://gitee.com/ascend/pytorch-develop](https://gitee.com/ascend/pytorch-develop) to obtain the PyTorch source code that adapts to the Ascend AI Processor. Run the following **git** command to download the source code: @@ -209,7 +208,7 @@ After the download is successful, the PyTorch file directory is generated. >![](public_sys-resources/icon-note.gif) **NOTE:** >If you do not have the permission to obtain the code, contact Huawei technical support to join the **Ascend** organization. -

Registering Operator Development

+

Registering Operator Development

## Overview @@ -345,7 +344,7 @@ The following uses the torch.add\(\) operator as an example to describe how to r -

Developing an Operator Adaptation Plugin

+

Developing an Operator Adaptation Plugin

## Overview @@ -530,7 +529,7 @@ The following uses the torch.add\(\) operator as an example to describe how to a >![](public_sys-resources/icon-note.gif) **NOTE:** >For details about the implementation code of **AddKernelNpu.cpp**, see the **pytorch/aten/src/ATen/native/npu/AddKernelNpu.cpp** document. -

Compiling and Installing the PyTorch Framework

+

Compiling and Installing the PyTorch Framework

## Compiling the PyTorch Framework @@ -550,7 +549,7 @@ The following uses the torch.add\(\) operator as an example to describe how to a ## Installing the PyTorch Framework -1. Upload the **torch-**_\*_**.whl** package generated in [Compiling and Installing the PyTorch Framework](#compiling-and-installing-the-pytorch-framework.md) to any path on the server. +1. Upload the **torch-**_\*_**.whl** package generated in [Compiling and Installing the PyTorch Framework](#compiling-and-installing-the-pytorch-framework) to any path on the server. 2. Go to the directory where **torch-**_\*_**.whl** is located and run the **pip** command to install PyTorch. If the current user is the **root** user, run the following command: @@ -570,14 +569,14 @@ The following uses the torch.add\(\) operator as an example to describe how to a >- After the code has been modified, you need to re-compile and re-install PyTorch. >- During the installation, the system may display a message indicating that the TorchVision 0.6.0 version does not match PyTorch. This problem has no impact and can be ignored. -

Operator Function Verification

+

Operator Function Verification

-- **[Overview](#overview.md)** +- **[Overview](#overview)** -- **[Implementation](#implementation.md)** +- **[Implementation](#implementation)** -

Overview

+

Overview

## Introduction @@ -591,7 +590,7 @@ Use the PyTorch frontend to construct the custom operator function and run the f The test cases and test tools are provided in the **pytorch/test/test\_npu/test\_network\_ops** directory at **https://gitee.com/ascend/pytorch-develop**. -

Implementation

+

Implementation

## Introduction @@ -674,28 +673,26 @@ This section describes how to test the functions of a PyTorch operator. ``` -

FAQs

+

FAQs

-- **[Pillow==5.3.0 Installation Failed](#pillow-5-3-0-installation-failed.md)** +- **[Pillow==5.3.0 Installation Failed](#pillow-5-3-0-installation-failed)** -- **[pip3.7 install torchvision Installation Failed](#pip3-7-install-torchvision-installation-failed.md)** +- **[pip3.7 install torchvision Installation Failed](#pip3-7-install-torchvision-installation-failed)** -- **["torch 1.5.0xxxx" and "torchvision" Do Not Match When torch-\*.whl Is Installed](#torch-1-5-0xxxx-and-torchvision-do-not-match-when-torch--whl-is-installed.md)** +- **["torch 1.5.0xxxx" and "torchvision" Do Not Match When torch-\*.whl Is Installed](#torch-1-5-0xxxx-and-torchvision-do-not-match-when-torch--whl-is-installed)** -- **[如何查看测试的运行日志](#en-us_topic_0000001117914770.md)** +- **[What Is the Meaning Of The NPU Error Code Output During the Test? Is There Any Corresponding Explanation?](#what-is-the-meaning-of-the-npu-error-code-output-during-the-test-is-there-any-corresponding-explanat)** -- **[What Is the Meaning Of The NPU Error Code Output During the Test? Is There Any Corresponding Explanation?](#what-is-the-meaning-of-the-npu-error-code-output-during-the-test-is-there-any-corresponding-explanat.md)** +- **[Why Cannot the Custom TBE Operator Be Called?](#why-cannot-the-custom-tbe-operator-be-called)** -- **[Why Cannot the Custom TBE Operator Be Called?](#why-cannot-the-custom-tbe-operator-be-called.md)** +- **[How Do I Determine Whether the TBE Operator Is Correctly Called for PyTorch Adaptation?](#how-do-i-determine-whether-the-tbe-operator-is-correctly-called-for-pytorch-adaptation)** -- **[How Do I Determine Whether the TBE Operator Is Correctly Called for PyTorch Adaptation?](#how-do-i-determine-whether-the-tbe-operator-is-correctly-called-for-pytorch-adaptation.md)** +- **[PyTorch Compilation Fails and the Message "error: ld returned 1 exit status" Is Displayed](#pytorch-compilation-fails-and-the-message-error-ld-returned-1-exit-status-is-displayed)** -- **[PyTorch Compilation Fails and the Message "error: ld returned 1 exit status" Is Displayed](#pytorch-compilation-fails-and-the-message-error-ld-returned-1-exit-status-is-displayed.md)** +- **[PyTorch Compilation Fails and the Message "error: call of overload...." Is Displayed](#pytorch-compilation-fails-and-the-message-error-call-of-overload-is-displayed)** -- **[PyTorch Compilation Fails and the Message "error: call of overload...." Is Displayed](#pytorch-compilation-fails-and-the-message-error-call-of-overload-is-displayed.md)** - -

Pillow==5.3.0 Installation Failed

+

Pillow==5.3.0 Installation Failed

## Symptom @@ -713,7 +710,7 @@ Run the following command to install the required dependencies: apt-get install libjpeg python-devel zlib-devel libjpeg-turbo-devel ``` -

pip3.7 install torchvision Installation Failed

+

pip3.7 install torchvision Installation Failed

## Symptom @@ -731,7 +728,7 @@ Run the following command: pip3.7 install torchvision --no-deps ``` -

"torch 1.5.0xxxx" and "torchvision" Do Not Match When torch-\*.whl Is Installed

+

"torch 1.5.0xxxx" and "torchvision" Do Not Match When torch-\*.whl Is Installed

## Symptom @@ -749,11 +746,11 @@ When the PyTorch is installed, the version check is automatically triggered. The This problem has no impact on the actual result, and no action is required. -

What Is the Meaning Of The NPU Error Code Output During the Test? Is There Any Corresponding Explanation?

+

What Is the Meaning Of The NPU Error Code Output During the Test? Is There Any Corresponding Explanation?

For details, see [aclError](https://support.huaweicloud.com/intl/en-us/adevg-A800_3000_3010/atlasdevelopment_01_0256.html). -

Why Cannot the Custom TBE Operator Be Called?

+

Why Cannot the Custom TBE Operator Be Called?

## Symptom @@ -767,7 +764,7 @@ The custom TBE operator has been developed and adapted to PyTorch. However, the ## Solutions -1. Set the operating environment by referring to [Verifying Operator Functions](#operator-function-verification.md). Pay special attention to the following settings: +1. Set the operating environment by referring to [Verifying Operator Functions](#operator-function-verification). Pay special attention to the following settings: ``` . /home/HwHiAiUser/Ascend/ascend-toolkit/set_env.sh @@ -807,7 +804,7 @@ The custom TBE operator has been developed and adapted to PyTorch. However, the -

How Do I Determine Whether the TBE Operator Is Correctly Called for PyTorch Adaptation?

+

How Do I Determine Whether the TBE Operator Is Correctly Called for PyTorch Adaptation?

Both the custom and built-in operators are stored in the installation directory as .py source code after installation. Therefore, you can edit the source code and add logs at the API entry to print the input parameters, and determine whether the input parameters are correct. @@ -857,7 +854,7 @@ The following uses the **zn\_2\_nchw** operator in the built-in operator packa ![](figures/en-us_image_0000001126846512.png) -

PyTorch Compilation Fails and the Message "error: ld returned 1 exit status" Is Displayed

+

PyTorch Compilation Fails and the Message "error: ld returned 1 exit status" Is Displayed

## Symptom @@ -877,7 +874,7 @@ In the implementation, the type of the last parameter is **int**, which does no Modify the adaptation function implemented in _xxxx_**KernelNpu.cpp**. In the preceding example, change the type of the last parameter in the **binary\_cross\_entropy\_npu** function to **int64\_t** \(use **int64\_t** instead of **long** in the .cpp file\). -

PyTorch Compilation Fails and the Message "error: call of overload...." Is Displayed

+

PyTorch Compilation Fails and the Message "error: call of overload...." Is Displayed

## Symptom @@ -901,14 +898,14 @@ In the implementation, the type of the second input parameter of **NPUAttrDesc* 2. Change the input parameter type of **binary\_cross\_entropy\_attr\(\)** to **int64\_t**. -

Appendixes

+

Appendixes

-- **[Installing CMake](#installing-cmake.md)** +- **[Installing CMake](#installing-cmake)** -- **[Exporting a Custom Operator](#exporting-a-custom-operator.md)** +- **[Exporting a Custom Operator](#exporting-a-custom-operator)** -

Installing CMake

+

Installing CMake

The following describes how to upgrade CMake to 3.12.1. @@ -947,7 +944,7 @@ The following describes how to upgrade CMake to 3.12.1. If the message "cmake version 3.12.1" is displayed, the installation is successful. -

Exporting a Custom Operator

+

Exporting a Custom Operator

## Overview diff --git a/docs/en/PyTorch Operator Support/PyTorch Operator Support.md b/docs/en/PyTorch Operator Support/PyTorch Operator Support.md index 55dd5031e6e21332c6da0468607ed88c521abc8d..35b744fb2188ffe4041f2b0c8a02e4fdd7b872ef 100644 --- a/docs/en/PyTorch Operator Support/PyTorch Operator Support.md +++ b/docs/en/PyTorch Operator Support/PyTorch Operator Support.md @@ -1,7 +1,7 @@ # FrameworkPTAdapter 2.0.2 PyTorch Operator Support -- [Mapping Between PyTorch Native Operators and Ascend Adapted Operators](#mapping-between-pytorch-native-operators-and-ascend-adapted-operators.md) -- [PyTorch Operators Customized by Ascend](#pytorch-operators-customized-by-ascend.md) -

Mapping Between PyTorch Native Operators and Ascend Adapted Operators

+- [Mapping Between PyTorch Native Operators and Ascend Adapted Operators](#mapping-between-pytorch-native-operators-and-ascend-adapted-operators) +- [PyTorch Operators Customized by Ascend](#pytorch-operators-customized-by-ascend) +

Mapping Between PyTorch Native Operators and Ascend Adapted Operators

No.

@@ -5405,7 +5405,7 @@
-

PyTorch Operators Customized by Ascend

+

PyTorch Operators Customized by Ascend

No.

diff --git a/docs/en/RELEASENOTE/RELEASENOTE.md b/docs/en/RELEASENOTE/RELEASENOTE.md index 47a9de634802cca189dcca262576ac69dbba1115..1e1fecb717753e094e27dab245d4651fffb05006 100644 --- a/docs/en/RELEASENOTE/RELEASENOTE.md +++ b/docs/en/RELEASENOTE/RELEASENOTE.md @@ -1,15 +1,15 @@ # PyTorch Release Notes 2.0.2 -- [Before You Start](#before-you-start.md) -- [New Features](#new-features.md) -- [Modified Features](#modified-features.md) -- [Resolved Issues](#resolved-issues.md) -- [Known Issues](#known-issues.md) -- [Compatibility](#compatibility.md) -

Before You Start

+- [Before You Start](#before-you-start) +- [New Features](#new-features) +- [Modified Features](#modified-features) +- [Resolved Issues](#resolved-issues) +- [Known Issues](#known-issues) +- [Compatibility](#compatibility) +

Before You Start

This framework is modified based on the open-source PyTorch 1.5.0 primarily developed by Facebook, inherits native PyTorch features, and uses NPUs for dynamic image training. Models are adapted by operator granularity, code can be reused, and current networks can be ported and used on NPUs with only device types or data types modified. -

New Features

+

New Features

**Table 1** Features supported by PyTorch @@ -84,15 +84,15 @@ This framework is modified based on the open-source PyTorch 1.5.0 primarily deve
-

Modified Features

+

Modified Features

N/A -

Resolved Issues

+

Resolved Issues

N/A -

Known Issues

+

Known Issues

Known Issue

@@ -131,7 +131,7 @@ N/A
-

Compatibility

+

Compatibility

Atlas 800 \(model 9010\): CentOS 7.6/Ubuntu 18.04/BC-Linux 7.6/Debian 9.9/Debian 10/openEuler 20.03 LTS