diff --git a/docs/en/Multi-server Multi-device Training Adaptation Guide for PyTorch Models.md b/docs/en/Multi-server Multi-device Training Adaptation Guide for PyTorch Models.md new file mode 100644 index 0000000000000000000000000000000000000000..a73ac64a6a5266a6375f417432973a65d3a66ab8 --- /dev/null +++ b/docs/en/Multi-server Multi-device Training Adaptation Guide for PyTorch Models.md @@ -0,0 +1,900 @@ +# Overview + +Users can obtain PyTorch training models from Ascend ModelZoo, but the models do not support multi-server multi-device training. You need to modify the models based on the actual model code. This document describes how to quickly train a PyTorch model in Distributed Data Parallel (DDP) mode in multi-server multi-device scenario. + +# Training Workflow + +The process of training a PyTorch model in multi-server multi-device scenario includes environment preparation, model preparation, model modification, and training startup. + +1. Environment Preparation + + Prepare the software, hardware, and network environment for multi-server multi-device training, including setting up the development and operating environment, connecting the cluster network, setting the processor IP address, and setting the firewall. + +2. Model Preparation + + Prepare a PyTorch model, data loader, and optimizer for training. You can download them from the open source community (https://gitee.com/ascend/modelzoo/tree/master/built-in/PyTorch) or prepare them by yourself. + +3. Model Modification + + Modify the basic model and add the code and environment variables required by DDP to enable the multi-server multi-device training. + +4. Training Startup + + Start model training in the multi-server multi-device scenario and view training logs. + + + +# Quick Start + +## Overview + +The example presented in this document helps you quickly understand how a PyTorch model is trained in multi-server multi-device scenario. The example uses a custom model for training in two-computer eight-device scenario. The two computers are named AI Server0 and AI Server1. The eight Ascend 910 Processors on each computer are named device0 to device7. + +## Preparing the Environment + +At least two computers with Ascend 910 Processors installed are required, and the NPU firmware and driver are correctly installed on each computer. + +1. Prepare the development and operating environment on each computer. + + - Install the CANN development and operating environment. For details, see the *CANN Software Installation Guide*. Use CANN later than 5.0.3. + + - Install the PyTorch that adapts to NPU. For details, see the *PyTorch Installation Guide*. + +2. Prepare the network. + + Set up the network by directly connecting switches or optical ports. For details, see the *Ascend Data Center Solution Networking Guide* at https://support.huawei.com/enterprise/en/doc/EDOC1100221995/229cc0e4. + + In this example, two computers with eight devices are used for training, so optical ports are used for network connection. + +3. Configure the device IP address. + + Configure the device IP address on AI Server0. + + ```shell + hccn_tool -i 0 -ip -s address 192.168.100.101 netmask 255.255.255.0 + hccn_tool -i 1 -ip -s address 192.168.101.101 netmask 255.255.255.0 + hccn_tool -i 2 -ip -s address 192.168.102.101 netmask 255.255.255.0 + hccn_tool -i 3 -ip -s address 192.168.103.101 netmask 255.255.255.0 + hccn_tool -i 4 -ip -s address 192.168.100.100 netmask 255.255.255.0 + hccn_tool -i 5 -ip -s address 192.168.101.100 netmask 255.255.255.0 + hccn_tool -i 6 -ip -s address 192.168.102.100 netmask 255.255.255.0 + hccn_tool -i 7 -ip -s address 192.168.103.100 netmask 255.255.255.0 + ``` + + Configure the device IP address on AI Server1. + + ```shell + hccn_tool -i 0 -ip -s address 192.168.100.111 netmask 255.255.255.0 + hccn_tool -i 1 -ip -s address 192.168.101.111 netmask 255.255.255.0 + hccn_tool -i 2 -ip -s address 192.168.102.111 netmask 255.255.255.0 + hccn_tool -i 3 -ip -s address 192.168.103.111 netmask 255.255.255.0 + hccn_tool -i 4 -ip -s address 192.168.100.110 netmask 255.255.255.0 + hccn_tool -i 5 -ip -s address 192.168.101.110 netmask 255.255.255.0 + hccn_tool -i 6 -ip -s address 192.168.102.110 netmask 255.255.255.0 + hccn_tool -i 7 -ip -s address 192.168.103.110 netmask 255.255.255.0 + ``` + +4. Configure the firewall. + + - Command for disabling the firewall on Ubuntu: + + ```shell + ufw disable + ``` + ``` + +- Command for disabling the firewall on Red Hat or CentOS 7: + + ```shell + systemctl stop firewalld + ``` + +## Preparing a Model + +This example creates a simple model for you to quickly understand multi-server multi-device training. You can also obtain the Ascend NPU-based PyTorch training model from the open source community (https://gitee.com/ascend/modelzoo/tree/master/built-in/PyTorch). + +1. Prepare a DDP model. + + The following is an example of main.py for multi-server multi-device training. + + ```python + import argparse + import os + import torch + import torchvision + import torch.nn as nn + import torch.nn.functional as F + import torch.distributed as dist + from torch.nn.parallel import DistributedDataParallel as DDP + + ### 1. Perform basic operations. ### + # Build a model. + class ToyModel(nn.Module): + def __init__(self): + super(ToyModel, self).__init__() + self.conv1 = nn.Conv2d(3, 6, 5) + self.pool = nn.MaxPool2d(2, 2) + self.conv2 = nn.Conv2d(6, 16, 5) + self.fc1 = nn.Linear(16 * 5 * 5, 120) + self.fc2 = nn.Linear(120, 84) + self.fc3 = nn.Linear(84, 10) + + def forward(self, x): + x = self.pool(F.relu(self.conv1(x))) + x = self.pool(F.relu(self.conv2(x))) + x = x.view(-1, 16 * 5 * 5) + x = F.relu(self.fc1(x)) + x = F.relu(self.fc2(x)) + x = self.fc3(x) + return x + + # Obtain a dataset. + def get_dataset(): + transform = torchvision.transforms.Compose([ + torchvision.transforms.ToTensor(), + torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) + ]) + my_trainset = torchvision.datasets.CIFAR10(root='./data', train=True, + download=True, transform=transform) + + train_sampler = torch.utils.data.distributed.DistributedSampler(my_trainset) + trainloader = torch.utils.data.DataLoader(my_trainset, + batch_size=16, num_workers=2, sampler=train_sampler) + return trainloader + + + ### 2. Initialize the parameters, data, model, loss function, and optimizer. #### + # Obtain local_rank and addr. + parser = argparse.ArgumentParser() + parser.add_argument("--local_rank", default=-1, type=int) + parser.add_argument("--addr", default='127.0.0.1', type=str, help='master addr') + + FLAGS = parser.parse_args() + local_rank = FLAGS.local_rank + addr = FLAGS.addr + + # Set the IP address and port of the master node. + os.environ['MASTER_ADDR'] = addr + os.environ['MASTER_PORT'] = '29501' + + # Initialize the DDP backend. + loc = 'npu:{}'.format(local_rank) + torch.npu.set_device(loc) + dist.init_process_group(backend='hccl') # HCCL is the backend of the NPU device. + + + # Prepare data after DDP initialization. + trainloader = get_dataset() + + # Instantiate the model. + model = ToyModel().to(loc) + + # Load the model weight. The weight needs to be loaded only on the master node before the DDP model is built. + ckpt_path = None + if dist.get_rank() == 0 and ckpt_path is not None: + model.load_state_dict(torch.load(ckpt_path)) + + # Build the DDP model. + model = DDP(model, device_ids=[local_rank], output_device=local_rank) + + # Initialize the optimizer. After the DDP model is built, use the model to initialize the optimizer. + optimizer = torch.optim.SGD(model.parameters(), lr=0.001) + + # Initialize the loss function. + loss_func = nn.CrossEntropyLoss().to(loc) + + ### 3. Train the network. + model.train() + iterator = range(100) + for epoch in iterator: + trainloader.sampler.set_epoch(epoch) + for data, label in trainloader: + data, label = data.to(local_rank), label.to(local_rank) + optimizer.zero_grad() + prediction = model(data) + loss = loss_func(prediction, label) + loss.backward() + print("loss = %0.3f \n" % loss) + optimizer.step() + + # 1. Similar to the DP mode, when you save a model, note that model.module instead of model is saved. + # That is because the model actually refers to a DDP model, and the parameters are packaged by `model=DDP(model)`. + # 2. You only need to save the model on process 0 once to avoid repeated saving. + if dist.get_rank() == 0: + torch.save(model.module.state_dict(), "%d.ckpt" % epoch) + ``` + +2. Ensure that the model is correct for training in the single-server multi-device scenario. + + 1. The Python third-party library is required if you install the model script yourself. + + 2. Configure NPU environment variables. For information about the **env_npu.sh** script, see the appendix. + + ```shell + source env_npu.sh + ``` + + 3. Execute **torch.distributed.launch** to run the following main.py command to train a model in single-server multi-device scenario. + + ```shell + python -m torch.distributed.launch --nproc_per_node 8 main.py + ``` + + `--nproc_per_node` indicates the number of training cards. + + After the command is run successfully, the model is trained on the eight NPUs of the device. + + + +## Modifying the Model + +The example provided in "Quick Start" is adapted to the multi-server multi-device training. You do not need to modify the script. For details about how to modify other models, see section "Multi-server Multi-device Training". + +## Starting the Training + +1. Upload main.py model script to any directory on AI Server0 and AI Server1. + +2. Query the host IP addresses of the servers. + + ```shell + hostname -I + ``` + + All IP addresses are displayed, and the first IP address is the host IP address of the current server. + + For example, the host IP address of AI Server0 is **192.168.*xx*.22**, and that of AI Server1 is **192.168.*xx*.23**. + +3. Use AI Server0 as the master node, and start the 2 x 8 cluster. + + Startup commands for AI Server0: + + ```shell + # Set environment variables. Obtain the env_npu.sh script content from the appendix. + source env_npu.sh + # Disable the trustlist of HCCL channel. + export HCCL_WHITELIST_DISABLE=1 + # Initialize the IP address of the HCCL communication NIC, and set the IP address to the host IP address of the current server. + export HCCL_IF_IP=192.168.xx.22 + # + python3.7 -m torch.distributed.launch --nnodes=2 --node_rank=0 --nproc_per_node 8 --master_addr 192.168.xx.22 --master_port 29501 main.py --addr 192.168.xx.22 + ``` + + Startup commands for AI Server1: + + ```shell + # Set environment variables. Obtain the env_npu.sh script content from the appendix. + source env_npu.sh + # Disable the trustlist of HCCL channel. + export HCCL_WHITELIST_DISABLE=1 + # Initialize the IP address of the HCCL communication NIC, and set the IP address to the host IP address of the current server. + export HCCL_IF_IP=192.168.xx.23 + + python3.7 -m torch.distributed.launch --nnodes=2 --node_rank=1 --nproc_per_node 8 --master_addr 192.168.xx.22 --master_port 29501 main.py --addr 192.168.xx.22 + ``` + + Parameter description: + + --nnode: specifies the number of nodes used for distributed training scripts. + + --node_rank: specifies the rank of the current node during multi-node distributed training. + + --nproc_per_node: specifies the number of GPU-based training processes on the current node. + + --master_addr: address of the master node (rank is 0). The value can be the IP address or the host name of node 0. + + --master_port: specifies the port number used by the master node during distributed training. + + --addr: input parameter of the main.py script, specifying the host IP address of the master node. + +3. View the host logs. + + Host logs are stored in the `~/ascend/log` directory. You can go to this directory to view the device logs of each host. + +# Multi-server Multi-device Training + +## Common Concepts and Parameters + +Basic concepts for PyTorch distributed training + +| Basic Concept | Description | +| :-----------: | ------------------------------------------------------------ | +| AI Server | Computer with an Ascend 910 Processors. Multiple computers are identified as AI Server +serial number, for example, AI Server0 and AI Server1. | +| device | Ascend 910 Processors on the AI server. Multiple processors are represented as device 0, device 1, ..., and device 7. | +| host | AI server host. | +| master | Select one of multiple AI servers as the master node for data communication. | +| group | Process group. By default, there is only one group. Use the default value. | +| world size | Number of global parallel processes, which can be obtained by running **torch.distributed.get_world_size()**. The value is the same for different processes. | +| rank | Sequence number of the current process, which is used for communication between processes. For example, for a 2 x 8 cluster, the **world size** is 16, and the rank in each process is [0, 1, 2, ..., 15]. | +| local_rank | Sequence number of processes on each host, for example, there are processes 0-7 on each host. Generally, **local_rank** is used to set the GPU/NPU on which the current model runs. | + +Parameters for executing **torch.distributed.launch** to start multi-device training + +| Parameter | Description | +| ------------------ | ------------------------------------------------------------ | +| **nnodes** | Specifies the number of nodes used for distributed training scripts. | +| **node_rank** | Specifies the rank of the current node during multi-node distributed training. | +| **nproc_per_node** | Specifies the number of GPU-based training processes on the current node. You are advised to set this parameter to the number of GPUs on the current node. In this way, each process can independently control a GPU to achieve the highest efficiency. | +| **master_addr** | Address of the master node (rank is 0). The value can be the IP address or the host name of node 0. For single-node multi-process training, set this parameter to **127.0.0.1**. | +| **master_port**: | Specifies the port number used by the master node during distributed training. The port number must be different from the port numbers of other applications. | + +## Multi-server Multi-device Training Process + +### Preparing the Environment + +At least two computers with Ascend 910 Processors installed are required, and the NPU firmware and driver are correctly installed on each server. + +1. Prepare the development and operating environment on each computer. + + - Install the CANN development and operating environment. For details, see the *CANN Software Installation Guide*. Use CANN later than 5.0.3. + + - Install the PyTorch that adapts to NPU. For details, see the *PyTorch Installation Guide.* + +2. Prepare the network. + + Cluster training is completed by multiple computers (a maximum of 128) with Ascend 910 Processors installed. The computers need to work with switches to form a fully-connected active/standby network on the data plane. The 8 x *n*-device training is supported. Two computers can be directly connected through optical ports. For details, see the *Ascend Data Center Solution Networking Guide * at https://support.huawei.com/enterprise/en/doc/EDOC1100221995/229cc0e4. + +3. Configure the device IP address. + + Use hccn_tool to configure the device IP address. hccn_tool is provided by CANN. + + ```shell + hccn_tool -i 0 -ip -s address 192.168.100.111 netmask 255.255.255.0 + ``` + + Observe the following rules when configuring the device IP address: + + 1. On the AI servers, devices 0/4, 1/5, 2/6, and 3/7 must be in the same network segment. But devices 0, 1, 2, and 3 must be in different network segments, and devices 4, 5, 6, and 7 must be in different network segments. + 2. In the cluster scenario, the devices corresponding to each AI server must be in the same network segment. NICs 0 and 1 of AI Server0 and AI Server1 must be in the same network segment. + 3. Each IP address must be unique. IP addresses in the same network segment must be distinguished by the last eight bits. + + Use hccn_tool to check whether the device IP address is correct. + + - Query the IP address of each device. + + ```shell + hccn_tool -i 0 -ip –g + ``` + + The IP addresses are displayed: + + > ipaddr:192.168.100.101 + > + > netmask:255.255.255.0 + + - Use hccn_tool to ensure that the devices of the two hosts are correctly connected by performing the test for eight times from device0 to devcie7. + + ```shell + hccn_tool -i 0 -netdetect -s address xx.xx.xx.xx + + hccn_tool -i 0 -net_health –g + ``` + + **-i**: device ID. + + **-s address**: ***xx.xx.xx.xx*** is the IP address of device *i* on the other host. + + If `success` is returned, the connection is successful. + +4. Configure the firewall. + + During HCCL communication, the firewall may intercept the communication port, causing communication timeout. Therefore, you need to disable the firewall on the server for PyTorch cluster training. + + - Command for disabling the firewall on Ubuntu: + + ```shell + ufw disable + ``` + + - Command for disabling the firewall on Red Hat or CentOS 7: + + ```shell + systemctl stop firewalld + ``` + +### Preparing a Model + +There are two methods for preparing a model. + +- Download a PyTorch training model from the open source community (https://gitee.com/ascend/modelzoo/tree/master/built-in/PyTorch). + + The model obtained from the open source community supports single-server multi-device training. Modify the model based on the related parameters described in section "Modifying the Model". + +- Manually build a PyTorch training model. + +1. Prepare a PyTorch training model and data loader. + + Prepare a PyTorch model: + + ```python + class ToyModel(nn.Module): + def __init__(self): + ... + def forward(self,x): + ... + ``` + + Prepare data: + + ```python + def get_dataset(): + transform = torchvision.transforms.Compose([ + torchvision.transforms.ToTensor(), + torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) + ]) + my_trainset = torchvision.datasets.CIFAR10(root='./data', train=True, + download=True, transform=transform) + + trainloader = torch.utils.data.DataLoader(my_trainset,batch_size=16,) + return trainloader + + trainloader=get_dataset() + ``` + +2. Instantiate the model. + + ```python + # Instantiate the model. + model = ToyModel().to(loc) + + # Load the model weight + if ckpt_path is not None: + model.load_state_dict(torch.load(ckpt_path)) + ``` + +3. Prepare the loss function and optimizer. + + ```python + # Initialize the optimizer. After the DDP model is built, use the model to initialize the optimizer. + optimizer = torch.optim.SGD(model.parameters(), lr=0.001) + + # Initialize the loss function. + loss_func = nn.CrossEntropyLoss().to(loc) + ``` + +4. Train the model. + + ```python + ### 3. Train the network. + model.train() + iterator = range(100) + for epoch in iterator: + for data, label in trainloader: + data, label = data.to(local_rank), label.to(local_rank) + optimizer.zero_grad() + prediction = model(data) + loss = loss_func(prediction, label) + loss.backward() + print("loss = %0.3f \n" % loss) + optimizer.step() + + torch.save(model.state_dict(), "%d.ckpt" % epoch) + ``` + + + +### Modifying the Model + +Modify the IP address and port of the master node, initialize the **distributed ** function, model DDP, data DDP, and optimizer, and modify the DDP model training method based on the initial model code. + +1. Set the IP address and port for the master node. In the NPU distributed training, HCCL is used for communication, while in PyTorch, the HCCL communication mechanism that is detected by automatic topology is used. That is, **RANK_TABLE_FLIE** is not required, but the communication depends on the NIC on the host side. Therefore, you need to set environment variables in the code to set the communication NIC. + + ```python + os.environ['MASTER_ADDR'] = xxx.xxx.xxx.xxx + os.environ['MASTER_PORT'] = 'xxx' + ``` + + **MASTER_ADDR**: Set this parameter to the IP address of the master node in the cluster. (Select any host as the master node.) + + **MASTER_PORT**: Set this parameter to the idle port of the master node. + + In the model code, the IP address and port number of the master node are generally presented as transferred parameters. In some open-source code, the IP address and port number may be presented as **127.0.0.1**. In this case, you need to modify them. + + The preceding variables must be declared before **torch.distributed.init_process_group()** is invoked. + +2. Initialize **distributed**. + + In PyTorch, `dist.init_process_group(backend='hccl', world_size=world_size, rank=rank)` is used to initialize thread groups. The parameters are described as follows: + + `backend`: communication protocol used for distributed training. Only hccl can be used on the NPU. + + `world_size`: total number of devices used for training. + + `rank`: rank ID of the currently initialized device, that is, the global logical ID. + + There are two methods to start multi-device training: + + - **torch.distributed.launch**: + + ```python + import torch.distributed as dist + + dist.init_process_group(backend='hccl') # HCCL is the backend of the NPU device. + ``` + + - **mp.spawn**: + + ```python + import torch.distributed as dist + + def main_worker(pid_idx, device_nums_per_node, args): + args.distributed_rank = args.rank * device_nums_per_node + pid_idx + dist.init_process_group(backend=args.dist_backend, world_size=args.distributed_world_size, rank=args.distributed_rank) + ``` + + In the preceding commands: + + `pid_idx`: device ID. + + `device_nums_per_node`: number of devices on each AI server. + +3. Initialize the model DDP. + + ```python + # Instantiate the model. + model = ToyModel().to(loc) + + # Load the model weight. It needs to be loaded only on the master node before the DDP model is built. + if dist.get_rank() == 0 and ckpt_path is not None: + model.load_state_dict(torch.load(ckpt_path)) + + # Build the DDP model. + model = DDP(model, device_ids=[local_rank], output_device=local_rank) + ``` + +4. Initialize the data DDP. + + ```python + def get_dataset(): + transform = torchvision.transforms.Compose([ + torchvision.transforms.ToTensor(), + torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) + ]) + my_trainset = torchvision.datasets.CIFAR10(root='./data', train=True, + download=True, transform=transform) + + train_sampler = torch.utils.data.distributed.DistributedSampler(my_trainset) + trainloader = torch.utils.data.DataLoader(my_trainset, + batch_size=16, num_workers=2, sampler=train_sampler) + return trainloader + + trainloader = get_dataset() + ``` + +5. Initialize the loss function and optimizer. + + ```python + # Initialize the optimizer. After the DDP model is built, use the model to initialize the optimizer. + optimizer = torch.optim.SGD(model.parameters(), lr=0.001) + + # Initialize the loss function. + loss_func = nn.CrossEntropyLoss().to(loc) + ``` + +6. Train the DDP model. + + ```python + model.train() + iterator = range(100) + for epoch in iterator: + # Set epoch. + trainloader.sampler.set_epoch(epoch) + + for data, label in trainloader: + data, label = data.to(local_rank), label.to(local_rank) + optimizer.zero_grad() + prediction = model(data) + loss = loss_func(prediction, label) + loss.backward() + print("loss = %0.3f \n" % loss) + optimizer.step() + + # 1. Similar to the DP mode, when you save a model, note that model.module instead of model is saved. + # That is because the model actually refers to a DDP model, and the parameters are packaged by `model=DDP(model)`. + # 2. You only need to save the model on process 0 once to avoid repeated saving. + if dist.get_rank() == 0: + torch.save(model.module.state_dict(), "%d.ckpt" % epoch) + ``` + +### Start the Training + +The training can be started manually or using the shell script. + +- Start the training manually using **torch.distributed.launch**. + + 1. Configure NPU environment variables. For details, see the **env_npu.sh** script in the appendix. + + 2. Add environment variables. For multi-server training, add the `HCCL_WHITELIST_DISABLE` and `HCCL_IF_IP` environment variables. + + - **HCCL_WHITELIST_DISABLE**: HCCL channel trustlist. The value **1** indicates that the trustlist is disabled. + - **HCCL_IF_IP**: initialized IP address of the HCCL communication NIC. Set it to the IP address of the host NIC IP of the current server. + + 3. Upload the modified model script to each AI server. + + 4. Install the required Python library on each AI server. + + 5. Select an AI server as the master node and query the IP address of each AI server. + + 6. Run the following commands on each AI server: + + ``` + python3 -m torch.distributed.launch --nnodes=${nnodes} --node_rank=i --nproc_per_node 8 --master_addr 192.168.xx.22 --master_port 29501 main.py --addr 192.168.xx.22 + ``` + + In the preceding commands: + + **--nnodes**: number of AI servers used for distributed training scripts. + + **--node_rank**: AI server ID. + + **--nproc_per_node**: number of devices of each AI server. + + **--master_addr**: IP address of the AI server that functions as the master node. + + **--master_port**: port number of the AI server that functions as the master node. + + **main.py**: Change it to the name of the startup script. + + **--addr**: indicates the IP address of the master node, which is a parameter transferred to the startup script. + +- Start the training using Open MPI. + + 1. Install the PI open-source library. + + In the multi-server multi-device scenario, distributed training deployment depends on the Open MPI open-source library, which must be installed on each server that participates in model training. Currently, Open MPI 4.0.1, 4.0.2, or 4.0.3 is required. + Run the **mpirun --version** command to check whether Open MPI has been installed. If `mpirun (Open MPI) 4.0.2 Report bugs to http://www.open-mpi.org/community/help/` is returned, Open MPI has been installed. If it has been installed and its version is 4.0.1, 4.0.2, or 4.0.3, you do not need to install it again. + + Otherwise, perform the following steps to install it + + 1. Visit the following link to download the Open MPI software package, for example, openmpi-4.0.2.tar.bz2. + https://www.open-mpi.org/software/ompi/v4.0/ + + 2. Log in to the installation environment as the root user. + + 3. Upload the downloaded Open MPI software package to a directory in the installation environment. + + 4. Go to the directory and run the following command to decompress the software package: + + ```shell + tar -jxvf openmpi-4.0.2.tar.bz2 + ``` + + 5. Go to the directory generated after the decompression, and run the following commands to configure, compile, and install Open MPI: + + ```shell + ./configure --prefix=/usr/local/mpirun4.0.2 + make + make install + ``` + + The **--prefix** parameter specifies the Open MPI installation path. Change it based on the site requirements. + + 6. Run the **vi ~/.bashrc** command to open the **.bashrc** file, and add the following environment variables to the end of the file: + + ```shell + export OPENMPI=/usr/local/mpirun4.0.2 + export LD_LIBRARY_PATH=$OPENMPI/lib + export PATH=$OPENMPI/bin:$PATH + ``` + + In the environment variables, **/usr/local/mpirun4.0.2** indicates the Open MPI installation path. Change it based on the site requirements. + Run the **:wq!** command to save the file and exit. + + 7. Make the configuration take effect. + + ``` + source ~/.bashrc + ``` + + 8. After the installation is complete, run the following command to check the installation version. If the required version information is displayed, the installation is successful. + + ``` + mpirun --version + ``` + + 2. Configure SSH password-free login for the AI servers. + + If Open MPI is used for distributed training deployment in the multi-server multi-device scenario, you need to configure SSH password-free login between every two servers to ensure that the servers can communicate with each other. The procedure is as follows: + + 1. Log in to each server as the root user. + + 2. Configure the reliability among hosts in the cluster. + + Open the **/etc/ssh/ssh_config** file and add the following fields to the end of the file: + + ``` + StrictHostKeyChecking no + UserKnownHostsFile /dev/null + ``` + + 3. Open the **/etc/hosts** file on each server and add the corresponding IP address and host name of the server to the first line of the file. If the file already contains the IP address and host name, skip this step. The following is an example of the content to be added: + + ``` + 10.90.140.199 ubuntu + ``` + + In the preceding content, **10.90.140.199** is the IP address of the server, and **ubuntu** is the host name. + + 4. Run the following commands on the first server to generate a public key (Assume that the IP address of the first server is **10.90.140.199**.): + + ``` + cd ~/.ssh/ # If the directory does not exist, run the ssh localhost command first. + ssh-keygen -t rsa # After the key is generated, a message is displayed. Press Enter for three consecutive times. + mv id_rsa.pub authorized_keys # Renames the generated key id_rsa.pub to authorized_keys. + ``` + + 5. Generate a key on each of the other servers, and copy the keys to the **authorized_keys** file on the first server. + + 1. Run the following commands on each of the other servers to generate a key: + + cd ~/.ssh/ + ssh-keygen -t rsa + + 2. Download the key file **id_rsa.pub** generated on each server to the local host and copy the key in the file. + + 3. On the first server, run the following command to open the authorized_keys file and copy the keys of each of other servers to the end of the public key of the first server. + + ``` + vi ~/.ssh/authorized_keys + ``` + + Run the **:wq!** command to save the file. + + 6. Run the following commands on each of the other servers to copy the public key of the first server to each of the other servers: + + cd ~/.ssh/ + scp root@10.90.140.199:~/.ssh/authorized_keys ./ + + 7. Run the following command on each server to test password-free login: + + ``` + ssh User name@IP address + ``` + + For example, run the **ssh root@10.90.140.231** command to log in to the server whose IP address is 10.90.140.231 from the first server whose IP address is 10.90.140.199 without a password. + + If information similar to the following is displayed, the login without a password is successful. + + ``` + Linux ubuntu 4.19.28 #1 SMP Tue Jun 23 19:05:23 EDT 2020 x86_64 + + The programs included with the ubuntu GNU/Linux system are free software; + the exact distribution terms for each program are described in the + individual files in /usr/share/doc/*/copyright. + + ubuntu GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent + permitted by applicable law. + Last login: Tue Sep 15 14:37:21 2020 from 10.254.88.75 + ``` + + You can run the **exit** command to log out of the server. If information similar to the following is displayed, the logout is successful. + + ``` + logout + Connection to 10.90.140.231 closed. + ``` + 3. Use Open MPI to start model training. + + 1. Compile a startup script for each AI server, for example, train.sh, and move the startup script to the same path of the corresponding AI server. + + ``` + # Configure NPU environment variables. For information about the env_npu.sh script, see the appendix. + source env_npu.sh + #Disable the trustlist of HCCL channel. + export HCCL_WHITELIST_DISABLE=1 + # Initialize the IP address of the HCCL communication NIC , and set the IP address to the host IP address of the current server. + export HCCL_IF_IP=xxx.xxx.xx.xxx + python3 -m torch.distributed.launch --nnodes=${nnodes} --node_rank=i --nproc_per_node 8 --master_addr xxx.xxx.xx.xxx --master_port 29501 main.py --addr xxx.xxx.xx.xxx + ``` + + For details about the script parameters, see section "Start the training manually using **torch.distributed.launch**". + + 2. Compile the startup script. + + ``` + # Configuring the mpirun environment variables. + export PATH=$PATH:/usr/local/mpirun4.0.2/bin + # Run the mpirun tool. + mpirun -H xxx.xxx.xxx.xxx:1,xxx.xxx.xxx.xxx:1 \ + --bind-to none -map-by slot \ + --mca btl_tcp_if_exclude lo,docker0,endvnic\ + --allow-run-as-root \ + --prefix /usr/local/mpirun4.0.2/ \ + ./train.sh + ``` + + In the preceding command: + + **-H**: IP address of each AI server and the number of started processes. + + **--bind-to**: process-binding policy. + + **--mca**: MCA parameter in a specific context. **arg0** is the parameter name, and **arg1** is the parameter value. + + **--allow-run-as-root**: The root user is allowed to run this script. + + **--prefix**: path of mpirun4.0.2. + + **./train.sh**: path of the startup script of each AI server. + + 4. View the log information after the training succeeds. + + + + Host logs are stored in the `~/ascend/log` directory. You can go to this directory to view the device logs of each host. + +# Appendix + +The following shows the NPU environment variable configuration script **env_npu.sh**, which can be used to configure the operating and development environment variables. + +```shell +#!/bin/bash +export install_path=/usr/local/Ascend + +if [ -d ${install_path}/toolkit ]; then + export LD_LIBRARY_PATH=/usr/include/hdf5/lib/:/usr/local/:/usr/local/lib/:/usr/lib/:${install_path}/fwkacllib/lib64/:${install_path}/driver/lib64/common/:${install_path}/driver/lib64/driver/:${install_path}/add-ons:${path_lib}:${LD_LIBRARY_PATH} + export PATH=${install_path}/fwkacllib/ccec_compiler/bin:${install_path}/fwkacllib/bin:$PATH + export PYTHONPATH=${install_path}/fwkacllib/python/site-packages:${install_path}/tfplugin/python/site-packages:${install_path}/toolkit/python/site-packages:$PYTHONPATH + export PYTHONPATH=/usr/local/python3.7.5/lib/python3.7/site-packages:$PYTHONPATH + export ASCEND_OPP_PATH=${install_path}/opp +else + if [ -d ${install_path}/nnae/latest ];then +exportLD_LIBRARY_PATH=/usr/local/:/usr/local/python3.7.5/lib/:/usr/local/openblas/lib:/usr/local/lib/:/usr/lib64/:/usr/lib/:${install_path}/nnae/latest/fwkacllib/lib64/:${install_path}/driver/lib64/common/:${install_path}/driver/lib64/driver/:${install_path}/add-ons/:/usr/lib/aarch64_64-linux-gnu:$LD_LIBRARY_PATH + export PATH=$PATH:${install_path}/nnae/latest/fwkacllib/ccec_compiler/bin/:${install_path}/nnae/latest/toolkit/tools/ide_daemon/bin/ + export ASCEND_OPP_PATH=${install_path}/nnae/latest/opp/ + export OPTION_EXEC_EXTERN_PLUGIN_PATH=${install_path}/nnae/latest/fwkacllib/lib64/plugin/opskernel/libfe.so:${install_path}/nnae/latest/fwkacllib/lib64/plugin/opskernel/libaicpu_engine.so:${install_path}/nnae/latest/fwkacllib/lib64/plugin/opskernel/libge_local_engine.so + export PYTHONPATH=${install_path}/nnae/latest/fwkacllib/python/site-packages/:${install_path}/nnae/latest/fwkacllib/python/site-packages/auto_tune.egg/auto_tune:${install_path}/nnae/latest/fwkacllib/python/site-packages/schedule_search.egg:$PYTHONPATH + export ASCEND_AICPU_PATH=${install_path}/nnae/latest + else + export LD_LIBRARY_PATH=/usr/local/:/usr/local/lib/:/usr/lib64/:/usr/lib/:/usr/local/python3.7.5/lib/:/usr/local/openblas/lib:${install_path}/ascend-toolkit/latest/fwkacllib/lib64/:${install_path}/driver/lib64/common/:${install_path}/driver/lib64/driver/:${install_path}/add-ons/:/usr/lib/aarch64-linux-gnu:$LD_LIBRARY_PATH + export PATH=$PATH:${install_path}/ascend-toolkit/latest/fwkacllib/ccec_compiler/bin/:${install_path}/ascend-toolkit/latest/toolkit/tools/ide_daemon/bin/ + export ASCEND_OPP_PATH=${install_path}/ascend-toolkit/latest/opp/ + export OPTION_EXEC_EXTERN_PLUGIN_PATH=${install_path}/ascend-toolkit/latest/fwkacllib/lib64/plugin/opskernel/libfe.so:${install_path}/ascend-toolkit/latest/fwkacllib/lib64/plugin/opskernel/libaicpu_engine.so:${install_path}/ascend-toolkit/latest/fwkacllib/lib64/plugin/opskernel/libge_local_engine.so + export PYTHONPATH=${install_path}/ascend-toolkit/latest/fwkacllib/python/site-packages/:${install_path}/ascend-toolkit/latest/fwkacllib/python/site-packages/auto_tune.egg/auto_tune:${install_path}/ascend-toolkit/latest/fwkacllib/python/site-packages/schedule_search.egg:$PYTHONPATH + export ASCEND_AICPU_PATH=${install_path}/ascend-toolkit/latest + fi +fi + +#Output host logs to the serial port. 0: disable; 1: enable. +export ASCEND_SLOG_PRINT_TO_STDOUT=0 +#Set the default log level. 0: debug; 1: info; 2: warning; 3: error. +export ASCEND_GLOBAL_LOG_LEVEL=3 +#Enable or disable the event log. 0: disable; 1: enable. +export ASCEND_GLOBAL_EVENT_ENABLE=0 +#Enable or disable taskque. 0: disable; 1: enable. +export TASK_QUEUE_ENABLE=1 +#Enable or disable the HCCL trustlist. 1: disable; 0: enable. +export HCCL_WHITELIST_DISABLE=1 + +#Set the device-side log to error. +${install_path}/driver/tools/msnpureport -g error -d 0 +${install_path}/driver/tools/msnpureport -g error -d 1 +${install_path}/driver/tools/msnpureport -g error -d 2 +${install_path}/driver/tools/msnpureport -g error -d 3 +${install_path}/driver/tools/msnpureport -g error -d 4 +${install_path}/driver/tools/msnpureport -g error -d 5 +${install_path}/driver/tools/msnpureport -g error -d 6 +${install_path}/driver/tools/msnpureport -g error -d 7 +#Disable the event log on the device side. +${install_path}/driver/tools/msnpureport -e disable + +path_lib=$(python3.7 -c """ +import sys +import re +result='' +for index in range(len(sys.path)): + match_sit = re.search('-packages', sys.path[index]) + if match_sit is not None: + match_lib = re.search('lib', sys.path[index]) + + if match_lib is not None: + end=match_lib.span()[1] + result += sys.path[index][0:end] + ':' + + result+=sys.path[index] + '/torch/lib:' +print(result)""" +) + +echo ${path_lib} + +export LD_LIBRARY_PATH=/usr/local/python3.7.5/lib/:${path_lib}:$LD_LIBRARY_PATH +``` \ No newline at end of file diff --git a/docs/en/ONNX Operator List/ONNX Operator List.md b/docs/en/ONNX Operator List/Supported ONNX Operators.md similarity index 39% rename from docs/en/ONNX Operator List/ONNX Operator List.md rename to docs/en/ONNX Operator List/Supported ONNX Operators.md index 4b5f3dee3622695b35c934600e8cb4692638fdc6..a7e975595cbb4119a0564bc9b6528eb998ce9f1a 100644 --- a/docs/en/ONNX Operator List/ONNX Operator List.md +++ b/docs/en/ONNX Operator List/Supported ONNX Operators.md @@ -1,276 +1,276 @@ -# ONNX Operator List -- [Abs](#absmd) -- [Acos](#acosmd) -- [Acosh](#acoshmd) -- [AdaptiveAvgPool2D](#adaptiveavgpool2dmd) -- [AdaptiveMaxPool2D](#adaptivemaxpool2dmd) -- [Add](#addmd) -- [Addcmul](#addcmulmd) -- [AffineGrid](#affinegridmd) -- [And](#andmd) -- [Argmax](#argmaxmd) -- [Argmin](#argminmd) -- [AscendRequantS16](#ascendrequants16md) -- [AscendRequant](#ascendrequantmd) -- [AscendQuant](#ascendquantmd) -- [AscendDequantS16](#ascenddequants16md) -- [AscendDequant](#ascenddequantmd) -- [AscendAntiQuant](#ascendantiquantmd) -- [Asin](#asinmd) -- [Asinh](#asinhmd) -- [Atan](#atanmd) -- [Atanh](#atanhmd) -- [AveragePool](#averagepoolmd) -- [BatchNormalization](#batchnormalizationmd) -- [BatchMatMul](#batchmatmulmd) -- [BatchMultiClassNMS](#batchmulticlassnmsmd) -- [BitShift](#bitshiftmd) -- [Cast](#castmd) -- [Ceil](#ceilmd) -- [Celu](#celumd) -- [Concat](#concatmd) -- [Clip](#clipmd) -- [ConvTranspose](#convtransposemd) -- [Cumsum](#cumsummd) -- [Conv](#convmd) -- [Compress](#compressmd) -- [Constant](#constantmd) -- [ConstantOfShape](#constantofshapemd) -- [Cos](#cosmd) -- [Cosh](#coshmd) -- [DeformableConv2D](#deformableconv2dmd) -- [Det](#detmd) -- [DepthToSpace](#depthtospacemd) -- [Div](#divmd) -- [Dropout](#dropoutmd) -- [Elu](#elumd) -- [EmbeddingBag](#embeddingbagmd) -- [Equal](#equalmd) -- [Erf](#erfmd) -- [Exp](#expmd) -- [Expand](#expandmd) -- [EyeLike](#eyelikemd) -- [Flatten](#flattenmd) -- [Floor](#floormd) -- [Gather](#gathermd) -- [GatherND](#gatherndmd) -- [GatherElements](#gatherelementsmd) -- [Gemm](#gemmmd) -- [GlobalAveragePool](#globalaveragepoolmd) -- [GlobalLpPool](#globallppoolmd) -- [GlobalMaxPool](#globalmaxpoolmd) -- [Greater](#greatermd) -- [GreaterOrEqual](#greaterorequalmd) -- [HardSigmoid](#hardsigmoidmd) +# Supported ONNX Operators +- [Abs](#Absmd) +- [Acos](#Acosmd) +- [Acosh](#Acoshmd) +- [AdaptiveAvgPool2D](#AdaptiveAvgPool2Dmd) +- [AdaptiveMaxPool2D](#AdaptiveMaxPool2Dmd) +- [Add](#Addmd) +- [Addcmul](#Addcmulmd) +- [AffineGrid](#AffineGridmd) +- [And](#Andmd) +- [Argmax](#Argmaxmd) +- [Argmin](#Argminmd) +- [AscendRequantS16](#AscendRequantS16md) +- [AscendRequant](#AscendRequantmd) +- [AscendQuant](#AscendQuantmd) +- [AscendDequantS16](#AscendDequantS16md) +- [AscendDequant](#AscendDequantmd) +- [AscendAntiQuant](#AscendAntiQuantmd) +- [Asin](#Asinmd) +- [Asinh](#Asinhmd) +- [Atan](#Atanmd) +- [Atanh](#Atanhmd) +- [AveragePool](#AveragePoolmd) +- [BatchNormalization](#BatchNormalizationmd) +- [BatchMatMul](#BatchMatMulmd) +- [BatchMultiClassNMS](#BatchMultiClassNMSmd) +- [BitShift](#BitShiftmd) +- [Cast](#Castmd) +- [Ceil](#Ceilmd) +- [Celu](#Celumd) +- [Concat](#Concatmd) +- [Clip](#Clipmd) +- [ConvTranspose](#ConvTransposemd) +- [Cumsum](#Cumsummd) +- [Conv](#Convmd) +- [Compress](#Compressmd) +- [Constant](#Constantmd) +- [ConstantOfShape](#ConstantOfShapemd) +- [Cos](#Cosmd) +- [Cosh](#Coshmd) +- [DeformableConv2D](#DeformableConv2Dmd) +- [Det](#Detmd) +- [DepthToSpace](#DepthToSpacemd) +- [Div](#Divmd) +- [Dropout](#Dropoutmd) +- [Elu](#Elumd) +- [EmbeddingBag](#EmbeddingBagmd) +- [Equal](#Equalmd) +- [Erf](#Erfmd) +- [Exp](#Expmd) +- [Expand](#Expandmd) +- [EyeLike](#EyeLikemd) +- [Flatten](#Flattenmd) +- [Floor](#Floormd) +- [Gather](#Gathermd) +- [GatherND](#GatherNDmd) +- [GatherElements](#GatherElementsmd) +- [Gemm](#Gemmmd) +- [GlobalAveragePool](#GlobalAveragePoolmd) +- [GlobalLpPool](#GlobalLpPoolmd) +- [GlobalMaxPool](#GlobalMaxPoolmd) +- [Greater](#Greatermd) +- [GreaterOrEqual](#GreaterOrEqualmd) +- [HardSigmoid](#HardSigmoidmd) - [hardmax](#hardmaxmd) -- [HardSwish](#hardswishmd) -- [Identity](#identitymd) -- [If](#ifmd) -- [InstanceNormalization](#instancenormalizationmd) -- [Less](#lessmd) -- [LeakyRelu](#leakyrelumd) -- [LessOrEqual](#lessorequalmd) -- [Log](#logmd) -- [LogSoftMax](#logsoftmaxmd) -- [LpNormalization](#lpnormalizationmd) -- [LpPool](#lppoolmd) -- [LRN](#lrnmd) -- [LSTM](#lstmmd) -- [MatMul](#matmulmd) -- [Max](#maxmd) -- [MaxPool](#maxpoolmd) -- [MaxRoiPool](#maxroipoolmd) -- [MaxUnpool](#maxunpoolmd) -- [Mean](#meanmd) -- [MeanVarianceNormalization](#meanvariancenormalizationmd) -- [Min](#minmd) -- [Mod](#modmd) -- [Mul](#mulmd) -- [Multinomial](#multinomialmd) -- [Neg](#negmd) -- [NonMaxSuppression](#nonmaxsuppressionmd) -- [NonZero](#nonzeromd) -- [Not](#notmd) -- [OneHot](#onehotmd) -- [Or](#ormd) -- [RandomNormalLike](#randomnormallikemd) -- [RandomUniformLike](#randomuniformlikemd) -- [RandomUniform](#randomuniformmd) -- [Range](#rangemd) -- [Reciprocal](#reciprocalmd) -- [ReduceL1](#reducel1md) -- [ReduceL2](#reducel2md) -- [ReduceLogSum](#reducelogsummd) -- [ReduceLogSumExp](#reducelogsumexpmd) -- [ReduceMin](#reduceminmd) -- [ReduceMean](#reducemeanmd) -- [ReduceProd](#reduceprodmd) -- [ReduceSumSquare](#reducesumsquaremd) -- [Resize](#resizemd) -- [Relu](#relumd) -- [ReduceSum](#reducesummd) -- [ReduceMax](#reducemaxmd) -- [Reshape](#reshapemd) -- [ReverseSequence](#reversesequencemd) -- [RoiExtractor](#roiextractormd) -- [RoiAlign](#roialignmd) -- [Round](#roundmd) -- [PRelu](#prelumd) -- [Scatter](#scattermd) -- [ScatterElements](#scatterelementsmd) -- [ScatterND](#scatterndmd) -- [Shrink](#shrinkmd) -- [Selu](#selumd) -- [Shape](#shapemd) -- [Sigmoid](#sigmoidmd) -- [Slice](#slicemd) -- [Softmax](#softmaxmd) -- [Softsign](#softsignmd) -- [Softplus](#softplusmd) -- [SpaceToDepth](#spacetodepthmd) -- [Split](#splitmd) -- [Sqrt](#sqrtmd) -- [Squeeze](#squeezemd) -- [Sub](#submd) -- [Sign](#signmd) -- [Sin](#sinmd) -- [Sinh](#sinhmd) -- [Size](#sizemd) -- [Sum](#summd) -- [Tanh](#tanhmd) -- [TfIdfVectorizer](#tfidfvectorizermd) -- [Tile](#tilemd) -- [ThresholdedRelu](#thresholdedrelumd) -- [TopK](#topkmd) -- [Transpose](#transposemd) -- [Pad](#padmd) -- [Pow](#powmd) -- [Unsqueeze](#unsqueezemd) -- [Xor](#xormd) -- [Where](#wheremd) -

Abs

- -### Description +- [HardSwish](#HardSwishmd) +- [Identity](#Identitymd) +- [If](#Ifmd) +- [InstanceNormalization](#InstanceNormalizationmd) +- [Less](#Lessmd) +- [LeakyRelu](#LeakyRelumd) +- [LessOrEqual](#LessOrEqualmd) +- [Log](#Logmd) +- [LogSoftMax](#LogSoftMaxmd) +- [LpNormalization](#LpNormalizationmd) +- [LpPool](#LpPoolmd) +- [LRN](#LRNmd) +- [LSTM](#LSTMmd) +- [MatMul](#MatMulmd) +- [Max](#Maxmd) +- [MaxPool](#MaxPoolmd) +- [MaxRoiPool](#MaxRoiPoolmd) +- [MaxUnpool](#MaxUnpoolmd) +- [Mean](#Meanmd) +- [MeanVarianceNormalization](#MeanVarianceNormalizationmd) +- [Min](#Minmd) +- [Mod](#Modmd) +- [Mul](#Mulmd) +- [Multinomial](#Multinomialmd) +- [Neg](#Negmd) +- [NonMaxSuppression](#NonMaxSuppressionmd) +- [NonZero](#NonZeromd) +- [Not](#Notmd) +- [OneHot](#OneHotmd) +- [Or](#Ormd) +- [RandomNormalLike](#RandomNormalLikemd) +- [RandomUniformLike](#RandomUniformLikemd) +- [RandomUniform](#RandomUniformmd) +- [Range](#Rangemd) +- [Reciprocal](#Reciprocalmd) +- [ReduceL1](#ReduceL1md) +- [ReduceL2](#ReduceL2md) +- [ReduceLogSum](#ReduceLogSummd) +- [ReduceLogSumExp](#ReduceLogSumExpmd) +- [ReduceMin](#ReduceMinmd) +- [ReduceMean](#ReduceMeanmd) +- [ReduceProd](#ReduceProdmd) +- [ReduceSumSquare](#ReduceSumSquaremd) +- [Resize](#Resizemd) +- [Relu](#Relumd) +- [ReduceSum](#ReduceSummd) +- [ReduceMax](#ReduceMaxmd) +- [Reshape](#Reshapemd) +- [ReverseSequence](#ReverseSequencemd) +- [RoiExtractor](#RoiExtractormd) +- [RoiAlign](#RoiAlignmd) +- [Round](#Roundmd) +- [PRelu](#PRelumd) +- [Scatter](#Scattermd) +- [ScatterElements](#ScatterElementsmd) +- [ScatterND](#ScatterNDmd) +- [Shrink](#Shrinkmd) +- [Selu](#Selumd) +- [Shape](#Shapemd) +- [Sigmoid](#Sigmoidmd) +- [Slice](#Slicemd) +- [Softmax](#Softmaxmd) +- [Softsign](#Softsignmd) +- [Softplus](#Softplusmd) +- [SpaceToDepth](#SpaceToDepthmd) +- [Split](#Splitmd) +- [Sqrt](#Sqrtmd) +- [Squeeze](#Squeezemd) +- [Sub](#Submd) +- [Sign](#Signmd) +- [Sin](#Sinmd) +- [Sinh](#Sinhmd) +- [Size](#Sizemd) +- [Sum](#Summd) +- [Tanh](#Tanhmd) +- [TfIdfVectorizer](#TfIdfVectorizermd) +- [Tile](#Tilemd) +- [ThresholdedRelu](#ThresholdedRelumd) +- [TopK](#TopKmd) +- [Transpose](#Transposemd) +- [Pad](#Padmd) +- [Pow](#Powmd) +- [Unsqueeze](#Unsqueezemd) +- [Xor](#Xormd) +- [Where](#Wheremd) +

Abs

+ +### Description Computes the absolute value of a tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input -x: tensor. Must be one of the following types: float16, float32, double, int32, int64. +x: tensor of type float16, float32, double, int32, or int64. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Acos

+

Acos

-### Description +### Description Computes acos of the input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Acosh

+

Acosh

-### Description +### Description -Computes inverse hyperbolic cosine of x element-wise. +Computes inverse hyperbolic cosine of input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v9/v10/v11/v12/v13 -

AdaptiveAvgPool2D

+

AdaptiveAvgPool2D

-### Description +### Description Applies a 2D adaptive avg pooling over the input. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Attributes\] +[Attributes] One attribute -output\_size: array of ints, specifying the output H and W dimension sizes. +output\_size: array of ints, specifying the output H and W shape sizes. -\[Outputs\] +[Outputs] One output y: tensor of the identical data type as x. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

AdaptiveMaxPool2D

+

AdaptiveMaxPool2D

-### Description +### Description Applies a 2D adaptive max pooling over the input. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or float64. -\[Attributes\] +[Attributes] One attribute -output\_size: array of ints, specifying the output H and W dimension sizes. +output\_size: array of ints, specifying the output H and W shape sizes. -\[Outputs\] +[Outputs] Two outputs @@ -278,43 +278,43 @@ y: tensor of the identical data type as x. argmax: tensor of type int32 or int64. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

Add

+

Add

-### Description +### Description Adds inputs element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs A: tensor. Must be one of the following types: int8, int16, int32, int64, uint8, float32, float16, double. -B: tensor of the identical data type as A. +B: tensor. Has an identical data type to that of A. -\[Outputs\] +[Outputs] -C: tensor of the identical data type as A. +C: tensor. Has an identical data type to that of A. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Addcmul

+

Addcmul

-### Description +### Description Performs element-wise computation: \(x1 \* x2\) \* value + input\_data -### Parameters +### Parameters -\[Inputs\] +[Inputs] Four inputs @@ -326,57 +326,57 @@ x2: tensor of the identical data type as input\_data value: tensor of the identical data type as input\_data -\[Outputs\] +[Outputs] One output y: tensor of the identical data type as the inputs. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

AffineGrid

+

AffineGrid

-### Description +### Description Generates a sampling grid with given matrices. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs theta: tensor of type float16 or float32. -output\_size: tensor of type int32. +output\_size: tensor of type int32 -\[Attributes\] +[Attributes] One attribute align\_corners: bool -\[Outputs\] +[Outputs] One output y: tensor of type int. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

And

+

And

-### Description +### Description -Returns the tensor resulted from performing the and logical operation element-wise on the input tensors. +Returns the tensor resulted from performing the AND logical operation element-wise on the input tensors. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -384,91 +384,91 @@ x1: tensor of type bool. x2: tensor of type bool. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and shape as input x. +y: tensor of the same data type and shape as input x. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Argmax

+

Argmax

-### Description +### Description Returns the indices of the maximum elements along the provided axis. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of type int32, the indexes. Has the same shape as x with the dimension along axis removed. +y: tensor of type int32, the indices. Has the same shape as x with the dimension along axis removed. -\[Attributes\] +[Attributes] -axis: \(required\) int32, axis in which to compute the arg indices. Accepted range is \[–len\(x.shape\), len\(x.shape\) – 1\]. +axis: (required) int32, axis in which to compute the arg indices. Accepted range is \[-len\(x.shape\), len\(x.shape\)-1\]. -keep\_dim: \(optional\) either 1 \(default\) or 0. +keep\_dim: (optional) either 1 (default) or 0. -\[Restrictions\] +[Restrictions] -The operator does not support inputs of type float32 when the atc command-line option **--precision\_mode** is set to **must\_keep\_origin\_dtype**. +The operator does not support inputs of type float32 when the atc command-line option --precision_mode is set to must_keep_origin_dtype. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Argmin

+

Argmin

-### Description +### Description Returns the indices of the minimum values along an axis. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output y: tensor of type int64. -\[Attributes\] +[Attributes] -axis: int. Must be in the range \[–r, r – 1\], where r indicates the rank of the input. +axis: int. Must be in the range [–r, r – 1], where r indicates the rank of the input. -\[Restrictions\] +[Restrictions] -The operator does not support inputs of type float32 when the atc command-line option **--precision\_mode** is set to **must\_keep\_origin\_dtype**. +The operator does not support inputs of type float32 when the atc command-line option --precision_mode is set to must_keep_origin_dtype. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

AscendRequantS16

+

AscendRequantS16

-### Description +### Description Performs requantization. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two required inputs and one optional input @@ -478,7 +478,7 @@ req\_scale: tensor of type uint64. x1: tensor of type int16. -\[Attributes\] +[Attributes] Two attributes @@ -486,7 +486,7 @@ dual\_output: bool relu\_flag: bool -\[Outputs\] +[Outputs] Two outputs @@ -494,19 +494,19 @@ y0: tensor of type int8. y1: tensor of type int16. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

AscendRequant

+

AscendRequant

-### Description +### Description Performs requantization. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -514,37 +514,37 @@ x0: tensor of type int32. req\_scale: tensor of type uint64. -\[Attributes\] +[Attributes] One attribute -relu\_flag: bool +Relu\_flag: bool -\[Outputs\] +[Outputs] One output y: tensor of type int8. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

AscendQuant

+

AscendQuant

-### Description +### Description Performs quantization. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Attributes\] +[Attributes] Four attributes @@ -552,29 +552,29 @@ offset: float scale: float -sqrt\_mode: bool +Sqrt\_mode: bool -round\_mode: string +Round\_mode: string -\[Outputs\] +[Outputs] One output y: tensor of type int8. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

AscendDequantS16

+

AscendDequantS16

-### Description +### Description Performs dequantization. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two required inputs and one optional input @@ -584,31 +584,31 @@ req\_scale: tensor of type uint64. x1: tensor of type int16. -\[Attributes\] +[Attributes] One attribute -relu\_flag: bool +Relu\_flag: bool -\[Outputs\] +[Outputs] One output y: tensor of type int16. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

AscendDequant

+

AscendDequant

-### Description +### Description Performs dequantization. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -616,39 +616,39 @@ x0: tensor of type int32. deq\_scale: tensor of type uint64 or float16. -\[Attributes\] +[Attributes] -sqrt\_mode: bool +Sqrt\_mode: bool -relu\_flag: bool +Relu\_flag: bool dtype: float -\[Outputs\] +[Outputs] One output y: tensor of type float16 or float. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

AscendAntiQuant

+

AscendAntiQuant

-### Description +### Description Performs dequantization. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type int8. -\[Attributes\] +[Attributes] offset: float @@ -658,185 +658,185 @@ sqrt\_mode: bool round\_mode: string -\[Outputs\] +[Outputs] One output y: tensor of type float16 or float. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

Asin

+

Asin

-### Description +### Description -Computes the trignometric inverse sine of the input element-wise. +Computes trignometric inverse sine of the input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x1: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Asinh

+

Asinh

-### Description +### Description Computes inverse hyperbolic sine of the input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v9/v10/v11/v12/v13 -

Atan

+

Atan

-### Description +### Description Computes the trignometric inverse tangent of the input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Atanh

+

Atanh

-### Description +### Description Computes inverse hyperbolic tangent of the input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v9/v10/v11/v12/v13 -

AveragePool

+

AveragePool

-### Description +### Description Performs average pooling. -### Parameters +### Parameters -\[Inputs\] +[Inputs] X: tensor of type float16 or float32, in NCHW format. -\[Outputs\] +[Outputs] Y: tensor of type float16 or float32, in NCHW format. -\[Attributes\] +[Attributes] -auto\_pad: \(optional\) selected from NOTSET, SAME\_UPPER, SAME\_LOWER, and VALID. +auto\_pad: (optional)selected from NOTSET, SAME\_UPPER, SAME\_LOWER, and VALID. count\_include\_pad: int, not supported currently. -kernel\_shape: \(optional\) +kernel\_shape: (optional) -- kernel\_shape\[0\]: int32, the kernel height. Must be in the range \[1, 32768\]. Defaults to 1. +kernel\_shape\[0\]: int32, the kernel height. Must be in the range [1, 32768]. Defaults to 1. -- kernel\_shape\[1\]: int32, the kernel width. Must be in the range \[1, 32768\]. Defaults to 1. +kernel\_shape\[0\]: int32, the kernel width. Must be in the range [1, 32768]. Defaults to 1. -strides: \(optional\) +strides: (optional) -- strides\[0\]: int32, the stride height. Defaults to 1. +strides[0]: int32, the stride height. Defaults to 1. -- strides\[1\]: int32, the stride width. Defaults to 1. +strides[1]: int32, the stride width. Defaults to 1. -pads: \(optional\) +pads: (optional) -- pads\[0\]: int32, top padding. Defaults to 0. +pads[0]: int32, top padding. Defaults to 0. -- pads\[1\]: int32, bottom padding. Defaults to 0. +pads[1]: int32, bottom padding. Defaults to 0. -- pads\[2\]: int32, left padding. Defaults to 0. +pads[2]: int32, left padding. Defaults to 0. -- pads\[3\]: int32, right padding. Defaults to 0. +pads[3]: int32, right padding. Defaults to 0. -ceil\_mode: \(optional\) int32, either 0 \(floor mode\) or 1 \(ceil mode\). Defaults to 0. +ceil_mode: (optional) int32, either 0 (floor mode) or 1 (ceil mode). Defaults to 0. -\[Restrictions\] +[Restrictions] -When strides\[0\] or strides\[1\] is greater than 63, computation is performed on AI CPU, which will compromise performance. +When strides[0] or strides[1] is greater than 63, computation is performed on AI CPU, which will compromise performance. -When the value of kernel\_shape\_H or kernel\_shape\_W is beyond the range \[1,255\] or kernel\_shape\_H \* kernel\_shape\_W \> 256, computation is performed on AI CPU, which will compromise performance. +When the value of kernel\_shape\_H or kernel\_shape\_W is beyond the range [1, 255] or kernel\_shape\_H \* kernel\_shape\_W \> 256, computation is performed on AI CPU, which will compromise performance. -input\_w ∈ \[1, 4096\] +1 <= input\_w <= 4096; When N of the input tensor is a prime number, N < 65535. ceil\_mode is valid only when auto\_pad is set to NOTSET. -The operator does not support inputs of type float32 when the atc command-line option **--precision\_mode** is set to **must\_keep\_origin\_dtype**. +The operator does not support inputs of type float32 when the atc command-line option **--precision\_mode** is set to **must\_keep\_origin\_dtype**. -Beware that both the SAME\_UPPER and SAME\_LOWER values of auto\_pad are functionally the same as the SAME argument of built-in TBE operators. The attribute configuration may lead to accuracy drop as the SAME argument is position-insensitive. +Beware that both the SAME_UPPER and SAME_LOWER values of auto_pad are functionally the same as the SAME argument of built-in TBE operators. The attribute configuration may lead to accuracy drop as the SAME argument is position-insensitive. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

BatchNormalization

+

BatchNormalization

-### Description +### Description Normalizes the inputs. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Five inputs @@ -850,7 +850,7 @@ mean: tensor of type float32, specifying the mean value. var: tensor of type float32, specifying the variance value. -\[Outputs\] +[Outputs] Five outputs @@ -860,29 +860,29 @@ mean: mean value. var: variance value. -saved\_mean: saved mean value, used to accelerate gradient calculation during training. +saved_mean: saved mean value, used to accelerate gradient calculation during training. -saved\_var: saved variance value, used to accelerate gradient calculation during training. +saved_var: saved variance value, used to accelerate gradient calculation during training. -\[Attributes\] +[Attributes] -epsilon: \(optional\) float32, added to var to avoid dividing by zero. Defaults to 0.0001. +epsilon: (optional) float32, added to var to avoid dividing by zero. Defaults to 0.0001. momentum: float32, not supported currently. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

BatchMatMul

+

BatchMatMul

-### Description +### Description Multiplies slices of two tensors in batches. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -890,7 +890,7 @@ x1: tensor of type float16, float, or int32. x2: tensor of type float16, float, or int32. -\[Attributes\] +[Attributes] Two attributes @@ -898,25 +898,25 @@ adj\_x1: bool adj\_x2: bool -\[Outputs\] +[Outputs] One output y: tensor of type float16, float, or int32. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

BatchMultiClassNMS

+

BatchMultiClassNMS

-### Description +### Description -Applies non-maximum suppression \(NMS\) on input boxes and input scores. +Applies non-maximum suppression (NMS) on input boxes and input scores. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two required inputs and two optional inputs @@ -928,7 +928,7 @@ clip\_window: tensor of type float16 num\_valid\_boxes: tensor of type int32 -\[Attributes\] +[Attributes] Six attributes @@ -944,7 +944,7 @@ change\_coordinate\_frame: bool transpose\_box: bool -\[Outputs\] +[Outputs] Four outputs @@ -956,19 +956,19 @@ nmsed\_classes: tensor of type float16 nmsed\_num: tensor of type float16 -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

BitShift

+

BitShift

-### Description +### Description Performs element-wise shift. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -976,131 +976,131 @@ x: tensor, indicating the input to be shifted. y: tensor, indicating the amounts of shift. -\[Outputs\] +[Outputs] z: shifted tensor. -\[Attributes\] +[Attributes] -direction: \(required\) string, indicating the direction of moving bits. Either RIGHT or LEFT. +direction: (required) string, indicating the direction of moving bits. Either RIGHT or LEFT. -\[Restrictions\] +[Restrictions] -When direction="LEFT", the inputs must not be of type UINT16, UIN32, or UINT64. +When direction is set to LEFT, the inputs must not be of type UINT16, UIN32, or UINT64. -### ONNX Opset Support +### ONNX Opset Support Opset v11/v12/v13 -

Cast

+

Cast

-### Description +### Description Casts a tensor to a new type. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor -\[Outputs\] +[Outputs] y: tensor of the data type specified by the attribute. Must be one of the following types: bool, float16, float32, int8, int32, uint8. -\[Attributes\] +[Attributes] -to: \(required\) int, the destination type. +to: (required) int, the destination type. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Ceil

+

Ceil

-### Description +### Description Returns the ceiling of the input, element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Celu

+

Celu

-### Description +### Description -Continuously Differentiable Exponential Linear Units \(CELUs\): performs the linear unit element-wise on the input tensor X using formula: +Continuously Differentiable Exponential Linear Units (CELUs): performs the linear unit element-wise on the input tensor X using formula: -max\(0,x\) + min\(0,alpha \* \(exp\(x/alpha\) – 1\)\) +max\(0,x\) + min\(0,alpha\*\(exp\(x/alpha\)-1\)\) -### Parameters +### Parameters -\[Inputs\] +[Inputs] X: tensor of type float. -\[Outputs\] +[Outputs] Y: tensor of type float. -\[Attributes\] +[Attributes] alpha: float. Defaults to 1.0. -### ONNX Opset Support +### ONNX Opset Support Opset v12/v13 -

Concat

+

Concat

-### Description +### Description Concatenates multiple inputs. -### Parameters +### Parameters -\[Inputs\] +[Inputs] inputs: tensors. Must be one of the following data types: float16, float32, int32, uint8, int16, int8, int64, qint8, quint8, qint32, uint16, uint32, uint64, qint16, quint16. -\[Outputs\] +[Outputs] concat\_result: tensor of the identical data type as inputs. -\[Attributes\] +[Attributes] -axis: the axis along which to concatenate — may be negative to index from the end. Must be in the range \[–r, r – 1\], where, r = rank\(inputs\). +axis: the axis along which to concatenate — may be negative to index from the end. Must be in the range [–r, r – 1], where, r = rank(inputs). -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Clip

+

Clip

-### Description +### Description Clips tensor values to a specified min and max. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Three inputs @@ -1110,25 +1110,25 @@ min: must be a scalar. max: must be a scalar. -\[Outputs\] +[Outputs] One output -Y: output tensor with clipped input elements. Has the identical shape and data type as the input. +Y: output tensor with clipped input elements. Has an identical shape and data type to those of the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

ConvTranspose

+

ConvTranspose

-### Description +### Description Computes transposed convolution. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Three inputs @@ -1136,15 +1136,15 @@ x: tensor of type float16 or float32. w: tensor of type float16 or float32. -b: \(optional\) tensor of type float16 or float32. +b: (optional) tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and shape as input x. +y: tensor of the same data type and shape as input x. -\[Attributes\] +[Attributes] auto\_pad: string. Defaults to NOTSET, which means explicit padding is used. @@ -1154,119 +1154,119 @@ group: int. Number of groups input channels and output channels are divided into kernel\_shape: ints. The shape of the convolution kernel. Defaults to w. -output\_padding: ints. Additional elements added to the side with higher coordinate indices in the output. Defaults to an all-0 array. +output\_padding: ints, specifying the value of padding. Defaults to an all-0 array. -output\_shape: ints. The shape of the output can be explicitly set which will cause pads values to be auto generated. +output_shape: ints. The shape of the output can be explicitly set which will cause pads values to be auto generated. pads: ints. Padding for the beginning and ending along each spatial axis. Defaults to an all-0 matrix. strides: ints. Stride along each spatial axis. Defaults to an all-1 matrix. -\[Restrictions\] +[Restrictions] Currently, only 2D transposed convolution is supported. 3D and higher are not supported. dilations can only be 1. -Currently, the output\_shape can be used to specify the output shape size. But the specified size must not be greater than the input size. +Currently, the output_shape can be used to specify the output shape size. But the specified size must not be greater than the input size. -The operator does not support inputs of type float32 or float64 when the atc command-line option **--precision\_mode** is set to **must\_keep\_origin\_dtype**. +The operator does not support inputs of type float32 or float64 when the atc command-line option --precision_mode is set to must_keep_origin_dtype. -The auto\_pad attribute must not be SAME\_UPPER or SAME\_LOWER. +The auto_pad attribute cannot be SAME_UPPER or SAME_LOWER. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Cumsum

+

Cumsum

-### Description +### Description Performs cumulative sum of the input elements along the given axis. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs x: tensor of type float16, float32, or int32. -axis: scalar of type int32 or int64. Defaults to 0. Must be in the range \[–rank\(x\), rank\(x\) – 1\]. +axis: scalar of type int32 or int64. Defaults to 0. Must be in the range [–rank(x), rank(x) – 1]. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type as input x. +y: tensor. Has an identical data type to that of input x. -\[Attributes\] +[Attributes] exclusive: int. Whether to return exclusive sum in which the top element is not included. Defaults to 0. reverse: int. Whether to perform the sums in reverse direction. Defaults to 0. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Conv

+

Conv

-### Description +### Description Computes convolution. -### Parameters +### Parameters -\[Inputs\] +[Inputs] X: 4D tensor W: tensor for the weight -B: \(optional\) 1D tensor for the bias +B: (optional) 1D tensor for the bias -\[Outputs\] +[Outputs] Y: tensor for the convolution output -\[Attributes\] +[Attributes] -auto\_pad: \(optional\) either VALID or NOTSET. +auto\_pad: (optional) either VALID or NOTSET -dilations: list of four integers, specifying the dilation rate. The value range for the H and W dimensions is \[1, 255\]. +dilations: list of four integers, specifying the dilation rate. The value range for the H and W dimensions is [1, 255]. group: int32. The input and output channels are separated into groups, and the output group channels will be only connected to the input group channels. Both the input and output channels must be divisible by group. Must be 1. -pads: list of four integers, specifying the number of pixels to add to each side of the input. Must be in the range \[0, 255\]. +pads: list of four integers, specifying the number of pixels to add to each side of the input. Must be in the range [0, 255]. -strides: list of four integers, specifying the strides of the convolution along the height and width. The value range for the H and W dimensions is \[1, 63\]. By default, the N and C dimensions are set to 1. +strides: list of four integers, specifying the strides of the convolution along the H and W dimensions. The value range for the H and W dimensions is [1, 63]. By default, the N and C dimensions are set to 1. -\[Restrictions\] +[Restrictions] -For input X, the value range for the W dimension is \[1, 4096\]. +For input X, the value range for the W dimension is [1, 4096]. -For the weight tensor, the value range for the H and W dimensions is \[1, 255\]. +For the weight tensor, the value range for the H and W dimensions is [1, 255]. When W and H of the output tensor are both 1, inputs X and W must have the same H and W dimensions. The operator is not supported if the output Y meets: W = 1, H ! = 1 -The operator does not support inputs of type float32 or float64 when the atc command-line option **--precision\_mode** is set to **must\_keep\_origin\_dtype**. +The operator does not support inputs of type float32 or float64 when the atc command-line option --precision_mode is set to must_keep_origin_dtype. -### ONNX Opset Support +### ONNX Opset Support Opset v9/v10/v11/v12/v13 -

Compress

+

Compress

-### Description +### Description Slices data based on the specified axis. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs: @@ -1274,135 +1274,135 @@ input: tensor with one or more dimensions. The supported types are uint8, uint16 condition: 1-dimensional tensor, used to specify slices and elements to be selected. The supported type is bool. -\[Outputs\] +[Outputs] One output output: tensor of the same type as the input -\[Attributes\] +[Attributes] -\(Optional\) axis: int, axis for slicing. If no axis is specified, the input tensor is flattened before slicing. The value range is \[-r, r-1\]. **r** indicates the dimensions of the input tensor. +(Optional) axis: int, axis for slicing. If no axis is specified, the input tensor is flattened before slicing. The value range is [-r, r-1]. r indicates the dimensions of the input tensor. -### ONNX Opset Support +### ONNX Opset Support Opset v9//v11/v12/v13 -

Constant

+

Constant

-### Description +### Description Creates a constant tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] -N/A +None -\[Outputs\] +[Outputs] One output Y: output tensor containing the same value of the provided tensor. -\[Attributes\] +[Attributes] value: the value for the elements of the output tensor. -\[Restrictions\] +[Restrictions] sparse\_value: not supported -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

ConstantOfShape

+

ConstantOfShape

-### Description +### Description Generates a tensor with given value and shape. -### Parameters +### Parameters -\[Inputs\] +[Inputs] x: 1D tensor of type int64, the shape of the output tensor. All values must be greater than 0. -\[Outputs\] +[Outputs] y: output tensor of shape specified by the input. If value is specified, the value and data type of the output tensor is taken from value. If value is not specified, the value in the output defaults to 0, and the data type defaults to float32. -\[Attributes\] +[Attributes] value: the value and data type of the output elements. -\[Restrictions\] +[Restrictions] -x: 1 <= len\(shape\) <= 8 +x: 1 <= len(shape) <= 8 -### ONNX Opset Support +### ONNX Opset Support Opset v9/v10/v11/v12/v13 -

Cos

+

Cos

-### Description +### Description Computes cos of the input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Cosh

+

Cosh

-### Description +### Description Computes hyperbolic cosine of the input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input -X1: tensor of type float16, float, or double. +x1: tensor of type float16, float, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to those of the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

DeformableConv2D

+

DeformableConv2D

-### Description +### Description Deformable convolution -### Parameters +### Parameters -\[Inputs\] +[Inputs] X: 4D tensor @@ -1410,105 +1410,105 @@ filter: weight tensor offsets: 4D tensor for the offset -bias: \(optional\) 1D tensor for the bias +bias: (optional) 1D tensor for the bias -\[Outputs\] +[Outputs] Y: deformed tensor -\[Attributes\] +[Attributes] -auto\_pad: \(optional\) either VALID or NOTSET. +auto\_pad: (optional) either VALID or NOTSET -dilations: list of four integers, specifying the dilation rate. The value range for the H and W dimensions is \[1, 255\]. +dilations: list of four integers, specifying the dilation rate. The value range for the H and W dimensions is [1, 255]. group: int32. The input and output channels are separated into groups, and the output group channels will be only connected to the input group channels. Both the input and output channels must be divisible by group. Must be 1. -pads: list of four integers, specifying the number of pixels to add to each side of the input. Must be in the range \[0, 255\]. +pads: list of four integers, specifying the number of pixels to add to each side of the input. Must be in the range [0, 255]. -strides: list of four integers, specifying the strides of the convolution along the height and width. The value range for the H and W dimensions is \[1, 63\]. By default, the N and C dimensions are set to 1. +strides: list of four integers, specifying the strides of the convolution along the H and W dimensions. The value range for the H and W dimensions is [1, 63]. By default, the N and C dimensions are set to 1. -data\_format: string, specifying the format of the input data. Defaults to NHWC. +data_format: string, specifying the format of the input data. Defaults to NHWC. -deformable\_groups: the number of deformable group partitions. Defaults to 1. +deformable_groups: number of deformable group partitions. Defaults to 1 -modulated: bool to specify the DeformableConv2D version. Set to true to use v2; set to false to use v1. Currently, only true \(v2\) is supported. +modulated: bool, specifying the DeformableConv2D version. Set to true to use v2; set to false to use v1. Currently, only true (v2) is supported. -Restrictions +[Restrictions] -For the input tensor X, expected range of the W dimension is \[1, 4096/filter\_width\] and expected range of the H dimension is \[1, 100000/filter\_height\]. +For the input tensor X, expected range of the W dimension is [1, 4096/filter_width] and expected range of the H dimension is [1, 100000/filter_height]. -For the weight tensor, expected range of both the W and H dimensions are \[1, 63\]. +For the weight tensor, expected range of both the W and H dimensions are [1, 63]. -The operator does not support inputs of type float32 or float64 when the atc command-line option **--precision\_mode** is set to **must\_keep\_origin\_dtype**. +The operator does not support inputs of type float32 or float64 when the atc command-line option --precision_mode is set to must_keep_origin_dtype. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

Det

+

Det

-### Description +### Description Calculates determinant of a square matrix or batches of square matrices. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and shape as input x. +y: tensor of the same data type and shape as input x. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

DepthToSpace

+

DepthToSpace

-### Description +### Description -Rearranges \(permutes\) data from depth into blocks of spatial data. +Rearranges (permutes) data from depth into blocks of spatial data. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input input: input tensor in format NCHW. Must be one of the following types: float16, float32, double, int32, int64. -\[Outputs\] +[Outputs] One output -output: tensor with shape \[N, C/\(blocksize \* blocksize\), H \* blocksize, W \* blocksize\] +output: tensor with shape [N, C/(blocksize * blocksize), H * blocksize, W * blocksize] -\[Attributes\] +[Attributes] -blocksize: \(required\) int, blocks to be moved. +blocksize: (required) int, blocks to be moved. -mode: string, either DCR \(default\) for depth-column-row order re-arrangement or CRD for column-row-depth order arrangement. +mode: string, either DCR (default) for depth-column-row order re-arrangement or CRD for column-row-depth order arrangement. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Div

+

Div

-### Description +### Description Performs element-wise division. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -1516,39 +1516,39 @@ x1: tensor of type float16, float32, double, int32, or int64. x2: tensor of type float16, float32, double, int32, or int64. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type as the inputs. +y: tensor. Has an identical data type to that of the input. -\[Restrictions\] +[Restrictions] -The output has the identical data type as the inputs. +The output has an identical data type to that of the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Dropout

+

Dropout

-### Description +### Description Copies or masks the input tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One to three inputs -data: input tensor, of type float16, float32, or double. +data: input tensor of type float16, float32, or double. -ratio: \(optional\) float16, float32, or double. +ratio: (optional) float16, float32, or double. -training\_mode: \(optional\) bool +training\_mode: (optional) bool -\[Outputs\] +[Outputs] One to two outputs @@ -1556,47 +1556,47 @@ output: tensor mask: tensor -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Elu

+

Elu

-### Description +### Description Computes the exponential linear function. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output y: tensor of the same data type and shape as input x. -\[Attributes\] +[Attributes] alpha: float, indicating the coefficient. Defaults to 1.0. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

EmbeddingBag

+

EmbeddingBag

-### Description +### Description Computes sums, means, or maxes of bags of embeddings. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two required inputs and two optional inputs @@ -1606,39 +1606,39 @@ indices: tensor of type int32. offset: tensor of type int32. -per\_sample\_weights: tensor of type float32. +per_sample_weights: tensor of type float32 -\[Attributes\] +[Attributes] Four attributes mode: string -scale\_grad\_by\_fraq: bool +scale_grad_by_fraq: bool sparse: bool -include\_last\_offset: bool +include_last_offset: bool -\[Outputs\] +[Outputs] One output y: tensor of type float32. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

Equal

+

Equal

-### Description +### Description -Returns the truth value of \(X1 == X2\) element-wise. +Returns the truth value of (X1 == X2) element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -1646,77 +1646,77 @@ X1: tensor X2: tensor -\[Outputs\] +[Outputs] One output y: tensor of type bool. -\[Restrictions\] +[Restrictions] X1 and X2 have the same format and data type. The following data types are supported: bool, uint8, int8, int16, int32, int64, float16, float32, and double. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Erf

+

Erf

-### Description +### Description Computes the Gauss error function of x element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and format as the input. +y: tensor. Has an identical data type and format to those of the input. -### ONNX Opset Support +### ONNX Opset Support Opset v9/v10/v11/v12/v13 -

Exp

+

Exp

-### Description +### Description Computes exponential of the input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Expand

+

Expand

-### Description +### Description Broadcasts the input tensor following the given shape and the broadcast rule. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -1724,111 +1724,111 @@ input: tensor of type float16 or float32. shape: tensor of type int64. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and shape as input x. +y: tensor of the same data type and shape as input x. -\[Restrictions\] +[Restrictions] -The model's inputs need to be changed from placeholders to constants. You can use ONNX Simplifier to simplify your model. +The model's input shape need to be changed from placeholders to constants. You can use ONNX Simplifier to simplify your model. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

EyeLike

+

EyeLike

-### Description +### Description -Generate a 2D tensor \(matrix\) with ones on the diagonal and zeros everywhere else. +Generate a 2D tensor (matrix) with ones on the diagonal and zeros everywhere else. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: 2D tensor, to be copied. -\[Outputs\] +[Outputs] One output y: tensor of the identical shape as input x. -\[Attributes\] +[Attributes] dtype: int, specifying the data type of the output. -k: int, specifying the index of the diagonal to be populated with ones. Defaults to 0. If y is output, y\[i, i+k\] = 1. +k: int, specifying the index of the diagonal to be populated with ones. Defaults to 0. If y is output, y[i, i+k] = 1. -\[Restrictions\] +[Restrictions] k must be 0. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Flatten

+

Flatten

-### Description +### Description Flattens the input. -### Parameters +### Parameters -\[Inputs\] +[Inputs] input: ND tensor. Must be one of the following data types: int8, uint8, int16, uint16, int32, uint32, int64, uint64, float16, float32. -\[Outputs\] +[Outputs] 2D tensor with the content of the input tensor. -\[Attributes\] +[Attributes] axis: int. Must be positive. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Floor

+

Floor

-### Description +### Description Returns element-wise largest integer not greater than x. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Gather

+

Gather

-### Description +### Description Gathers slices from the input according to indices. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -1836,109 +1836,109 @@ x1: tensor of type float16, float32, int32, int64, int8, int16, uint8, uint16, u indices: tensor of type int32 or int64. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type as input x1. +y: tensor. Has an identical data type to that of input x1. -\[Attributes\] +[Attributes] -axis: int, the axis in x1 to gather indices from. Must be in the range \[–r, r – 1\], where r indicates the rank of the input x1. +axis: int, the axis in x1 to gather indices from. Must be in the range [–r, r – 1], where r indicates the rank of the input x1. -\[Restrictions\] +[Restrictions] indices must not be negative. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

GatherND

+

GatherND

-### Description +### Description Gathers slices of data into an output tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs -data: input tensor of rank r \>= 1. Must be one of the following types: float16, float32, double, int32, int64. +data: input tensor of rank r >= 1. Must be one of the following types: float16, float32, double, int32, int64. -indices: tensor of type int64, of rank q \>= 1. +indices: tensor of type int64, of rank q >= 1. -\[Outputs\] +[Outputs] One output -output: tensor of rank q + r – indices\_shape\[–1\] – 1 +output: tensor of q + r - indices_shape[-1] - 1 -\[Attributes\] +[Attributes] -batch\_dims: int, the number of batch dimensions. Defaults to 0. +batch_dims: int, the number of batch dimensions. Defaults to 0. -\[Restrictions\] +[Restrictions] -The operator does not support inputs of type double when the atc command-line option --precision\_mode is set to must\_keep\_origin\_dtype. +The operator does not support inputs of type double when the atc command-line option --precision_mode is set to must_keep_origin_dtype. -### ONNX Opset Support +### ONNX Opset Support Opset v11/v12/v13 -

GatherElements

+

GatherElements

-### Description +### Description Produces an output by indexing into the input tensor at index positions. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs -input: input tensor of rank \> 1. Must be one of the following types: float16, float32, double, int32, int64. +input: input tensor of rank > 1. Must be one of the following types: float16, float32, double, int32, int64. indices: tensor of type int32 or int64. -\[Outputs\] +[Outputs] One output output: tensor with the same shape as indices. -\[Attributes\] +[Attributes] axis: int, the axis to gather on. Defaults to 0. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Gemm

+

Gemm

-### Description +### Description General matrix multiplication -### Parameters +### Parameters -\[Inputs\] +[Inputs] A: 2D tensor of type float16 or float32. B: 2D tensor of type float16 or float32. -C: \(optional\) bias, not supported currently. +C: (optional) bias, not supported currently. -\[Outputs\] +[Outputs] Y: 2D tensor of type float16 or float32. -\[Attributes\] +[Attributes] transA: bool, indicating whether A needs to be transposed. @@ -1948,93 +1948,93 @@ alpha: float, not supported currently. beta: float, not supported currently. -\[Restrictions\] +[Restrictions] -Opset V8, V9, and V10 versions do not support inputs of type float32 when the atc command-line option **--precision\_mode** is set to **must\_keep\_origin\_dtype**. +Opset V8, V9, and V10 versions do not support inputs of type float32 when the atc command-line option --precision_mode is set to must_keep_origin_dtype. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

GlobalAveragePool

+

GlobalAveragePool

-### Description +### Description Performs global average pooling. -### Parameters +### Parameters -\[Inputs\] +[Inputs] X: tensor of type float16 or float32, in NCHW format. -\[Outputs\] +[Outputs] Y: pooled tensor in NCHW format. Has the same data type as X. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

GlobalLpPool

+

GlobalLpPool

-### Description +### Description Performs global norm pooling. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs input: tensor of type float16 or float32. -\(Optional\) p: int32. Defaults to **2**. +(Optional) p: int32. Defaults to 2. -\[Outputs\] +[Outputs] One output y: tensor of the same data type as input x. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

GlobalMaxPool

+

GlobalMaxPool

-### Description +### Description Performs global max pooling. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: output tensor of the upstream node. Must be of type float16, float32, or double. -\[Outputs\] +[Outputs] One output output: pooled tensor -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Greater

+

Greater

-### Description +### Description -Returns the truth value of \(x1 \> x2\) element-wise. +Returns the truth value of (x1 >= x2) element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -2042,25 +2042,25 @@ x1: tensor of type float16, float32, int32, int8, or uint8. x2: tensor of type float16, float32, int32, int8, or uint8. -\[Outputs\] +[Outputs] One output y: tensor of type bool. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

GreaterOrEqual

+

GreaterOrEqual

-### Description +### Description -Returns the truth value of \(x1 \>= x2\) element-wise. +Returns the truth value of (x1 >= x2) element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -2068,135 +2068,135 @@ x1: tensor of type float16, float32, int32, int8, or uint8. x2: tensor of type float16, float32, int32, int8, or uint8. -\[Outputs\] +[Outputs] One output y: tensor of type bool. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v12 -

HardSigmoid

+

HardSigmoid

-### Description +### Description -Takes one input data \(tensor\) and produces one output data \(tensor\) where the HardSigmoid function, y = max\(0, min\(1, alpha \* x + beta\)\), is applied to the tensor element-wise. +Takes one input data (tensor) and produces one output data (tensor) where the HardSigmoid function, y = max(0, min(1, alpha * x + beta)), is applied to the tensor element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input X: tensor of type float16, float, or double. -\[Outputs\] +[Outputs] One output Y: tensor of type float16, float, or double. -\[Attributes\] +[Attributes] alpha: float. Defaults to 0.2. beta: float. Defaults to 0.2. -### ONNX Opset Support +### ONNX Opset Support Opset v1/v6/v8/v9/v10/v11/v12/v13

hardmax

-### Description +### Description -Computes the hardmax values for the given input: Hardmax\(element in input, axis\) = 1 if the element is the first maximum value along the specified axis, 0 otherwise. +Computes the hardmax values for the given input: Hardmax(element in input, axis) = 1 if the element is the first maximum value along the specified axis, 0 otherwise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32, of rank = 2. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and shape as input x. +y: tensor of the same data type and shape as input x. -\[Attributes\] +[Attributes] axis: int. The dimension Hardmax will be performed on. Defaults to –1. -\[Restrictions\] +[Restrictions] -In the atc command line, the --precision\_mode option must be set to allow\_fp32\_to\_fp16. +In the atc command line, the --precision_mode option must be set to allow_fp32_to_fp16. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

HardSwish

+

HardSwish

-### Description +### Description -Applies the HardSwish function. **y=x \* max\(0, min\(1, alpha \* x + beta \)\)**, where **alpha** is **1/6** and **beat** is **0.5**. +Applies the HardSwish function. y=x * max(0, min(1, alpha * x + beta )), where alpha is 1/6 and beat is 0.5. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output y: tensor of type float16 or float32. -### ONNX Opset Support +### ONNX Opset Support Opset v14 -

Identity

+

Identity

-### Description +### Description Identity operator -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and shape as input x. +y: tensor of the same data type and shape as input x. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

If

+

If

-### Description +### Description If conditional -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input @@ -2204,61 +2204,61 @@ cond: condition for the if operator. Two attributes -else\_branch: branch tensor to run if condition is false. +else_branch: branch tensor to run if condition is false. -then\_branch: branch tensor to run if condition is true. +then_branch: branch tensor to run if condition is true. -\[Outputs\] +[Outputs] One or more outputs y: tensor or list of tensors -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

InstanceNormalization

+

InstanceNormalization

-### Description +### Description -Computes a tensor by using the formula: y = scale \* \(x – mean\) / sqrt\(variance + epsilon\) + B, where mean and variance are computed per instance per channel. +Computes a tensor by using the formula: y = scale * (x – mean) / sqrt(variance + epsilon) + B, where mean and variance are computed per instance per channel. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Three inputs x: tensor of type float16 or float. -scale: 1D scale tensor of size C. +scale: 1D tensor of the same size C of input x. Has an identical data type to that of input x. -B: 1D tensor of size C. +B: 1D tensor of the same size C of input x. Has an identical data type to that of input x. -\[Outputs\] +[Outputs] One output y: tensor of the identical data type and shape as input x. -\[Attributes\] +[Attributes] epsilon: float. The epsilon value to use to avoid division by zero. Defaults to 1e – 05. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Less

+

Less

-### Description +### Description -Returns the truth value of \(x1 < x2\) element-wise. +Returns the truth value of (x1 < x2) element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -2266,53 +2266,53 @@ x1: tensor of type float16, float32, int32, int8, or uint8. x2: tensor of type float16, float32, int32, int8, or uint8. -\[Outputs\] +[Outputs] One output y: tensor of type bool. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

LeakyRelu

+

LeakyRelu

-### Description +### Description Computes the Leaky ReLU activation function. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to those of the input. -\[Attributes\] +[Attributes] alpha: float, the leakage coefficient. Defaults to 0.01. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

LessOrEqual

+

LessOrEqual

-### Description +### Description -Returns the truth value of \(x <= y\) element-wise. +Returns the truth value of (x <= y) element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -2320,159 +2320,159 @@ x: tensor of type float16 or float32. y: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output y: tensor of type bool, with the same shape as the input x. -### ONNX Opset Support +### ONNX Opset Support Opset v12/v13 -

Log

+

Log

-### Description +### Description Computes natural logarithm of x element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type as the input. +y: tensor. Has an identical data type to that of the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

LogSoftMax

+

LogSoftMax

-### Description +### Description Computes log softmax activations. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -\[Attributes\] +[Attributes] -axis: int. Must be in the range \[–r, r – 1\], where r indicates the rank of the input. +axis: int. Must be in the range [–r, r – 1], where r indicates the rank of the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

LpNormalization

+

LpNormalization

-### Description +### Description Given a matrix, applies Lp-normalization along the provided axis. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input input: tensor of type float16 or float. -\[Outputs\] +[Outputs] One output output: tensor of type float16 or float. -\[Attributes\] +[Attributes] -axis: int. Defaults to **–1**. +axis: int. Defaults to –1. -p: int. Defaults to **2**. +p: int. Defaults to 2. -\[Restrictions\] +[Restrictions] -Beware that both the **SAME\_UPPER** and **SAME\_LOWER** values of auto\_pad are functionally the same as the SAME argument of built-in TBE operators. The attribute configuration may lead to an accuracy drop as the SAME argument is position-insensitive. +Beware that both the SAME_UPPER and SAME_LOWER values of auto_pad are functionally the same as the SAME argument of built-in TBE operators. The attribute configuration may lead to accuracy drop as the SAME argument is position-insensitive. -### ONNX Opset Support +### ONNX Opset Support Opset v1/v8/v9/v10/v11/v12/v13 -

LpPool

+

LpPool

-### Description +### Description Performs Lp norm pooling. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output y: tensor of type float16 or float32. -\[Attributes\] +[Attributes] -auto\_pad: string. Defaults to **NOTSET**. The value can be **NOTSET**, **SAME\_UPPER**, or **VALID**. +auto\_pad: string. The value can be NOTSET (default), SAME_UPPER, or VALID. -kernel\_shape: int list, size of the kernel on each axis. This parameter is mandatory. +(Required) kernel\_shape: int list, size of the kernel on each axis. -p: int, norm. Defaults to **2**. +p: int, norm. Defaults to 2. pads: int list. strides: int list. -### ONNX Opset Support +### ONNX Opset Support Opset v11/v12/v13 -

LRN

+

LRN

-### Description +### Description Performs local response normalization. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and format as input x. +y: tensor. Has an identical data type and format to those of input x. -\[Attributes\] +[Attributes] alpha: float, a scale factor. @@ -2482,19 +2482,19 @@ bias: float. size: int, the number of channels to sum over. Must be odd. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

LSTM

+

LSTM

-### Description +### Description Computes a one-layer LSTM. This operator is usually supported via some custom implementation such as CuDNN. -### Parameters +### Parameters -\[3–8 Inputs\] +[3–8 Inputs] X: tensor of type float16, float, or double. @@ -2504,27 +2504,27 @@ R: tensor of type float16, float, or double. B: tensor of type float16, float, or double. -sequence\_lens: tensor of type int32. +sequence_lens: tensor of type int32. -initial\_h: tensor of type float16, float, or double. +initial_h:, tensor of type float 16, float, or double. -initial\_c: tensor of type float16, float, or double. +initial_c:, tensor of type float 16, float, or double. p: tensor of type float16, float, or double. -\[0–3 Outputs\] +[0–3 Outputs] Y: tensor of type float16, float, or double. -Y\_h: tensor of type float16, float, or double. +Y_h:, tensor of type float 16, float, or double. -Y\_c: tensor of type float16, float, or double. +Y_c:, tensor of type float 16, float, or double. -\[Attributes\] +[Attributes] -activation\_alpha: list of floats. +activation_alpha: list of floats -activation\_beta: list of floats. +activation_beta: list of floats activations: list of strings. @@ -2532,25 +2532,25 @@ clip: float direction: string. Defaults to forward. -hidden\_size: int +hidden_size: int -input\_forget: int. Defaults to 0. +input_forget: int. Defaults to 0. layout: int. Defaults to 0. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

MatMul

+

MatMul

-### Description +### Description Multiplies two matrices. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -2558,436 +2558,436 @@ x1: 2D tensor of type float16. x2: 2D tensor of type float16. -\[Outputs\] +[Outputs] One output y: 2D tensor of type float16. -\[Restrictions\] +[Restrictions] Only 1D to 6D inputs are supported. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Max

+

Max

-### Description +### Description Computes element-wise max of each of the input tensors. -### Parameters +### Parameters -\[Inputs\] +[Inputs] -One or more inputs \(1–∞\) +One or more inputs (1–∞) -data\_0: list of tensors. Must be one of the following types: float16, float32, int8, int16, int32. +data_0: list of tensors. Must be one of the following types: float16, float32, int8, int16, int32. -\[Outputs\] +[Outputs] One output -max: tensor with the same type and shape as the input x \(broadcast shape\) +max: tensor with the same type and shape as the input x (broadcast shape) -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

MaxPool

+

MaxPool

-### Description +### Description Performs max pooling. -### Parameters +### Parameters -\[Inputs\] +[Inputs] X: tensor of type float16 or float32, in NCHW format. -\[Outputs\] +[Outputs] Y: tensor of type float16 or float32, in NCHW format. -\[Attributes\] +[Attributes] -auto\_pad: \(optional\) selected from SAME\_UPPER, SAME\_LOWER, VALID, and NOTSET. +auto_pad: (optional) selected from SAME_UPPER, SAME_LOWER, VALID, and NOTSET. -storage\_order: not supported currently. +storage_order: not supported currently. -kernel\_shape: \(optional\) +kernel_shape: (optional) -- kernel\_shape\[0\]: int32, the kernel height. Must be in the range \[1, 32768\]. Defaults to 1. -- kernel\_shape\[1\]: int32, the kernel width. Must be in the range \[1, 32768\]. Defaults to 1. +- kernel\_shape\[0\]: int32, the kernel height. Must be in the range [1, 32768]. Defaults to 1. +- kernel\_shape\[0\]: int32, the kernel width. Must be in the range [1, 32768]. Defaults to 1. -strides: \(optional\) +strides: (optional) -- strides\[0\]: int32, the stride height. Defaults to 1. -- strides\[1\]: int32, the stride width. Defaults to 1. +- strides[0]: int32, the stride height. Defaults to 1. +- strides[1]: int32, the stride width. Defaults to 1. -pads: \(optional\) +pads: (optional) -- pads\[0\]: int32, top padding. Defaults to 0. -- pads\[1\]: int32, bottom padding. Defaults to 0. -- pads\[2\]: int32, left padding. Defaults to 0. -- pads\[3\]: int32, right padding. Defaults to 0. +- pads[0]: int32, top padding. Defaults to 0. +- pads[1]: int32, bottom padding. Defaults to 0. +- pads[2]: int32, left padding. Defaults to 0. +- pads[3]: int32, right padding. Defaults to 0. -ceil\_mode: \(optional\) int32, either 0 \(floor mode\) or 1 \(ceil mode\). Defaults to 0. +ceil_mode: (optional) int32, either 0 (floor mode) or 1 (ceil mode). Defaults to 0. -\[Restrictions\] +[Restrictions] -When strides\[0\] or strides\[1\] is greater than 63, computation is performed on AI CPU, which will compromise performance. +When strides[0] or strides[1] is greater than 63, computation is performed on AI CPU, which will compromise performance. -When the value of kernel\_shape\_H or kernel\_shape\_W is beyond the range \[1,255\] or kernel\_shape\_H \* kernel\_shape\_W \> 256, computation is performed on AI CPU, which will compromise performance. +When the value of kernel\_shape\_H or kernel\_shape\_W is beyond the range [1, 255] or kernel\_shape\_H \* kernel\_shape\_W \> 256, computation is performed on AI CPU, which will compromise performance. -input\_w ∈ \[1, 4096\] +1 <= input\_w <= 4096 When N of the input tensor is a prime number, N < 65535. dilations is not supported for a 2D tensor. -If auto\_pad is VALID, ceil\_mode must be 0. +If auto_pad is VALID, ceil_mode must be 0. -The operator does not support inputs of type float32 when the atc command-line option **--precision\_mode** is set to **must\_keep\_origin\_dtype**. +The operator does not support inputs of type float32 when the atc command-line option --precision\_mode is set to must\_keep\_origin\_dtype. -pads and auto\_pad are mutually exclusive. +pads and auto_pad are mutually exclusive. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

MaxRoiPool

+

MaxRoiPool

-### Description +### Description -Consumes an input tensor X and region of interests \(RoIs\) to apply max pooling across each RoI, to produce output 4-D tensor of shape \(num\_rois, channels, pooled\_shape\[0\], pooled\_shape\[1\]\). +Consumes an input tensor X and region of interests (RoIs) to apply max pooling across each RoI, to produce output 4-D tensor shape (num_rois, channels, pooled_shape[0], pooled_shape[1]). -### Parameters +### Parameters -\[Inputs\] +[Inputs] X: tensor of type float16 or float. rois: tensor of type float16 or float. -\[Outputs\] +[Outputs] Y: tensor of type float16, float, or double. -\[Attributes\] +[Attributes] -pooled\_shape: list of ints +pooled_shape: list of ints -spatial\_scale: float. Defaults to 1.0. +spatial_scale: float. Defaults to 1.0. -\[Restrictions\] +[Restrictions] -The operator does not support inputs of type float32 when the atc command-line option **--precision\_mode** is set to **must\_keep\_origin\_dtype**. +The operator does not support inputs of type float32 when the atc command-line option --precision\_mode is set to must\_keep\_origin\_dtype. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/13 -

MaxUnpool

+

MaxUnpool

-### Description +### Description Indicates the reverse of the MaxPool operation. -### Parameters +### Parameters -\[Inputs\] +[Inputs] X: tensor of type float16 or float32. I: tensor of type int64. -\(Optional\) output\_shape: output shape of type int64. +(Optional) output_shape: output shape of type int64. -\[Outputs\] +[Outputs] Y: tensor of the same data type as the input. -\[Attributes\] +[Attributes] -\(Mandatory\) kernel\_shape: int list, kernel size on each axis. +(Mandatory) kernel_shape: int list, kernel size on each axis. pads: int list, pad on each axis. strides: int list, stride on each axis. -### ONNX Opset Support +### ONNX Opset Support Opset v9/v11/v12/v13 -

Mean

+

Mean

-### Description +### Description -Computes element-wise mean of each of the input tensors \(with NumPy-style broadcasting support\). All inputs and outputs must have the same data type. This operator supports multi-directional \(NumPy-style\) broadcasting. +Computes element-wise mean of each of the input tensors (with NumPy-style broadcasting support). All inputs and outputs must have the same data type. This operator supports multi-directional (NumPy-style) broadcasting. -### Parameters +### Parameters -\[Inputs\] One or more inputs \(1–∞\) +[Inputs] One or more inputs (1–∞) -data\_0: tensor of type float16, float, double, or bfloat16. +data_0: tensor of type float16, float, double, or bfloat16. -\[Outputs\] +[Outputs] mean: tensor of type float16, float, double, or bfloat16. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

MeanVarianceNormalization

+

MeanVarianceNormalization

-### Description +### Description -Performs mean variance normalization on the input tensor X using formula: \(X – EX\)/sqrt\(E\(X – EX\)^2\) +Performs mean variance normalization on the input tensor X using formula: (X – EX)/sqrt(E(X – EX)^2) -### Parameters +### Parameters -\[Inputs\] +[Inputs] X: tensor of type float16, float, or bfloat16. -\[Outputs\] +[Outputs] Y: tensor of type float16, float, or bfloat16. -\[Attributes\] +[Attributes] -axes: list of ints. Defaults to \['0', '2', '3'\]. +axes: list of ints. Defaults to ['0', '2', '3']. -### ONNX Opset Support +### ONNX Opset Support Opset v9/v10/v11/v12/v13 -

Min

+

Min

-### Description +### Description Returns the minimum of the input tensors. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input -x: list of tensors of type float16 or float32. +x: list of tensors of type float16 or float32 -\[Outputs\] +[Outputs] One output y: output tensor -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Mod

+

Mod

-### Description +### Description -Performs element-wise binary modulus \(with NumPy-style broadcasting support\). The sign of the remainder is the same as that of the divisor. +Performs element-wise binary modulus (with NumPy-style broadcasting support). The sign of the remainder is the same as that of the divisor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] A: tensor. Must be one of the following data types: uint8, uint16, uint32, uint64, int8, int16, int32, int64, float16, float, double, bfloat16. B: tensor. Must be one of the following data types: uint8, uint16, uint32, uint64, int8, int16, int32, int64, float16, float, double, bfloat16. -\[Outputs\] +[Outputs] C: tensor. Must be one of the following data types: uint8, uint16, uint32, uint64, int8, int16, int32, int64, float16, float, double, bfloat16. -\[Attributes\] +[Attributes] fmod: int. Defaults to 0. -\[Restrictions\] +[Restrictions] fmod must not be 0 if the inputs are of type float. -### ONNX Opset Support +### ONNX Opset Support Opset v10/v11/v12/v13 -

Mul

+

Mul

-### Description +### Description Performs dot product of two matrices. -### Parameters +### Parameters -\[Inputs\] +[Inputs] A: tensor of type float16, float32, uint8, int8, int16, or int32. B: tensor of type float16, float32, uint8, int8, int16, or int32. -\[Outputs\] +[Outputs] -C: tensor of the identical data type as the input tensor. +C: tensor. Has an identical data type to that of the input tensor. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Multinomial

+

Multinomial

-### Description +### Description Generates a tensor of samples from a multinomial distribution according to the probabilities of each of the possible outcomes. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input -x: tensor of type float16 or float32, with shape \[batch\_size, class\_size\]. +x: tensor of type float16 or float32, with shape [batch_size, class_size]. -\[Outputs\] +[Outputs] One output -y: tensor of type int32 or int64, with shape \[batch\_size, sample\_size\]. +y: tensor of type int32 or int64, with shape [batch_size, sample_size]. -\[Attributes\] +[Attributes] -dtype: int. The output dtype. Defaults to 6 \(int32\). +dtype: int. The output dtype. Defaults to 6 (int32). -sample\_size: int. Number of times to sample. Defaults to 1. +sample_size: Number of times to sample. Defaults to 1. seed: float. Seed to the random generator. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Neg

+

Neg

-### Description +### Description Computes numerical negative value element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or int32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type as the input. +y: tensor. Has an identical data type to that of the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

NonMaxSuppression

+

NonMaxSuppression

-### Description +### Description -Filters out boxes that have high intersection-over-union \(IOU\) overlap with previously selected boxes. Bounding boxes with score less than score\_threshold are removed. Bounding box format is indicated by the center\_point\_box attribute. Note that this algorithm is agnostic to where the origin is in the coordinate system and more generally is invariant to orthogonal transformations and translations of the coordinate system; thus translating or reflections of the coordinate system result in the same boxes being selected by the algorithm. The selected\_indices output is a set of integers indexing into the input collection of bounding boxes representing the selected boxes. The bounding box coordinates corresponding to the selected indices can then be obtained using the Gather or GatherND operation. +Filters out boxes that have high intersection-over-union (IOU) overlap with previously selected boxes. Bounding boxes with score less than score_threshold are removed. Bounding box format is indicated by the center_point_box attribute. Note that this algorithm is agnostic to where the origin is in the coordinate system and more generally is invariant to orthogonal transformations and translations of the coordinate system; thus translating or reflections of the coordinate system result in the same boxes being selected by the algorithm. The selected_indices output is a set of integers indexing into the input collection of bounding boxes representing the selected boxes. The bounding box coordinates corresponding to the selected indices can then be obtained using the Gather or GatherND operation. -### Parameters +### Parameters -\[2–5 Inputs\] +[2–5 Inputs] boxes: tensor of type float scores: tensor of type float -max\_output\_boxes\_per\_class: \(optional\) tensor of type int64 +max_output_boxes_per_class: (optional) tensor of type int64 -iou\_threshold: \(optional\) tensor of type float +iou_threshold: (optional) tensor of type float -score\_threshold: \(optional\) tensor of type float +score_threshold: (optional) tensor of type float -\[Outputs\] +[Outputs] -selected\_indices: tensor of type int64 +selected_indices: tensor of type int64 -\[Attributes\] +[Attributes] -center\_point\_box: int. Defaults to 0. +center_point_box: int. Defaults to 0. -### ONNX Opset Support +### ONNX Opset Support Opset v10/v11/v12/v13 -

NonZero

+

NonZero

-### Description +### Description -Returns the indices of the elements that are non-zero \(in row-major order\). +Returns the indices of the elements that are non-zero (in row-major order). -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, int32, int8, or uint8. -\[Outputs\] +[Outputs] One output y: tensor of type int64. -### ONNX Opset Support +### ONNX Opset Support Opset v9/v10/v11/v12/v13 -

Not

+

Not

-### Description +### Description Returns the negation of the input tensor element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type bool. -\[Outputs\] +[Outputs] One output y: tensor of type bool. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

OneHot

+

OneHot

-### Description +### Description Produces a one-hot tensor based on inputs. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Three inputs @@ -2997,35 +2997,35 @@ depth: tensor. Must be one of the following data types: uint8, uint16, uint32, u values: tensor. Must be one of the following data types: uint8, uint16, uint32, uint64, int8, int16, int32, int64, float16, float, double. -\[Attributes\] +[Attributes] One attribute -axis: \(optional\) axis along which one-hot representation is added. +axis: (optional) axis along which one-hot representation is added. -\[Outputs\] +[Outputs] One output y: tensor of the identical data type as the values input. -\[Restrictions\] +[Restrictions] axis must not be less than –1. -### ONNX Opset Support +### ONNX Opset Support Opset v9/v10/v11/v12/v13 -

Or

+

Or

-### Description +### Description -Returns the tensor resulted from performing the or logical operation element-wise on the input tensors. +Returns the tensor resulted from performing the OR logical operation element-wise on the input tensors. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -3033,37 +3033,37 @@ X1: tensor of type bool. X2: tensor of type bool. -\[Outputs\] +[Outputs] One output y: tensor of type bool. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

RandomNormalLike

+

RandomNormalLike

-### Description +### Description Generates a tensor with random values drawn from a normal distribution. The shape of the output tensor is copied from the shape of the input tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float. -\[Outputs\] +[Outputs] One output y: tensor of the identical data type and shape as input x. -\[Attributes\] +[Attributes] dtype: int, specifying the data type of the output tensor. @@ -3073,31 +3073,31 @@ scale: float. The standard deviation of the normal distribution. Defaults to 1.0 seed: float. Seed to the random generator. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

RandomUniformLike

+

RandomUniformLike

-### Description +### Description Generates a tensor with random values drawn from a uniform distribution. The shape of the output tensor is copied from the shape of the input tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float. -\[Outputs\] +[Outputs] One output y: tensor of the identical data type and shape as input x. -\[Attributes\] +[Attributes] dtype: int, specifying the data type of the output tensor. @@ -3107,19 +3107,19 @@ low: float. Lower boundary of the uniform distribution. Defaults to 0.0. seed: float. Seed to the random generator. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

RandomUniform

+

RandomUniform

-### Description +### Description Generates a tensor with random values drawn from a uniform distribution. -### Parameters +### Parameters -\[Attributes\] +[Attributes] Five attributes @@ -3129,29 +3129,29 @@ high: float. Specifies the upper boundary. low: float. Specifies the lower boundary. -seed: \(optional\) seed to the random generator. +seed: (optional) seed to the random generator. shape: output shape. -\[Outputs\] +[Outputs] One output y: tensor of the data type specified by the dtype attribute. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Range

+

Range

-### Description +### Description Generate a tensor containing a sequence of numbers. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Three inputs @@ -3161,399 +3161,399 @@ limit: scalar of type float16 or float32. delta: scalar of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type as input x. +y: tensor. Has an identical data type to that of input x. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Reciprocal

+

Reciprocal

-### Description +### Description Computes the reciprocal of the input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

ReduceL1

+

ReduceL1

-### Description +### Description Computes the L1 norm of the input tensor's elements along the provided axes. The resulted tensor has the same rank as the input if keepdim is set to 1. If keepdim is set to 0, then the result tensor has the reduced dimension pruned. The above behavior is similar to NumPy, with the exception that NumPy defaults keepdim to False instead of True. -### Parameters +### Parameters -\[Inputs\] +[Inputs] data: tensor. Must be one of the following types: uint32, uint64, int32, int64, float16, float, double, bfloat16. -\[Outputs\] +[Outputs] reduced: tensor. Must be one of the following types: uint32, uint64, int32, int64, float16, float, double, bfloat16. -\[Attributes\] +[Attributes] axes: list of ints. keepdims: int. Defaults to 1. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

ReduceL2

+

ReduceL2

-### Description +### Description Computes the L2 norm of the input tensor's elements along the provided axes. The resulted tensor has the same rank as the input if keepdim is set to 1. If keepdim is set to 0, then the result tensor has the reduced dimension pruned. The above behavior is similar to NumPy, with the exception that NumPy defaults keepdim to False instead of True. -### Parameters +### Parameters -\[Inputs\] +[Inputs] data: tensor. Must be one of the following types: uint32, uint64, int32, int64, float16, float, double, bfloat16. -\[Outputs\] +[Outputs] reduced: tensor. Must be one of the following types: uint32, uint64, int32, int64, float16, float, double, bfloat16. -\[Attributes\] +[Attributes] axes: list of ints. keepdims: int. Defaults to 1. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

ReduceLogSum

+

ReduceLogSum

-### Description +### Description Computes the sum of elements across dimensions of a tensor in log representations. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output y: tensor of type float16 or float32. -\[Attributes\] +[Attributes] -axes: int list. Must be in the range \[–r, r – 1\], where **r** indicates the dimension count of the input x. +axes: int list. Must be in the range [–r, r – 1], where r indicates the dimension count of the input x. -keepdims: int. Defaults to **1**, meaning that the reduced dimensions with length 1 are retained. +keepdims: int. Defaults to 1, meaning that the reduced dimensions with length 1 are retained. -### ONNX Opset Support +### ONNX Opset Support Opset v11/v13 -

ReduceLogSumExp

+

ReduceLogSumExp

-### Description +### Description Reduces a dimension of a tensor by calculating exponential for all elements in the dimension and calculates logarithm of the sum. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input data: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output reduced: tensor of type float16 or float32. -\[Attributes\] +[Attributes] -axes: tensor of type int32 or int64. Must be in the range \[–r, r – 1\], where **r** indicates the dimension count of the input x. +axes: tensor of type int32 or int64. Must be in the range [–r, r – 1], where r indicates the dimension count of the input x. -keepdims: int, indicating whether to reduce the dimensions. The default value is **1**, indicating that the dimensions are reduced. +keepdims: int, indicating whether to reduce the dimensions. The default value is 1, indicating that the dimensions are reduced. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

ReduceMin

+

ReduceMin

-### Description +### Description Computes the minimum of elements across dimensions of a tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output y: tensor of type float16 or float32. -\[Attributes\] +[Attributes] -axes: int list. Must be in the range \[–r, r – 1\], where **r** indicates the dimension count of the input x. +axes: int list. Must be in the range [–r, r – 1], where r indicates the dimension count of the input x. keepdims: int. Defaults to 1, meaning that the reduced dimensions with length 1 are retained. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

ReduceMean

+

ReduceMean

-### Description +### Description Computes the mean of elements across dimensions of a tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and format as input x. +y: tensor. Has an identical data type and format to those of input x. -\[Attributes\] +[Attributes] -axes: 1D list of ints, the dimensions to reduce. Must be in the range \[–r, r – 1\], where r indicates the rank of the input. +axes: 1D list of ints, the dimensions to reduce. Must be in the range [–r, r – 1], where r indicates the rank of the input. keepdims: int. Defaults to 1, meaning that the reduced dimensions with length 1 are retained. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

ReduceProd

+

ReduceProd

-### Description +### Description Computes the product of the input tensor's elements along the provided axes. The resulted tensor has the same rank as the input if keepdim is set to 1. If keepdim is set to 0, then the result tensor has the reduced dimension pruned. -### Parameters +### Parameters -\[Inputs\] +[Inputs] data: tensor. Must be one of the following types: uint32, uint64, int32, int64, float16, float, double, bfloat16. -\[Outputs\] +[Outputs] reduced: tensor. Must be one of the following types: uint32, uint64, int32, int64, float16, float, double, bfloat16. -\[Attributes\] +[Attributes] axes: list of ints. keepdims: int. Defaults to 1. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

ReduceSumSquare

+

ReduceSumSquare

-### Description +### Description Computes the sum square of the input tensor's elements along the provided axes. The resulted tensor has the same rank as the input if keepdim is set to 1. If keepdim is set to 0, then the result tensor has the reduced dimension pruned. The above behavior is similar to NumPy, with the exception that NumPy defaults keepdim to False instead of True. -### Parameters +### Parameters -\[Inputs\] +[Inputs] data: tensor. Must be one of the following types: uint32, uint64, int32, int64, float16, float, double, bfloat16. -\[Outputs\] +[Outputs] reduced: tensor. Must be one of the following types: uint32, uint64, int32, int64, float16, float, double, bfloat16. -\[Attributes\] +[Attributes] axes: list of ints. keepdims: int. Defaults to 1. -### ONNX Opset Support +### ONNX Opset Support Opset v1/v8/v9/v10/v11/v12/v13 -

Resize

+

Resize

-### Description +### Description Resizes the input tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Four inputs x: tensor of type float16 or float32. -roi: 1D tensor of type float16 or float32, with shape \[start1, ..., startN, end1, ..., endN\]. The tensor normalized by the input image. +roi: 1D tensor of type float16 or float32, with shape [start1, ..., startN, end1, ..., endN]. The tensor is normalized by the input image. scales: array. Has the same rank as that of the input x. sizes: size of the output tensor. -\[Outputs\] +[Outputs] One output y: resized tensor -\[Attributes\] +[Attributes] -coordinate\_transformation\_mode: string. Defaults to half\_pixel. Describes how to transform the coordinate in the resized tensor to the coordinate in the original tensor. +coordinate_transformation_mode: string. Defaults to half_pixel. Describes how to transform the coordinate in the resized tensor to the coordinate in the original tensor. -cubic\_coeff\_a: float. The coefficient used in cubic interpolation. Defaults to –0.75. +cubic_coeff_a: float. The coefficient used in cubic interpolation. Defaults to –0.75. -exclude\_outside: int. The weight outside the tensor. Defaults to 0. +exclude_outside: int. The weight outside the tensor. Defaults to 0. -mode: string. Interpolation mode selected from nearest \(default\), linear, and cubic. +mode: string. Interpolation mode selected from nearest (default), linear, and cubic. -nearest\_mode: string. Defaults to round\_prefer\_floor. +nearest_mode: string. Nearest operator mode. Defaults to round_prefer_floor. -\[Restrictions\] +[Restrictions] -Currently, only the nearest and linear interpolation modes are supported to process images. In addition, the model's two inputs \(scales and sizes\) need to be changed from placeholders to constants. You can use ONNX Simplifier to simplify your model. +Currently, only the nearest and linear interpolation modes are supported to process images. In addition, the model's two inputs (scales and sizes) need to be changed from placeholders to constants. You can use ONNX Simplifier to simplify your model. -### ONNX Opset Support +### ONNX Opset Support Opset v10/v11/v12 -

Relu

+

Relu

-### Description +### Description Applies the rectified linear unit activation function. -### Parameters +### Parameters -\[Inputs\] +[Inputs] -X: input tensor of type float32, int32, uint8, int16, int8, uint16, float16, or qint8. +X: input tensor of type float32, int32, uint8, int16, int8, uint16, float16, or qint8 -\[Outputs\] +[Outputs] -Y: tensor of the identical data type as X. +Y: tensor. Has an identical data type to that of X. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

ReduceSum

+

ReduceSum

-### Description +### Description Computes the sum of the input tensor's element along the provided axes. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and format as input x. +y: tensor. Has an identical data type and format to those of input x. -\[Attributes\] +[Attributes] -axes: 1D list of ints, the dimensions to reduce. Must be in the range \[–r, r – 1\], where r indicates the rank of the input. +axes: 1D list of ints, the dimensions to reduce. Must be in the range [–r, r – 1], where r indicates the rank of the input. keepdims: int. Defaults to 1, meaning that the reduced dimensions with length 1 are retained. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

ReduceMax

+

ReduceMax

-### Description +### Description Computes the maximum of elements across dimensions of a tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or int32. -\[Outputs\] +[Outputs] One output y: tensor of type float16, float32, or int32. -\[Attributes\] +[Attributes] -axes: list of ints. Must be in the range \[–r, r – 1\], where r indicates the rank of the input. +axes: list of ints. Must be in the range [–r, r – 1], where r indicates the rank of the input. keepdims: int. Defaults to 1, meaning that the reduced dimensions with length 1 are retained. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Reshape

+

Reshape

-### Description +### Description Reshapes the input. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -3561,55 +3561,55 @@ data: tensor. shape: tensor of type int64, for the shape of the output tensor. -\[Outputs\] +[Outputs] reshaped: tensor -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

ReverseSequence

+

ReverseSequence

-### Description +### Description Reverses batch of sequences having different lengths. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs -x: tensor of type float16 or float32, of rank \>= 2. +x: tensor of type float16 or float32, of rank >= 2. -sequence\_lens: tensor of type int64. Lengths of the sequences in a batch. +sequence_lens: tensor of type int64. Lengths of the sequences in a batch. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and shape as input x. +y: tensor of the same data type and shape as input x. -\[Attributes\] +[Attributes] -batch\_axis: int. Specifies the batch axis. Defaults to 1. +batch_axis: int. Specifies the batch axis. Defaults to 1. -time\_axis: int. Specifies the time axis. Defaults to 1. +time_axis: int. Specifies the time axis. Defaults to 1. -### ONNX Opset Support +### ONNX Opset Support Opset v10/v11/v12/v13 -

RoiExtractor

+

RoiExtractor

-### Description +### Description Obtains the ROI feature matrix from the feature mapping list. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -3617,115 +3617,115 @@ features: tensor of type float32 or float16. rois: tensor of type float32 or float16. -\[Attributes\] +[Attributes] Eight attributes -finest\_scale: int +finest_scale: int -roi\_scale\_factor: float +roi_scale_factor: float -spatial\_scale: array of floats +spatial_scale: array of floats -pooled\_height: int +pooled_height: int -pooled\_width: int +pooled_width: int -sample\_num: int +sample_num: int -pool\_mode: string +pool_mode: string aligned: bool -\[Outputs\] +[Outputs] One output y: tensor of type float32 or float16. -### ONNX Opset Support +### ONNX Opset Support -No ONNX support for this custom operator +No ONNX support for this custom operator. -

RoiAlign

+

RoiAlign

-### Description +### Description Performs ROI align operation. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Three inputs x: 4D tensor of type float16 or float32. -rois: float16 or float32. ROIs to pool over. Has shape \(num\_rois, 4\). +rois: float16 or float32 with shape (num_rois, 4). -batch\_indices: int64. Has shape \(num\_rois,\). +batch_indices: int64 with shape (num_rois,). -\[Outputs\] +[Outputs] One output -y: tensor of the identical type as input x. Has shape \(num\_rois, C, output\_height, output\_width\). +y: tensor of the identical type as input x. Has shape (num_rois, C, output_height, output_width). -\[Attributes\] +[Attributes] mode: string. The pooling method. Defaults to avg. -output\_height: int. Pooled output y's height. Defaults to 1. +output_height: int. Pooled output y's height. Defaults to 1. -output\_width: int. Pooled output y's width. Defaults to 1. +output_width: int. Pooled output y's width. Defaults to 1. -sampling\_ratio: int. Number of sampling points in the interpolation grid used to compute the output value of each pooled output bin. Defaults to 0. +sampling_ratio: int. Number of sampling points in the interpolation grid used to compute the output value of each pooled output bin. Defaults to 0. -spatial\_scale: float. Multiplicative spatial scale factor to translate ROI coordinates from their input spatial scale to the scale used when pooling. Defaults to 1.0. +spatial_scale: float. Multiplicative spatial scale factor to translate ROI coordinates from their input spatial scale to the scale used when pooling. Defaults to 1.0. -\[Restrictions\] +[Restrictions] -batch\_indices must be of type int32 instead of int64. +batch_indices must be of type int32 instead of int64. -The operator does not support inputs of type float32 or float64 when the atc command-line option **--precision\_mode** is set to **must\_keep\_origin\_dtype**. +The operator does not support inputs of type float32 or float64 when the atc command-line option --precision_mode is set to must_keep_origin_dtype. -### ONNX Opset Support +### ONNX Opset Support Opset v10/v11/v12/v13 -

Round

+

Round

-### Description +### Description Rounds the values of a tensor to the nearest integer, element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

PRelu

+

PRelu

-### Description +### Description Computes Parametric Rectified Linear Unit. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -3733,29 +3733,29 @@ x: tensor of type float16 or float32. slope: tensor of the same data type as input x. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and shape as input x. +y: tensor of the same data type and shape as input x. -\[Restrictions\] +[Restrictions] -slope must be 1D. When input x is 1D, the dimension value of slope must be 1. When input x is not 1D, the dimension value of slope can be 1 or shape\[1\] of input x. +slope must be 1D. When input x is 1D, the dimension value of slope must be 1. When input x is not 1D, the dimension value of slope can be 1 or shape[1] of input x. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Scatter

+

Scatter

-### Description +### Description Returns the result by updating the values of the input data to values specified by updates at specific index positions specified by indices. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Three inputs @@ -3765,29 +3765,29 @@ indices: tensor of type int32 or int64. updates: tensor of the identical data type as data. -\[Outputs\] +[Outputs] One output y: tensor of the identical data type and shape as input x. -\[Attributes\] +[Attributes] axis: int, specifying which axis to scatter on. Defaults to 0. -### ONNX Opset Support +### ONNX Opset Support Opset v9/v10 -

ScatterElements

+

ScatterElements

-### Description +### Description Returns the result by updating the values of the input data to values specified by updates at specific index positions specified by indices. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input @@ -3797,87 +3797,87 @@ indices: tensor of type int32 or int64. updates: tensor of the identical data type as data. -\[Outputs\] +[Outputs] One output y: tensor of the identical data type and shape as input x. -\[Attributes\] +[Attributes] axis: int, specifying which axis to scatter on. Defaults to 0. -### ONNX Opset Support +### ONNX Opset Support Opset v11/v12/v13 -

ScatterND

+

ScatterND

-### Description +### Description Creates a copy of the input data, and then updates its values to those specified by updates at specific index positions specified by indices. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Three inputs -data: tensor of type float16 or float32, of rank \>= 1. +data: tensor of type float16 or float32, of rank >= 1. -indices: tensor of type int64, of rank \>= 1. +indices: tensor of type int64, of rank >= 1. -updates: tensor of type float16 or float32, of rank = q + r – indices\_shape\[–1\] – 1. +updates: tensor of type float16 or float32, of rank = q + r – indices_shape[–1] – 1. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and shape as input x. +y: tensor of the same data type and shape as input x. -### ONNX Opset Support +### ONNX Opset Support Opset v11 -

Shrink

+

Shrink

-### Description +### Description -Takes one input tensor and outputs one tensor. The formula of this operator is: If x < – lambd, y = x + bias; If x \> lambd, y = x – bias; otherwise, y = 0. +Takes one input tensor and outputs one tensor. The formula of this operator is: If x < – lambd, y = x + bias; If x > lambd, y = x – bias; otherwise, y = 0. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input data: tensor of type float16 or float. -\[Outputs\] +[Outputs] One output y: tensor of the identical data type and shape as input x. -\[Attributes\] +[Attributes] bias: float. Defaults to 0.0. lambda: float. Defaults to 0.5. -### ONNX Opset Support +### ONNX Opset Support Opset v9/v10/v11/ v12/v13 -

Selu

+

Selu

-### Description +### Description -Produces a tensor where the scaled exponential linear unit function: y = gamma \* \(alpha \* e^x – alpha\) for x <= 0, y = gamma \* x for x \> 0, is applied to the input tensor element-wise. +Produces a tensor where the scaled exponential linear unit function: y = gamma * (alpha * e^x – alpha) for x <= 0, y = gamma * x for x > 0, is applied to the input tensor element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input @@ -3889,71 +3889,71 @@ alpha: coefficient of SELU gamma: coefficient of SELU -\[Outputs\] +[Outputs] One output y: tensor of the identical data type as the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Shape

+

Shape

-### Description +### Description Returns a tensor containing the shape of the input tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor -\[Outputs\] +[Outputs] -y: int64 tensor containing the shape of the input tensor. +y: int64 tensor with the shape of the input tensor. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Sigmoid

+

Sigmoid

-### Description +### Description Computes sigmoid of the input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type as input x. +y: tensor. Has an identical data type to that of input x. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Slice

+

Slice

-### Description +### Description Extracts a slice from a tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Five inputs @@ -3963,231 +3963,231 @@ starts: 1D tensor of type int32 or int64, specifying the start index. ends: 1D tensor of type int32 or int64, specifying the end index. -axes: \(optional\) 1D tensor of type int32 or int64. The axis to extract a slice from. Must be in the range \[–r, r – 1\], where r indicates the rank of the input x. +axes: (optional) 1D tensor of type int32 or int64. The axis to extract a slice from. Must be in the range [–r, r – 1], where r indicates the rank of the input x. -steps: \(optional\) 1D tensor of type int32 or int64, specifying the slice step. The slice step of the last axis must be 1. +steps: (optional) 1D tensor of type int32 or int64, specifying the slice step. The slice step of the last axis must be 1. -\[Outputs\] +[Outputs] -y: tensor of the identical data type as input x. +y: tensor. Has an identical data type to that of the input. -\[Restrictions\] +[Restrictions] -x: must have a rank greater than 1. +x: The dimension of the input tensor cannot be 1. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Softmax

+

Softmax

-### Description +### Description Computes softmax activations. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input x. +y: tensor. Has an identical data type and shape to those of input x. -\[Attributes\] +[Attributes] -axis: \(optional\) int, the dimension softmax would be performed on. Defaults to –1. Must be in the range \[–len\(x.shape\), len\(x.shape\) – 1\]. +axis: (optional) int, the dimension softmax would be performed on. Defaults to –1. Must be in the range [–len(x.shape), len(x.shape) – 1]. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Softsign

+

Softsign

-### Description +### Description -Computes softsign: \(x/\(1+|x|\)\) +Computes softsign: (x/(1+|x|)) -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Softplus

+

Softplus

-### Description +### Description Computes softplus. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input X: 1D input tensor -\[Outputs\] +[Outputs] One output Y: 1D tensor -\[Restrictions\] +[Restrictions] Only the float16 and float32 data types are supported. -The output has the identical data type as the input. +The output has an identical data type to that of the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

SpaceToDepth

+

SpaceToDepth

-### Description +### Description Rearranges blocks of spatial data into depth. More specifically, this operator outputs a copy of the input tensor where values from the height and width dimensions are moved to the depth dimension. -### Parameters +### Parameters -\[Inputs\] +[Inputs] input: tensor. Must be one of the following data types: uint8, uint16, uint32, uint64, int8, int16, int32, int64, bfloat16, float16, float, double, string, bool, complex64, complex128. -\[Outputs\] +[Outputs] output: tensor. Must be one of the following data types: uint8, uint16, uint32, uint64, int8, int16, int32, int64, bfloat16, float16, float, double, string, bool, complex64, complex128. -\[Attributes\] +[Attributes] blocksize: int -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Split

+

Split

-### Description +### Description Splits the input tensor into a list of sub-tensors. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor. Must be one of the following types: float16, float32, int8, int16, int32, int64, uint8, uint16, uint32, uint64. -\[Outputs\] +[Outputs] One output -y: list of tensors of the identical data type as input x. +y: list of tensors. Has an identical data type to that of input x. -\[Attributes\] +[Attributes] split: list of type int8, int16, int32, or int64, for the length of each output along axis. axis: int8, int16, int32, or int64, for the axis along which to split. -\[Restrictions\] +[Restrictions] Each element of split must be greater than or equal to 1. The sum of all split elements must be equal to axis. -axis ∈ \[–len\(x.shape\), len\(x.shape\) – 1\] +axis ∈ [–len(x.shape), len(x.shape) – 1] -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Sqrt

+

Sqrt

-### Description +### Description Computes element-wise square root of the input tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor -\[Outputs\] +[Outputs] One output y: tensor -\[Restrictions\] +[Restrictions] The output has the identical shape and dtype as the input. The supported data types are float16 and float32. -NaN is returned if x is less than 0. +If x is less than 0, NaN is returned. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Squeeze

+

Squeeze

-### Description +### Description Removes dimensions of size 1 from the shape of a tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor. Must be one of the following data types: float16, float32, double, uint8, uint16, uint32, uint64, int8, int16, int32, int64, bool. -\[Outputs\] +[Outputs] -y: tensor of the identical data type as the input. +y: tensor. Has an identical data type to that of the input. -\[Attributes\] +[Attributes] -axes: 1D list of int32s or int64s, indicating the dimensions to squeeze. Negative value means counting dimensions from the back. Accepted range is \[–r, r – 1\] where r = rank\(x\). +axes: 1D list of int32s or int64s, indicating the dimensions to squeeze. Negative value means counting dimensions from the back. Accepted range is [–r, r – 1] where r = rank(x). -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Sub

+

Sub

-### Description +### Description Performs element-wise subtraction. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -4195,217 +4195,217 @@ x1: tensor x2: tensor -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type as the input. +y: tensor. Has an identical data type to that of the input. -\[Restrictions\] +[Restrictions] The output has the identical shape and dtype as the input. The supported data types are int32, float16, and float32. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Sign

+

Sign

-### Description +### Description Computes the symbol of the input tensor element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and shape as input x. +y: tensor of the same data type and shape as input x. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Sin

+

Sin

-### Description +### Description Computes sine of the input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Sinh

+

Sinh

-### Description +### Description Computes hyperbolic sine of the input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16, float32, or double. -\[Outputs\] +[Outputs] One output -y: tensor. Has the identical data type and shape as the input. +y: tensor. Has an identical data type and shape to the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Size

+

Size

-### Description +### Description Outputs the number of elements in the input tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output y: scalar of type int64 -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Sum

+

Sum

-### Description +### Description Computes element-wise sum of each of the input tensors. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and shape as input x. +y: tensor of the same data type and shape as input x. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Tanh

+

Tanh

-### Description +### Description Computes hyperbolic tangent of the input element-wise. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type as the input. +y: tensor. Has an identical data type to that of the input. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

TfIdfVectorizer

+

TfIdfVectorizer

-### Description +### Description Extracts n-grams from the input sequence and save them as a vector. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input data: tensor of type int32 or int64. -\[Outputs\] +[Outputs] One output y: tensor of type float. -\[Attributes\] +[Attributes] -max\_gram\_length: int. Maximum n-gram length. +max_gram_length: int. Maximum n-gram length. -max\_skip\_count: int. Maximum number of items to be skipped when constructing an n-gram from data. +max_skip_count: int. Maximum number of items to be skipped when constructing an n-gram from data. -min\_gram\_length: int. Minimum n-gram length. +min_gram_length: int. Minimum n-gram length. -mode: string. The weighting criteria. It can be "TF" \(term frequency\), "IDF" \(inverse document frequency\), or "TFIDF" \(the combination of TF and IDF\). +mode: string. The weighting criteria. It can be "TF" (term frequency), "IDF" (inverse document frequency), or "TFIDF" (the combination of TF and IDF). -ngram\_counts: list of ints. The starting indexes of n-grams in pool. It is useful when determining the boundary between two consecutive collections of n-grams. +ngram_counts: list of ints. The starting indexes of n-grams in pool. It is useful when determining the boundary between two consecutive collections of n-grams. -ngram\_indexes: list of ints. The i-th element in ngram\_indexes indicates the coordinate of the i-th n-gram in the output tensor. +ngram_indexes: list of ints. The i-th element in ngram_indexes indicates the coordinate of the i-th n-gram in the output tensor. -pool\_int64s: list of ints, indicating n-grams learned from the training set. This attribute and pool\_strings are mutually exclusive. +pool_int64s: list of ints, indicating n-grams learned from the training set. This attribute and pool_strings are mutually exclusive. -pool\_strings: list of strings. Has the same meaning as pool\_int64s. +pool_strings: list of strings. Has the same meaning as pool_int64s. weights: list of floats. Stores the weight of each n-gram in pool. -### ONNX Opset Support +### ONNX Opset Support Opset v9/v10/v11/ v12/v13 -

Tile

+

Tile

-### Description +### Description Constructs a tensor by tiling a given tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -4413,53 +4413,53 @@ x: tensor repeats: 1D tensor of type int64. Has the same size as the number of dimensions in x. -\[Outputs\] +[Outputs] One output -y: tensor of the identical type and dimension as the input. output\_dim\[i\] = input\_dim\[i\] \* repeats\[i\] +y: tensor of the identical type and dimension as the input. output_dim[i] = input_dim[i] * repeats[i] -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

ThresholdedRelu

+

ThresholdedRelu

-### Description +### Description -When x \> alpha, y = x; otherwise, y = 0. +When x > alpha, y = x; otherwise, y = 0. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type float16 or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type and shape as input x. +y: tensor of the same data type and shape as input x. -\[Attributes\] +[Attributes] alpha: float, indicating the threshold. Defaults to 1.0. -### ONNX Opset Support +### ONNX Opset Support Opset v10/v11/v12/v13 -

TopK

+

TopK

-### Description +### Description Retrieves the top-K largest or smallest elements along a specified axis. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -4467,7 +4467,7 @@ x: tensor of type float16 or float32. k: tensor of type int64. -\[Outputs\] +[Outputs] Two outputs @@ -4475,7 +4475,7 @@ Values: tensor containing top K values from the input tensor. Indexes: tensor containing the corresponding input tensor indices for the top K values. -\[Attributes\] +[Attributes] axis: int. The dimension on which to do the sort. Defaults to –1. @@ -4483,43 +4483,43 @@ largest: int. Whether to return the top-K largest or smallest elements. Defaults sorted: int. Whether to return the elements in sorted order. Defaults to 1. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Transpose

+

Transpose

-### Description +### Description Transposes the input. -### Parameters +### Parameters -\[Inputs\] +[Inputs] data: tensor. Must be one of the following types: float16, float32, int8, int16, int32, int64, uint8, uint16, uint32, uint64. -\[Outputs\] +[Outputs] transposed: tensor after transposition. -\[Attributes\] +[Attributes] -perm: \(required\) list of integers, for the dimension sequence of data. +perm: (required) list of integers, for the dimension sequence of data. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Pad

+

Pad

-### Description +### Description Pads a tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -4527,89 +4527,89 @@ x: tensor of type float16, float32, or int32. pads: tensor of type int32 or int64. -constant\_value: optional. Defaults to **0**, an empty string, or **False**. If the selected mode is **constant**, the scalar value is used. +constant_value: optional. Defaults to 0, an empty string, or False. If the selected mode is constant, the scalar value is used. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type as input x. +y: tensor. Has an identical data type to that of input x. -\[Attributes\] +[Attributes] mode: str type. The following modes are supported: constant, reflect, and edge. -\[Restrictions\] +[Restrictions] -If the value of mode is **constant**, the value of **constant\_value** can only be **0**. +If the value of mode is constant, the value of constant_value can only be 0. -### ONNX Opset Support +### ONNX Opset Support Opset v11 -

Pow

+

Pow

-### Description +### Description Computes x1 to the x2th power. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs x1: tensor of type float16, float32, double, int32, int8, or uint8. -x2: tensor of the identical data type as input x1. +x2: tensor. Has an identical data type to that of input x1. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type as input x1. +y: tensor. Has an identical data type to that of x1. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Unsqueeze

+

Unsqueeze

-### Description +### Description Inserts single-dimensional entries to the shape of an input tensor. -### Parameters +### Parameters -\[Inputs\] +[Inputs] One input x: tensor of type uint8, uint16, uint32, int8, int16, int32, float16, or float32. -\[Outputs\] +[Outputs] One output -y: tensor of the identical data type as input x. +y: tensor. Has an identical data type to that of input x. -\[Attributes\] +[Attributes] -axes: list of integers indicating the dimensions to be inserted. Accepted range is \[–input\_rank, input\_rank\]\(inclusive\) where r = rank\(x\). +axes: list of integers indicating the dimensions to be inserted. Accepted range is [–input_rank, input_rank](inclusive) where r = rank(x). -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/10/v11/v12 -

Xor

+

Xor

-### Description +### Description Computes the element-wise logical XOR of the given input tensors. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Two inputs @@ -4617,23 +4617,23 @@ a: tensor of type bool. b: tensor of type bool. -\[Outputs\] +[Outputs] c: tensor of type bool. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 -

Where

+

Where

-### Description +### Description Returns elements chosen from x or y depending on condition. -### Parameters +### Parameters -\[Inputs\] +[Inputs] Three inputs @@ -4641,13 +4641,12 @@ condition: bool. x: tensor of type float16, float32, int8, int32, or uint8. Elements from which to choose when condition is true. -y: tensor of the identical data type as x. Elements from which to choose when condition is false. +y: tensor. Has an identical data type to that of x. Elements from which to choose when condition is false. -\[Outputs\] +[Outputs] -Tensor of the identical data type as input x. +Tensor that has an identical data type to that of input x. -### ONNX Opset Support +### ONNX Opset Support Opset v8/v9/v10/v11/v12/v13 - diff --git a/docs/en/PyTorch 1.5.0 API Support.md b/docs/en/PyTorch 1.5.0 API Support.md new file mode 100644 index 0000000000000000000000000000000000000000..2a0fde3c703b38e4864e269fbcc2ada65ca34a4f --- /dev/null +++ b/docs/en/PyTorch 1.5.0 API Support.md @@ -0,0 +1,3284 @@ +## [Tensors](https://pytorch.org/docs/1.5.0/torch.html) + +| No. | API | Supported/Unsupported | +| ---- | ----------------------------- | ------------------------------------------------------------ | +| 1 | torch.is_tensor | Supported | +| 2 | torch.is_storage | Supported | +| 3 | torch.is_complex | Supported (The judgment is supported, but the complex number is not supported by the current hardware.) | +| 4 | torch.is_floating_point | Supported | +| 5 | torch.set_default_dtype | Supported | +| 6 | torch.get_default_dtype | Supported | +| 7 | torch.set_default_tensor_type | Supported | +| 8 | torch.numel | Supported | +| 9 | torch.set_printoptions | Supported | +| 10 | torch.set_flush_denormal | Supported | +| 11 | torch.tensor | Supported | +| 12 | torch.sparse_coo_tensor | Unsupported | +| 13 | torch.as_tensor | Supported | +| 14 | torch.as_strided | Supported | +| 15 | torch.from_numpy | Supported | +| 16 | torch.zeros | Supported | +| 17 | torch.zeros_like | Supported | +| 18 | torch.ones | Supported | +| 19 | torch.ones_like | Supported | +| 20 | torch.arange | Supported | +| 21 | torch.range | Supported | +| 22 | torch.linspace | Supported | +| 23 | torch.logspace | Supported | +| 24 | torch.eye | Supported | +| 25 | torch.empty | Supported | +| 26 | torch.empty_like | Supported | +| 27 | torch.empty_strided | Supported | +| 28 | torch.full | Supported | +| 29 | torch.full_like | Supported | +| 30 | torch.quantize_per_tensor | Supported | +| 31 | torch.quantize_per_channel | Supported | +| 32 | torch.cat | Supported | +| 33 | torch.chunk | Supported | +| 34 | torch.gather | Supported | +| 35 | torch.index_select | Supported | +| 36 | torch.masked_select | Supported | +| 37 | torch.narrow | Supported | +| 38 | torch.nonzero | Supported | +| 39 | torch.reshape | Supported | +| 40 | torch.split | Supported | +| 41 | torch.squeeze | Supported | +| 42 | torch.stack | Supported | +| 43 | torch.t | Supported | +| 44 | torch.take | Supported | +| 45 | torch.transpose | Supported | +| 46 | torch.unbind | Supported | +| 47 | torch.unsqueeze | Supported | +| 48 | torch.where | Supported | + +## Generators + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------- | --------------------- | +| 1 | torch._C.Generator | Supported | +| 2 | torch._C.Generator.device | Supported | +| 3 | torch._C.Generator.get_state | Unsupported | +| 4 | torch._C.Generator.initial_seed | Supported | +| 5 | torch._C.Generator.manual_seed | Supported | +| 6 | torch._C.Generator.seed | Supported | +| 7 | torch._C.Generator.set_state | Unsupported | + +## Random Sampling + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------ | --------------------- | +| 1 | torch.seed | Supported | +| 2 | torch.manual_seed | Supported | +| 3 | torch.initial_seed | Supported | +| 4 | torch.get_rng_state | Supported | +| 5 | torch.set_rng_state | Supported | +| 6 | torch.torch.default_generator | Supported | +| 7 | torch.bernoulli | Supported | +| 8 | torch.multinomial | Supported | +| 9 | torch.normal | Supported | +| 10 | torch.poisson | Unsupported | +| 11 | torch.rand | Supported | +| 12 | torch.rand_like | Supported | +| 13 | torch.randint | Supported | +| 14 | torch.randint_like | Supported | +| 15 | torch.randn | Supported | +| 16 | torch.randn_like | Supported | +| 17 | torch.randperm | Supported | +| 18 | torch.Tensor.bernoulli_() | Supported | +| 19 | torch.Tensor.bernoulli_() | Supported | +| 20 | torch.Tensor.exponential_() | Unsupported | +| 21 | torch.Tensor.geometric_() | Unsupported | +| 22 | torch.Tensor.log_normal_() | Unsupported | +| 23 | torch.Tensor.normal_() | Supported | +| 24 | torch.Tensor.random_() | Supported | +| 25 | torch.Tensor.uniform_() | Supported | +| 26 | torch.quasirandom.SobolEngine | Supported | +| 27 | torch.quasirandom.SobolEngine.draw | Supported | +| 28 | torch.quasirandom.SobolEngine.fast_forward | Supported | +| 29 | torch.quasirandom.SobolEngine.reset | Supported | + +## Serialization + +| No. | API | Supported/Unsupported | +| ---- | ---------- | --------------------- | +| 1 | torch.save | Supported | +| 2 | torch.load | Supported | + +## Math Operations + +| No. | API | Supported/Unsupported | +| ---- | ------------------------ | --------------------- | +| 1 | torch.abs | Supported | +| 2 | torch.acos | Supported | +| 3 | torch.add | Supported | +| 4 | torch.addcdiv | Supported | +| 5 | torch.addcmul | Supported | +| 6 | torch.angle | Unsupported | +| 7 | torch.asin | Supported | +| 8 | torch.atan | Supported | +| 9 | torch.atan2 | Supported | +| 10 | torch.bitwise_not | Supported | +| 11 | torch.bitwise_and | Supported | +| 12 | torch.bitwise_or | Supported | +| 13 | torch.bitwise_xor | Supported | +| 14 | torch.ceil | Supported | +| 15 | torch.clamp | Supported | +| 16 | torch.conj | Unsupported | +| 17 | torch.cos | Supported | +| 18 | torch.cosh | Supported | +| 19 | torch.div | Supported | +| 20 | torch.digamma | Unsupported | +| 21 | torch.erf | Supported | +| 22 | torch.erfc | Supported | +| 23 | torch.erfinv | Supported | +| 24 | torch.exp | Supported | +| 25 | torch.expm1 | Supported | +| 26 | torch.floor | Supported | +| 27 | torch.floor_divide | Supported | +| 28 | torch.fmod | Supported | +| 29 | torch.frac | Supported | +| 30 | torch.imag | Unsupported | +| 31 | torch.lerp | Supported | +| 32 | torch.lgamma | Unsupported | +| 33 | torch.log | Supported | +| 34 | torch.log10 | Supported | +| 35 | torch.log1p | Supported | +| 36 | torch.log2 | Supported | +| 37 | torch.logical_and | Supported | +| 38 | torch.logical_not | Supported | +| 39 | torch.logical_or | Supported | +| 40 | torch.logical_xor | Supported | +| 41 | torch.mul | Supported | +| 42 | torch.mvlgamma | Unsupported | +| 43 | torch.neg | Supported | +| 44 | torch.polygamma | Unsupported | +| 45 | torch.pow | Supported | +| 46 | torch.real | Supported | +| 47 | torch.reciprocal | Supported | +| 48 | torch.remainder | Supported | +| 49 | torch.round | Supported | +| 50 | torch.rsqrt | Supported | +| 51 | torch.sigmoid | Supported | +| 52 | torch.sign | Supported | +| 53 | torch.sin | Supported | +| 54 | torch.sinh | Supported | +| 55 | torch.sqrt | Supported | +| 56 | torch.square | Supported | +| 57 | torch.tan | Supported | +| 58 | torch.tanh | Supported | +| 59 | torch.true_divide | Supported | +| 60 | torch.trunc | Supported | +| 61 | torch.argmax | Supported | +| 62 | torch.argmin | Supported | +| 63 | torch.dist | Supported | +| 64 | torch.logsumexp | Supported | +| 65 | torch.mean | Supported | +| 66 | torch.median | Supported | +| 67 | torch.mode | Unsupported | +| 68 | torch.norm | Supported | +| 69 | torch.prod | Supported | +| 70 | torch.std | Supported | +| 71 | torch.std_mean | Supported | +| 72 | torch.sum | Supported | +| 73 | torch.unique | Supported | +| 74 | torch.unique_consecutive | Unsupported | +| 75 | torch.var | Unsupported | +| 76 | torch.var_mean | Unsupported | +| 77 | torch.allclose | Supported | +| 78 | torch.argsort | Supported | +| 79 | torch.eq | Supported | +| 80 | torch.equal | Supported | +| 81 | torch.ge | Supported | +| 82 | torch.gt | Supported | +| 83 | torch.isfinite | Supported | +| 84 | torch.isinf | Supported | +| 85 | torch.isnan | Supported | +| 86 | torch.kthvalue | Supported | +| 87 | torch.le | Supported | +| 88 | torch.lt | Supported | +| 89 | torch.max | Supported | +| 90 | torch.min | Supported | +| 91 | torch.ne | Supported | +| 92 | torch.sort | Supported | +| 93 | torch.topk | Supported | +| 94 | torch.fft | Unsupported | +| 95 | torch.ifft | Unsupported | +| 96 | torch.rfft | Unsupported | +| 97 | torch.irfft | Unsupported | +| 98 | torch.stft | Unsupported | +| 99 | torch.bartlett_window | Supported | +| 100 | torch.blackman_window | Supported | +| 101 | torch.hamming_window | Supported | +| 102 | torch.hann_window | Supported | +| 103 | torch.bincount | Supported | +| 104 | torch.broadcast_tensors | Supported | +| 105 | torch.cartesian_prod | Supported | +| 106 | torch.cdist | Supported | +| 107 | torch.combinations | Unsupported | +| 108 | torch.cross | Supported | +| 109 | torch.cummax | Supported | +| 110 | torch.cummin | Supported | +| 111 | torch.cumprod | Supported | +| 112 | torch.cumsum | Supported | +| 113 | torch.diag | Supported | +| 114 | torch.diag_embed | Supported | +| 115 | torch.diagflat | Supported | +| 116 | torch.diagonal | Supported | +| 117 | torch.einsum | Supported | +| 118 | torch.flatten | Supported | +| 119 | torch.flip | Supported | +| 120 | torch.rot90 | Supported | +| 121 | torch.histc | Unsupported | +| 122 | torch.meshgrid | Supported | +| 123 | torch.renorm | Supported | +| 124 | torch.repeat_interleave | Supported | +| 125 | torch.roll | Supported | +| 126 | torch.tensordot | Supported | +| 127 | torch.trace | Unsupported | +| 128 | torch.tril | Supported | +| 129 | torch.tril_indices | Supported | +| 130 | torch.triu | Supported | +| 131 | torch.triu_indices | Supported | +| 132 | torch.addbmm | Supported | +| 133 | torch.addmm | Supported | +| 134 | torch.addmv | Supported | +| 135 | torch.addr | Supported | +| 136 | torch.baddbmm | Supported | +| 137 | torch.bmm | Supported | +| 138 | torch.chain_matmul | Supported | +| 139 | torch.cholesky | Unsupported | +| 140 | torch.cholesky_inverse | Unsupported | +| 141 | torch.cholesky_solve | Unsupported | +| 142 | torch.dot | Supported | +| 143 | torch.eig | Unsupported | +| 144 | torch.geqrf | Unsupported | +| 145 | torch.ger | Supported | +| 146 | torch.inverse | Supported | +| 147 | torch.det | Unsupported | +| 148 | torch.logdet | Unsupported | +| 149 | torch.slogdet | Supported | +| 150 | torch.lstsq | Unsupported | +| 151 | torch.lu | Unsupported | +| 152 | torch.lu_solve | Unsupported | +| 153 | torch.lu_unpack | Unsupported | +| 154 | torch.matmul | Supported | +| 155 | torch.matrix_power | Supported | +| 156 | torch.matrix_rank | Supported | +| 157 | torch.mm | Supported | +| 158 | torch.mv | Supported | +| 159 | torch.orgqr | Unsupported | +| 160 | torch.ormqr | Unsupported | +| 161 | torch.pinverse | Supported | +| 162 | torch.qr | Supported | +| 163 | torch.solve | Unsupported | +| 164 | torch.svd | Supported | +| 165 | torch.svd_lowrank | Supported | +| 166 | torch.pca_lowrank | Supported | +| 167 | torch.symeig | Supported | +| 168 | torch.lobpcg | Unsupported | +| 169 | torch.trapz | Supported | +| 170 | torch.triangular_solve | Supported | + +## Utilities + +| No. | API | Supported/Unsupported | +| ---- | ----------------------------- | --------------------- | +| 1 | torch.compiled_with_cxx11_abi | Supported | +| 2 | torch.result_type | Supported | +| 3 | torch.can_cast | Supported | +| 4 | torch.promote_types | Supported | + +## Other + +| No. | API | Supported/Unsupported | +| ---- | ----------------------------- | --------------------- | +| 1 | torch.no_grad | Supported | +| 2 | torch.enable_grad | Supported | +| 3 | torch.set_grad_enabled | Supported | +| 4 | torch.get_num_threads | Supported | +| 5 | torch.set_num_threads | Supported | +| 6 | torch.get_num_interop_threads | Supported | +| 7 | torch.set_num_interop_threads | Supported | + +## torch.Tensor + +| No. | API | Supported/Unsupported | +| ---- | -------------------------------------- | --------------------- | +| 1 | torch.Tensor | Supported | +| 2 | torch.Tensor.new_tensor | Supported | +| 3 | torch.Tensor.new_full | Supported | +| 4 | torch.Tensor.new_empty | Supported | +| 5 | torch.Tensor.new_ones | Supported | +| 6 | torch.Tensor.new_zeros | Supported | +| 7 | torch.Tensor.is_cuda | Supported | +| 8 | torch.Tensor.is_quantized | Supported | +| 9 | torch.Tensor.device | Supported | +| 10 | torch.Tensor.ndim | Supported | +| 11 | torch.Tensor.T | Supported | +| 12 | torch.Tensor.abs | Supported | +| 13 | torch.Tensor.abs_ | Supported | +| 14 | torch.Tensor.acos | Supported | +| 15 | torch.Tensor.acos_ | Supported | +| 16 | torch.Tensor.add | Supported | +| 17 | torch.Tensor.add_ | Supported | +| 18 | torch.Tensor.addbmm | Supported | +| 19 | torch.Tensor.addbmm_ | Supported | +| 20 | torch.Tensor.addcdiv | Supported | +| 21 | torch.Tensor.addcdiv_ | Supported | +| 22 | torch.Tensor.addcmul | Supported | +| 23 | torch.Tensor.addcmul_ | Supported | +| 24 | torch.Tensor.addmm | Supported | +| 25 | torch.Tensor.addmm_ | Supported | +| 26 | torch.Tensor.addmv | Supported | +| 27 | torch.Tensor.addmv_ | Supported | +| 28 | torch.Tensor.addr | Supported | +| 29 | torch.Tensor.addr_ | Supported | +| 30 | torch.Tensor.allclose | Supported | +| 31 | torch.Tensor.angle | Unsupported | +| 32 | torch.Tensor.apply_ | Unsupported | +| 33 | torch.Tensor.argmax | Supported | +| 34 | torch.Tensor.argmin | Supported | +| 35 | torch.Tensor.argsort | Supported | +| 36 | torch.Tensor.asin | Supported | +| 37 | torch.Tensor.asin_ | Supported | +| 38 | torch.Tensor.as_strided | Supported | +| 39 | torch.Tensor.atan | Supported | +| 40 | torch.Tensor.atan2 | Supported | +| 41 | torch.Tensor.atan2_ | Supported | +| 42 | torch.Tensor.atan_ | Supported | +| 43 | torch.Tensor.baddbmm | Supported | +| 44 | torch.Tensor.baddbmm_ | Supported | +| 45 | torch.Tensor.bernoulli | Supported | +| 46 | torch.Tensor.bernoulli_ | Supported | +| 47 | torch.Tensor.bfloat16 | Unsupported | +| 48 | torch.Tensor.bincount | Supported | +| 49 | torch.Tensor.bitwise_not | Supported | +| 50 | torch.Tensor.bitwise_not_ | Supported | +| 51 | torch.Tensor.bitwise_and | Supported | +| 52 | torch.Tensor.bitwise_and_ | Supported | +| 53 | torch.Tensor.bitwise_or | Supported | +| 54 | torch.Tensor.bitwise_or_ | Supported | +| 55 | torch.Tensor.bitwise_xor | Supported | +| 56 | torch.Tensor.bitwise_xor_ | Supported | +| 57 | torch.Tensor.bmm | Supported | +| 58 | torch.Tensor.bool | Supported | +| 59 | torch.Tensor.byte | Supported | +| 60 | torch.Tensor.cauchy_ | Unsupported | +| 61 | torch.Tensor.ceil | Supported | +| 62 | torch.Tensor.ceil_ | Supported | +| 63 | torch.Tensor.char | Supported | +| 64 | torch.Tensor.cholesky | Unsupported | +| 65 | torch.Tensor.cholesky_inverse | Unsupported | +| 66 | torch.Tensor.cholesky_solve | Unsupported | +| 67 | torch.Tensor.chunk | Supported | +| 68 | torch.Tensor.clamp | Supported | +| 69 | torch.Tensor.clamp_ | Supported | +| 70 | torch.Tensor.clone | Supported | +| 71 | torch.Tensor.contiguous | Supported | +| 72 | torch.Tensor.copy_ | Supported | +| 73 | torch.Tensor.conj | Unsupported | +| 74 | torch.Tensor.cos | Supported | +| 75 | torch.Tensor.cos_ | Supported | +| 76 | torch.Tensor.cosh | Supported | +| 77 | torch.Tensor.cosh_ | Supported | +| 78 | torch.Tensor.cpu | Supported | +| 79 | torch.Tensor.cross | Supported | +| 80 | torch.Tensor.cuda | Unsupported | +| 81 | torch.Tensor.cummax | Supported | +| 82 | torch.Tensor.cummin | Supported | +| 83 | torch.Tensor.cumprod | Supported | +| 84 | torch.Tensor.cumsum | Supported | +| 85 | torch.Tensor.data_ptr | Supported | +| 86 | torch.Tensor.dequantize | Unsupported | +| 87 | torch.Tensor.det | Unsupported | +| 88 | torch.Tensor.dense_dim | Unsupported | +| 89 | torch.Tensor.diag | Supported | +| 90 | torch.Tensor.diag_embed | Supported | +| 91 | torch.Tensor.diagflat | Supported | +| 92 | torch.Tensor.diagonal | Supported | +| 93 | torch.Tensor.fill_diagonal_ | Supported | +| 94 | torch.Tensor.digamma | Unsupported | +| 95 | torch.Tensor.digamma_ | Unsupported | +| 96 | torch.Tensor.dim | Supported | +| 97 | torch.Tensor.dist | Supported | +| 98 | torch.Tensor.div | Supported | +| 99 | torch.Tensor.div_ | Supported | +| 100 | torch.Tensor.dot | Supported | +| 101 | torch.Tensor.double | Unsupported | +| 102 | torch.Tensor.eig | Unsupported | +| 103 | torch.Tensor.element_size | Supported | +| 104 | torch.Tensor.eq | Supported | +| 105 | torch.Tensor.eq_ | Supported | +| 106 | torch.Tensor.equal | Supported | +| 107 | torch.Tensor.erf | Supported | +| 108 | torch.Tensor.erf_ | Supported | +| 109 | torch.Tensor.erfc | Supported | +| 110 | torch.Tensor.erfc_ | Supported | +| 111 | torch.Tensor.erfinv | Supported | +| 112 | torch.Tensor.erfinv_ | Supported | +| 113 | torch.Tensor.exp | Supported | +| 114 | torch.Tensor.exp_ | Supported | +| 115 | torch.Tensor.expm1 | Supported | +| 116 | torch.Tensor.expm1_ | Supported | +| 117 | torch.Tensor.expand | Supported | +| 118 | torch.Tensor.expand_as | Supported | +| 119 | torch.Tensor.exponential_ | Unsupported | +| 120 | torch.Tensor.fft | Unsupported | +| 121 | torch.Tensor.fill_ | Supported | +| 122 | torch.Tensor.flatten | Supported | +| 123 | torch.Tensor.flip | Supported | +| 124 | torch.Tensor.float | Supported | +| 125 | torch.Tensor.floor | Supported | +| 126 | torch.Tensor.floor_ | Supported | +| 127 | torch.Tensor.floor_divide | Supported | +| 128 | torch.Tensor.floor_divide_ | Supported | +| 129 | torch.Tensor.fmod | Supported | +| 130 | torch.Tensor.fmod_ | Supported | +| 131 | torch.Tensor.frac | Supported | +| 132 | torch.Tensor.frac_ | Supported | +| 133 | torch.Tensor.gather | Supported | +| 134 | torch.Tensor.ge | Supported | +| 135 | torch.Tensor.ge_ | Supported | +| 136 | torch.Tensor.geometric_ | Unsupported | +| 137 | torch.Tensor.geqrf | Unsupported | +| 138 | torch.Tensor.ger | Supported | +| 139 | torch.Tensor.get_device | Supported | +| 140 | torch.Tensor.gt | Supported | +| 141 | torch.Tensor.gt_ | Supported | +| 142 | torch.Tensor.half | Supported | +| 143 | torch.Tensor.hardshrink | Supported | +| 144 | torch.Tensor.histc | Unsupported | +| 145 | torch.Tensor.ifft | Unsupported | +| 146 | torch.Tensor.index_add_ | Supported | +| 147 | torch.Tensor.index_add | Supported | +| 148 | torch.Tensor.index_copy_ | Supported | +| 149 | torch.Tensor.index_copy | Supported | +| 150 | torch.Tensor.index_fill_ | Supported | +| 151 | torch.Tensor.index_fill | Supported | +| 152 | torch.Tensor.index_put_ | Supported | +| 153 | torch.Tensor.index_put | Supported | +| 154 | torch.Tensor.index_select | Supported | +| 155 | torch.Tensor.indices | Unsupported | +| 156 | torch.Tensor.int | Supported | +| 157 | torch.Tensor.int_repr | Unsupported | +| 158 | torch.Tensor.inverse | Supported | +| 159 | torch.Tensor.irfft | Unsupported | +| 160 | torch.Tensor.is_contiguous | Supported | +| 161 | torch.Tensor.is_complex | Supported | +| 162 | torch.Tensor.is_floating_point | Supported | +| 163 | torch.Tensor.is_pinned | Supported | +| 164 | torch.Tensor.is_set_to | Unsupported | +| 165 | torch.Tensor.is_shared | Supported | +| 166 | torch.Tensor.is_signed | Supported | +| 167 | torch.Tensor.is_sparse | Supported | +| 168 | torch.Tensor.item | Supported | +| 169 | torch.Tensor.kthvalue | Supported | +| 170 | torch.Tensor.le | Supported | +| 171 | torch.Tensor.le_ | Supported | +| 172 | torch.Tensor.lerp | Supported | +| 173 | torch.Tensor.lerp_ | Supported | +| 174 | torch.Tensor.lgamma | Unsupported | +| 175 | torch.Tensor.lgamma_ | Unsupported | +| 176 | torch.Tensor.log | Supported | +| 177 | torch.Tensor.log_ | Supported | +| 178 | torch.Tensor.logdet | Unsupported | +| 179 | torch.Tensor.log10 | Supported | +| 180 | torch.Tensor.log10_ | Supported | +| 181 | torch.Tensor.log1p | Supported | +| 182 | torch.Tensor.log1p_ | Supported | +| 183 | torch.Tensor.log2 | Supported | +| 184 | torch.Tensor.log2_ | Supported | +| 185 | torch.Tensor.log_normal_ | Supported | +| 186 | torch.Tensor.logsumexp | Supported | +| 187 | torch.Tensor.logical_and | Supported | +| 188 | torch.Tensor.logical_and_ | Supported | +| 189 | torch.Tensor.logical_not | Supported | +| 190 | torch.Tensor.logical_not_ | Supported | +| 191 | torch.Tensor.logical_or | Supported | +| 192 | torch.Tensor.logical_or_ | Supported | +| 193 | torch.Tensor.logical_xor | Unsupported | +| 194 | torch.Tensor.logical_xor_ | Unsupported | +| 195 | torch.Tensor.long | Supported | +| 196 | torch.Tensor.lstsq | Unsupported | +| 197 | torch.Tensor.lt | Supported | +| 198 | torch.Tensor.lt_ | Supported | +| 199 | torch.Tensor.lu | Supported | +| 200 | torch.Tensor.lu_solve | Supported | +| 201 | torch.Tensor.map_ | Unsupported | +| 202 | torch.Tensor.masked_scatter_ | Supported | +| 203 | torch.Tensor.masked_scatter | Supported | +| 204 | torch.Tensor.masked_fill_ | Supported | +| 205 | torch.Tensor.masked_fill | Supported | +| 206 | torch.Tensor.masked_select | Supported | +| 207 | torch.Tensor.matmul | Supported | +| 208 | torch.Tensor.matrix_power | Supported | +| 209 | torch.Tensor.max | Supported | +| 210 | torch.Tensor.mean | Supported | +| 211 | torch.Tensor.median | Supported | +| 212 | torch.Tensor.min | Supported | +| 213 | torch.Tensor.mm | Supported | +| 214 | torch.Tensor.mode | Unsupported | +| 215 | torch.Tensor.mul | Supported | +| 216 | torch.Tensor.mul_ | Supported | +| 217 | torch.Tensor.multinomial | Supported | +| 218 | torch.Tensor.mv | Supported | +| 219 | torch.Tensor.mvlgamma | Unsupported | +| 220 | torch.Tensor.mvlgamma_ | Unsupported | +| 221 | torch.Tensor.narrow | Supported | +| 222 | torch.Tensor.narrow_copy | Supported | +| 223 | torch.Tensor.ndimension | Supported | +| 224 | torch.Tensor.ne | Supported | +| 225 | torch.Tensor.ne_ | Supported | +| 226 | torch.Tensor.neg | Supported | +| 227 | torch.Tensor.neg_ | Supported | +| 228 | torch.Tensor.nelement | Supported | +| 229 | torch.Tensor.nonzero | Supported | +| 230 | torch.Tensor.norm | Supported | +| 231 | torch.Tensor.normal_ | Supported | +| 232 | torch.Tensor.numel | Supported | +| 233 | torch.Tensor.numpy | Unsupported | +| 234 | torch.Tensor.orgqr | Unsupported | +| 235 | torch.Tensor.ormqr | Unsupported | +| 236 | torch.Tensor.permute | Supported | +| 237 | torch.Tensor.pin_memory | Unsupported | +| 238 | torch.Tensor.pinverse | Supported | +| 239 | torch.Tensor.polygamma | Unsupported | +| 240 | torch.Tensor.polygamma_ | Unsupported | +| 241 | torch.Tensor.pow | Supported | +| 242 | torch.Tensor.pow_ | Supported | +| 243 | torch.Tensor.prod | Supported | +| 244 | torch.Tensor.put_ | Supported | +| 245 | torch.Tensor.qr | Supported | +| 246 | torch.Tensor.qscheme | Unsupported | +| 247 | torch.Tensor.q_scale | Unsupported | +| 248 | torch.Tensor.q_zero_point | Unsupported | +| 249 | torch.Tensor.q_per_channel_scales | Unsupported | +| 250 | torch.Tensor.q_per_channel_zero_points | Unsupported | +| 251 | torch.Tensor.q_per_channel_axis | Unsupported | +| 252 | torch.Tensor.random_ | Supported | +| 253 | torch.Tensor.reciprocal | Supported | +| 254 | torch.Tensor.reciprocal_ | Supported | +| 255 | torch.Tensor.record_stream | Supported | +| 256 | torch.Tensor.remainder | Supported | +| 257 | torch.Tensor.remainder_ | Supported | +| 258 | torch.Tensor.renorm | Supported | +| 259 | torch.Tensor.renorm_ | Supported | +| 260 | torch.Tensor.repeat | Supported | +| 261 | torch.Tensor.repeat_interleave | Supported | +| 262 | torch.Tensor.requires_grad_ | Supported | +| 263 | torch.Tensor.reshape | Supported | +| 264 | torch.Tensor.reshape_as | Supported | +| 265 | torch.Tensor.resize_ | Supported | +| 266 | torch.Tensor.resize_as_ | Supported | +| 267 | torch.Tensor.rfft | Unsupported | +| 268 | torch.Tensor.roll | Supported | +| 269 | torch.Tensor.rot90 | Supported | +| 270 | torch.Tensor.round | Supported | +| 271 | torch.Tensor.round_ | Supported | +| 272 | torch.Tensor.rsqrt | Supported | +| 273 | torch.Tensor.rsqrt_ | Supported | +| 274 | torch.Tensor.scatter | Supported | +| 275 | torch.Tensor.scatter_ | Supported | +| 276 | torch.Tensor.scatter_add_ | Supported | +| 277 | torch.Tensor.scatter_add | Supported | +| 278 | torch.Tensor.select | Supported | +| 279 | torch.Tensor.set_ | Supported | +| 280 | torch.Tensor.share_memory_ | Unsupported | +| 281 | torch.Tensor.short | Supported | +| 282 | torch.Tensor.sigmoid | Supported | +| 283 | torch.Tensor.sigmoid_ | Supported | +| 284 | torch.Tensor.sign | Supported | +| 285 | torch.Tensor.sign_ | Supported | +| 286 | torch.Tensor.sin | Supported | +| 287 | torch.Tensor.sin_ | Supported | +| 288 | torch.Tensor.sinh | Supported | +| 289 | torch.Tensor.sinh_ | Supported | +| 290 | torch.Tensor.size | Supported | +| 291 | torch.Tensor.slogdet | Supported | +| 292 | torch.Tensor.solve | Unsupported | +| 293 | torch.Tensor.sort | Supported | +| 294 | torch.Tensor.split | Supported | +| 295 | torch.Tensor.sparse_mask | Unsupported | +| 296 | torch.Tensor.sparse_dim | Unsupported | +| 297 | torch.Tensor.sqrt | Supported | +| 298 | torch.Tensor.sqrt_ | Supported | +| 299 | torch.Tensor.square | Supported | +| 300 | torch.Tensor.square_ | Supported | +| 301 | torch.Tensor.squeeze | Supported | +| 302 | torch.Tensor.squeeze_ | Supported | +| 303 | torch.Tensor.std | Supported | +| 304 | torch.Tensor.stft | Unsupported | +| 305 | torch.Tensor.storage | Supported | +| 306 | torch.Tensor.storage_offset | Supported | +| 307 | torch.Tensor.storage_type | Supported | +| 308 | torch.Tensor.stride | Supported | +| 309 | torch.Tensor.sub | Supported | +| 310 | torch.Tensor.sub_ | Supported | +| 311 | torch.Tensor.sum | Supported | +| 312 | torch.Tensor.sum_to_size | Supported | +| 313 | torch.Tensor.svd | Supported | +| 314 | torch.Tensor.symeig | Supported | +| 315 | torch.Tensor.t | Supported | +| 316 | torch.Tensor.t_ | Supported | +| 317 | torch.Tensor.to | Supported | +| 318 | torch.Tensor.to_mkldnn | Unsupported | +| 319 | torch.Tensor.take | Supported | +| 320 | torch.Tensor.tan | Supported | +| 321 | torch.Tensor.tan_ | Supported | +| 322 | torch.Tensor.tanh | Supported | +| 323 | torch.Tensor.tanh_ | Supported | +| 324 | torch.Tensor.tolist | Supported | +| 325 | torch.Tensor.topk | Supported | +| 326 | torch.Tensor.to_sparse | Unsupported | +| 327 | torch.Tensor.trace | Unsupported | +| 328 | torch.Tensor.transpose | Supported | +| 329 | torch.Tensor.transpose_ | Supported | +| 330 | torch.Tensor.triangular_solve | Supported | +| 331 | torch.Tensor.tril | Supported | +| 332 | torch.Tensor.tril_ | Supported | +| 333 | torch.Tensor.triu | Supported | +| 334 | torch.Tensor.triu_ | Supported | +| 335 | torch.Tensor.true_divide | Supported | +| 336 | torch.Tensor.true_divide_ | Supported | +| 337 | torch.Tensor.trunc | Supported | +| 338 | torch.Tensor.trunc_ | Supported | +| 339 | torch.Tensor.type | Supported | +| 340 | torch.Tensor.type_as | Supported | +| 341 | torch.Tensor.unbind | Supported | +| 342 | torch.Tensor.unfold | Supported | +| 343 | torch.Tensor.uniform_ | Supported | +| 344 | torch.Tensor.unique | Supported | +| 345 | torch.Tensor.unique_consecutive | Unsupported | +| 346 | torch.Tensor.unsqueeze | Supported | +| 347 | torch.Tensor.unsqueeze_ | Supported | +| 348 | torch.Tensor.values | Unsupported | +| 349 | torch.Tensor.var | Unsupported | +| 350 | torch.Tensor.view | Supported | +| 351 | torch.Tensor.view_as | Supported | +| 352 | torch.Tensor.where | Supported | +| 353 | torch.Tensor.zero_ | Supported | +| 354 | torch.BoolTensor | Supported | +| 355 | torch.BoolTensor.all | Supported | +| 356 | torch.BoolTensor.any | Supported | + +## Layers (torch.nn) + +| No. | API | Supported/Unsupported | +| ---- | -------------------------------------------------------- | ------------------------------------------------------------ | +| 1 | torch.nn.Parameter | Supported | +| 2 | torch.nn.Module | Supported | +| 3 | torch.nn.Module.add_module | Supported | +| 4 | torch.nn.Module.apply | Supported | +| 5 | torch.nn.Module.bfloat16 | Unsupported | +| 6 | torch.nn.Module.buffers | Supported | +| 7 | torch.nn.Module.children | Supported | +| 8 | torch.nn.Module.cpu | Supported | +| 9 | torch.nn.Module.cuda | Unsupported | +| 10 | torch.nn.Module.double | Unsupported | +| 11 | torch.nn.Module.dump_patches | Supported | +| 12 | torch.nn.Module.eval | Supported | +| 13 | torch.nn.Module.extra_repr | Supported | +| 14 | torch.nn.Module.float | Supported | +| 15 | torch.nn.Module.forward | Supported | +| 16 | torch.nn.Module.half | Supported | +| 17 | torch.nn.Module.load_state_dict | Supported | +| 18 | torch.nn.Module.modules | Supported | +| 19 | torch.nn.Module.named_buffers | Supported | +| 20 | torch.nn.Module.named_children | Supported | +| 21 | torch.nn.Module.named_modules | Supported | +| 22 | torch.nn.Module.named_parameters | Supported | +| 23 | torch.nn.Module.parameters | Supported | +| 24 | torch.nn.Module.register_backward_hook | Supported | +| 25 | torch.nn.Module.register_buffer | Supported | +| 26 | torch.nn.Module.register_forward_hook | Supported | +| 27 | torch.nn.Module.register_forward_pre_hook | Supported | +| 28 | torch.nn.Module.register_parameter | Supported | +| 29 | torch.nn.Module.requires_grad_ | Supported | +| 30 | torch.nn.Module.state_dict | Supported | +| 31 | torch.nn.Module.to | Supported | +| 32 | torch.nn.Module.train | Supported | +| 33 | torch.nn.Module.type | Supported | +| 34 | torch.nn.Module.zero_grad | Supported | +| 35 | torch.nn.Sequential | Supported | +| 36 | torch.nn.ModuleList | Supported | +| 37 | torch.nn.ModuleList.append | Supported | +| 38 | torch.nn.ModuleList.extend | Supported | +| 39 | torch.nn.ModuleList.insert | Supported | +| 40 | torch.nn.ModuleDict | Supported | +| 41 | torch.nn.ModuleDict.clear | Supported | +| 42 | torch.nn.ModuleDict.items | Supported | +| 43 | torch.nn.ModuleDict.keys | Supported | +| 44 | torch.nn.ModuleDict.pop | Supported | +| 45 | torch.nn.ModuleDict.update | Supported | +| 46 | torch.nn.ModuleDict.values | Supported | +| 47 | torch.nn.ParameterList | Supported | +| 48 | torch.nn.ParameterList.append | Supported | +| 49 | torch.nn.ParameterList.extend | Supported | +| 50 | torch.nn.ParameterDict | Supported | +| 51 | torch.nn.ParameterDict.clear | Supported | +| 52 | torch.nn.ParameterDict.items | Supported | +| 53 | torch.nn.ParameterDict.keys | Supported | +| 54 | torch.nn.ParameterDict.pop | Supported | +| 55 | torch.nn.ParameterDict.update | Supported | +| 56 | torch.nn.ParameterDict.values | Supported | +| 57 | torch.nn.Conv1d | Supported | +| 58 | torch.nn.Conv2d | Supported | +| 59 | torch.nn.Conv3d | Supported | +| 60 | torch.nn.ConvTranspose1d | Supported | +| 61 | torch.nn.ConvTranspose2d | Supported | +| 62 | torch.nn.ConvTranspose3d | Supported | +| 63 | torch.nn.Unfold | Supported | +| 64 | torch.nn.Fold | Supported | +| 65 | torch.nn.MaxPool1d | Supported | +| 66 | torch.nn.MaxPool2d | Supported | +| 67 | torch.nn.MaxPool3d | Supported | +| 68 | torch.nn.MaxUnpool1d | Supported | +| 69 | torch.nn.MaxUnpool2d | Supported | +| 70 | torch.nn.MaxUnpool3d | Supported | +| 71 | torch.nn.AvgPool1d | Supported | +| 72 | torch.nn.AvgPool2d | Supported | +| 73 | torch.nn.AvgPool3d | Supported | +| 74 | torch.nn.FractionalMaxPool2d | Unsupported | +| 75 | torch.nn.LPPool1d | Supported | +| 76 | torch.nn.LPPool2d | Supported | +| 77 | torch.nn.AdaptiveMaxPool1d | Supported | +| 78 | torch.nn.AdaptiveMaxPool2d | Supported | +| 79 | torch.nn.AdaptiveMaxPool3d | Unsupported | +| 80 | torch.nn.AdaptiveAvgPool1d | Supported | +| 81 | torch.nn.AdaptiveAvgPool2d | Supported | +| 82 | torch.nn.AdaptiveAvgPool3d | Supported (Only the scenario with D=1, H=1, and W=1 is supported.) | +| 83 | torch.nn.ReflectionPad1d | Unsupported | +| 84 | torch.nn.ReflectionPad2d | Supported | +| 85 | torch.nn.ReplicationPad1d | Unsupported | +| 86 | torch.nn.ReplicationPad2d | Supported | +| 87 | torch.nn.ReplicationPad3d | Unsupported | +| 88 | torch.nn.ZeroPad2d | Supported | +| 89 | torch.nn.ConstantPad1d | Supported | +| 90 | torch.nn.ConstantPad2d | Supported | +| 91 | torch.nn.ConstantPad3d | Supported | +| 92 | torch.nn.ELU | Supported | +| 93 | torch.nn.Hardshrink | Supported | +| 94 | torch.nn.Hardtanh | Supported | +| 95 | torch.nn.LeakyReLU | Supported | +| 96 | torch.nn.LogSigmoid | Supported | +| 97 | torch.nn.MultiheadAttention | Supported | +| 98 | torch.nn.PReLU | Supported | +| 99 | torch.nn.ReLU | Supported | +| 100 | torch.nn.ReLU6 | Supported | +| 101 | torch.nn.RReLU | Supported | +| 102 | torch.nn.SELU | Supported | +| 103 | torch.nn.CELU | Supported | +| 104 | torch.nn.GELU | Supported | +| 105 | torch.nn.Sigmoid | Supported | +| 106 | torch.nn.Softplus | Supported | +| 107 | torch.nn.Softshrink | Supported (The SoftShrink scenario is not supported.) | +| 108 | torch.nn.Softsign | Supported | +| 109 | torch.nn.Tanh | Supported | +| 110 | torch.nn.Tanhshrink | Supported | +| 111 | torch.nn.Threshold | Supported | +| 112 | torch.nn.Softmin | Supported | +| 113 | torch.nn.Softmax | Supported | +| 114 | torch.nn.Softmax2d | Supported | +| 115 | torch.nn.LogSoftmax | Supported | +| 116 | torch.nn.AdaptiveLogSoftmaxWithLoss | Unsupported | +| 117 | torch.nn.AdaptiveLogSoftmaxWithLoss.log_prob | Unsupported | +| 118 | torch.nn.AdaptiveLogSoftmaxWithLoss.predict | Unsupported | +| 119 | torch.nn.BatchNorm1d | Supported | +| 120 | torch.nn.BatchNorm2d | Supported | +| 121 | torch.nn.BatchNorm3d | Supported | +| 122 | torch.nn.GroupNorm | Supported | +| 123 | torch.nn.SyncBatchNorm | Supported | +| 124 | torch.nn.SyncBatchNorm.convert_sync_batchnorm | Supported | +| 125 | torch.nn.InstanceNorm1d | Supported | +| 126 | torch.nn.InstanceNorm2d | Supported | +| 127 | torch.nn.InstanceNorm3d | Supported | +| 128 | torch.nn.LayerNorm | Supported | +| 129 | torch.nn.LocalResponseNorm | Supported | +| 130 | torch.nn.RNNBase | Supported | +| 131 | torch.nn.RNNBase.flatten_parameters | Supported | +| 132 | torch.nn.RNN | Supported | +| 133 | torch.nn.LSTM | Supported | +| 134 | torch.nn.GRU | Supported (The DynamicGRUV2 scenario is not supported.) | +| 135 | torch.nn.RNNCell | Supported | +| 136 | torch.nn.LSTMCell | Supported | +| 137 | torch.nn.GRUCell | Supported | +| 138 | torch.nn.Transformer | Supported | +| 139 | torch.nn.Transformer.forward | Supported | +| 140 | torch.nn.Transformer.generate_square_subsequent_mask | Supported | +| 141 | torch.nn.TransformerEncoder | Supported | +| 142 | torch.nn.TransformerEncoder.forward | Supported | +| 143 | torch.nn.TransformerDecoder | Supported | +| 144 | torch.nn.TransformerDecoder.forward | Supported | +| 145 | torch.nn.TransformerEncoderLayer | Supported | +| 146 | torch.nn.TransformerEncoderLayer.forward | Supported | +| 147 | torch.nn.TransformerDecoderLayer | Supported | +| 148 | torch.nn.TransformerDecoderLayer.forward | Supported | +| 149 | torch.nn.Identity | Supported | +| 150 | torch.nn.Linear | Supported | +| 151 | torch.nn.Bilinear | Supported | +| 152 | torch.nn.Dropout | Supported | +| 153 | torch.nn.Dropout2d | Supported | +| 154 | torch.nn.Dropout3d | Supported | +| 155 | torch.nn.AlphaDropout | Supported | +| 156 | torch.nn.Embedding | Supported | +| 157 | torch.nn.Embedding.from_pretrained | Supported | +| 158 | torch.nn.EmbeddingBag | Supported | +| 159 | torch.nn.EmbeddingBag.from_pretrained | Supported | +| 160 | torch.nn.CosineSimilarity | Supported | +| 161 | torch.nn.PairwiseDistance | Supported | +| 162 | torch.nn.L1Loss | Supported | +| 163 | torch.nn.MSELoss | Supported | +| 164 | torch.nn.CrossEntropyLoss | Supported | +| 165 | torch.nn.CTCLoss | Supported | +| 166 | torch.nn.NLLLoss | Supported | +| 167 | torch.nn.PoissonNLLLoss | Supported | +| 168 | torch.nn.KLDivLoss | Supported | +| 169 | torch.nn.BCELoss | Supported | +| 170 | torch.nn.BCEWithLogitsLoss | Supported | +| 171 | torch.nn.MarginRankingLoss | Supported | +| 172 | torch.nn.HingeEmbeddingLoss | Supported | +| 173 | torch.nn.MultiLabelMarginLoss | Supported | +| 174 | torch.nn.SmoothL1Loss | Supported | +| 175 | torch.nn.SoftMarginLoss | Supported | +| 176 | torch.nn.MultiLabelSoftMarginLoss | Supported | +| 177 | torch.nn.CosineEmbeddingLoss | Supported | +| 178 | torch.nn.MultiMarginLoss | Unsupported | +| 179 | torch.nn.TripletMarginLoss | Supported | +| 180 | torch.nn.PixelShuffle | Supported | +| 181 | torch.nn.Upsample | Supported | +| 182 | torch.nn.UpsamplingNearest2d | Supported | +| 183 | torch.nn.UpsamplingBilinear2d | Supported | +| 184 | torch.nn.DataParallel | Unsupported | +| 185 | torch.nn.parallel.DistributedDataParallel | Supported | +| 186 | torch.nn.parallel.DistributedDataParallel.no_sync | Supported | +| 187 | torch.nn.utils.clip_grad_norm_ | Supported | +| 188 | torch.nn.utils.clip_grad_value_ | Supported | +| 189 | torch.nn.utils.parameters_to_vector | Supported | +| 190 | torch.nn.utils.vector_to_parameters | Supported | +| 197 | torch.nn.utils.prune.PruningContainer | Supported | +| 198 | torch.nn.utils.prune.PruningContainer.add_pruning_method | Supported | +| 199 | torch.nn.utils.prune.PruningContainer.apply | Supported | +| 200 | torch.nn.utils.prune.PruningContainer.apply_mask | Supported | +| 201 | torch.nn.utils.prune.PruningContainer.compute_mask | Supported | +| 202 | torch.nn.utils.prune.PruningContainer.prune | Supported | +| 203 | torch.nn.utils.prune.PruningContainer.remove | Supported | +| 204 | torch.nn.utils.prune.Identity | Supported | +| 205 | torch.nn.utils.prune.Identity.apply | Supported | +| 206 | torch.nn.utils.prune.Identity.apply_mask | Supported | +| 207 | torch.nn.utils.prune.Identity.prune | Supported | +| 208 | torch.nn.utils.prune.Identity.remove | Supported | +| 209 | torch.nn.utils.prune.RandomUnstructured | Supported | +| 210 | torch.nn.utils.prune.RandomUnstructured.apply | Supported | +| 211 | torch.nn.utils.prune.RandomUnstructured.apply_mask | Supported | +| 212 | torch.nn.utils.prune.RandomUnstructured.prune | Supported | +| 213 | torch.nn.utils.prune.RandomUnstructured.remove | Supported | +| 214 | torch.nn.utils.prune.L1Unstructured | Supported | +| 215 | torch.nn.utils.prune.L1Unstructured.apply | Supported | +| 216 | torch.nn.utils.prune.L1Unstructured.apply_mask | Supported | +| 217 | torch.nn.utils.prune.L1Unstructured.prune | Supported | +| 218 | torch.nn.utils.prune.L1Unstructured.remove | Supported | +| 219 | torch.nn.utils.prune.RandomStructured | Supported | +| 220 | torch.nn.utils.prune.RandomStructured.apply | Supported | +| 221 | torch.nn.utils.prune.RandomStructured.apply_mask | Supported | +| 222 | torch.nn.utils.prune.RandomStructured.compute_mask | Supported | +| 223 | torch.nn.utils.prune.RandomStructured.prune | Supported | +| 224 | torch.nn.utils.prune.RandomStructured.remove | Supported | +| 225 | torch.nn.utils.prune.LnStructured | Supported | +| 226 | torch.nn.utils.prune.LnStructured.apply | Supported | +| 227 | torch.nn.utils.prune.LnStructured.apply_mask | Supported | +| 228 | torch.nn.utils.prune.LnStructured.compute_mask | Supported | +| 229 | torch.nn.utils.prune.LnStructured.prune | Supported | +| 230 | torch.nn.utils.prune.LnStructured.remove | Supported | +| 231 | torch.nn.utils.prune.CustomFromMask | Supported | +| 232 | torch.nn.utils.prune.CustomFromMask.apply | Supported | +| 233 | torch.nn.utils.prune.CustomFromMask.apply_mask | Supported | +| 234 | torch.nn.utils.prune.CustomFromMask.prune | Supported | +| 235 | torch.nn.utils.prune.CustomFromMask.remove | Supported | +| 236 | torch.nn.utils.prune.identity | Supported | +| 237 | torch.nn.utils.prune.random_unstructured | Supported | +| 238 | torch.nn.utils.prune.l1_unstructured | Supported | +| 239 | torch.nn.utils.prune.random_structured | Supported | +| 240 | torch.nn.utils.prune.ln_structured | Supported | +| 241 | torch.nn.utils.prune.global_unstructured | Supported | +| 242 | torch.nn.utils.prune.custom_from_mask | Supported | +| 243 | torch.nn.utils.prune.remove | Supported | +| 244 | torch.nn.utils.prune.is_pruned | Supported | +| 245 | torch.nn.utils.weight_norm | Supported | +| 246 | torch.nn.utils.remove_weight_norm | Supported | +| 247 | torch.nn.utils.spectral_norm | Supported | +| 248 | torch.nn.utils.remove_spectral_norm | Supported | +| 249 | torch.nn.utils.rnn.PackedSequence | Supported | +| 250 | torch.nn.utils.rnn.pack_padded_sequence | Supported | +| 251 | torch.nn.utils.rnn.pad_packed_sequence | Unsupported | +| 252 | torch.nn.utils.rnn.pad_sequence | Supported | +| 253 | torch.nn.utils.rnn.pack_sequence | Unsupported | +| 254 | torch.nn.Flatten | Supported | +| 255 | torch.quantization.quantize | Unsupported | +| 256 | torch.quantization.quantize_dynamic | Unsupported | +| 257 | torch.quantization.quantize_qat | Unsupported | +| 258 | torch.quantization.prepare | Supported | +| 259 | torch.quantization.prepare_qat | Unsupported | +| 260 | torch.quantization.convert | Unsupported | +| 261 | torch.quantization.QConfig | Supported | +| 262 | torch.quantization.QConfigDynamic | Supported | +| 263 | torch.quantization.fuse_modules | Supported | +| 264 | torch.quantization.QuantStub | Supported | +| 265 | torch.quantization.DeQuantStub | Supported | +| 266 | torch.quantization.QuantWrapper | Supported | +| 267 | torch.quantization.add_quant_dequant | Supported | +| 268 | torch.quantization.add_observer_ | Supported | +| 269 | torch.quantization.swap_module | Supported | +| 270 | torch.quantization.propagate_qconfig_ | Supported | +| 271 | torch.quantization.default_eval_fn | Supported | +| 272 | torch.quantization.MinMaxObserver | Supported | +| 273 | torch.quantization.MovingAverageMinMaxObserver | Supported | +| 274 | torch.quantization.PerChannelMinMaxObserver | Supported | +| 275 | torch.quantization.MovingAveragePerChannelMinMaxObserver | Supported | +| 276 | torch.quantization.HistogramObserver | Unsupported | +| 277 | torch.quantization.FakeQuantize | Unsupported | +| 278 | torch.quantization.NoopObserver | Supported | +| 279 | torch.quantization.get_observer_dict | Supported | +| 280 | torch.quantization.RecordingObserver | Supported | +| 281 | torch.nn.intrinsic.ConvBn2d | Supported | +| 282 | torch.nn.intrinsic.ConvBnReLU2d | Supported | +| 283 | torch.nn.intrinsic.ConvReLU2d | Supported | +| 284 | torch.nn.intrinsic.ConvReLU3d | Supported | +| 285 | torch.nn.intrinsic.LinearReLU | Supported | +| 286 | torch.nn.intrinsic.qat.ConvBn2d | Unsupported | +| 287 | torch.nn.intrinsic.qat.ConvBnReLU2d | Unsupported | +| 288 | torch.nn.intrinsic.qat.ConvReLU2d | Unsupported | +| 289 | torch.nn.intrinsic.qat.LinearReLU | Unsupported | +| 290 | torch.nn.intrinsic.quantized.ConvReLU2d | Unsupported | +| 291 | torch.nn.intrinsic.quantized.ConvReLU3d | Unsupported | +| 292 | torch.nn.intrinsic.quantized.LinearReLU | Unsupported | +| 293 | torch.nn.qat.Conv2d | Unsupported | +| 294 | torch.nn.qat.Conv2d.from_float | Unsupported | +| 295 | torch.nn.qat.Linear | Unsupported | +| 296 | torch.nn.qat.Linear.from_float | Unsupported | +| 297 | torch.nn.quantized.functional.relu | Unsupported | +| 298 | torch.nn.quantized.functional.linear | Unsupported | +| 299 | torch.nn.quantized.functional.conv2d | Unsupported | +| 300 | torch.nn.quantized.functional.conv3d | Unsupported | +| 301 | torch.nn.quantized.functional.max_pool2d | Unsupported | +| 302 | torch.nn.quantized.functional.adaptive_avg_pool2d | Unsupported | +| 303 | torch.nn.quantized.functional.avg_pool2d | Unsupported | +| 304 | torch.nn.quantized.functional.interpolate | Unsupported | +| 305 | torch.nn.quantized.functional.upsample | Unsupported | +| 306 | torch.nn.quantized.functional.upsample_bilinear | Unsupported | +| 307 | torch.nn.quantized.functional.upsample_nearest | Unsupported | +| 308 | torch.nn.quantized.ReLU | Unsupported | +| 309 | torch.nn.quantized.ReLU6 | Unsupported | +| 310 | torch.nn.quantized.Conv2d | Unsupported | +| 311 | torch.nn.quantized.Conv2d.from_float | Unsupported | +| 312 | torch.nn.quantized.Conv3d | Unsupported | +| 313 | torch.nn.quantized.Conv3d.from_float | Unsupported | +| 314 | torch.nn.quantized.FloatFunctional | Supported | +| 315 | torch.nn.quantized.QFunctional | Unsupported | +| 316 | torch.nn.quantized.Quantize | Supported | +| 317 | torch.nn.quantized.DeQuantize | Unsupported | +| 318 | torch.nn.quantized.Linear | Unsupported | +| 319 | torch.nn.quantized.Linear.from_float | Unsupported | +| 320 | torch.nn.quantized.dynamic.Linear | Unsupported | +| 321 | torch.nn.quantized.dynamic.Linear.from_float | Unsupported | +| 322 | torch.nn.quantized.dynamic.LSTM | Unsupported | + +## Functions(torch.nn.functional) + +| No. | API | Supported/Unsupported | +| ---- | ---------------------------------------------------- | ------------------------------------------------------------ | +| 1 | torch.nn.functional.conv1d | Supported | +| 2 | torch.nn.functional.conv2d | Supported | +| 3 | torch.nn.functional.conv3d | Supported | +| 4 | torch.nn.functional.conv_transpose1d | Supported | +| 5 | torch.nn.functional.conv_transpose2d | Supported | +| 6 | torch.nn.functional.conv_transpose3d | Supported | +| 7 | torch.nn.functional.unfold | Supported | +| 8 | torch.nn.functional.fold | Supported | +| 9 | torch.nn.functional.avg_pool1d | Supported | +| 10 | torch.nn.functional.avg_pool2d | Supported | +| 11 | torch.nn.functional.avg_pool3d | Supported | +| 12 | torch.nn.functional.max_pool1d | Supported | +| 13 | torch.nn.functional.max_pool2d | Supported | +| 14 | torch.nn.functional.max_pool3d | Supported | +| 15 | torch.nn.functional.max_unpool1d | Supported | +| 16 | torch.nn.functional.max_unpool2d | Supported | +| 17 | torch.nn.functional.max_unpool3d | Supported | +| 18 | torch.nn.functional.lp_pool1d | Supported | +| 19 | torch.nn.functional.lp_pool2d | Supported | +| 20 | torch.nn.functional.adaptive_max_pool1d | Supported | +| 21 | torch.nn.functional.adaptive_max_pool2d | Supported | +| 22 | torch.nn.functional.adaptive_max_pool3d | Unsupported | +| 23 | torch.nn.functional.adaptive_avg_pool1d | Supported | +| 24 | torch.nn.functional.adaptive_avg_pool2d | Supported | +| 25 | torch.nn.functional.adaptive_avg_pool3d | Supported (Only the scenario with D=1, H=1, and W=1 is supported.) | +| 26 | torch.nn.functional.threshold | Supported | +| 27 | torch.nn.functional.threshold_ | Supported | +| 28 | torch.nn.functional.relu | Supported | +| 29 | torch.nn.functional.relu_ | Supported | +| 30 | torch.nn.functional.hardtanh | Supported | +| 31 | torch.nn.functional.hardtanh_ | Supported | +| 32 | torch.nn.functional.relu6 | Supported | +| 33 | torch.nn.functional.elu | Supported | +| 34 | torch.nn.functional.elu_ | Supported | +| 35 | torch.nn.functional.selu | Supported | +| 36 | torch.nn.functional.celu | Supported | +| 37 | torch.nn.functional.leaky_relu | Supported | +| 38 | torch.nn.functional.leaky_relu_ | Supported | +| 39 | torch.nn.functional.prelu | Supported | +| 40 | torch.nn.functional.rrelu | Supported | +| 41 | torch.nn.functional.rrelu_ | Supported | +| 42 | torch.nn.functional.glu | Supported | +| 43 | torch.nn.functional.gelu | Supported | +| 44 | torch.nn.functional.logsigmoid | Supported | +| 45 | torch.nn.functional.hardshrink | Supported | +| 46 | torch.nn.functional.tanhshrink | Supported | +| 47 | torch.nn.functional.softsign | Supported | +| 48 | torch.nn.functional.softplus | Supported | +| 49 | torch.nn.functional.softmin | Supported | +| 50 | torch.nn.functional.softmax | Supported | +| 51 | torch.nn.functional.softshrink | Supported | +| 52 | torch.nn.functional.gumbel_softmax | Unsupported | +| 53 | torch.nn.functional.log_softmax | Supported | +| 54 | torch.nn.functional.tanh | Supported | +| 55 | torch.nn.functional.sigmoid | Supported | +| 56 | torch.nn.functional.batch_norm | Supported | +| 57 | torch.nn.functional.instance_norm | Supported | +| 58 | torch.nn.functional.layer_norm | Supported | +| 59 | torch.nn.functional.local_response_norm | Supported | +| 60 | torch.nn.functional.normalize | Supported | +| 61 | torch.nn.functional.linear | Supported | +| 62 | torch.nn.functional.bilinear | Supported | +| 63 | torch.nn.functional.dropout | Supported | +| 64 | torch.nn.functional.alpha_dropout | Supported | +| 65 | torch.nn.functional.dropout2d | Supported | +| 66 | torch.nn.functional.dropout3d | Supported | +| 67 | torch.nn.functional.embedding | Supported | +| 68 | torch.nn.functional.embedding_bag | Supported | +| 69 | torch.nn.functional.one_hot | Supported | +| 70 | torch.nn.functional.pairwise_distance | Supported | +| 71 | torch.nn.functional.cosine_similarity | Supported | +| 72 | torch.nn.functional.pdist | Supported | +| 73 | torch.nn.functional.binary_cross_entropy | Supported | +| 74 | torch.nn.functional.binary_cross_entropy_with_logits | Supported | +| 75 | torch.nn.functional.poisson_nll_loss | Supported | +| 76 | torch.nn.functional.cosine_embedding_loss | Supported | +| 77 | torch.nn.functional.cross_entropy | Supported | +| 78 | torch.nn.functional.ctc_loss | Supported (Only 2-dimensional input is supported.) | +| 79 | torch.nn.functional.hinge_embedding_loss | Supported | +| 80 | torch.nn.functional.kl_div | Supported | +| 81 | torch.nn.functional.l1_loss | Supported | +| 82 | torch.nn.functional.mse_loss | Supported | +| 83 | torch.nn.functional.margin_ranking_loss | Supported | +| 84 | torch.nn.functional.multilabel_margin_loss | Supported | +| 85 | torch.nn.functional.multilabel_soft_margin_loss | Supported | +| 86 | torch.nn.functional.multi_margin_loss | Unsupported | +| 87 | torch.nn.functional.nll_loss | Supported | +| 88 | torch.nn.functional.smooth_l1_loss | Supported | +| 89 | torch.nn.functional.soft_margin_loss | Supported | +| 90 | torch.nn.functional.triplet_margin_loss | Supported | +| 91 | torch.nn.functional.pixel_shuffle | Supported | +| 92 | torch.nn.functional.pad | Supported | +| 93 | torch.nn.functional.interpolate | Supported | +| 94 | torch.nn.functional.upsample | Supported | +| 95 | torch.nn.functional.upsample_nearest | Supported | +| 96 | torch.nn.functional.upsample_bilinear | Supported | +| 97 | torch.nn.functional.grid_sample | Supported | +| 98 | torch.nn.functional.affine_grid | Supported | +| 99 | torch.nn.parallel.data_parallel | Unsupported | + +## torch.distributed + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------- | --------------------- | +| 1 | torch.distributed.init_process_group | Supported | +| 2 | torch.distributed.Backend | Supported | +| 3 | torch.distributed.get_backend | Supported | +| 4 | torch.distributed.get_rank | Supported | +| 5 | torch.distributed.get_world_size | Supported | +| 6 | torch.distributed.is_initialized | Supported | +| 7 | torch.distributed.is_mpi_available | Supported | +| 8 | torch.distributed.is_nccl_available | Supported | +| 9 | torch.distributed.new_group | Supported | +| 10 | torch.distributed.send | Unsupported | +| 11 | torch.distributed.recv | Unsupported | +| 12 | torch.distributed.isend | Unsupported | +| 13 | torch.distributed.irecv | Unsupported | +| 14 | is_completed | Supported | +| 15 | wait | Supported | +| 16 | torch.distributed.broadcast | Supported | +| 17 | torch.distributed.all_reduce | Supported | +| 18 | torch.distributed.reduce | Unsupported | +| 19 | torch.distributed.all_gather | Supported | +| 20 | torch.distributed.gather | Unsupported | +| 21 | torch.distributed.scatter | Unsupported | +| 22 | torch.distributed.barrier | Supported | +| 23 | torch.distributed.ReduceOp | Supported | +| 24 | torch.distributed.reduce_op | Supported | +| 25 | torch.distributed.broadcast_multigpu | Unsupported | +| 26 | torch.distributed.all_reduce_multigpu | Unsupported | +| 27 | torch.distributed.reduce_multigpu | Unsupported | +| 28 | torch.distributed.all_gather_multigpu | Unsupported | +| 29 | torch.distributed.launch | Supported | +| 30 | torch.multiprocessing.spawn | Supported | + +## torch.npu + +| No. | API | NPU API | Supported/Unsupported | +| ---- | ------------------------------------- | ------------------------------------ | --------------------- | +| 1 | torch.cuda.current_blas_handle | torch.npu.current_blas_handle | Unsupported | +| 2 | torch.cuda.current_device | torch.npu.current_device | Supported | +| 3 | torch.cuda.current_stream | torch.npu.current_stream | Supported | +| 4 | torch.cuda.default_stream | torch.npu.default_stream | Supported | +| 5 | torch.cuda.device | torch.npu.device | Supported | +| 6 | torch.cuda.device_count | torch.npu.device_count | Supported | +| 7 | torch.cuda.device_of | torch.npu.device_of | Supported | +| 8 | torch.cuda.get_device_capability | torch.npu.get_device_capability | Unsupported | +| 9 | torch.cuda.get_device_name | torch.npu.get_device_name | Unsupported | +| 10 | torch.cuda.init | torch.npu.init | Supported | +| 11 | torch.cuda.ipc_collect | torch.npu.ipc_collect | Unsupported | +| 12 | torch.cuda.is_available | torch.npu.is_available | Supported | +| 13 | torch.cuda.is_initialized | torch.npu.is_initialized | Supported | +| 14 | torch.cuda.set_device | torch.npu.set_device | Partially supported | +| 15 | torch.cuda.stream | torch.npu.stream | Supported | +| 16 | torch.cuda.synchronize | torch.npu.synchronize | Supported | +| 17 | torch.cuda.get_rng_state | torch.npu.get_rng_state | Unsupported | +| 18 | torch.cuda.get_rng_state_all | torch.npu.get_rng_state_all | Unsupported | +| 19 | torch.cuda.set_rng_state | torch.npu.set_rng_state | Unsupported | +| 20 | torch.cuda.set_rng_state_all | torch.npu.set_rng_state_all | Unsupported | +| 21 | torch.cuda.manual_seed | torch.npu.manual_seed | Unsupported | +| 22 | torch.cuda.manual_seed_all | torch.npu.manual_seed_all | Unsupported | +| 23 | torch.cuda.seed | torch.npu.seed | Unsupported | +| 24 | torch.cuda.seed_all | torch.npu.seed_all | Unsupported | +| 25 | torch.cuda.initial_seed | torch.npu.initial_seed | Unsupported | +| 26 | torch.cuda.comm.broadcast | torch.npu.comm.broadcast | Unsupported | +| 27 | torch.cuda.comm.broadcast_coalesced | torch.npu.comm.broadcast_coalesced | Unsupported | +| 28 | torch.cuda.comm.reduce_add | torch.npu.comm.reduce_add | Unsupported | +| 29 | torch.cuda.comm.scatter | torch.npu.comm.scatter | Unsupported | +| 30 | torch.cuda.comm.gather | torch.npu.comm.gather | Unsupported | +| 31 | torch.cuda.Stream | torch.npu.Stream | Supported | +| 32 | torch.cuda.Stream.query | torch.npu.Stream.query | Supported | +| 33 | torch.cuda.Stream.record_event | torch.npu.Stream.record_event | Supported | +| 34 | torch.cuda.Stream.synchronize | torch.npu.Stream.synchronize | Supported | +| 35 | torch.cuda.Stream.wait_event | torch.npu.Stream.wait_event | Supported | +| 36 | torch.cuda.Stream.wait_stream | torch.npu.Stream.wait_stream | Supported | +| 37 | torch.cuda.Event | torch.npu.Event | Supported | +| 38 | torch.cuda.Event.elapsed_time | torch.npu.Event.elapsed_time | Supported | +| 39 | torch.cuda.Event.from_ipc_handle | torch.npu.Event.from_ipc_handle | Unsupported | +| 40 | torch.cuda.Event.ipc_handle | torch.npu.Event.ipc_handle | Unsupported | +| 41 | torch.cuda.Event.query | torch.npu.Event.query | Supported | +| 42 | torch.cuda.Event.record | torch.npu.Event.record | Supported | +| 43 | torch.cuda.Event.synchronize | torch.npu.Event.synchronize | Supported | +| 44 | torch.cuda.Event.wait | torch.npu.Event.wait | Supported | +| 45 | torch.cuda.empty_cache | torch.npu.empty_cache | Supported | +| 46 | torch.cuda.memory_stats | torch.npu.memory_stats | Supported | +| 47 | torch.cuda.memory_summary | torch.npu.memory_summary | Supported | +| 48 | torch.cuda.memory_snapshot | torch.npu.memory_snapshot | Supported | +| 49 | torch.cuda.memory_allocated | torch.npu.memory_allocated | Supported | +| 50 | torch.cuda.max_memory_allocated | torch.npu.max_memory_allocated | Supported | +| 51 | torch.cuda.reset_max_memory_allocated | torch.npu.reset_max_memory_allocated | Supported | +| 52 | torch.cuda.memory_reserved | torch.npu.memory_reserved | Supported | +| 53 | torch.cuda.max_memory_reserved | torch.npu.max_memory_reserved | Supported | +| 54 | torch.cuda.memory_cached | torch.npu.memory_cached | Supported | +| 55 | torch.cuda.max_memory_cached | torch.npu.max_memory_cached | Supported | +| 56 | torch.cuda.reset_max_memory_cached | torch.npu.reset_max_memory_cached | Supported | +| 57 | torch.cuda.nvtx.mark | torch.npu.nvtx.mark | Unsupported | +| 58 | torch.cuda.nvtx.range_push | torch.npu.nvtx.range_push | Unsupported | +| 59 | torch.cuda.nvtx.range_pop | torch.npu.nvtx.range_pop | Unsupported | +| 60 | torch.cuda._sleep | torch.npu._sleep | Unsupported | +| 61 | torch.cuda.Stream.priority_range | torch.npu.Stream.priority_range | Unsupported | +| 62 | torch.cuda.get_device_properties | torch.npu.get_device_properties | Unsupported | +| 63 | torch.cuda.amp.GradScaler | torch.npu.amp.GradScaler | Unsupported | + +The **torch.npu.set_device ()** API can be used to specify the device only at the starting position of the program by using **set_device**. The device cannot be specified for multiple times or switched by using **torch.npu.device (id)**. + +## NPU Custom Operators + +| No. | Operator | +| ---- | ---------------------------------------------- | +| 1 | npu_convolution_transpose | +| 2 | npu_conv_transpose2d | +| 3 | npu_convolution_transpose_backward | +| 4 | npu_conv_transpose2d_backward | +| 5 | npu_conv_transpose3d_backward | +| 6 | npu_convolution | +| 7 | npu_convolution_backward | +| 8 | npu_convolution_double_backward | +| 9 | npu_conv2d | +| 10 | npu_conv2d.out | +| 11 | npu_conv2d_backward | +| 12 | npu_conv3d | +| 13 | npu_conv3d.out | +| 14 | npu_conv3d_backward | +| 15 | one_ | +| 16 | npu_sort_v2.out | +| 17 | npu_sort_v2 | +| 18 | npu_format_cast | +| 19 | npu_format_cast_.acl_format | +| 20 | npu_format_cast_.src | +| 21 | npu_transpose_to_contiguous | +| 22 | npu_transpose | +| 23 | npu_transpose.out | +| 24 | npu_broadcast | +| 25 | npu_broadcast.out | +| 26 | npu_dtype_cast | +| 27 | npu_dtype_cast_.Tensor | +| 28 | npu_roi_alignbk | +| 29 | empty_with_format | +| 30 | empty_with_format.names | +| 31 | copy_memory_ | +| 32 | npu_one_hot | +| 33 | npu_stride_add | +| 34 | npu_softmax_cross_entropy_with_logits | +| 35 | npu_softmax_cross_entropy_with_logits_backward | +| 36 | npu_ps_roi_pooling | +| 37 | npu_ps_roi_pooling_backward | +| 38 | npu_roi_align | +| 39 | npu_nms_v4 | +| 40 | npu_lstm | +| 41 | npu_lstm_backward | +| 42 | npu_iou | +| 43 | npu_ptiou | +| 44 | npu_nms_with_mask | +| 45 | npu_pad | +| 46 | npu_bounding_box_encode | +| 47 | npu_bounding_box_decode | +| 48 | npu_gru | +| 49 | npu_gru_backward | +| 50 | npu_set_.source_Storage_storage_offset_format | +| 51 | npu_random_choice_with_mask | +| 52 | npu_batch_nms | +| 53 | npu_slice | +| 54 | npu_slice.out | +| 55 | npu_dropoutV2 | +| 56 | npu_dropoutV2_backward | +| 57 | _npu_dropout | +| 58 | _npu_dropout_inplace | +| 59 | npu_dropout_backward | +| 60 | npu_indexing | +| 61 | npu_indexing.out | +| 62 | npu_ifmr | +| 63 | npu_max.dim | +| 64 | npu_max.names_dim | +| 65 | npu_scatter | +| 66 | npu_max_backward | +| 67 | npu_apply_adam | +| 68 | npu_layer_norm_eval | +| 69 | npu_alloc_float_status | +| 70 | npu_get_float_status | +| 71 | npu_clear_float_status | +| 72 | npu_confusion_transpose | +| 73 | npu_confusion_transpose_backward | +| 74 | npu_bmmV2 | +| 75 | fast_gelu | +| 76 | fast_gelu_backward | +| 77 | npu_sub_sample | +| 78 | npu_deformable_conv2d | +| 79 | npu_deformable_conv2dbk | +| 80 | npu_mish | +| 81 | npu_anchor_response_flags | +| 82 | npu_yolo_boxes_encode | +| 83 | npu_grid_assign_positive | +| 84 | npu_mish_backward | +| 85 | npu_normalize_batch | +| 86 | npu_masked_fill_range | +| 87 | npu_linear | +| 88 | npu_linear_backward | +| 89 | npu_bert_apply_adam | +| 90 | npu_giou | +| 91 | npu_giou_backward | + +Operator descriptions: + +> npu_apply_adam(beta1_power, beta2_power, lr, beta1, beta2, epsilon, grad, use_locking, use_nesterov, out = (var, m, v)) + +count adam result. + +- Parameters: + - **beta1_power** (Number) - power of beta1. + - **beta2_power** (Number) - power of beta2. + - **lr** (Number) - learning rate. + - **beta1** (Number) - exponential decay rate for the 1st moment estimates. + - **beta2** (Number) - exponential decay rate for the 2nd moment estimates. + - **epsilon** (Number) - term added to the denominator to improve numerical stability. + - **grad** (Tensor) - the gradient. + - **use_locking** (bool) - If `True` use locks for update operations. + - **use_nesterov** (bool) -If `True`, uses the nesterov update. + - **var** (Tensor) - variables to be optimized. + - **m** (Tensor) - mean value of variables. + - **v** (Tensor) - variance of variables. + +- Constraints: + + None + +- Examples: + + None + +> npu_convolution_transpose(input, weight, bias, padding, output_padding, stride, dilation, groups) -> Tensor + +Applies a 2D or 3D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”. + +- Parameters: + - **input** (Tensor) - input tensor of shape(minibatch, in_channels, iH, iW) or (minibatch, in_channels, iT, iH, iW) + - **weight** (Tensor) - filters of shape(in_channels, out_channels/groups, kH, kW) or (in_channels, out_channels/groups, kT, kH, kW) + - **bias** (Tensor, optional) - optional bias of shape(out_channels) + - **padding** (ListInt) - (dilation * (kernel_size - 1) - padding) zero-padding will be added to both sides of each dimension in the input + - **output_padding** (ListInt) - additional size added to one side of each dimension in the output shape. + - **stride** (ListInt) - the stride of the convolving kernel + - **dilation** (ListInt) - the spacing between kernel elements + - **groups** (Number) - split input into groups, in_channels should be divisible by the number of groups + +- Constraints: + + None + +- Examples: + + None + +> npu_conv_transpose2d(input, weight, bias, padding, output_padding, stride, dilation, groups) -> Tensor + +Applies a 2D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”. + +- Parameters: + - **input** (Tensor) - input tensor of shape(minibatch, in_channels, iH, iW) + - **weight** (Tensor) - filters of shape(in_channels, out_channels/groups, kH, kW) + - **bias** (Tensor, optional) - optional bias of shape(out_channels) + - **padding** (ListInt) - (dilation * (kernel_size - 1) - padding) zero-padding will be added to both sides of each dimension in the input + - **output_padding** (ListInt) - additional size added to one side of each dimension in the output shape. + - **stride** (ListInt) - the stride of the convolving kernel + - **dilation** (ListInt) - the spacing between kernel elements + - **groups** (Number) - split input into groups, in_channels should be divisible by the number of groups + +- Constraints: + + None + +- Examples: + + None + +> npu_convolution(input, weight, bias, stride, padding, dilation, groups) -> Tensor + +Applies a 2D or 3D convolution over an input image composed of several input planes. + +- Parameters: + - **input** (Tensor) - input tensor of shape(minibatch, in_channels, iH, iW) or (minibatch, in_channels, iT, iH, iW) + - **weight** (Tensor) - filters of shape(out_channels, in_channels/groups, kH, kW) or (out_channels, in_channels/groups, kT, kH, kW) + - **bias** (Tensor, optional) - optional bias of shape(out_channels) + - **stride** (ListInt) - the stride of the convolving kernel + - **padding** (ListInt) - implicit paddings on both sides of the input + - **dilation** (ListInt) - the spacing between kernel elements + - **groups** (ListInt) - split input into groups, in_channels should be divisible by the number of groups + +- Constraints: + + None + +- Examples: + + None + +> npu_conv2d(input, weight, bias, stride, padding, dilation, groups) -> Tensor + +Applies a 2D convolution over an input image composed of several input planes. + +- Parameters: + - **input** (Tensor) - input tensor of shape(minibatch, in_channels, iH, iW) + - **weight** (Tensor) - filters of shape(out_channels, in_channels/groups, kH, kW) + - **bias** (Tensor, optional) - optional bias of shape(out_channels) + - **stride** (ListInt) - the stride of the convolving kernel + - **padding** (ListInt) - implicit paddings on both sides of the input + - **dilation** (ListInt) - the spacing between kernel elements + - **groups** (ListInt) - split input into groups, in_channels should be divisible by the number of groups + +- Constraints: + + None + +- Examples: + + None + +> npu_conv3d(input, weight, bias, stride, padding, dilation, groups) -> Tensor + +Applies a 3D convolution over an input image composed of several input planes. + +- Parameters: + - **input** (Tensor) - input tensor of shape(minibatch, in_channels, iT, iH, iW) + - **weight** (Tensor) - filters of shape(out_channels, in_channels/groups, kT, kH, kW) + - **bias** (Tensor, optional) - optional bias of shape(out_channels) + - **stride** (ListInt) - the stride of the convolving kernel + - **padding** (ListInt) - implicit paddings on both sides of the input + - **dilation** (ListInt) - the spacing between kernel elements + - **groups** (ListInt) - split input into groups, in_channels should be divisible by the number of groups + +- Constraints: + + None + +- Examples: + + None + +> one_(self) -> Tensor + +Fills self tensor with ones. + +- Parameters: + +- **self** (Tensor) - input tensor + +- Constraints: + + None + +- Examples: + + ```python + >>> x = torch.rand(2, 3).npu() + >>> x + tensor([[0.6072, 0.9726, 0.3475], + [0.3717, 0.6135, 0.6788]], device='npu:0') + >>> x.one_() + tensor([[1., 1., 1.], + [1., 1., 1.]], device='npu:0') + ``` + +> npu_sort_v2(self, dim=-1, descending=False, out=None) -> Tensor + +Sorts the elements of the input tensor along a given dimension in ascending order by value without indices. +If dim is not given, the last dimension of the input is chosen. +If descending is True then the elements are sorted in descending order by value. + +- Parameters: + - **self** (Tensor) - the input tensor + - **dim** (int, optional) - the dimension to sort along + - **descending** (bool, optional) - controls the sorting order (ascending or descending) + - **out** (Tensor, optional) - the output that can be optionally given to be used as output buffers + +- Constraints: + + At present only support the last dim(-1). + +- Examples: + + ```python + >>> x = torch.randn(3, 4).npu() + >>> x + tensor([[-0.0067, 1.7790, 0.5031, -1.7217], + [ 1.1685, -1.0486, -0.2938, 1.3241], + [ 0.1880, -2.7447, 1.3976, 0.7380]], device='npu:0') + >>> sorted_x = torch.npu_sort_v2(x) + >>> sorted_x + tensor([[-1.7217, -0.0067, 0.5029, 1.7793], + [-1.0488, -0.2937, 1.1689, 1.3242], + [-2.7441, 0.1880, 0.7378, 1.3975]], device='npu:0') + ``` + +> npu_format_cast(self, acl_format) -> Tensor + +Change the format of a npu tensor. + +- Parameters: + - **self** (Tensor) - the input tensor + - **acl_format** (int) - the target format to transform + +- Constraints: + + None + +- Examples: + + ```python + >>> x = torch.rand(2, 3, 4, 5).npu() + >>> x.storage().npu_format() + 0 + >>> x1 = x.npu_format_cast(29) + >>> x1.storage().npu_format() + 29 + ``` + +> npu_format_cast_ + +> npu_format_cast_.acl_format(self, acl_format) -> Tensor + + In-place version of npu_format_cast() + +> npu_format_cast_.src(self, src) -> Tensor + + In-place Change the format of self, with the same format as src. + + - Parameters: + - **self** (Tensor) - the input tensor + - **src** (Tensor) - the target format to transform + + - Constraints: + + None + + - Examples: + + ```python + >>> x = torch.rand(2, 3, 4, 5).npu() + >>> x.storage().npu_format() + 0 + >>> x.npu_format_cast_(29).storage().npu_format() + 29 + ``` + +> npu_transpose(self, perm) -> Tensor + +Returns a view of the original tensor with its dimensions permuted, and make the result contiguous. + +- Parameters: + - **self** (Tensor) - the input tensor + - **perm** (ListInt) - The desired ordering of dimensions + +- Constraints: + + None + +- Examples: + + ```python + >>> x = torch.randn(2, 3, 5).npu() + >>> x.shape + torch.Size([2, 3, 5]) + >>> x1 = torch.npu_transpose(x, (2, 0, 1)) + >>> x1.shape + torch.Size([5, 2, 3]) + >>> x2 = x.npu_transpose(2, 0, 1) + >>> x2.shape + torch.Size([5, 2, 3]) + ``` + +> npu_broadcast(self, perm) -> Tensor + +Returns a new view of the self tensor with singleton dimensions expanded to a larger size, and make the result contiguous. + +Tensor can be also expanded to a larger number of dimensions, and the new ones will be appended at the front. + +- Parameters: + - **self** (Tensor) - the input tensor + - **perm** (ListInt) - the desired expanded size + +- Constraints: + + None + +- Examples: + + ```python + >>> x = torch.tensor([[1], [2], [3]]).npu() + >>> x.shape + torch.Size([3, 1]) + >>> x.npu_broadcast(3, 4) + tensor([[1, 1, 1, 1], + [2, 2, 2, 2], + [3, 3, 3, 3]], device='npu:0') + ``` + +> npu_dtype_cast(input, dtype) -> Tensor + +Performs Tensor dtype conversion. + +- Parameters: + - **input** (Tensor) - the input tensor. + - **dtype** (torch.dtype) - the desired data type of returned Tensor. + +- Constraints: + + None + +- Examples: + + ```python + >>> torch. npu_dtype_cast (torch.tensor([0, 0.5, -1.]).npu(), dtype=torch.int) + tensor([ 0, 0, -1], device='npu:0', dtype=torch.int32) + ``` + +> empty_with_format(size, dtype, layout, device, pin_memory, acl_format) -> Tensor + +Returns a tensor filled with uninitialized data. The shape of the tensor is defined by the variable argument size. The format of the tensor is defined by the variable argument acl_format. + +- Parameters: + + - **size** (int...) – a sequence of integers defining the shape of the output tensor. Can be a variable number of arguments or a collection like a list or tuple. + + - **dtype** (torch.dtype, optional) – the desired data type of returned tensor. Default: if None, uses a global default (see torch.set_default_tensor_type()). + + - **layout** (torch.layout, optional) – the desired layout of returned Tensor. Default: None. + + - **device** (torch.device, optional) – the desired device of returned tensor. Default: None + + - **pin_memory** (bool, optional) – If set, returned tensor would be allocated in the pinned memory. Default: None. + + - **acl_format** (Number) – the desired memory format of returned Tensor. Default: 2. + +- Constraints: + + None + +- Examples: + ```python + >>> torch.empty_with_format((2, 3), dtype=torch.float32, device="npu") + tensor([[1., 1., 1.], + [1., 1., 1.]], device='npu:0') + ``` + +> copy_memory_(dst, src, non_blocking=False) -> Tensor + +Copies the elements from src into self tensor and returns self. + +- Parameters: + - **dst** (Tensor) - the source tensor to copy from. + - **src** (Tensor) - the desired data type of returned Tensor. + - **non_blocking** (bool) - if True and this copy is between CPU and NPU, the copy may occur asynchronously with respect to the host. For other cases, this argument has no effect. + +- Constraints: + + copy_memory_ only support npu tensor. + input tensors of copy_memory_ should have same dtype. + input tensors of copy_memory_ should have same device index. + +- Examples: + + ```python + >>> a=torch.IntTensor([0, 0, -1]).npu() + >>> b=torch.IntTensor([1, 1, 1]).npu() + >>> a.copy_memory_(b) + tensor([1, 1, 1], device='npu:0', dtype=torch.int32) + ``` + +> npu_one_hot(input, num_classes=-1, depth=1, on_value=1, off_value=0) -> Tensor + +Returns a one-hot tensor. The locations represented by index in "x" take value "on_value", while all other locations take value "off_value". + +- Parameters: + - **input** (Tensor) - class values of any shape. + - **num_classes** (Tensor) - The axis to fill. Defaults to "-1". + - **depth** (Number) - The depth of the one hot dimension. + - **on_value** (Number) - The value to fill in output when indices[j] = i. + - **off_value** (Number) - The value to fill in output when indices[j] != i. + +- Constraints: + + None + +- Examples: + ```python + >>> a=torch.IntTensor([5, 3, 2, 1]).npu() + >>> b=torch.npu_one_hot(a, depth=5) + >>> b + tensor([[0., 0., 0., 0., 0.], + [0., 0., 0., 1., 0.], + [0., 0., 1., 0., 0.], + [0., 1., 0., 0., 0.]], device='npu:0') + ``` + +> npu_stride_add(x1, x2, offset1, offset2, c1_len) -> Tensor + +Add the partial values of two tensors in format NC1HWC0. + +- Parameters: + - **x1** (Tensor) - A Tensor in 5HD. + - **x2** (Tensor) - A Tensor of the same type as "x1", and the same shape as "x1", except for the C1 value. + - **offset1** (Number) - A required int. Offset value of C1 in "x1". + - **offset2** (Number) - A required int. Offset value of C1 in "x2". + - **c1_len** (Number) - A required int. C1 len of "y". The value must be less than the difference between C1 and offset in "x1" and "x2". + +- Constraints: + + None + +- Examples: + ```python + >>> a=torch.tensor([[[[[1.]]]]]).npu() + >>> b=torch.npu_stride_add(a, a, 0, 0, 1) + >>> b + tensor([[[[[2.]]], + [[[0.]]], + [[[0.]]], + [[[0.]]], + [[[0.]]], + [[[0.]]], + [[[0.]]], + [[[0.]]], + [[[0.]]], + [[[0.]]], + [[[0.]]], + [[[0.]]], + [[[0.]]], + [[[0.]]], + [[[0.]]], + [[[0.]]]]], device='npu:0') + ``` + +> npu_softmax_cross_entropy_with_logits(features, labels) -> Tensor + +Computes softmax cross entropy cost. + +- Parameters: + - **features** (Tensor) - A Tensor. A "batch_size * num_classes" matrix. + - **labels** (Tensor) - A Tensor of the same type as "features". A "batch_size * num_classes" matrix. + +- Constraints: + + None + +- Examples: + + None + +> npu_ps_roi_pooling(x, rois, spatial_scale, group_size, output_dim) -> Tensor + +Performs Position Sensitive PS ROI Pooling. + +- Parameters: + - **x** (Tensor) - An NC1HWC0 tensor, describing the feature map, dimension C1 must be equal to (int(output_dim+15)/C0))*group_size*group_size. + - **rois** (Tensor) - A tensor with shape [batch, 5, rois_num], describing the ROIs, each ROI consists of five elements: "batch_id", "x1", "y1", "x2", and "y2", which "batch_id" indicates the index of the input feature map, "x1", "y1", "x2", or "y2" must be greater than or equal to "0.0". + - **spatial_scale** (Number) - A required float32, scaling factor for mapping the input coordinates to the ROI coordinates . + - **group_size** (Number) - A required int32, specifying the number of groups to encode position-sensitive score maps, must be within the range (0, 128). + - **output_dim** (Number) - A required int32, specifying the number of output channels, must be greater than 0. + +- Constraints: + + None + +- Examples: + ```python + >>> roi = torch.tensor([[[1], [2], [3], [4], [5]], + [[6], [7], [8], [9], [10]]], dtype = torch.float16).npu() + >>> x = torch.tensor([[[[ 1]], [[ 2]], [[ 3]], [[ 4]], + [[ 5]], [[ 6]], [[ 7]], [[ 8]]], + [[[ 9]], [[10]], [[11]], [[12]], + [[13]], [[14]], [[15]], [[16]]]], dtype = torch.float16).npu() + >>> out = torch.npu_ps_roi_pooling(x, roi, 0.5, 2, 2) + >>> out + tensor([[[[0., 0.], + [0., 0.]], + [[0., 0.], + [0., 0.]]], + [[[0., 0.], + [0., 0.]], + [[0., 0.], + [0., 0.]]]], device='npu:0', dtype=torch.float16) + ``` + +> npu_roi_align(features, rois, spatial_scale, pooled_height, pooled_width, sample_num, roi_end_mode) -> Tensor + +Obtains the ROI feature matrix from the feature map. It is a customized FasterRcnn operator. + +- Parameters: + - **features** (Tensor) - A Tensor in 5HD. + - **rois** (Tensor) - ROI position. A 2D Tensor with shape (N, 5). "N" indicates the number of ROIs, the value "5" indicates the indexes of images where the ROIs are located, "x0", "y0", "x1", and "y1". + - **spatial_scale** (Number) - A required attribute of type float32, specifying the scaling ratio of "features" to the original image. + - **pooled_height** (Number) - A required attribute of type int32, specifying the H dimension. + - **pooled_width** (Number) - A required attribute of type int32, specifying the W dimension. + - **sample_num** (Number) - An optional attribute of type int32, specifying the horizontal and vertical sampling frequency of each output. If this attribute is set to "0", the sampling frequency is equal to the rounded up value of "rois", which is a floating point number. Defaults to "2". + - **roi_end_mode** (Number) - An optional attribute of type int32. Defaults to "1". + +- Constraints: + + None + +- Examples: + ```python + >>> x = torch.FloatTensor([[[[1, 2, 3 , 4, 5, 6], + [7, 8, 9, 10, 11, 12], + [13, 14, 15, 16, 17, 18], + [19, 20, 21, 22, 23, 24], + [25, 26, 27, 28, 29, 30], + [31, 32, 33, 34, 35, 36]]]]).npu() + >>> rois = torch.tensor([[0, -2.0, -2.0, 22.0, 22.0]]).npu() + >>> out = torch.npu_roi_align(x, rois, 0.25, 3, 3, 2, 0) + >>> out + tensor([[[[ 4.5000, 6.5000, 8.5000], + [16.5000, 18.5000, 20.5000], + [28.5000, 30.5000, 32.5000]]]], device='npu:0') + ``` + +> npu_nms_v4(boxes, scores, max_output_size, iou_threshold, scores_threshold, pad_to_max_output_size=False) -> (Tensor, Tensor) + +Greedily selects a subset of bounding boxes in descending order of score. + +- Parameters: + - **boxes** (Tensor) - A 2-D float tensor of shape [num_boxes, 4]. + - **scores** (Tensor) - A 1-D float tensor of shape [num_boxes] representing a single score corresponding to each box (each row of boxes). + - **max_output_size** (Number) - A scalar representing the maximum number of boxes to be selected by non max suppression. + - **iou_threshold** (Tensor) - A 0-D float tensor representing the threshold for deciding whether boxes overlap too much with respect to IOU. + - **scores_threshold** (Tensor) - A 0-D float tensor representing the threshold for deciding when to remove boxes based on score. + - **pad_to_max_output_size** (bool) - If true, the output selected_indices is padded to be of length max_output_size. Defaults to false. + +- Returns: + - **selected_indices** - A 1-D integer tensor of shape [M] representing the selected indices from the boxes tensor, where M <= max_output_size. + - **valid_outputs** - A 0-D integer tensor representing the number of valid elements in selected_indices, with the valid elements appearing first. + +- Constraints: + + None + +- Examples: + ```python + >>> boxes=torch.randn(100,4).npu() + >>> scores=torch.randn(100).npu() + >>> boxes.uniform_(0,100) + >>> scores.uniform_(0,1) + >>> max_output_size = 20 + >>> iou_threshold = torch.tensor(0.5).npu() + >>> scores_threshold = torch.tensor(0.3).npu() + >>> npu_output = torch.npu_nms_v4(boxes, scores, max_output_size, iou_threshold, scores_threshold) + >>> npu_output + (tensor([57, 65, 25, 45, 43, 12, 52, 91, 23, 78, 53, 11, 24, 62, 22, 67, 9, 94, + 54, 92], device='npu:0', dtype=torch.int32), tensor(20, device='npu:0', dtype=torch.int32)) + ``` + +> npu_nms_rotated(dets, scores, iou_threshold, scores_threshold=0, max_output_size=-1, mode=0) -> (Tensor, Tensor) + +Greedy selects a subset of the rotated bounding boxes in descending fractional order. + +- Parameters: + - **dets** (Tensor) - A 2-D float tensor of shape [num_boxes, 5]. + - **scores** (Tensor) - A 1-D float tensor of shape [num_boxes] representing a single score corresponding to each box (each row of boxes). + - **iou_threshold** (Number) - A scalar representing the threshold for deciding whether boxes overlap too much with respect to IOU. + - **scores_threshold** (Number) - A scalar representing the threshold for deciding when to remove boxes based on score. Defaults to "0". + - **max_output_size** (Number) - A scalar integer tensor representing the maximum number of boxes to be selected by non max suppression. Defaults to "-1", that is, no constraint is imposed. + - **mode** (Number) - This parameter specifies the layout type of the dets. The default value is 0. If mode is set to 0, the input values of dets are x, y, w, h, and angle. If mode is set to 1, the input values of dets are x1, y1, x2, y2, and angle. Defaults to "0". + +- Returns: + - **selected_index** - A 1-D integer tensor of shape [M] representing the selected indices from the dets tensor, where M <= max_output_size. + - **selected_num** - A 0-D integer tensor representing the number of valid elements in selected_indices. + +- Constraints: + + None + +- Examples: + ```python + >>> dets=torch.randn(100,5).npu() + >>> scores=torch.randn(100).npu() + >>> dets.uniform_(0,100) + >>> scores.uniform_(0,1) + >>> output1, output2 = torch.npu_nms_rotated(dets, scores, 0.2, 0, -1, 1) + >>> output1 + tensor([76, 48, 15, 65, 91, 82, 21, 96, 62, 90, 13, 59, 0, 18, 47, 23, 8, 56, + 55, 63, 72, 39, 97, 81, 16, 38, 17, 25, 74, 33, 79, 44, 36, 88, 83, 37, + 64, 45, 54, 41, 22, 28, 98, 40, 30, 20, 1, 86, 69, 57, 43, 9, 42, 27, + 71, 46, 19, 26, 78, 66, 3, 52], device='npu:0', dtype=torch.int32) + >>> output2 + tensor([62], device='npu:0', dtype=torch.int32) + ``` + +> npu_lstm(x, weight, bias, seq_len, h, c, has_biases, num_layers, dropout, train, bidirectional, batch_first, flag_seq, direction) + +DynamicRNN calculation. + +- Parameters: + - **x** (Tensor) - A required 4D Tensor. Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **weight** (Tensor) - A required 4D Tensor. Must be one of the following types: float16, float32. The format must be FRACTAL_ZN_LSTM. + - **bias** (Tensor) - A required 1D Tensor. Must be one of the following types: float16, float32. The format must be ND. + - **seq_len** (Tensor) - A optional Tensor. Only Support float16 in FRACTAL_NZ and int32 in ND. + - **h** (Tensor) - A optional 4D Tensor. Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **c** (Tensor) - A optional 4D Tensor. Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **has_biases** (bool) - If the value is true, bias exists. + - **num_layers** (Number) - Number of recurrent layers. Only Support single layer currently. + - **dropout** (Number) - If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout. unsupport currently. + - **train** (bool) - An bool identifying is training in the op. Default to true . + - **bidirectional** (bool) - If True, becomes a bidirectional LSTM. unsupport currently. + - **batch_first** (bool) - If True, then the input and output tensors are provided as (batch, seq, feature). unsupport currently. + - **flag_seq** (bool) - If True, then the input is PackSequnce. unsupport currently. + - **direction** (bool) - If True, then the direction is "REDIRECTIONAL", otherwise is "UNIDIRECTIONAL". + +- Returns: + - **y** - A 4D Tensor. Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **output_h** - A 4D Tensor. Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **output_c** - A 4D Tensor. Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **i** - A 4D Tensor. Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **j** - A 4D Tensor. Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **f** - A 4D Tensor. Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **o** - A 4D Tensor. Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **tanhct** - A 4D Tensor. Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + +- Constraints: + + None + +- Examples: + + None + +>npu_iou(bboxes, gtboxes, mode=0) -> Tensor +>npu_ptiou(bboxes, gtboxes, mode=0) -> Tensor + +Computes the intersection over union (iou) or the intersection over. foreground (iof) based on the ground-truth and predicted regions. + +- Parameters: + - **bboxes** (Tensor) - the input tensor. + - **gtboxes** (Tensor) - the input tensor. + - **mode** (Number) - 0 1 corresponds to two modes iou iof. + +- Constraints: + + None + +- Examples: + + ```python + >>> bboxes = torch.tensor([[0, 0, 10, 10], + [10, 10, 20, 20], + [32, 32, 38, 42]], dtype=torch.float16).to("npu") + >>> gtboxes = torch.tensor([[0, 0, 10, 20], + [0, 10, 10, 10], + [10, 10, 20, 20]], dtype=torch.float16).to("npu") + >>> output_iou = torch.npu_iou(bboxes, gtboxes, 0) + >>> output_iou + tensor([[0.4985, 0.0000, 0.0000], + [0.0000, 0.0000, 0.0000], + [0.0000, 0.9961, 0.0000]], device='npu:0', dtype=torch.float16) + ``` + +>npu_pad(input, paddings) -> Tensor + +Pads a tensor + +- Parameters: + - **input** (Tensor) - the input tensor. + - **paddings** (ListInt) - type int32 or int64. + +- Constraints: + + None + +- Examples: + + ```python + >>> input = torch.tensor([[20, 20, 10, 10]], dtype=torch.float16).to("npu") + >>> paddings = [1, 1, 1, 1] + >>> output = torch.npu_pad(input, paddings) + >>> output + tensor([[ 0., 0., 0., 0., 0., 0.], + [ 0., 20., 20., 10., 10., 0.], + [ 0., 0., 0., 0., 0., 0.]], device='npu:0', dtype=torch.float16) + ``` + +>npu_nms_with_mask(input, iou_threshold) -> (Tensor, Tensor, Tensor) + +The value 01 is generated for the nms operator to determine the valid bit + +- Parameters: + - **input** (Tensor) - the input tensor. + - **iou_threshold** (Number) - Threshold. If the value exceeds this threshold, the value is 1. Otherwise, the value is 0. + +- Returns: + + - **selected_boxes** - 2-D tensor with shape of [N,5], representing filtered boxes including proposal boxes and corresponding confidence scores. + - **selected_idx** - 1-D tensor with shape of [N], representing the index of input proposal boxes. + - **selected_mask** - 1-D tensor with shape of [N], the symbol judging whether the output proposal boxes is valid . + +- Constraints: + + The 2nd-dim of input box_scores must be equal to 8. + +- Examples: + + ```python + >>> input = torch.tensor([[0.0, 1.0, 2.0, 3.0, 0.6], [6.0, 7.0, 8.0, 9.0, 0.4]], dtype=torch.float16).to("npu") + >>> iou_threshold = 0.5 + >>> output1, output2, output3, = torch.npu_nms_with_mask(input, iou_threshold) + >>> output1 + tensor([[0.0000, 1.0000, 2.0000, 3.0000, 0.6001], + [6.0000, 7.0000, 8.0000, 9.0000, 0.3999]], device='npu:0', + dtype=torch.float16) + >>> output2 + tensor([0, 1], device='npu:0', dtype=torch.int32) + >>> output3 + tensor([1, 1], device='npu:0', dtype=torch.uint8) + ``` + +>npu_bounding_box_encode(anchor_box, ground_truth_box, means0, means1, means2, means3, stds0, stds1, stds2, stds3) -> Tensor + +Computes the coordinate variations between bboxes and ground truth boxes. It is a customized FasterRcnn operator + +- Parameters: + - **anchor_box** (Tensor) - the input tensor.Anchor boxes. A 2D Tensor of float32 with shape (N, 4). "N" indicates the number of bounding boxes, and the value "4" refers to "x0", "x1", "y0", and "y1". + - **ground_truth_box** (Tensor) - the input tensor.Ground truth boxes. A 2D Tensor of float32 with shape (N, 4). "N" indicates the number of bounding boxes, and the value "4" refers to "x0", "x1", "y0", and "y1" + - **means0** (Number) - An index of type int + - **means1** (Number) - An index of type int + - **means2** (Number) - An index of type int + - **means3** (Number) - An index of type int. Defaults to [0,0,0,0]. "deltas" = "deltas" x "stds" + "means". + - **stds0** (Number) - An index of type int + - **stds1** (Number) - An index of type int + - **stds2** (Number) - An index of type int + - **stds3** (Number) - An index of type int Defaults to [1.0,1.0,1.0,1.0]. "deltas" = "deltas" x "stds" + "means" . + +- Constraints: + + None + +- Examples: + + ```python + >>> anchor_box = torch.tensor([[1., 2., 3., 4.], [3.,4., 5., 6.]], dtype = torch.float32).to("npu") + >>> ground_truth_box = torch.tensor([[5., 6., 7., 8.], [7.,8., 9., 6.]], dtype = torch.float32).to("npu") + >>> output = torch.npu_bounding_box_encode(anchor_box, ground_truth_box, 0, 0, 0, 0, 0.1, 0.1, 0.2, 0.2) + >>> output + tensor([[13.3281, 13.3281, 0.0000, 0.0000], + [13.3281, 6.6641, 0.0000, -5.4922]], device='npu:0') + >>> + ``` + +>npu_bounding_box_decode(rois, deltas, means0, means1, means2, means3, stds0, stds1, stds2, stds3, max_shape, wh_ratio_clip) -> Tensor + +Generates bounding boxes based on "rois" and "deltas". It is a customized FasterRcnn operator . + +- Parameters: + - **rois** (Tensor) - Region of interests (ROIs) generated by the region proposal network (RPN). A 2D Tensor of type float32 or float16 with shape (N, 4). "N" indicates the number of ROIs, and the value "4" refers to "x0", "x1", "y0", and "y1". + - **deltas** (Tensor) - Absolute variation between the ROIs generated by the RPN and ground truth boxes. A 2D Tensor of type float32 or float16 with shape (N, 4). "N" indicates the number of errors, and 4 indicates "dx", "dy", "dw", and "dh" . + - **means0** (Number) - An index of type int + - **means1** (Number) - An index of type int + - **means2** (Number) - An index of type int + - **means3** (Number) - An index of type int. Defaults to [0,0,0,0]. "deltas" = "deltas" x "stds" + "means". + - **stds0** (Number) - An index of type int + - **stds1** (Number) - An index of type int + - **stds2** (Number) - An index of type int + - **stds3** (Number) - An index of type int Defaults to [1.0,1.0,1.0,1.0]. "deltas" = "deltas" x "stds" + "means" . + - **max_shape** (ListInt) - Shape [h, w], specifying the size of the image transferred to the network. Used to ensure that the bbox shape after conversion does not exceed "max_shape + - **wh_ratio_clip** (Number) - Defaults to "16/1000". The values of "dw" and "dh" fall within (-wh_ratio_clip, wh_ratio_clip) . + +- Constraints: + + None + +- Examples: + + ```python + >>> rois = torch.tensor([[1., 2., 3., 4.], [3.,4., 5., 6.]], dtype = torch.float32).to("npu") + >>> deltas = torch.tensor([[5., 6., 7., 8.], [7.,8., 9., 6.]], dtype = torch.float32).to("npu") + >>> output = torch.npu_bounding_box_decode(rois, deltas, 0, 0, 0, 0, 1, 1, 1, 1, (10, 10), 0.1) + >>> output + tensor([[2.5000, 6.5000, 9.0000, 9.0000], + [9.0000, 9.0000, 9.0000, 9.0000]], device='npu:0') + ``` + +>npu_gru(input, hx, weight_input, weight_hidden, bias_input, bias_hidden, seq_length, has_biases, num_layers, dropout, train, bidirectional, batch_first) -> (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor) + +DynamicGRUV2 calculation. + +- Parameters: + - **input** (Tensor) - Must be one of the following types: float16. The format must be FRACTAL_NZ. + - **hx** (Tensor) - Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **weight_input** (Tensor) - Must be one of the following types: float16. The format must be FRACTAL_Z. + - **weight_hidden** (Tensor) - Must be one of the following types: float16. The format must be FRACTAL_Z. + - **bias_input** (Tensor) - Must be one of the following types: float16, float32. The format must be ND. + - **bias_hidden** (Tensor) - Must be one of the following types: float16, float32. The format must be ND. + - **seq_length** (Tensor) - Must be one of the following types: int32. The format must be ND. + - **has_biases** (bool) - Default to true. + - **num_layers** (Number) + - **dropout** (Number) + - **train** (bool) - An bool identifying is training in the op. Default to true. + - **bidirectional** (bool) - Default to true. + - **batch_first** (bool) - Default to true. + +- Returns: + + - **y** (Tensor) - Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **output_h** (Tensor) - output_h:Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **update** (Tensor) - update:Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **reset** (Tensor) - reset:Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **new** (Tensor) - Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + - **hidden_new** (Tensor) - Must be one of the following types: float16, float32. The format must be FRACTAL_NZ. + +- Constraints: + + None + +- Examples: + + None + +>npu_random_choice_with_mask(x, count=256, seed=0, seed2=0) -> (Tensor, Tensor) + +Shuffle index of no-zero element + +- Parameters: + - **x** (Tensor) - the input tensor. + - **count** (Number) - the count of output, if 0, out all no-zero elements. + - **seed** (Number) - type int32 or int64. + - **seed2** (Number) - type int32 or int64. + +- Returns: + + - **y** - 2-D tensor, no-zero element index. + - **mask** - 1-D, whether the corresponding index is valid. + +- Constraints: + + None + +- Examples: + + ```python + >>> x = torch.tensor([1, 0, 1, 0], dtype=torch.bool).to("npu") + >>> result, mask = torch.npu_random_choice_with_mask(x, 2, 1, 0) + >>> result + tensor([[0], + [2]], device='npu:0', dtype=torch.int32) + >>> mask + tensor([True, True], device='npu:0') + ``` + +>npu_batch_nms(self, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size, change_coordinate_frame=False, transpose_box=False) -> (Tensor, Tensor, Tensor, Tensor) + +Computes nms for input boxes and score, support multiple batch and classes. will do clip to window, score filter, top_k, and nms + +- Parameters: + - **self** (Tensor) - the input tensor. + - **scores** (Tensor) - the input tensor. + - **score_threshold** (Number) - A required attribute of type float32, specifying the score filter iou iou_threshold. + - **iou_threshold** (Number) - A required attribute of type float32, specifying the nms iou iou_threshold. + - **max_size_per_class** (Number) - A required attribute of type int, specifying the nms output num per class. + - **max_total_size** (Number) - A required attribute of type int, specifying the the nms output num per batch. + - **change_coordinate_frame** (bool) - A optional attribute of type bool, whether to normalize coordinates after clipping. + - **transpose_box** (bool) - A optional attribute of type bool, whether inserted transpose before this op. must be "false" + +- Returns: + + - **nmsed_boxes** (Tensor) - A 3D Tensor of type float16 with shape (batch, max_total_size, 4),specifying the output nms boxes per batch. + - **nmsed_scores** (Tensor) - A 2D Tensor of type float16 with shape (batch, max_total_size),specifying the output nms score per batch. + - **nmsed_classes** (Tensor) - A 2D Tensor of type float16 with shape (batch, max_total_size),specifying the output nms class per batch. + - **nmsed_num** (Tensor) - A 1D Tensor of type int32 with shape (batch), specifying the valid num of nmsed_boxes. + +- Constraints: + + None + +- Examples: + + ```python + >>> boxes = torch.randn(8, 2, 4, 4, dtype = torch.float32).to("npu") + >>> scores = torch.randn(3, 2, 4, dtype = torch.float32).to("npu") + >>> nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = torch.npu_batch_nms(boxes, scores, 0.3, 0.5, 3, 4) + >>> nmsed_boxes + >>> nmsed_scores + >>> nmsed_classes + >>> nmsed_num + ``` + +>npu_slice(self, offsets, size) -> Tensor + +Extracts a slice from a tensor + +- Parameters: + - **self** (Tensor) - the input tensor. + - **offsets** (ListInt) - type int32 or int64. + - **size** (ListInt) - type int32 or int64. + +- Constraints: + + None + +- Examples: + + ```python + >>> input = torch.tensor([[1,2,3,4,5], [6,7,8,9,10]], dtype=torch.float16).to("npu") + >>> offsets = [0, 0] + >>> size = [2, 2] + >>> output = torch.npu_slice(input, offsets, size) + >>> output + tensor([[1., 2.], + [6., 7.]], device='npu:0', dtype=torch.float16) + ``` + +>npu_dropoutV2(self, seed, p) -> (Tensor, Tensor, Tensor(a!)) + +count dropout result with seed + +- Parameters: + - **self** (Tensor) - The input Tensor. + - **seed** (Tensor) - The input Tensor. + - **p** (Float) - Dropout probability. + +- Returns: + + - **y** - A tensor with the same shape and type as "x". + - **mask** - A tensor with the same shape and type as "x". + - **new_seed** - A tensor with the same shape and type as "seed". + +- Constraints: + + None + +- Examples: + + ```python + >>> input = torch.tensor([1.,2.,3.,4.]).npu() + >>> input + tensor([1., 2., 3., 4.], device='npu:0') + >>> seed = torch.rand((32,),dtype=torch.float32).npu() + >>> seed + tensor([0.4368, 0.7351, 0.8459, 0.4657, 0.6783, 0.8914, 0.8995, 0.4401, 0.4408, + 0.4453, 0.2404, 0.9680, 0.0999, 0.8665, 0.2993, 0.5787, 0.0251, 0.6783, + 0.7411, 0.0670, 0.9430, 0.9165, 0.3983, 0.5849, 0.7722, 0.4659, 0.0486, + 0.2693, 0.6451, 0.2734, 0.3176, 0.0176], device='npu:0') + >>> prob = 0.3 + >>> output, mask, out_seed = torch.npu_dropoutV2(input, seed, prob) + >>> output + tensor([0.4408, 0.4453, 0.2404, 0.9680], device='npu:0') + >>> mask + tensor([0., 0., 0., 0.], device='npu:0') + >>> out_seed + tensor([0.4408, 0.4453, 0.2404, 0.9680, 0.0999, 0.8665, 0.2993, 0.5787, 0.0251, + 0.6783, 0.7411, 0.0670, 0.9430, 0.9165, 0.3983, 0.5849, 0.7722, 0.4659, + 0.0486, 0.2693, 0.6451, 0.2734, 0.3176, 0.0176, 0.0000, 0.0000, 0.0000, + 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], device='npu:0') + ``` + +>_npu_dropout(self, p) -> (Tensor, Tensor) + +count dropout result without seed + +- Parameters: + Similar to `torch.dropout`, optimize implemention to npu device. + - **self** (Tensor) - The input Tensor. + - **p** (Float) - Dropout probability. + +- Constraints: + + None + +- Examples: + + ```python + >>> input = torch.tensor([1.,2.,3.,4.]).npu() + >>> input + tensor([1., 2., 3., 4.], device='npu:0') + >>> prob = 0.3 + >>> output, mask = torch._npu_dropout(input, prob) + >>> output + tensor([0.0000, 2.8571, 0.0000, 0.0000], device='npu:0') + >>> mask + tensor([ 98, 255, 188, 186, 120, 157, 175, 159, 77, 223, 127, 79, 247, 151, + 253, 255], device='npu:0', dtype=torch.uint8) + ``` + +>_npu_dropout_inplace(result, p) -> (Tensor(a!), Tensor) + +count dropout result inplace. + +- Parameters: + Similar to `torch.dropout_`, optimize implemention to npu device. + - **result** (Tensor) - The Tensor dropout inplace. + - **p** (Float) - Dropout probability. + +- Constraints: + + None + +- Examples: + + ```python + >>> input = torch.tensor([1.,2.,3.,4.]).npu() + >>> input + tensor([1., 2., 3., 4.], device='npu:0') + >>> prob = 0.3 + >>> output, mask = torch._npu_dropout_inplace(input, prob) + >>> output + tensor([0.0000, 2.8571, 0.0000, 0.0000], device='npu:0') + >>> input + tensor([0.0000, 2.8571, 4.2857, 5.7143], device='npu:0') + >>> mask + tensor([ 98, 255, 188, 186, 120, 157, 175, 159, 77, 223, 127, 79, 247, 151, + 253, 255], device='npu:0', dtype=torch.uint8) + ``` + +>npu_indexing(self, begin, end, strides, begin_mask=0, end_mask=0, ellipsis_mask=0, new_axis_mask=0, shrink_axis_mask=0) -> Tensor + +count indexing result by begin,end,strides array. + +- Parameters: + - **self** (Tensor) - A Input Tensor. + - **begin** (ListInt) - The index of the first value to select. + - **end** (ListInt) - The index of the last value to select. + - **strides** (ListInt) - The index increment. + - **begin_mask** (Number) - A bitmask where a bit "i" being "1" means to ignore the begin + value and instead use the largest interval possible. + - **end_mask** (Number) - Analogous to "begin_mask". + - **ellipsis_mask** (Number) - A bitmask where bit "i" being "1" means the "i"th position + is actually an ellipsis. + - **new_axis_mask** (Number) - A bitmask where bit "i" being "1" means the "i"th + specification creates a new shape 1 dimension. + - **shrink_axis_mask** (Number) - A bitmask where bit "i" implies that the "i"th + specification should shrink the dimensionality. + +- Constraints: + + None + +- Examples: + + ```python + >>> input = torch.tensor([[1, 2, 3, 4],[5, 6, 7, 8]], dtype=torch.int32).to("npu") + >>> input + tensor([[1, 2, 3, 4], + [5, 6, 7, 8]], device='npu:0', dtype=torch.int32) + >>> output = torch.npu_indexing(input1, [0, 0], [2, 2], [1, 1]) + >>> output + tensor([[1, 2], + [5, 6]], device='npu:0', dtype=torch.int32) + ``` + +>npu_ifmr(Tensor data, Tensor data_min, Tensor data_max, Tensor cumsum, float min_percentile, float max_percentile, float search_start, float search_end, float search_step, bool with_offset) -> (Tensor, Tensor) + +count ifmr result by begin,end,strides array, Input Feature Map Reconstruction + +- Parameters: + - **data** (Tensor) - A Tensor of feature map. + - **data_min** (Tensor) - A Tensor of min value of feature map. + - **data_max** (Tensor) - A Tensor of max value of feature map. + - **cumsum** (Tensor) - A Tensor of cumsum bin of data. + - **min_percentile** (Float) - min init percentile. + - **max_percentile** (Float) - max init percentile. + - **search_start** (Float) - search start. + - **search_end** (Float) - search end. + - **search_step** (Float) - step size of searching. + - **with_offset** (bool) - whether using offset. + +- Returns: + + - **scale** - optimal scale. + - **offset** - optimal offset . + +- Constraints: + + None + +- Examples: + + ```python + >>> input = torch.rand((2,2,3,4),dtype=torch.float32).npu() + >>> input + tensor([[[[0.4508, 0.6513, 0.4734, 0.1924], + [0.0402, 0.5502, 0.0694, 0.9032], + [0.4844, 0.5361, 0.9369, 0.7874]], + + [[0.5157, 0.1863, 0.4574, 0.8033], + [0.5986, 0.8090, 0.7605, 0.8252], + [0.4264, 0.8952, 0.2279, 0.9746]]], + + [[[0.0803, 0.7114, 0.8773, 0.2341], + [0.6497, 0.0423, 0.8407, 0.9515], + [0.1821, 0.5931, 0.7160, 0.4968]], + + [[0.7977, 0.0899, 0.9572, 0.0146], + [0.2804, 0.8569, 0.2292, 0.1118], + [0.5747, 0.4064, 0.8370, 0.1611]]]], device='npu:0') + >>> min_value = torch.min(input) + >>> min_value + tensor(0.0146, device='npu:0') + >>> max_value = torch.max(input) + >>> max_value + tensor(0.9746, device='npu:0') + >>> hist = torch.histc(input.to('cpu'), + bins=128, + min=min_value.to('cpu'), + max=max_value.to('cpu')) + >>> hist + tensor([1., 0., 0., 2., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., + 0., 1., 0., 0., 2., 1., 0., 0., 0., 0., 2., 1., 0., 0., 0., 0., 0., 1., + 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., + 1., 0., 0., 0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 0., 1., + 0., 0., 1., 0., 0., 2., 0., 0., 0., 0., 0., 0., 2., 0., 0., 0., 0., 0., + 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 2., 0., 0., + 1., 1., 1., 0., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1., 1., + 0., 1.]) + >>> cdf = torch.cumsum(hist,dim=0).int().npu() + >>> cdf + tensor([ 1, 1, 1, 3, 3, 3, 3, 4, 5, 5, 6, 6, 7, 7, 7, 7, 7, 7, + 7, 8, 8, 8, 10, 11, 11, 11, 11, 11, 13, 14, 14, 14, 14, 14, 14, 15, + 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, + 17, 17, 17, 17, 18, 19, 19, 20, 21, 21, 22, 22, 23, 23, 23, 24, 24, 25, + 25, 25, 26, 26, 26, 28, 28, 28, 28, 28, 28, 28, 30, 30, 30, 30, 30, 30, + 30, 30, 31, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 35, 37, 37, 37, + 38, 39, 40, 40, 41, 41, 41, 42, 42, 43, 44, 44, 44, 44, 45, 45, 46, 47, + 47, 48], device='npu:0', dtype=torch.int32) + >>> scale, offset = torch.npu_ifmr(input, + min_value, + max_value, + cdf, + min_percentile=0.999999, + max_percentile=0.999999, + search_start=0.7, + search_end=1.3, + search_step=0.01, + with_offset=False) + >>> scale + tensor(0.0080, device='npu:0') + >>> offset + tensor(0., device='npu:0') + ``` + +>npu_max.dim(self, dim, keepdim=False) -> (Tensor, Tensor) + +count max result with dim. + +- Parameters: + Similar to `torch.max`, optimize implemention to npu device. + + - **self** (Tensor) – the input tensor. + - **dim** (Number) – the dimension to reduce. + - **keepdim** (bool) – whether the output tensor has dim retained or not. + +- Returns: + + - **values** - max values in the input tensor. + - **indices** - index of max values in the input tensor. + +- Constraints: + + None + +- Examples: + + ```python + >>> input = torch.randn(2, 2, 2, 2, dtype = torch.float32).npu() + >>> input + tensor([[[[-1.8135, 0.2078], + [-0.6678, 0.7846]], + + [[ 0.6458, -0.0923], + [-0.2124, -1.9112]]], + + [[[-0.5800, -0.4979], + [ 0.2580, 1.1335]], + + [[ 0.6669, 0.1876], + [ 0.1160, -0.1061]]]], device='npu:0') + >>> outputs, indices = torch.npu_max(input, 2) + >>> outputs + tensor([[[-0.6678, 0.7846], + [ 0.6458, -0.0923]], + + [[ 0.2580, 1.1335], + [ 0.6669, 0.1876]]], device='npu:0') + >>> indices + tensor([[[1, 1], + [0, 0]], + + [[1, 1], + [0, 0]]], device='npu:0', dtype=torch.int32) + ``` + +>npu_min.dim(self, dim, keepdim=False) -> (Tensor, Tensor) + +count min result with dim. + +- Parameters: + Similar to `torch.min`, optimize implemention to npu device. + - **self** (Tensor) – the input tensor. + - **dim** (Number) – the dimension to reduce. + - **keepdim** (bool) – whether the output tensor has dim retained or not. + +- Returns: + + - **values** - min values in the input tensor. + - **indices** - index of min values in the input tensor. + +- Constraints: + + None + +- Examples: + + ```python + >>> input = torch.randn(2, 2, 2, 2, dtype = torch.float32).npu() + >>> input + tensor([[[[-0.9909, -0.2369], + [-0.9569, -0.6223]], + + [[ 0.1157, -0.3147], + [-0.7761, 0.1344]]], + + [[[ 1.6292, 0.5953], + [ 0.6940, -0.6367]], + + [[-1.2335, 0.2131], + [ 1.0748, -0.7046]]]], device='npu:0') + >>> outputs, indices = torch.npu_min(input, 2) + >>> outputs + tensor([[[-0.9909, -0.6223], + [-0.7761, -0.3147]], + + [[ 0.6940, -0.6367], + [-1.2335, -0.7046]]], device='npu:0') + >>> indices + tensor([[[0, 1], + [1, 0]], + + [[1, 1], + [0, 1]]], device='npu:0', dtype=torch.int32) + ``` + +>npu_scatter(self, indices, updates, dim) -> Tensor + +count scatter result with dim. + +- Parameters: + Similar to `torch.scatter`, optimize implemention to npu device. + + - **self** (Tensor) - the input tensor. + - **indices** (Tensor) – the indices of elements to scatter, can be either empty or of the same dimensionality as src. When empty, the operation returns self unchanged. + - **updates** (Tensor) – the source element(s) to scatter. +- **dim** (Number) – the axis along which to index + +- Constraints: + + None + +- Examples: + + ```python + >>> input = torch.tensor([[1.6279, 0.1226], [0.9041, 1.0980]]).npu() + >>> input + tensor([[1.6279, 0.1226], + [0.9041, 1.0980]], device='npu:0') + >>> indices = torch.tensor([0, 1],dtype=torch.int32).npu() + >>> indices + tensor([0, 1], device='npu:0', dtype=torch.int32) + >>> updates = torch.tensor([-1.1993, -1.5247]).npu() + >>> updates + tensor([-1.1993, -1.5247], device='npu:0') + >>> dim = 0 + >>> output = torch.npu_scatter(input, indices, updates, dim) + >>> output + tensor([[-1.1993, 0.1226], + [ 0.9041, -1.5247]], device='npu:0') + ``` + +>npu_layer_norm_eval(input, normalized_shape, weight=None, bias=None, eps=1e-05) -> Tensor + +count layer norm result. + +- Parameters: + The same as `torch.nn.functional.layer_norm`, optimize implemention to npu device. + - **input** (Tensor) - The input Tensor. + - **normalized_shape** (ListInt) – input shape from an expected input of size. + - **weight** (Tensor) - The gamma Tensor. + - **bias** (Tensor) - The beta Tensor. + - **eps** (Float) – The epsilon value added to the denominator for numerical stability. Default: 1e-5. + +- Constraints: + + None + +- Examples: + + ```python + >>> input = torch.rand((6, 4), dtype=torch.float32).npu() + >>> input + tensor([[0.1863, 0.3755, 0.1115, 0.7308], + [0.6004, 0.6832, 0.8951, 0.2087], + [0.8548, 0.0176, 0.8498, 0.3703], + [0.5609, 0.0114, 0.5021, 0.1242], + [0.3966, 0.3022, 0.2323, 0.3914], + [0.1554, 0.0149, 0.1718, 0.4972]], device='npu:0') + >>> normalized_shape = input.size()[1:] + >>> normalized_shape + torch.Size([4]) + >>> weight = torch.Tensor(*normalized_shape).npu() + >>> weight + tensor([ nan, 6.1223e-41, -8.3159e-20, 9.1834e-41], device='npu:0') + >>> bias = torch.Tensor(*normalized_shape).npu() + >>> bias + tensor([5.6033e-39, 6.1224e-41, 6.1757e-39, 6.1224e-41], device='npu:0') + >>> output = torch.npu_layer_norm_eval(input, normalized_shape, weight, bias, 1e-5) + >>> output + tensor([[ nan, 6.7474e-41, 8.3182e-20, 2.0687e-40], + [ nan, 8.2494e-41, -9.9784e-20, -8.2186e-41], + [ nan, -2.6695e-41, -7.7173e-20, 2.1353e-41], + [ nan, -1.3497e-41, -7.1281e-20, -6.9827e-42], + [ nan, 3.5663e-41, 1.2002e-19, 1.4314e-40], + [ nan, -6.2792e-42, 1.7902e-20, 2.1050e-40]], device='npu:0') + ``` + +>npu_alloc_float_status(self) -> Tensor + +Produces eight numbers with a value of zero + +- Parameters: + + - **self** (Tensor) - Any Tensor + +- Constraints: + + None + +- Examples: + + ```python + >>> input = torch.randn([1,2,3]).npu() + >>> output = torch.npu_alloc_float_status(input) + >>> input + tensor([[[ 2.2324, 0.2478, -0.1056], + [ 1.1273, -0.2573, 1.0558]]], device='npu:0') + >>> output + tensor([0., 0., 0., 0., 0., 0., 0., 0.], device='npu:0') + ``` + +> npu_get_float_status(self) -> Tensor + +Computes NPU get float status operator function. + +- Parameters: + + - **self** (Tensor) - A Tensor of data memory address. Must be float32 . + +- Constraints: + + None + +- Examples: + + ```python + >>> x = torch.rand(2).npu() + >>> torch.npu_get_float_status(x) + tensor([0., 0., 0., 0., 0., 0., 0., 0.], device='npu:0') + ``` + +> npu_clear_float_status(self) -> Tensor + +Set the value of address 0x40000 to 0 in each core. + +- Parameters: + + - **self** (Tensor) - A tensor of type float32. + +- Constraints: + + None + +- Examples: + + ```python + >>> x = torch.rand(2).npu() + >>> torch.npu_clear_float_status(x) + tensor([0., 0., 0., 0., 0., 0., 0., 0.], device='npu:0') + ``` + +> npu_confusion_transpose(self, perm, shape, transpose_first) -> Tensor + +Confuse reshape and transpose. + +- Parameters: + + - **self** (Tensor) - A Tensor. Must be one of the following types: float16, float32, int8, int16, int32, int64, uint8, uint16, uint32, uint64. + - **perm** (ListInt) - A permutation of the dimensions of "x". + - **shape** (ListInt) - The shape of the input. + - **transpose_first** (bool) - If True, the transpose is first, otherwise the reshape is first. + +- Constraints: + + None + +- Examples: + + ```python + >>> x = torch.rand(2, 3, 4, 6).npu() + >>> x.shape + torch.Size([2, 3, 4, 6]) + >>> y = torch.npu_confusion_transpose(x, (0, 2, 1, 3), (2, 4, 18), True) + >>> y.shape + torch.Size([2, 4, 18]) + >>> y2 = torch.npu_confusion_transpose(x, (0, 2, 1), (2, 12, 6), False) + >>> y2.shape + torch.Size([2, 6, 12]) + ``` + +> npu_bmmV2(self, mat2, output_sizes) -> Tensor + +Multiplies matrix "a" by matrix "b", producing "a * b" . + +- Parameters: + - **self** (Tensor) - A matrix Tensor. Must be one of the following types: float16, float32, int32. 2D or higher. Has format [ND, NHWC, FRACTAL_NZ]. + - **mat2** (Tensor) - A matrix Tensor. Must be one of the following types: float16, float32, int32. 2D or higher. Has format [ND, NHWC, FRACTAL_NZ]. + - **output_sizes** (ListInt) - Output's shape, used in matmul's backpropagation, default []. + +- Constraints: + + None + +- Examples: + + ```python + >>> mat1 = torch.randn(10, 3, 4).npu() + >>> mat2 = torch.randn(10, 4, 5).npu() + >>> res = torch.npu_bmmV2(mat1, mat2, []) + >>> res.shape + torch.Size([10, 3, 5]) + ``` + +> fast_gelu(self) -> Tensor + +Computes the gradient for the fast_gelu of "x" . + +- Parameters: + + - **self** (Tensor) - A Tensor. Must be one of the following types: float16, float32 + +- Constraints: + + None + +- Examples: + + ```python + >>> x = torch.rand(2).npu() + >>> x + tensor([0.5991, 0.4094], device='npu:0') + >>> torch.fast_gelu(x) + tensor([0.4403, 0.2733], device='npu:0') + ``` + +> npu_sub_sample(self, per_images, positive_fraction) -> Tensor + +Randomly sample a subset of positive and negative examples,and overwrite the label vector to the ignore value (-1) for all elements that are not included in the sample. + +- Parameters: + + - **self** (Tensor) - shape of labels,(N, ) label vector with values. + - **per_images** (Number) - A require attribute of type int. + - **positive_fraction** (Float) - A require attribute of type float. + +- Constraints: + + None + +- Examples: + + ```python + >>> x = torch.tensor([-2, 3, 6, -7, -2, 8, 1, -5, 7, 4]).int().npu() + >>> x + tensor([-2, 3, 6, -7, -2, 8, 1, -5, 7, 4], device='npu:0', + dtype=torch.int32) + >>> torch.npu_sub_sample(x, 5, 0.6) + tensor([-1, -1, -1, -1, -1, -1, 1, -1, -1, -1], device='npu:0', + dtype=torch.int32) + ``` + +> npu_deformable_conv2d(input, weight, offset, bias, kernel_size, stride, padding, dilation=[1,1,1,1], groups=1, deformable_groups=1, modulated=True) -> (Tensor, Tensor) + +Computes the deformed convolution output with the expected input. + +- Parameters: + + - **self** (Tensor) - A 4D tensor of input image. With the format "NHWC", the data is stored in the order of: [batch, in_height, in_width, in_channels]. + - **weight** (Tensor) - A 4D tensor of learnable filters. Must have the same type as "x". With the format "HWCN" , the data is stored in the order of: [filter_height, filter_width, in_channels / groups, out_channels]. + - **offset** (Tensor) - A 4D tensor of x-y coordinates offset and mask. With the format "NHWC", the data is stored in the order of: [batch, out_height, out_width, deformable_groups * filter_height * filter_width * 3]. + - **bias** (Tensor) - An optional 1D tensor of additive biases to the filter outputs. The data is stored in the order of: [out_channels]. + - **kernel_size** (ListInt) - A tuple/list of 2 integers.kernel size. + - **stride** (ListInt) - Required. A list of 4 integers. The stride of the sliding window for each dimension of input. The dimension order is interpreted according to the data format of "x". The N and C dimensions must be set to 1. + - **padding** (ListInt) - Required. A list of 4 integers. The number of pixels to add to each (top, bottom, left, right) side of the input. + - **dilations** (ListInt) - Optional. A list of 4 integers. The dilation factor for each dimension of input. The dimension order is interpreted according to the data format of "x". The N and C dimensions must be set to 1. Defaults to [1, 1, 1, 1]. + - **groups** (Number) - Optional. An integer of type int32. The number of blocked connections from input channels to output channels. In_channels and out_channels must both be divisible by "groups". Defaults to 1. + - **deformable_groups** (Number) - Optional. An integer of type int32. The number of deformable group partitions. In_channels must be divisible by "deformable_groups". Defaults to 1. + - **modulated** (bool) - Optional. Specify version of DeformableConv2D, true means v2, false means v1, currently only support v2. + +- Constraints: + + None + +- Examples: + + ```python + >>> x = torch.rand(16, 32, 32, 32).npu() + >>> weight = torch.rand(32, 32, 5, 5).npu() + >>> offset = torch.rand(16, 75, 32, 32).npu() + >>> output, _ = torch.npu_deformable_conv2d(x, weight, offset, None, kernel_size=[5, 5], stride = [1, 1, 1, 1], padding = [2, 2, 2, 2]) + >>> output.shape + torch.Size([16, 32, 32, 32]) + ``` + +> npu_mish(self) -> Tensor + +Computes hyperbolic tangent of "x" element-wise. + +- Parameters: + + - **self** (Tensor) - A Tensor. Must be one of the following types: float16, float32. + +- Constraints: + + None + +- Examples: + + ```python + >>> x = torch.rand(10, 30, 10).npu() + >>> y = torch.npu_mish(x) + >>> y.shape + torch.Size([10, 30, 10]) + ``` + +> npu_anchor_response_flags(self, featmap_size, stride, num_base_anchors) -> Tensor + +Generate the responsible flags of anchor in a single feature map. + +- Parameters: + - **self** (Tensor) - Ground truth box, 2-D Tensor with shape [batch, 4]. + - **featmap_size** (ListInt) - The size of feature maps, listint. + - **strides** (ListInt) - Stride of current level, listint. + - **num_base_anchors** (Number) - The number of base anchors. + +- Constraints: + + None + +- Examples: + + ```python + >>> x = torch.rand(100, 4).npu() + >>> y = torch.npu_anchor_response_flags(x, [60, 60], [2, 2], 9) + >>> y.shape + torch.Size([32400]) + ``` + +> npu_yolo_boxes_encode(self, gt_bboxes, stride, performance_mode=False) -> Tensor + +Generates bounding boxes based on yolo's "anchor" and "ground-truth" boxes. It is a customized mmdetection operator. + +- Parameters: + - **self** (Tensor) - anchor boxes generated by the yolo training set. A 2D Tensor of type float32 or float16 with shape (N, 4). "N" indicates the number of ROIs, "N" indicates the number of ROIs, and the value "4" refers to (tx, ty, tw, th). + - **gt_bboxes** (Tensor) - target of the transformation, e.g, ground-truth boxes. A 2D Tensor of type float32 or float16 with shape (N, 4). "N" indicates the number of ROIs, and 4 indicates "dx", "dy", "dw", and "dh". + - **strides** (Tensor) - Scale for each box. A 1D Tensor of type int32 shape (N,). "N" indicates the number of ROIs. +- **performance_mode** (bool) - Select performance mode, "high_precision" or "high_performance". select "high_precision" when input type is float32, the output tensor precision will be smaller than 0.0001, select "high_performance" when input type is float32, the ops will be best performance, but precision will be only smaller than 0.005. + +- Constraints: + + input anchor boxes only support maximum N=20480. + +- Examples: + + ```python + >>> anchor_boxes = torch.rand(2, 4).npu() + >>> gt_bboxes = torch.rand(2, 4).npu() + >>> stride = torch.tensor([2, 2], dtype=torch.int32).npu() + >>> output = torch.npu_yolo_boxes_encode(anchor_boxes, gt_bboxes, stride, False) + >>> output.shape + torch.Size([2, 4]) + ``` + +> npu_grid_assign_positive(self, overlaps, box_responsible_flags, max_overlaps, argmax_overlaps, gt_max_overlaps, gt_argmax_overlaps, num_gts, pos_iou_thr, min_pos_iou, gt_max_assign_all) -> Tensor + +Performs Position Sensitive PS ROI Pooling Grad. + +- Parameters: + - **self** (Tensor) - Tensor of type float16 or float32, shape (n, ) + - **overlaps** (Tensor) - A Tensor. Datatype is same as assigned_gt_inds. IOU between gt_bboxes and bboxes. shape(k, n) + - **box_responsible_flags** (Tensor) - A Tensor. Support uint8. Flag to indicate whether box is responsible. + - **max_overlaps** (Tensor) - A Tensor. Datatype is same as assigned_gt_inds. overlaps.max(axis=0). + - **argmax_overlaps** (Tensor) - A Tensor. Support int32. overlaps.argmax(axis=0). + - **gt_max_overlaps** (Tensor) - A Tensor. Datatype is same as assigned_gt_inds. overlaps.max(axis=1). + - **gt_argmax_overlaps** (Tensor) - A Tensor. Support int32. overlaps.argmax(axis=1). + - **num_gts** (Number) - A Tensor. Support int32. real k. shape (1, ) + - **pos_iou_thr** (Float) - loat. IOU threshold for positive bboxes. + - **min_pos_iou** (Float) - float. minimum iou for a bbox to be considered as a positive bbox + - **gt_max_assign_all** (bool) - bool. whether to assign all bboxes with the same highest overlap with some gt to that gt. + +- Constraints: + + None + +- Examples: + + ```python + >>> assigned_gt_inds = torch.rand(4).npu() + >>> overlaps = torch.rand(2,4).npu() + >>> box_responsible_flags = torch.tensor([1, 1, 1, 0], dtype=torch.uint8).npu() + >>> max_overlap = torch.rand(4).npu() + >>> argmax_overlap = torch.tensor([1, 0, 1, 0], dtype=torch.int32).npu() + >>> gt_max_overlaps = torch.rand(2).npu() + >>> gt_argmax_overlaps = torch.tensor([1, 0],dtype=torch.int32).npu() + >>> output = torch.npu_grid_assign_positive(assigned_gt_inds, overlaps, box_responsible_flags, max_overlap, argmax_overlap, gt_max_overlaps, gt_argmax_overlaps, 128, 0.5, 0., True) + >>> output.shape + torch.Size([4]) + ``` + +> npu_normalize_batch(self, seq_len, normalize_type=0) -> Tensor + + Performs batch normalization . + +- Parameters: + + - **self** (Tensor) - A Tensor. Support float32. shape (n, c, d). + - **seq_len** (Tensor) - A Tensor. Each batch normalize data num. Support Int32. Shape (n, ). + - **normalize_type** (Number) - Str. Support "per_feature" or "all_features". + +- Constraints: + + None + +- Examples: + ```python + >>> a=np.random.uniform(1,10,(2,3,6)).astype(np.float32) + >>> b=np.random.uniform(3,6,(2)).astype(np.int32) + >>> x=torch.from_numpy(a).to("npu") + >>> seqlen=torch.from_numpy(b).to("npu") + >>> out = torch.npu_normalize_batch(x, seqlen, 0) + >>> out + tensor([[[ 1.1496, -0.6685, -0.4812, 1.7611, -0.5187, 0.7571], + [ 1.1445, -0.4393, -0.7051, 1.0474, -0.2646, -0.1582], + [ 0.1477, 0.9179, -1.0656, -6.8692, -6.7437, 2.8621]], + + [[-0.6880, 0.1337, 1.3623, -0.8081, -1.2291, -0.9410], + [ 0.3070, 0.5489, -1.4858, 0.6300, 0.6428, 0.0433], + [-0.5387, 0.8204, -1.1401, 0.8584, -0.3686, 0.8444]]], + device='npu:0') + ``` + +> npu_masked_fill_range(self, start, end, value, axis=-1) -> Tensor + +masked fill tensor along with one axis by range.boxes. It is a customized masked fill range operator . + +- Parameters: + + - **self** (Tensor) - input tensor. A ND Tensor of float32/float16/int32/int8 with shapes 1-D (D,), 2-D(N, D), 3-D(N, C, D). + - **start** (Tensor) - masked fill start pos. A 3D Tensor of int32 with shape (num, N). + - **end** (Tensor) - masked fill end pos. A 3D Tensor of int32 with shape (num, N). + - **value** (Tensor) - masked fill value. A 2D Tensor of float32/float16/int32/int8 with shape (num,). + - **axis** (Number) - axis with masked fill of int32. Defaults to -1. + +- Constraints: + + None + +- Examples: + ```python + >>> a=torch.rand(4,4).npu() + >>> a + tensor([[0.9419, 0.4919, 0.2874, 0.6560], + [0.6691, 0.6668, 0.0330, 0.1006], + [0.3888, 0.7011, 0.7141, 0.7878], + [0.0366, 0.9738, 0.4689, 0.0979]], device='npu:0') + >>> start = torch.tensor([[0,1,2]], dtype=torch.int32).npu() + >>> end = torch.tensor([[1,2,3]], dtype=torch.int32).npu() + >>> value = torch.tensor([1], dtype=torch.float).npu() + >>> out = torch.npu_masked_fill_range(a, start, end, value, 1) + >>> out + tensor([[1.0000, 0.4919, 0.2874, 0.6560], + [0.6691, 1.0000, 0.0330, 0.1006], + [0.3888, 0.7011, 1.0000, 0.7878], + [0.0366, 0.9738, 0.4689, 0.0979]], device='npu:0') + ``` + +> npu_linear(input, weight, bias=None) -> Tensor + + Multiplies matrix "a" by matrix "b", producing "a * b" . + +- Parameters: + + - **input** (Tensor) - A matrix Tensor. 2D. Must be one of the following types: float32, float16, int32, int8. Has format [ND, NHWC, FRACTAL_NZ]. + - **weight** (Tensor) - A matrix Tensor. 2D. Must be one of the following types: float32, float16, int32, int8. Has format [ND, NHWC, FRACTAL_NZ]. + - **bias** (Tensor) - A 1D Tensor. Must be one of the following types: float32, float16, int32. Has format [ND, NHWC]. + +- Constraints: + + None + +- Examples: + ```python + >>> x=torch.rand(2,16).npu() + >>> w=torch.rand(4,16).npu() + >>> b=torch.rand(4).npu() + >>> output = torch.npu_linear(x, w, b) + >>> output + tensor([[3.6335, 4.3713, 2.4440, 2.0081], + [5.3273, 6.3089, 3.9601, 3.2410]], device='npu:0') + ``` + +> npu_bert_apply_adam.old(Tensor(a!) var, Tensor(b!) m, Tensor(c!) v, lr, beta1, beta2, epsilon, grad, max_grad_norm, global_grad_norm, weight_decay, step_size=None, adam_mode=0) -> (Tensor(a!), Tensor(b!), Tensor(c!)) + + count adam result. + +- Parameters: + + - **var** (Tensor) - A Tensor. Support float16/float32. + - **m**(Tensor) - A Tensor. Datatype and shape are same as exp_avg. + - **v**(Tensor) - A Tensor. Datatype and shape are same as exp_avg. + - **lr** (Number) - A Tensor. Datatype is same as exp_avg. + - **beta1** (Number) - A Tensor. Datatype is same as exp_avg. + - **beta2** (Number) - A Tensor. Datatype is same as exp_avg. + - **epsilon** (Number) - A Tensor. Datatype is same as exp_avg. + - **grad**(Tensor) - A Tensor. Datatype and shape are same as exp_avg. + - **max_grad_norm** (Number) - A Tensor. Datatype is same as exp_avg. + - **global_grad_norm** (Number) - A Tensor. Datatype is same as exp_avg. + - **weight_decay** (Number) - A Tensor. Datatype is same as exp_avg. + +- Constraints: + + None + +- Examples: + ```python + >>> var_in = torch.rand(321538).uniform_(-32., 21.).npu() + >>> m_in = torch.zeros(321538).npu() + >>> v_in = torch.zeros(321538).npu() + >>> grad = torch.rand(321538).uniform_(-0.05, 0.03).npu() + >>> max_grad_norm = -1. + >>> beta1 = 0.9 + >>> beta2 = 0.99 + >>> weight_decay = 0. + >>> lr = 0. + >>> epsilon = 1e-06 + >>> global_grad_norm = 0. + >>> var_out, m_out, v_out = torch.npu_bert_apply_adam(var_in, m_in, v_in, lr, beta1, beta2, epsilon, grad, max_grad_norm, global_grad_norm, weight_decay) + >>> var_out + tensor([ 14.7733, -30.1218, -1.3647, ..., -16.6840, 7.1518, 8.4872], + device='npu:0') + ``` + +> npu_bert_apply_adam(lr, beta1, beta2, epsilon, grad, max_grad_norm, global_grad_norm, weight_decay, step_size=None, adam_mode=0, *, out=(var,m,v)) + + count adam result. + +- Parameters: + + - **var** (Tensor) - A Tensor. Support float16/float32. + - **m** (Tensor) - A Tensor. Datatype and shape are same as exp_avg. + - **v** (Tensor) - A Tensor. Datatype and shape are same as exp_avg. + - **lr** (Number) - Datatype is same as exp_avg. + - **beta1** (Number) - Datatype is same as exp_avg. + - **beta2** (Number) - Datatype is same as exp_avg. + - **epsilon** (Number) - Datatype is same as exp_avg. + - **grad** (Tensor) - A Tensor. Datatype and shape are same as exp_avg. + - **max_grad_norm** (Number) - Datatype is same as exp_avg. + - **global_grad_norm** (Number) - Datatype is same as exp_avg. + - **weight_decay** (Number) - Datatype is same as exp_avg. + +- Keyword Arguments : + + - **out** :A Tensor, optional. The output tensor. + +- Constraints: + + None + +- Examples: + ```python + >>> var_in = torch.rand(321538).uniform_(-32., 21.).npu() + >>> m_in = torch.zeros(321538).npu() + >>> v_in = torch.zeros(321538).npu() + >>> grad = torch.rand(321538).uniform_(-0.05, 0.03).npu() + >>> max_grad_norm = -1. + >>> beta1 = 0.9 + >>> beta2 = 0.99 + >>> weight_decay = 0. + >>> lr = 0. + >>> epsilon = 1e-06 + >>> global_grad_norm = 0. + >>> var_out, m_out, v_out = torch.npu_bert_apply_adam(lr, beta1, beta2, epsilon, grad, max_grad_norm, global_grad_norm, weight_decay, out=(var_in, m_in, v_in)) + >>> var_out + tensor([ 14.7733, -30.1218, -1.3647, ..., -16.6840, 7.1518, 8.4872], + device='npu:0') + ``` + +> npu_giou(self, gtboxes, trans=False, is_cross=False, mode=0) -> Tensor + +First calculate the minimum closure area of the two boxes, IoU, the proportion of the closed area that does not belong to the two boxes in the closure area, and finally subtract this proportion from IoU to get GIoU . + +- Parameters: + + - **self** (Tensor) - Bounding boxes, a 2D Tensor of type float16 or float32 with shape (N, 4). "N" indicates the number of bounding boxes, and the value "4" refers to [x1, y1, x2, y2] or [x, y, w, h]. + - **gtboxes** (Tensor) - Ground-truth boxes, a 2D Tensor of type float16 or float32 with shape (M, 4). "M" indicates the number of ground truth boxes, and the value "4" refers to [x1, y1, x2, y2] or [x, y, w, h]. + - **trans** (bool) - An optional bool, true for 'xywh', false for 'xyxy'. + - **is_cross** (bool) - An optional bool, control whether the output shape is [M, N] or [1, N]. + - **mode:** (Number) - Computation mode, a character string with the value range of [iou, iof] . + +- Constraints: + + None + +- Examples: + ```python + >>> a=np.random.uniform(0,1,(4,10)).astype(np.float16) + >>> b=np.random.uniform(0,1,(4,10)).astype(np.float16) + >>> box1=torch.from_numpy(a).to("npu") + >>> box2=torch.from_numpy(a).to("npu") + >>> output = torch.npu_giou(box1, box2, trans=True, is_cross=False, mode=0) + >>> output + tensor([[1.], + [1.], + [1.], + [1.], + [1.], + [1.], + [1.], + [1.], + [1.], + [1.]], device='npu:0', dtype=torch.float16) + ``` + +> npu_silu(self) -> Tensor + +Computes the for the Swish of "x" . + +- Parameters: + + - **self** (Tensor) - A Tensor. Must be one of the following types: float16, float32 + +- Constraints: + + None + +- Examples: +```python +>>> a=torch.rand(2,8).npu() +>>> output = torch.npu_silu(a) +>>> output +tensor([[0.4397, 0.7178, 0.5190, 0.2654, 0.2230, 0.2674, 0.6051, 0.3522], + [0.4679, 0.1764, 0.6650, 0.3175, 0.0530, 0.4787, 0.5621, 0.4026]], + device='npu:0') +``` + +> npu_reshape(self, shape, bool can_refresh=False) -> Tensor + +Reshapes a tensor. Only the tensor shape is changed, without changing the data. + +- Parameters: + + - **self** (Tensor) - A Tensor. + - **shape** (ListInt) - Defines the shape of the output tensor. + - **can_refresh** (bool) - Used to specify whether reshape can be refreshed in place. + +- Constraints: + + This operator cannot be directly called by the acllopExecute API. + +- Examples: + ```python + >>> a=torch.rand(2,8).npu() + >>> out=torch.npu_reshape(a,(4,4)) + >>> out + tensor([[0.6657, 0.9857, 0.7614, 0.4368], + [0.3761, 0.4397, 0.8609, 0.5544], + [0.7002, 0.3063, 0.9279, 0.5085], + [0.1009, 0.7133, 0.8118, 0.6193]], device='npu:0') + ``` + +> npu_rotated_overlaps(self, query_boxes, trans=False) -> Tensor + +Calculate the overlapping area of the rotated box. + +- Parameters: + + - **self** (Tensor) - data of grad increment, a 3D Tensor of type float32 with shape (B, 5, N). + - **query_boxes** (Tensor) - Bounding boxes, a 3D Tensor of type float32 with shape (B, 5, K). + - **trans** (bool) - An optional attr, true for 'xyxyt', false for 'xywht'. + +- Constraints: + + None + +- Examples: + ```python + >>> a=np.random.uniform(0,1,(1,3,5)).astype(np.float16) + >>> b=np.random.uniform(0,1,(1,2,5)).astype(np.float16) + >>> box1=torch.from_numpy(a).to("npu") + >>> box2=torch.from_numpy(a).to("npu") + >>> output = torch.npu_rotated_overlaps(box1, box2, trans=False) + >>> output + tensor([[[0.0000, 0.1562, 0.0000], + [0.1562, 0.3713, 0.0611], + [0.0000, 0.0611, 0.0000]]], device='npu:0', dtype=torch.float16) + ``` + +> npu_rotated_iou(self, query_boxes, trans=False, mode=0, is_cross=True) -> Tensor + +Calculate the IOU of the rotated box. + +- Parameters: + + - **self** (Tensor) - data of grad increment, a 3D Tensor of type float32 with shape (B, 5, N). + - **query_boxes** (Tensor) - Bounding boxes, a 3D Tensor of type float32 with shape (B, 5, K). + - **trans** (bool) - An optional attr, true for 'xyxyt', false for 'xywht'. + - **is_cross** (bool) -Cross calculation when it is True, and one-to-one calculation when it is False. + - **mode** (Number) - Computation mode, a character string with the value range of [iou, iof, giou] . + +- Constraints: + + None + +- Examples: + ```python + >>> a=np.random.uniform(0,1,(2,2,5)).astype(np.float16) + >>> b=np.random.uniform(0,1,(2,3,5)).astype(np.float16) + >>> box1=torch.from_numpy(a).to("npu") + >>> box2=torch.from_numpy(a).to("npu") + >>> output = torch.npu_rotated_iou(box1, box2, trans=False, mode=0, is_cross=True) + >>> output + tensor([[[3.3325e-01, 1.0162e-01], + [1.0162e-01, 1.0000e+00]], + + [[0.0000e+00, 0.0000e+00], + [0.0000e+00, 5.9605e-08]]], device='npu:0', dtype=torch.float16) + ``` + +> npu_rotated_box_encode(anchor_box, gt_bboxes, weight) -> Tensor + +Rotate Bounding Box Encoding. + +- Parameters: + + - anchor_box (Tensor) - A 3D Tensor with shape (B, 5, N). the input tensor.Anchor boxes. "B" indicates the number of batch size, "N" indicates the number of bounding boxes, and the value "5" refers to "x0", "x1", "y0", "y1" and "angle" . + - gt_bboxes (Tensor) - A 3D Tensor of float32 (float16) with shape (B, 5, N). + - weight (Tensor) - A float list for "x0", "x1", "y0", "y1" and "angle", defaults to [1.0, 1.0, 1.0, 1.0, 1.0]. + +- Constraints: + + None + +- Examples: + + ``` + >>> anchor_boxes = torch.tensor([[[30.69], [32.6], [45.94], [59.88], [-44.53]]], dtype=torch.float16).to("npu") + >>> gt_bboxes = torch.tensor([[[30.44], [18.72], [33.22], [45.56], [8.5]]], dtype=torch.float16).to("npu") + >>> weight = torch.tensor([1., 1., 1., 1., 1.], dtype=torch.float16).npu() + >>> out = torch.npu_rotated_box_encode(anchor_boxes, gt_bboxes, weight) + >>> out + tensor([[[-0.4253], + [-0.5166], + [-1.7021], + [-0.0162], + [ 1.1328]]], device='npu:0', dtype=torch.float16) + ``` + + > npu_rotated_box_decode(anchor_boxes, deltas, weight) -> Tensor + + Rotate Bounding Box Encoding + + - Parameters: + + - anchor_box (Tensor) - A 3D Tensor with shape (B, 5, N). the input tensor.Anchor boxes. "B" indicates the number of batch size, "N" indicates the number of bounding boxes, and the value "5" refers to "x0", "x1", "y0", "y1" and "angle" . + - deltas (Tensor) - A 3D Tensor of float32 (float16) with shape (B, 5, N). + - weight (Tensor) - A float list for "x0", "x1", "y0", "y1" and "angle", defaults to [1.0, 1.0, 1.0, 1.0, 1.0]. + + - Constraints: + + None + + - Examples: + + ``` + >>> anchor_boxes = torch.tensor([[[4.137],[33.72],[29.4], [54.06], [41.28]]], dtype=torch.float16).to("npu") + >>> deltas = torch.tensor([[[0.0244], [-1.992], [0.2109], [0.315], [-37.25]]], dtype=torch.float16).to("npu") + >>> weight = torch.tensor([1., 1., 1., 1., 1.], dtype=torch.float16).npu() + >>> out = torch.npu_rotated_box_decode(anchor_boxes, deltas, weight) + >>> out + tensor([[[ 1.7861], + [-10.5781], + [ 33.0000], + [ 17.2969], + [-88.4375]]], device='npu:0', dtype=torch.float16) + ``` + + \ No newline at end of file diff --git a/docs/en/PyTorch 1.8.1 API Support.md b/docs/en/PyTorch 1.8.1 API Support.md new file mode 100644 index 0000000000000000000000000000000000000000..1f26f5d55693e12980d6af105309ea0d2696b4fa --- /dev/null +++ b/docs/en/PyTorch 1.8.1 API Support.md @@ -0,0 +1,1231 @@ +# PyTorch 1.8.1 API Support + +## Tensors + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [is_tensor](https://pytorch.org/docs/1.8.1/generated/torch.is_tensor.html) | Unsupported | +| 2 | [is_storage](https://pytorch.org/docs/1.8.1/generated/torch.is_storage.html) | Unsupported | +| 3 | [is_complex](https://pytorch.org/docs/1.8.1/generated/torch.is_complex.html) | Unsupported | +| 4 | [is_floating_point](https://pytorch.org/docs/1.8.1/generated/torch.is_floating_point.html) | Unsupported | +| 5 | [is_nonzero](https://pytorch.org/docs/1.8.1/generated/torch.is_nonzero.html) | Unsupported | +| 6 | [set_default_dtype](https://pytorch.org/docs/1.8.1/generated/torch.set_default_dtype.html) | Unsupported | +| 7 | [get_default_dtype](https://pytorch.org/docs/1.8.1/generated/torch.get_default_dtype.html) | Unsupported | +| 8 | [set_default_tensor_type](https://pytorch.org/docs/1.8.1/generated/torch.set_default_tensor_type.html) | Unsupported | +| 9 | [numel](https://pytorch.org/docs/1.8.1/generated/torch.numel.html) | Unsupported | +| 10 | [set_printoptions](https://pytorch.org/docs/1.8.1/generated/torch.set_printoptions.html) | Unsupported | +| 11 | [set_flush_denormal](https://pytorch.org/docs/1.8.1/generated/torch.set_flush_denormal.html) | Unsupported | + +### Creation Ops + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [tensor](https://pytorch.org/docs/1.8.1/generated/torch.tensor.html) | Unsupported | +| 2 | [sparse_coo_tensor](https://pytorch.org/docs/1.8.1/generated/torch.sparse_coo_tensor.html) | Unsupported | +| 3 | [as_tensor](https://pytorch.org/docs/1.8.1/generated/torch.as_tensor.html) | Unsupported | +| 4 | [as_strided](https://pytorch.org/docs/1.8.1/generated/torch.as_strided.html) | Unsupported | +| 5 | [from_numpy](https://pytorch.org/docs/1.8.1/generated/torch.from_numpy.html) | Unsupported | +| 6 | [zeros](https://pytorch.org/docs/1.8.1/generated/torch.zeros.html) | Unsupported | +| 7 | [zeros_like](https://pytorch.org/docs/1.8.1/generated/torch.zeros_like.html) | Unsupported | +| 8 | [ones](https://pytorch.org/docs/1.8.1/generated/torch.ones.html) | Unsupported | +| 9 | [ones_like](https://pytorch.org/docs/1.8.1/generated/torch.ones_like.html) | Unsupported | +| 10 | [arange](https://pytorch.org/docs/1.8.1/generated/torch.arange.html) | Unsupported | +| 11 | [range](https://pytorch.org/docs/1.8.1/generated/torch.range.html) | Unsupported | +| 12 | [linspace](https://pytorch.org/docs/1.8.1/generated/torch.linspace.html) | Unsupported | +| 13 | [logspace](https://pytorch.org/docs/1.8.1/generated/torch.logspace.html) | Unsupported | +| 14 | [eye](https://pytorch.org/docs/1.8.1/generated/torch.eye.html) | Unsupported | +| 15 | [empty](https://pytorch.org/docs/1.8.1/generated/torch.empty.html) | Unsupported | +| 16 | [empty_like](https://pytorch.org/docs/1.8.1/generated/torch.empty_like.html) | Unsupported | +| 17 | [empty_strided](https://pytorch.org/docs/1.8.1/generated/torch.empty_strided.html) | Unsupported | +| 18 | [full](https://pytorch.org/docs/1.8.1/generated/torch.full.html) | Unsupported | +| 19 | [full_like](https://pytorch.org/docs/1.8.1/generated/torch.full_like.html) | Unsupported | +| 20 | [quantize_per_tensor](https://pytorch.org/docs/1.8.1/generated/torch.quantize_per_tensor.html) | Unsupported | +| 21 | [quantize_per_channel](https://pytorch.org/docs/1.8.1/generated/torch.quantize_per_channel.html) | Unsupported | +| 22 | [dequantize](https://pytorch.org/docs/1.8.1/generated/torch.dequantize.html) | Unsupported | +| 23 | [complex](https://pytorch.org/docs/1.8.1/generated/torch.complex.html) | Unsupported | +| 24 | [polar](https://pytorch.org/docs/1.8.1/generated/torch.polar.html) | Unsupported | +| 25 | [heaviside](https://pytorch.org/docs/1.8.1/generated/torch.heaviside.html) | Unsupported | + +### Indexing, Slicing, Joining, Mutating Ops + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [cat](https://pytorch.org/docs/1.8.1/generated/torch.cat.html) | Unsupported | +| 2 | [chunk](https://pytorch.org/docs/1.8.1/generated/torch.chunk.html) | Unsupported | +| 3 | [column_stack](https://pytorch.org/docs/1.8.1/generated/torch.column_stack.html) | Unsupported | +| 4 | [dstack](https://pytorch.org/docs/1.8.1/generated/torch.dstack.html) | Unsupported | +| 5 | [gather](https://pytorch.org/docs/1.8.1/generated/torch.gather.html) | Unsupported | +| 6 | [hstack](https://pytorch.org/docs/1.8.1/generated/torch.hstack.html) | Unsupported | +| 7 | [index_select](https://pytorch.org/docs/1.8.1/generated/torch.index_select.html) | Unsupported | +| 8 | [masked_select](https://pytorch.org/docs/1.8.1/generated/torch.masked_select.html) | Unsupported | +| 9 | [movedim](https://pytorch.org/docs/1.8.1/generated/torch.movedim.html) | Unsupported | +| 10 | [moveaxis](https://pytorch.org/docs/1.8.1/generated/torch.moveaxis.html) | Unsupported | +| 11 | [narrow](https://pytorch.org/docs/1.8.1/generated/torch.narrow.html) | Unsupported | +| 12 | [nonzero](https://pytorch.org/docs/1.8.1/generated/torch.nonzero.html) | Unsupported | +| 13 | [reshape](https://pytorch.org/docs/1.8.1/generated/torch.reshape.html) | Unsupported | +| 14 | [row_stack](https://pytorch.org/docs/1.8.1/generated/torch.row_stack.html) | Unsupported | +| 15 | [scatter](https://pytorch.org/docs/1.8.1/generated/torch.scatter.html) | Unsupported | +| 16 | [scatter_add](https://pytorch.org/docs/1.8.1/generated/torch.scatter_add.html) | Unsupported | +| 17 | [split](https://pytorch.org/docs/1.8.1/generated/torch.split.html) | Unsupported | +| 18 | [squeeze](https://pytorch.org/docs/1.8.1/generated/torch.squeeze.html) | Unsupported | +| 19 | [stack](https://pytorch.org/docs/1.8.1/generated/torch.stack.html) | Unsupported | +| 20 | [swapaxes](https://pytorch.org/docs/1.8.1/generated/torch.swapaxes.html) | Unsupported | +| 21 | [swapdims](https://pytorch.org/docs/1.8.1/generated/torch.swapdims.html) | Unsupported | +| 22 | [t](https://pytorch.org/docs/1.8.1/generated/torch.t.html) | Unsupported | +| 23 | [take](https://pytorch.org/docs/1.8.1/generated/torch.take.html) | Unsupported | +| 24 | [tensor_split](https://pytorch.org/docs/1.8.1/generated/torch.tensor_split.html) | Unsupported | +| 25 | [tile](https://pytorch.org/docs/1.8.1/generated/torch.tile.html) | Unsupported | +| 26 | [transpose](https://pytorch.org/docs/1.8.1/generated/torch.transpose.html) | Unsupported | +| 27 | [unbind](https://pytorch.org/docs/1.8.1/generated/torch.unbind.html) | Unsupported | +| 28 | [unsqueeze](https://pytorch.org/docs/1.8.1/generated/torch.unsqueeze.html) | Unsupported | +| 29 | [vstack](https://pytorch.org/docs/1.8.1/generated/torch.vstack.html) | Unsupported | +| 30 | [where](https://pytorch.org/docs/1.8.1/generated/torch.where.html) | Unsupported | + +## Generators + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [Generator](https://pytorch.org/docs/1.8.1/generated/torch.Generator.html) | Unsupported | + +## Random Sampling + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [seed](https://pytorch.org/docs/1.8.1/generated/torch.seed.html) | Unsupported | +| 2 | [manual_seed](https://pytorch.org/docs/1.8.1/generated/torch.manual_seed.html) | Unsupported | +| 3 | [initial_seed](https://pytorch.org/docs/1.8.1/generated/torch.initial_seed.html) | Unsupported | +| 4 | [get_rng_state](https://pytorch.org/docs/1.8.1/generated/torch.get_rng_state.html) | Unsupported | +| 5 | [set_rng_state](https://pytorch.org/docs/1.8.1/generated/torch.set_rng_state.html) | Unsupported | +| 6 | [bernoulli](https://pytorch.org/docs/1.8.1/generated/torch.bernoulli.html) | Unsupported | +| 7 | [multinomial](https://pytorch.org/docs/1.8.1/generated/torch.multinomial.html) | Unsupported | +| 8 | [normal](https://pytorch.org/docs/1.8.1/generated/torch.normal.html) | Unsupported | +| 9 | [poisson](https://pytorch.org/docs/1.8.1/generated/torch.poisson.html) | Unsupported | +| 10 | [rand](https://pytorch.org/docs/1.8.1/generated/torch.rand.html) | Unsupported | +| 11 | [rand_like](https://pytorch.org/docs/1.8.1/generated/torch.rand_like.html) | Unsupported | +| 12 | [randint](https://pytorch.org/docs/1.8.1/generated/torch.randint.html) | Unsupported | +| 13 | [randint_like](https://pytorch.org/docs/1.8.1/generated/torch.randint_like.html) | Unsupported | +| 14 | [randn](https://pytorch.org/docs/1.8.1/generated/torch.randn.html) | Unsupported | +| 15 | [randn_like](https://pytorch.org/docs/1.8.1/generated/torch.randn_like.html) | Unsupported | +| 16 | [randperm](https://pytorch.org/docs/1.8.1/generated/torch.randperm.html) | Unsupported | + +### In-place Random Sampling + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [torch.Tensor.bernoulli_()](https://pytorch.org/docs/1.8.1/tensors.html) | Unsupported | +| 2 | [torch.Tensor.cauchy_()](https://pytorch.org/docs/1.8.1/tensors.html) | Unsupported | +| 3 | [torch.Tensor.exponential_()](https://pytorch.org/docs/1.8.1/tensors.html) | Unsupported | +| 4 | [torch.Tensor.geometric_()](https://pytorch.org/docs/1.8.1/tensors.html) | Unsupported | +| 5 | [torch.Tensor.log_normal_()](https://pytorch.org/docs/1.8.1/tensors.html) | Unsupported | +| 6 | [torch.Tensor.normal_()](https://pytorch.org/docs/1.8.1/tensors.html) | Unsupported | +| 7 | [torch.Tensor.random_()](https://pytorch.org/docs/1.8.1/tensors.html) | Unsupported | +| 8 | [torch.Tensor.uniform_()](https://pytorch.org/docs/1.8.1/tensors.html) | Unsupported | + +### Quasi-random Sampling + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [quasirandom.SobolEngine](https://pytorch.org/docs/1.8.1/generated/torch.quasirandom.SobolEngine.html) | Unsupported | + +## Serialization + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [save](https://pytorch.org/docs/1.8.1/generated/torch.save.html) | Unsupported | +| 2 | [load](https://pytorch.org/docs/1.8.1/generated/torch.load.html) | Unsupported | + +## Parallelism + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [get_num_threads](https://pytorch.org/docs/1.8.1/generated/torch.get_num_threads.html) | Unsupported | +| 2 | [set_num_threads](https://pytorch.org/docs/1.8.1/generated/torch.set_num_threads.html) | Unsupported | +| 3 | [get_num_interop_threads](https://pytorch.org/docs/1.8.1/generated/torch.get_num_interop_threads.html) | Unsupported | +| 4 | [set_num_interop_threads](https://pytorch.org/docs/1.8.1/generated/torch.set_num_interop_threads.html) | Unsupported | + +## Locally Disabling Gradient Computation + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [no_grad](https://pytorch.org/docs/1.8.1/generated/torch.no_grad.html#torch.no_grad) | Unsupported | +| 2 | [enable_grad](https://pytorch.org/docs/1.8.1/generated/torch.enable_grad.html#torch.enable_grad) | Unsupported | +| 3 | set_grad_enabled | Unsupported | + +## Math Operations + +### Pointwise Ops + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [abs](https://pytorch.org/docs/1.8.1/generated/torch.abs.html#torch.abs) | Unsupported | +| 2 | [absolute](https://pytorch.org/docs/1.8.1/generated/torch.absolute.html#torch.absolute) | Unsupported | +| 3 | [acos](https://pytorch.org/docs/1.8.1/generated/torch.acos.html#torch.acos) | Unsupported | +| 4 | [arccos](https://pytorch.org/docs/1.8.1/generated/torch.arccos.html#torch.arccos) | Unsupported | +| 5 | [acosh](https://pytorch.org/docs/1.8.1/generated/torch.acosh.html#torch.acosh) | Unsupported | +| 6 | [arccosh](https://pytorch.org/docs/1.8.1/generated/torch.arccosh.html#torch.arccosh) | Unsupported | +| 7 | [add](https://pytorch.org/docs/1.8.1/generated/torch.add.html#torch.add) | Unsupported | +| 8 | [addcdiv](https://pytorch.org/docs/1.8.1/generated/torch.addcdiv.html#torch.addcdiv) | Unsupported | +| 9 | [addcmul](https://pytorch.org/docs/1.8.1/generated/torch.addcmul.html#torch.addcmul) | Unsupported | +| 10 | [angle](https://pytorch.org/docs/1.8.1/generated/torch.angle.html#torch.angle) | Unsupported | +| 11 | [asin](https://pytorch.org/docs/1.8.1/generated/torch.asin.html#torch.asin) | Unsupported | +| 12 | [arcsin](https://pytorch.org/docs/1.8.1/generated/torch.arcsin.html#torch.arcsin) | Unsupported | +| 13 | [asinh](https://pytorch.org/docs/1.8.1/generated/torch.asinh.html#torch.asinh) | Unsupported | +| 14 | [arcsinh](https://pytorch.org/docs/1.8.1/generated/torch.arcsinh.html#torch.arcsinh) | Unsupported | +| 15 | [atan](https://pytorch.org/docs/1.8.1/generated/torch.atan.html#torch.atan) | Unsupported | +| 16 | [arctan](https://pytorch.org/docs/1.8.1/generated/torch.arctan.html#torch.arctan) | Unsupported | +| 17 | [atanh](https://pytorch.org/docs/1.8.1/generated/torch.atanh.html#torch.atanh) | Unsupported | +| 18 | [arctanh](https://pytorch.org/docs/1.8.1/generated/torch.arctanh.html#torch.arctanh) | Unsupported | +| 19 | [atan2](https://pytorch.org/docs/1.8.1/generated/torch.atan2.html#torch.atan2) | Unsupported | +| 20 | [bitwise_not](https://pytorch.org/docs/1.8.1/generated/torch.bitwise_not.html#torch.bitwise_not) | Unsupported | +| 21 | [bitwise_and](https://pytorch.org/docs/1.8.1/generated/torch.bitwise_and.html#torch.bitwise_and) | Unsupported | +| 22 | [bitwise_or](https://pytorch.org/docs/1.8.1/generated/torch.bitwise_or.html#torch.bitwise_or) | Unsupported | +| 23 | [bitwise_xor](https://pytorch.org/docs/1.8.1/generated/torch.bitwise_xor.html#torch.bitwise_xor) | Unsupported | +| 24 | [ceil](https://pytorch.org/docs/1.8.1/generated/torch.ceil.html#torch.ceil) | Unsupported | +| 25 | [clamp](https://pytorch.org/docs/1.8.1/generated/torch.clamp.html#torch.clamp) | Unsupported | +| 26 | [clip](https://pytorch.org/docs/1.8.1/generated/torch.clip.html#torch.clip) | Unsupported | +| 27 | [conj](https://pytorch.org/docs/1.8.1/generated/torch.conj.html#torch.conj) | Unsupported | +| 28 | [copysign](https://pytorch.org/docs/1.8.1/generated/torch.copysign.html#torch.copysign) | Unsupported | +| 29 | [cos](https://pytorch.org/docs/1.8.1/generated/torch.cos.html#torch.cos) | Unsupported | +| 30 | [cosh](https://pytorch.org/docs/1.8.1/generated/torch.cosh.html#torch.cosh) | Unsupported | +| 31 | [deg2rad](https://pytorch.org/docs/1.8.1/generated/torch.deg2rad.html#torch.deg2rad) | Unsupported | +| 32 | [div](https://pytorch.org/docs/1.8.1/generated/torch.div.html#torch.div) | Unsupported | +| 33 | [divide](https://pytorch.org/docs/1.8.1/generated/torch.divide.html#torch.divide) | Unsupported | +| 34 | [digamma](https://pytorch.org/docs/1.8.1/generated/torch.digamma.html#torch.digamma) | Unsupported | +| 35 | [erf](https://pytorch.org/docs/1.8.1/generated/torch.erf.html#torch.erf) | Unsupported | +| 36 | [erfc](https://pytorch.org/docs/1.8.1/generated/torch.erfc.html#torch.erfc) | Unsupported | +| 37 | [erfinv](https://pytorch.org/docs/1.8.1/generated/torch.erfinv.html#torch.erfinv) | Unsupported | +| 38 | [exp](https://pytorch.org/docs/1.8.1/generated/torch.exp.html#torch.exp) | Unsupported | +| 39 | [exp2](https://pytorch.org/docs/1.8.1/generated/torch.exp2.html#torch.exp2) | Unsupported | +| 40 | [expm1](https://pytorch.org/docs/1.8.1/generated/torch.expm1.html#torch.expm1) | Unsupported | +| 41 | [fake_quantize_per_channel_affine](https://pytorch.org/docs/1.8.1/generated/torch.fake_quantize_per_channel_affine.html#torch.fake_quantize_per_channel_affine) | Unsupported | +| 42 | [fake_quantize_per_tensor_affine](https://pytorch.org/docs/1.8.1/generated/torch.fake_quantize_per_tensor_affine.html#torch.fake_quantize_per_tensor_affine) | Unsupported | +| 43 | [fix](https://pytorch.org/docs/1.8.1/generated/torch.fix.html#torch.fix) | Unsupported | +| 44 | [float_power](https://pytorch.org/docs/1.8.1/generated/torch.float_power.html#torch.float_power) | Unsupported | +| 45 | [floor](https://pytorch.org/docs/1.8.1/generated/torch.floor.html#torch.floor) | Unsupported | +| 46 | [floor_divide](https://pytorch.org/docs/1.8.1/generated/torch.floor_divide.html#torch.floor_divide) | Unsupported | +| 47 | [fmod](https://pytorch.org/docs/1.8.1/generated/torch.fmod.html#torch.fmod) | Unsupported | +| 48 | [frac](https://pytorch.org/docs/1.8.1/generated/torch.frac.html#torch.frac) | Unsupported | +| 49 | [imag](https://pytorch.org/docs/1.8.1/generated/torch.imag.html#torch.imag) | Unsupported | +| 50 | [ldexp](https://pytorch.org/docs/1.8.1/generated/torch.ldexp.html#torch.ldexp) | Unsupported | +| 51 | [lerp](https://pytorch.org/docs/1.8.1/generated/torch.lerp.html#torch.lerp) | Unsupported | +| 52 | [lgamma](https://pytorch.org/docs/1.8.1/generated/torch.lgamma.html#torch.lgamma) | Unsupported | +| 53 | [log](https://pytorch.org/docs/1.8.1/generated/torch.log.html#torch.log) | Unsupported | +| 54 | [log10](https://pytorch.org/docs/1.8.1/generated/torch.log10.html#torch.log10) | Unsupported | +| 55 | [log1p](https://pytorch.org/docs/1.8.1/generated/torch.log1p.html#torch.log1p) | Unsupported | +| 56 | [log2](https://pytorch.org/docs/1.8.1/generated/torch.log2.html#torch.log2) | Unsupported | +| 57 | [logaddexp](https://pytorch.org/docs/1.8.1/generated/torch.logaddexp.html#torch.logaddexp) | Unsupported | +| 58 | [logaddexp2](https://pytorch.org/docs/1.8.1/generated/torch.logaddexp2.html#torch.logaddexp2) | Unsupported | +| 59 | [logical_and](https://pytorch.org/docs/1.8.1/generated/torch.logical_and.html#torch.logical_and) | Unsupported | +| 60 | [logical_not](https://pytorch.org/docs/1.8.1/generated/torch.logical_not.html#torch.logical_not) | Unsupported | +| 61 | [logical_or](https://pytorch.org/docs/1.8.1/generated/torch.logical_or.html#torch.logical_or) | Unsupported | +| 62 | [logical_xor](https://pytorch.org/docs/1.8.1/generated/torch.logical_xor.html#torch.logical_xor) | Unsupported | +| 63 | [logit](https://pytorch.org/docs/1.8.1/generated/torch.logit.html#torch.logit) | Unsupported | +| 64 | [hypot](https://pytorch.org/docs/1.8.1/generated/torch.hypot.html#torch.hypot) | Unsupported | +| 65 | [i0](https://pytorch.org/docs/1.8.1/generated/torch.i0.html#torch.i0) | Unsupported | +| 66 | [igamma](https://pytorch.org/docs/1.8.1/generated/torch.igamma.html#torch.igamma) | Unsupported | +| 67 | [igammac](https://pytorch.org/docs/1.8.1/generated/torch.igammac.html#torch.igammac) | Unsupported | +| 68 | [mul](https://pytorch.org/docs/1.8.1/generated/torch.mul.html#torch.mul) | Unsupported | +| 69 | [multiply](https://pytorch.org/docs/1.8.1/generated/torch.multiply.html#torch.multiply) | Unsupported | +| 70 | [mvlgamma](https://pytorch.org/docs/1.8.1/generated/torch.mvlgamma.html#torch.mvlgamma) | Unsupported | +| 71 | [nan_to_num](https://pytorch.org/docs/1.8.1/generated/torch.nan_to_num.html#torch.nan_to_num) | Unsupported | +| 72 | [neg](https://pytorch.org/docs/1.8.1/generated/torch.neg.html#torch.neg) | Unsupported | +| 73 | [negative](https://pytorch.org/docs/1.8.1/generated/torch.negative.html#torch.negative) | Unsupported | +| 74 | [nextafter](https://pytorch.org/docs/1.8.1/generated/torch.nextafter.html#torch.nextafter) | Unsupported | +| 75 | [polygamma](https://pytorch.org/docs/1.8.1/generated/torch.polygamma.html#torch.polygamma) | Unsupported | +| 76 | [pow](https://pytorch.org/docs/1.8.1/generated/torch.pow.html#torch.pow) | Unsupported | +| 77 | [rad2deg](https://pytorch.org/docs/1.8.1/generated/torch.rad2deg.html#torch.rad2deg) | Unsupported | +| 78 | [real](https://pytorch.org/docs/1.8.1/generated/torch.real.html#torch.real) | Unsupported | +| 79 | [reciprocal](https://pytorch.org/docs/1.8.1/generated/torch.reciprocal.html#torch.reciprocal) | Unsupported | +| 80 | [remainder](https://pytorch.org/docs/1.8.1/generated/torch.remainder.html#torch.remainder) | Unsupported | +| 81 | [round](https://pytorch.org/docs/1.8.1/generated/torch.round.html#torch.round) | Unsupported | +| 82 | [rsqrt](https://pytorch.org/docs/1.8.1/generated/torch.rsqrt.html#torch.rsqrt) | Unsupported | +| 83 | [sigmoid](https://pytorch.org/docs/1.8.1/generated/torch.sigmoid.html#torch.sigmoid) | Unsupported | +| 84 | [sign](https://pytorch.org/docs/1.8.1/generated/torch.sign.html#torch.sign) | Unsupported | +| 85 | [sgn](https://pytorch.org/docs/1.8.1/generated/torch.sgn.html#torch.sgn) | Unsupported | +| 86 | [signbit](https://pytorch.org/docs/1.8.1/generated/torch.signbit.html#torch.signbit) | Unsupported | +| 87 | [sin](https://pytorch.org/docs/1.8.1/generated/torch.sin.html#torch.sin) | Unsupported | +| 88 | [sinc](https://pytorch.org/docs/1.8.1/generated/torch.sinc.html#torch.sinc) | Unsupported | +| 89 | [sinh](https://pytorch.org/docs/1.8.1/generated/torch.sinh.html#torch.sinh) | Unsupported | +| 90 | [sqrt](https://pytorch.org/docs/1.8.1/generated/torch.sqrt.html#torch.sqrt) | Unsupported | +| 91 | [square](https://pytorch.org/docs/1.8.1/generated/torch.square.html#torch.square) | Unsupported | +| 92 | [sub](https://pytorch.org/docs/1.8.1/generated/torch.sub.html#torch.sub) | Unsupported | +| 93 | [subtract](https://pytorch.org/docs/1.8.1/generated/torch.subtract.html#torch.subtract) | Unsupported | +| 94 | [tan](https://pytorch.org/docs/1.8.1/generated/torch.tan.html#torch.tan) | Unsupported | +| 95 | [tanh](https://pytorch.org/docs/1.8.1/generated/torch.tanh.html#torch.tanh) | Unsupported | +| 96 | [true_divide](https://pytorch.org/docs/1.8.1/generated/torch.true_divide.html#torch.true_divide) | Unsupported | +| 97 | [trunc](https://pytorch.org/docs/1.8.1/generated/torch.trunc.html#torch.trunc) | Unsupported | +| 98 | [xlogy](https://pytorch.org/docs/1.8.1/generated/torch.xlogy.html#torch.xlogy) | Unsupported | + +### Reduction Ops + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [argmax](https://pytorch.org/docs/1.8.1/generated/torch.argmax.html#torch.argmax) | Unsupported | +| 2 | [argmin](https://pytorch.org/docs/1.8.1/generated/torch.argmin.html#torch.argmin) | Unsupported | +| 3 | [amax](https://pytorch.org/docs/1.8.1/generated/torch.amax.html#torch.amax) | Unsupported | +| 4 | [amin](https://pytorch.org/docs/1.8.1/generated/torch.amin.html#torch.amin) | Unsupported | +| 5 | [all](https://pytorch.org/docs/1.8.1/generated/torch.all.html#torch.all) | Unsupported | +| 6 | [any](https://pytorch.org/docs/1.8.1/generated/torch.any.html#torch.any) | Unsupported | +| 7 | [max](https://pytorch.org/docs/1.8.1/generated/torch.max.html#torch.max) | Unsupported | +| 8 | [min](https://pytorch.org/docs/1.8.1/generated/torch.min.html#torch.min) | Unsupported | +| 9 | [dist](https://pytorch.org/docs/1.8.1/generated/torch.dist.html#torch.dist) | Unsupported | +| 10 | [logsumexp](https://pytorch.org/docs/1.8.1/generated/torch.logsumexp.html#torch.logsumexp) | Unsupported | +| 11 | [mean](https://pytorch.org/docs/1.8.1/generated/torch.mean.html#torch.mean) | Unsupported | +| 12 | [median](https://pytorch.org/docs/1.8.1/generated/torch.median.html#torch.median) | Unsupported | +| 13 | [nanmedian](https://pytorch.org/docs/1.8.1/generated/torch.nanmedian.html#torch.nanmedian) | Unsupported | +| 14 | [mode](https://pytorch.org/docs/1.8.1/generated/torch.mode.html#torch.mode) | Unsupported | +| 15 | [norm](https://pytorch.org/docs/1.8.1/generated/torch.norm.html#torch.norm) | Unsupported | +| 16 | [nansum](https://pytorch.org/docs/1.8.1/generated/torch.nansum.html#torch.nansum) | Unsupported | +| 17 | [prod](https://pytorch.org/docs/1.8.1/generated/torch.prod.html#torch.prod) | Unsupported | +| 18 | [quantile](https://pytorch.org/docs/1.8.1/generated/torch.quantile.html#torch.quantile) | Unsupported | +| 19 | [nanquantile](https://pytorch.org/docs/1.8.1/generated/torch.nanquantile.html#torch.nanquantile) | Unsupported | +| 20 | [std](https://pytorch.org/docs/1.8.1/generated/torch.std.html#torch.std) | Unsupported | +| 21 | [std_mean](https://pytorch.org/docs/1.8.1/generated/torch.std_mean.html#torch.std_mean) | Unsupported | +| 22 | [sum](https://pytorch.org/docs/1.8.1/generated/torch.sum.html#torch.sum) | Unsupported | +| 23 | [unique](https://pytorch.org/docs/1.8.1/generated/torch.unique.html#torch.unique) | Unsupported | +| 24 | [unique_consecutive](https://pytorch.org/docs/1.8.1/generated/torch.unique_consecutive.html#torch.unique_consecutive) | Unsupported | +| 25 | [var](https://pytorch.org/docs/1.8.1/generated/torch.var.html#torch.var) | Unsupported | +| 26 | [var_mean](https://pytorch.org/docs/1.8.1/generated/torch.var_mean.html#torch.var_mean) | Unsupported | +| 27 | [count_nonzero](https://pytorch.org/docs/1.8.1/generated/torch.count_nonzero.html#torch.count_nonzero) | Unsupported | + +### Comparison Ops + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [allclose](https://pytorch.org/docs/1.8.1/generated/torch.allclose.html#torch.allclose) | Unsupported | +| 2 | [argsort](https://pytorch.org/docs/1.8.1/generated/torch.argsort.html#torch.argsort) | Unsupported | +| 3 | [eq](https://pytorch.org/docs/1.8.1/generated/torch.eq.html#torch.eq) | Unsupported | +| 4 | [equal](https://pytorch.org/docs/1.8.1/generated/torch.equal.html#torch.equal) | Unsupported | +| 5 | [ge](https://pytorch.org/docs/1.8.1/generated/torch.ge.html#torch.ge) | Unsupported | +| 6 | [greater_equal](https://pytorch.org/docs/1.8.1/generated/torch.greater_equal.html#torch.greater_equal) | Unsupported | +| 7 | [gt](https://pytorch.org/docs/1.8.1/generated/torch.gt.html#torch.gt) | Unsupported | +| 8 | [greater](https://pytorch.org/docs/1.8.1/generated/torch.greater.html#torch.greater) | Unsupported | +| 9 | [isclose](https://pytorch.org/docs/1.8.1/generated/torch.isclose.html#torch.isclose) | Unsupported | +| 10 | [isfinite](https://pytorch.org/docs/1.8.1/generated/torch.isfinite.html#torch.isfinite) | Unsupported | +| 11 | [isinf](https://pytorch.org/docs/1.8.1/generated/torch.isinf.html#torch.isinf) | Unsupported | +| 12 | [isposinf](https://pytorch.org/docs/1.8.1/generated/torch.isposinf.html#torch.isposinf) | Unsupported | +| 13 | [isneginf](https://pytorch.org/docs/1.8.1/generated/torch.isneginf.html#torch.isneginf) | Unsupported | +| 14 | [isnan](https://pytorch.org/docs/1.8.1/generated/torch.isnan.html#torch.isnan) | Unsupported | +| 15 | [isreal](https://pytorch.org/docs/1.8.1/generated/torch.isreal.html#torch.isreal) | Unsupported | +| 16 | [kthvalue](https://pytorch.org/docs/1.8.1/generated/torch.kthvalue.html#torch.kthvalue) | Unsupported | +| 17 | [le](https://pytorch.org/docs/1.8.1/generated/torch.le.html#torch.le) | Unsupported | +| 18 | [less_equal](https://pytorch.org/docs/1.8.1/generated/torch.less_equal.html#torch.less_equal) | Unsupported | +| 19 | [lt](https://pytorch.org/docs/1.8.1/generated/torch.lt.html#torch.lt) | Unsupported | +| 20 | [less](https://pytorch.org/docs/1.8.1/generated/torch.less.html#torch.less) | Unsupported | +| 21 | [maximum](https://pytorch.org/docs/1.8.1/generated/torch.maximum.html#torch.maximum) | Unsupported | +| 22 | [minimum](https://pytorch.org/docs/1.8.1/generated/torch.minimum.html#torch.minimum) | Unsupported | +| 23 | [fmax](https://pytorch.org/docs/1.8.1/generated/torch.fmax.html#torch.fmax) | Unsupported | +| 24 | [fmin](https://pytorch.org/docs/1.8.1/generated/torch.fmin.html#torch.fmin) | Unsupported | +| 25 | [ne](https://pytorch.org/docs/1.8.1/generated/torch.ne.html#torch.ne) | Unsupported | +| 26 | [not_equal](https://pytorch.org/docs/1.8.1/generated/torch.not_equal.html#torch.not_equal) | Unsupported | +| 27 | [sort](https://pytorch.org/docs/1.8.1/generated/torch.sort.html#torch.sort) | Unsupported | +| 28 | [topk](https://pytorch.org/docs/1.8.1/generated/torch.topk.html#torch.topk) | Unsupported | +| 29 | [msort](https://pytorch.org/docs/1.8.1/generated/torch.msort.html#torch.msort) | Unsupported | + +### Spectral Ops + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [stft](https://pytorch.org/docs/1.8.1/generated/torch.stft.html#torch.stft) | Unsupported | +| 2 | [istft](https://pytorch.org/docs/1.8.1/generated/torch.istft.html#torch.istft) | Unsupported | +| 3 | [bartlett_window](https://pytorch.org/docs/1.8.1/generated/torch.bartlett_window.html#torch.bartlett_window) | Unsupported | +| 4 | [blackman_window](https://pytorch.org/docs/1.8.1/generated/torch.blackman_window.html#torch.blackman_window) | Unsupported | +| 5 | [hamming_window](https://pytorch.org/docs/1.8.1/generated/torch.hamming_window.html#torch.hamming_window) | Unsupported | +| 6 | [hann_window](https://pytorch.org/docs/1.8.1/generated/torch.hann_window.html#torch.hann_window) | Unsupported | +| 7 | [kaiser_window](https://pytorch.org/docs/1.8.1/generated/torch.kaiser_window.html#torch.kaiser_window) | Unsupported | + +### Other Operations + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [atleast_1d](https://pytorch.org/docs/1.8.1/generated/torch.atleast_1d.html#torch.atleast_1d) | Unsupported | +| 2 | [atleast_2d](https://pytorch.org/docs/1.8.1/generated/torch.atleast_2d.html#torch.atleast_2d) | Unsupported | +| 3 | [atleast_3d](https://pytorch.org/docs/1.8.1/generated/torch.atleast_3d.html#torch.atleast_3d) | Unsupported | +| 4 | [bincount](https://pytorch.org/docs/1.8.1/generated/torch.bincount.html#torch.bincount) | Unsupported | +| 5 | [block_diag](https://pytorch.org/docs/1.8.1/generated/torch.block_diag.html#torch.block_diag) | Unsupported | +| 6 | [broadcast_tensors](https://pytorch.org/docs/1.8.1/generated/torch.broadcast_tensors.html#torch.broadcast_tensors) | Unsupported | +| 7 | [broadcast_to](https://pytorch.org/docs/1.8.1/generated/torch.broadcast_to.html#torch.broadcast_to) | Unsupported | +| 8 | [broadcast_shapes](https://pytorch.org/docs/1.8.1/generated/torch.broadcast_shapes.html#torch.broadcast_shapes) | Unsupported | +| 9 | [bucketize](https://pytorch.org/docs/1.8.1/generated/torch.bucketize.html#torch.bucketize) | Unsupported | +| 10 | [cartesian_prod](https://pytorch.org/docs/1.8.1/generated/torch.cartesian_prod.html#torch.cartesian_prod) | Unsupported | +| 11 | [cdist](https://pytorch.org/docs/1.8.1/generated/torch.cdist.html#torch.cdist) | Unsupported | +| 12 | [clone](https://pytorch.org/docs/1.8.1/generated/torch.clone.html#torch.clone) | Unsupported | +| 13 | [combinations](https://pytorch.org/docs/1.8.1/generated/torch.combinations.html#torch.combinations) | Unsupported | +| 14 | [cross](https://pytorch.org/docs/1.8.1/generated/torch.cross.html#torch.cross) | Unsupported | +| 15 | [cummax](https://pytorch.org/docs/1.8.1/generated/torch.cummax.html#torch.cummax) | Unsupported | +| 16 | [cummin](https://pytorch.org/docs/1.8.1/generated/torch.cummin.html#torch.cummin) | Unsupported | +| 17 | [cumprod](https://pytorch.org/docs/1.8.1/generated/torch.cumprod.html#torch.cumprod) | Unsupported | +| 18 | [cumsum](https://pytorch.org/docs/1.8.1/generated/torch.cumsum.html#torch.cumsum) | Unsupported | +| 19 | [diag](https://pytorch.org/docs/1.8.1/generated/torch.diag.html#torch.diag) | Unsupported | +| 20 | [diag_embed](https://pytorch.org/docs/1.8.1/generated/torch.diag_embed.html#torch.diag_embed) | Unsupported | +| 21 | [diagflat](https://pytorch.org/docs/1.8.1/generated/torch.diagflat.html#torch.diagflat) | Unsupported | +| 22 | [diagonal](https://pytorch.org/docs/1.8.1/generated/torch.diagonal.html#torch.diagonal) | Unsupported | +| 23 | [diff](https://pytorch.org/docs/1.8.1/generated/torch.diff.html#torch.diff) | Unsupported | +| 24 | [einsum](https://pytorch.org/docs/1.8.1/generated/torch.einsum.html#torch.einsum) | Unsupported | +| 25 | [flatten](https://pytorch.org/docs/1.8.1/generated/torch.flatten.html#torch.flatten) | Unsupported | +| 26 | [flip](https://pytorch.org/docs/1.8.1/generated/torch.flip.html#torch.flip) | Unsupported | +| 27 | [fliplr](https://pytorch.org/docs/1.8.1/generated/torch.fliplr.html#torch.fliplr) | Unsupported | +| 28 | [flipud](https://pytorch.org/docs/1.8.1/generated/torch.flipud.html#torch.flipud) | Unsupported | +| 29 | [kron](https://pytorch.org/docs/1.8.1/generated/torch.kron.html#torch.kron) | Unsupported | +| 30 | [rot90](https://pytorch.org/docs/1.8.1/generated/torch.rot90.html#torch.rot90) | Unsupported | +| 31 | [gcd](https://pytorch.org/docs/1.8.1/generated/torch.gcd.html#torch.gcd) | Unsupported | +| 32 | [histc](https://pytorch.org/docs/1.8.1/generated/torch.histc.html#torch.histc) | Unsupported | +| 33 | [meshgrid](https://pytorch.org/docs/1.8.1/generated/torch.meshgrid.html#torch.meshgrid) | Unsupported | +| 34 | [lcm](https://pytorch.org/docs/1.8.1/generated/torch.lcm.html#torch.lcm) | Unsupported | +| 35 | [logcumsumexp](https://pytorch.org/docs/1.8.1/generated/torch.logcumsumexp.html#torch.logcumsumexp) | Unsupported | +| 36 | [ravel](https://pytorch.org/docs/1.8.1/generated/torch.ravel.html#torch.ravel) | Unsupported | +| 37 | [renorm](https://pytorch.org/docs/1.8.1/generated/torch.renorm.html#torch.renorm) | Unsupported | +| 38 | [repeat_interleave](https://pytorch.org/docs/1.8.1/generated/torch.repeat_interleave.html#torch.repeat_interleave) | Unsupported | +| 39 | [roll](https://pytorch.org/docs/1.8.1/generated/torch.roll.html#torch.roll) | Unsupported | +| 40 | [searchsorted](https://pytorch.org/docs/1.8.1/generated/torch.searchsorted.html#torch.searchsorted) | Unsupported | +| 41 | [tensordot](https://pytorch.org/docs/1.8.1/generated/torch.tensordot.html#torch.tensordot) | Unsupported | +| 42 | [trace](https://pytorch.org/docs/1.8.1/generated/torch.trace.html#torch.trace) | Unsupported | +| 43 | [tril](https://pytorch.org/docs/1.8.1/generated/torch.tril.html#torch.tril) | Unsupported | +| 44 | [tril_indices](https://pytorch.org/docs/1.8.1/generated/torch.tril_indices.html#torch.tril_indices) | Unsupported | +| 45 | [triu](https://pytorch.org/docs/1.8.1/generated/torch.triu.html#torch.triu) | Unsupported | +| 46 | [triu_indices](https://pytorch.org/docs/1.8.1/generated/torch.triu_indices.html#torch.triu_indices) | Unsupported | +| 47 | [vander](https://pytorch.org/docs/1.8.1/generated/torch.vander.html#torch.vander) | Unsupported | +| 48 | [view_as_real](https://pytorch.org/docs/1.8.1/generated/torch.view_as_real.html#torch.view_as_real) | Unsupported | +| 49 | [view_as_complex](https://pytorch.org/docs/1.8.1/generated/torch.view_as_complex.html#torch.view_as_complex) | Unsupported | + +### BLAS and LAPACK Operations + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [addbmm](https://pytorch.org/docs/1.8.1/generated/torch.addbmm.html#torch.addbmm) | Unsupported | +| 2 | [addmm](https://pytorch.org/docs/1.8.1/generated/torch.addmm.html#torch.addmm) | Unsupported | +| 3 | [addmv](https://pytorch.org/docs/1.8.1/generated/torch.addmv.html#torch.addmv) | Unsupported | +| 4 | [addr](https://pytorch.org/docs/1.8.1/generated/torch.addr.html#torch.addr) | Unsupported | +| 5 | [baddbmm](https://pytorch.org/docs/1.8.1/generated/torch.baddbmm.html#torch.baddbmm) | Unsupported | +| 6 | [bmm](https://pytorch.org/docs/1.8.1/generated/torch.bmm.html#torch.bmm) | Unsupported | +| 7 | [chain_matmul](https://pytorch.org/docs/1.8.1/generated/torch.chain_matmul.html#torch.chain_matmul) | Unsupported | +| 8 | [cholesky](https://pytorch.org/docs/1.8.1/generated/torch.cholesky.html#torch.cholesky) | Unsupported | +| 9 | [cholesky_inverse](https://pytorch.org/docs/1.8.1/generated/torch.cholesky_inverse.html#torch.cholesky_inverse) | Unsupported | +| 10 | [cholesky_solve](https://pytorch.org/docs/1.8.1/generated/torch.cholesky_solve.html#torch.cholesky_solve) | Unsupported | +| 11 | [dot](https://pytorch.org/docs/1.8.1/generated/torch.dot.html#torch.dot) | Unsupported | +| 12 | [eig](https://pytorch.org/docs/1.8.1/generated/torch.eig.html#torch.eig) | Unsupported | +| 13 | [geqrf](https://pytorch.org/docs/1.8.1/generated/torch.geqrf.html#torch.geqrf) | Unsupported | +| 14 | [ger](https://pytorch.org/docs/1.8.1/generated/torch.ger.html#torch.ger) | Unsupported | +| 15 | [inner](https://pytorch.org/docs/1.8.1/generated/torch.inner.html#torch.inner) | Unsupported | +| 16 | [inverse](https://pytorch.org/docs/1.8.1/generated/torch.inverse.html#torch.inverse) | Unsupported | +| 17 | [det](https://pytorch.org/docs/1.8.1/generated/torch.det.html#torch.det) | Unsupported | +| 18 | [logdet](https://pytorch.org/docs/1.8.1/generated/torch.logdet.html#torch.logdet) | Unsupported | +| 19 | [slogdet](https://pytorch.org/docs/1.8.1/generated/torch.slogdet.html#torch.slogdet) | Unsupported | +| 20 | [lstsq](https://pytorch.org/docs/1.8.1/generated/torch.lstsq.html#torch.lstsq) | Unsupported | +| 21 | [lu](https://pytorch.org/docs/1.8.1/generated/torch.lu.html#torch.lu) | Unsupported | +| 22 | [lu_solve](https://pytorch.org/docs/1.8.1/generated/torch.lu_solve.html#torch.lu_solve) | Unsupported | +| 23 | [lu_unpack](https://pytorch.org/docs/1.8.1/generated/torch.lu_unpack.html#torch.lu_unpack) | Unsupported | +| 24 | [matmul](https://pytorch.org/docs/1.8.1/generated/torch.matmul.html#torch.matmul) | Unsupported | +| 25 | [matrix_power](https://pytorch.org/docs/1.8.1/generated/torch.matrix_power.html#torch.matrix_power) | Unsupported | +| 26 | [matrix_rank](https://pytorch.org/docs/1.8.1/generated/torch.matrix_rank.html#torch.matrix_rank) | Unsupported | +| 27 | [matrix_exp](https://pytorch.org/docs/1.8.1/generated/torch.matrix_exp.html#torch.matrix_exp) | Unsupported | +| 28 | [mm](https://pytorch.org/docs/1.8.1/generated/torch.mm.html#torch.mm) | Unsupported | +| 29 | [mv](https://pytorch.org/docs/1.8.1/generated/torch.mv.html#torch.mv) | Unsupported | +| 30 | [orgqr](https://pytorch.org/docs/1.8.1/generated/torch.orgqr.html#torch.orgqr) | Unsupported | +| 31 | [ormqr](https://pytorch.org/docs/1.8.1/generated/torch.ormqr.html#torch.ormqr) | Unsupported | +| 32 | [outer](https://pytorch.org/docs/1.8.1/generated/torch.outer.html#torch.outer) | Unsupported | +| 33 | [pinverse](https://pytorch.org/docs/1.8.1/generated/torch.pinverse.html#torch.pinverse) | Unsupported | +| 34 | [qr](https://pytorch.org/docs/1.8.1/generated/torch.qr.html#torch.qr) | Unsupported | +| 35 | [solve](https://pytorch.org/docs/1.8.1/generated/torch.solve.html#torch.solve) | Unsupported | +| 36 | [svd](https://pytorch.org/docs/1.8.1/generated/torch.svd.html#torch.svd) | Unsupported | +| 37 | [svd_lowrank](https://pytorch.org/docs/1.8.1/generated/torch.svd_lowrank.html#torch.svd_lowrank) | Unsupported | +| 38 | [pca_lowrank](https://pytorch.org/docs/1.8.1/generated/torch.pca_lowrank.html#torch.pca_lowrank) | Unsupported | +| 39 | [symeig](https://pytorch.org/docs/1.8.1/generated/torch.symeig.html#torch.symeig) | Unsupported | +| 40 | [lobpcg](https://pytorch.org/docs/1.8.1/generated/torch.lobpcg.html#torch.lobpcg) | Unsupported | +| 41 | [trapz](https://pytorch.org/docs/1.8.1/generated/torch.trapz.html#torch.trapz) | Unsupported | +| 42 | [triangular_solve](https://pytorch.org/docs/1.8.1/generated/torch.triangular_solve.html#torch.triangular_solve) | Unsupported | +| 43 | [vdot](https://pytorch.org/docs/1.8.1/generated/torch.vdot.html#torch.vdot) | Unsupported | + +## Utilities + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [compiled_with_cxx11_abi](https://pytorch.org/docs/1.8.1/generated/torch.compiled_with_cxx11_abi.html#torch.compiled_with_cxx11_abi) | Unsupported | +| 2 | [result_type](https://pytorch.org/docs/1.8.1/generated/torch.result_type.html#torch.result_type) | Unsupported | +| 3 | [can_cast](https://pytorch.org/docs/1.8.1/generated/torch.can_cast.html#torch.can_cast) | Unsupported | +| 4 | [promote_types](https://pytorch.org/docs/1.8.1/generated/torch.promote_types.html#torch.promote_types) | Unsupported | +| 5 | [use_deterministic_algorithms](https://pytorch.org/docs/1.8.1/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms) | Unsupported | +| 6 | [are_deterministic_algorithms_enabled](https://pytorch.org/docs/1.8.1/generated/torch.are_deterministic_algorithms_enabled.html#torch.are_deterministic_algorithms_enabled) | Unsupported | +| 7 | [_assert](https://pytorch.org/docs/1.8.1/generated/torch._assert.html#torch._assert) | Unsupported | + +# Layers (torch.nn) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [Parameter](https://pytorch.org/docs/1.8.1/generated/torch.nn.parameter.Parameter.html#torch.nn.parameter.Parameter) | Unsupported | +| 2 | [UninitializedParameter](https://pytorch.org/docs/1.8.1/generated/torch.nn.parameter.UninitializedParameter.html#torch.nn.parameter.UninitializedParameter) | Unsupported | + +## [Containers](https://pytorch.org/docs/1.8.1/nn.html#id1) + + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [Module](https://pytorch.org/docs/1.8.1/generated/torch.nn.Module.html#torch.nn.Module) | Unsupported | +| 2 | [Sequential](https://pytorch.org/docs/1.8.1/generated/torch.nn.Sequential.html#torch.nn.Sequential) | Unsupported | +| 3 | [ModuleList](https://pytorch.org/docs/1.8.1/generated/torch.nn.ModuleList.html#torch.nn.ModuleList) | Unsupported | +| 4 | [ModuleDict](https://pytorch.org/docs/1.8.1/generated/torch.nn.ModuleDict.html#torch.nn.ModuleDict) | Unsupported | +| 5 | [ParameterList](https://pytorch.org/docs/1.8.1/generated/torch.nn.ParameterList.html#torch.nn.ParameterList) | Unsupported | +| 6 | [ParameterDict](https://pytorch.org/docs/1.8.1/generated/torch.nn.ParameterDict.html#torch.nn.ParameterDict) | Unsupported | + +### Global Hooks For Module + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [register_module_forward_pre_hook](https://pytorch.org/docs/1.8.1/generated/torch.nn.modules.module.register_module_forward_pre_hook.html#torch.nn.modules.module.register_module_forward_pre_hook) | Unsupported | +| 2 | [register_module_forward_hook](https://pytorch.org/docs/1.8.1/generated/torch.nn.modules.module.register_module_forward_hook.html#torch.nn.modules.module.register_module_forward_hook) | Unsupported | +| 3 | [register_module_backward_hook](https://pytorch.org/docs/1.8.1/generated/torch.nn.modules.module.register_module_backward_hook.html#torch.nn.modules.module.register_module_backward_hook) | Unsupported | + +## [Convolution Layers](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.Conv1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.Conv1d.html#torch.nn.Conv1d) | Unsupported | +| 2 | [nn.Conv2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.Conv2d.html#torch.nn.Conv2d) | Unsupported | +| 3 | [nn.Conv3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.Conv3d.html#torch.nn.Conv3d) | Unsupported | +| 4 | [nn.ConvTranspose1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.ConvTranspose1d.html#torch.nn.ConvTranspose1d) | Unsupported | +| 5 | [nn.ConvTranspose2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.ConvTranspose2d.html#torch.nn.ConvTranspose2d) | Unsupported | +| 6 | [nn.ConvTranspose3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.ConvTranspose3d.html#torch.nn.ConvTranspose3d) | Unsupported | +| 7 | [nn.LazyConv1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.LazyConv1d.html#torch.nn.LazyConv1d) | Unsupported | +| 8 | [nn.LazyConv2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.LazyConv2d.html#torch.nn.LazyConv2d) | Unsupported | +| 9 | [nn.LazyConv3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.LazyConv3d.html#torch.nn.LazyConv3d) | Unsupported | +| 10 | [nn.LazyConvTranspose1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.LazyConvTranspose1d.html#torch.nn.LazyConvTranspose1d) | Unsupported | +| 11 | [nn.LazyConvTranspose2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.LazyConvTranspose2d.html#torch.nn.LazyConvTranspose2d) | Unsupported | +| 12 | [nn.LazyConvTranspose3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.LazyConvTranspose3d.html#torch.nn.LazyConvTranspose3d) | Unsupported | +| 13 | [nn.Unfold](https://pytorch.org/docs/1.8.1/generated/torch.nn.Unfold.html#torch.nn.Unfold) | Unsupported | +| 14 | [nn.Fold](https://pytorch.org/docs/1.8.1/generated/torch.nn.Fold.html#torch.nn.Fold) | Unsupported | + +## [Pooling Layers](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.MaxPool1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.MaxPool1d.html#torch.nn.MaxPool1d) | Unsupported | +| 2 | [nn.MaxPool2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.MaxPool2d.html#torch.nn.MaxPool2d) | Unsupported | +| 3 | [nn.MaxPool3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.MaxPool3d.html#torch.nn.MaxPool3d) | Unsupported | +| 4 | [nn.MaxUnpool1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.MaxUnpool1d.html#torch.nn.MaxUnpool1d) | Unsupported | +| 5 | [nn.MaxUnpool2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.MaxUnpool2d.html#torch.nn.MaxUnpool2d) | Unsupported | +| 6 | [nn.MaxUnpool3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.MaxUnpool3d.html#torch.nn.MaxUnpool3d) | Unsupported | +| 7 | [nn.AvgPool1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.AvgPool1d.html#torch.nn.AvgPool1d) | Unsupported | +| 8 | [nn.AvgPool2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.AvgPool2d.html#torch.nn.AvgPool2d) | Unsupported | +| 9 | [nn.AvgPool3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.AvgPool3d.html#torch.nn.AvgPool3d) | Unsupported | +| 10 | [nn.FractionalMaxPool2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.FractionalMaxPool2d.html#torch.nn.FractionalMaxPool2d) | Unsupported | +| 11 | [nn.LPPool1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.LPPool1d.html#torch.nn.LPPool1d) | Unsupported | +| 12 | [nn.LPPool2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.LPPool2d.html#torch.nn.LPPool2d) | Unsupported | +| 13 | [nn.AdaptiveMaxPool1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.AdaptiveMaxPool1d.html#torch.nn.AdaptiveMaxPool1d) | Unsupported | +| 14 | [nn.AdaptiveMaxPool2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.AdaptiveMaxPool2d.html#torch.nn.AdaptiveMaxPool2d) | Unsupported | +| 15 | [nn.AdaptiveMaxPool3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.AdaptiveMaxPool3d.html#torch.nn.AdaptiveMaxPool3d) | Unsupported | +| 16 | [nn.AdaptiveAvgPool1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.AdaptiveAvgPool1d.html#torch.nn.AdaptiveAvgPool1d) | Unsupported | +| 17 | [nn.AdaptiveAvgPool2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.AdaptiveAvgPool2d.html#torch.nn.AdaptiveAvgPool2d) | Unsupported | +| 18 | [nn.AdaptiveAvgPool3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.AdaptiveAvgPool3d.html#torch.nn.AdaptiveAvgPool3d) | Unsupported | + +## [Padding Layers](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.ReflectionPad1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.ReflectionPad1d.html#torch.nn.ReflectionPad1d) | Unsupported | +| 2 | [nn.ReflectionPad2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.ReflectionPad2d.html#torch.nn.ReflectionPad2d) | Unsupported | +| 3 | [nn.ReplicationPad1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.ReplicationPad1d.html#torch.nn.ReplicationPad1d) | Unsupported | +| 4 | [nn.ReplicationPad2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.ReplicationPad2d.html#torch.nn.ReplicationPad2d) | Unsupported | +| 5 | [nn.ReplicationPad3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.ReplicationPad3d.html#torch.nn.ReplicationPad3d) | Unsupported | +| 6 | [nn.ZeroPad2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.ZeroPad2d.html#torch.nn.ZeroPad2d) | Unsupported | +| 7 | [nn.ConstantPad1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.ConstantPad1d.html#torch.nn.ConstantPad1d) | Unsupported | +| 8 | [nn.ConstantPad2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.ConstantPad2d.html#torch.nn.ConstantPad2d) | Unsupported | +| 9 | [nn.ConstantPad3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.ConstantPad3d.html#torch.nn.ConstantPad3d) | Unsupported | + + + +## [Non-Linear Activations (Weighted sum, Nonlinearity)](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.ELU](https://pytorch.org/docs/1.8.1/generated/torch.nn.ELU.html#torch.nn.ELU) | Unsupported | +| 2 | [nn.Hardshrink](https://pytorch.org/docs/1.8.1/generated/torch.nn.Hardshrink.html#torch.nn.Hardshrink) | Unsupported | +| 3 | [nn.Hardsigmoid](https://pytorch.org/docs/1.8.1/generated/torch.nn.Hardsigmoid.html#torch.nn.Hardsigmoid) | Unsupported | +| 4 | [nn.Hardtanh](https://pytorch.org/docs/1.8.1/generated/torch.nn.Hardtanh.html#torch.nn.Hardtanh) | Unsupported | +| 5 | [nn.Hardswish](https://pytorch.org/docs/1.8.1/generated/torch.nn.Hardswish.html#torch.nn.Hardswish) | Unsupported | +| 6 | [nn.LeakyReLU](https://pytorch.org/docs/1.8.1/generated/torch.nn.LeakyReLU.html#torch.nn.LeakyReLU) | Unsupported | +| 7 | [nn.LogSigmoid](https://pytorch.org/docs/1.8.1/generated/torch.nn.LogSigmoid.html#torch.nn.LogSigmoid) | Unsupported | +| 8 | [nn.MultiheadAttention](https://pytorch.org/docs/1.8.1/generated/torch.nn.MultiheadAttention.html#torch.nn.MultiheadAttention) | Unsupported | +| 9 | [nn.PReLU](https://pytorch.org/docs/1.8.1/generated/torch.nn.PReLU.html#torch.nn.PReLU) | Unsupported | +| 10 | [nn.ReLU](https://pytorch.org/docs/1.8.1/generated/torch.nn.ReLU.html#torch.nn.ReLU) | Unsupported | +| 11 | [nn.ReLU6](https://pytorch.org/docs/1.8.1/generated/torch.nn.ReLU6.html#torch.nn.ReLU6) | Unsupported | +| 12 | [nn.RReLU](https://pytorch.org/docs/1.8.1/generated/torch.nn.RReLU.html#torch.nn.RReLU) | Unsupported | +| 13 | [nn.SELU](https://pytorch.org/docs/1.8.1/generated/torch.nn.SELU.html#torch.nn.SELU) | Unsupported | +| 14 | [nn.CELU](https://pytorch.org/docs/1.8.1/generated/torch.nn.CELU.html#torch.nn.CELU) | Unsupported | +| 15 | [nn.GELU](https://pytorch.org/docs/1.8.1/generated/torch.nn.GELU.html#torch.nn.GELU) | Unsupported | +| 16 | [nn.Sigmoid](https://pytorch.org/docs/1.8.1/generated/torch.nn.Sigmoid.html#torch.nn.Sigmoid) | Unsupported | +| 17 | [nn.SiLU](https://pytorch.org/docs/1.8.1/generated/torch.nn.SiLU.html#torch.nn.SiLU) | Unsupported | +| 18 | [nn.Softplus](https://pytorch.org/docs/1.8.1/generated/torch.nn.Softplus.html#torch.nn.Softplus) | Unsupported | +| 19 | [nn.Softshrink](https://pytorch.org/docs/1.8.1/generated/torch.nn.Softshrink.html#torch.nn.Softshrink) | Unsupported | +| 20 | [nn.Softsign](https://pytorch.org/docs/1.8.1/generated/torch.nn.Softsign.html#torch.nn.Softsign) | Unsupported | +| 21 | [nn.Tanh](https://pytorch.org/docs/1.8.1/generated/torch.nn.Tanh.html#torch.nn.Tanh) | Unsupported | +| 22 | [nn.Tanhshrink](https://pytorch.org/docs/1.8.1/generated/torch.nn.Tanhshrink.html#torch.nn.Tanhshrink) | Unsupported | +| 23 | [nn.Threshold](https://pytorch.org/docs/1.8.1/generated/torch.nn.Threshold.html#torch.nn.Threshold) | Unsupported | + +## [Non-Linear Activations (Other)](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.Softmin](https://pytorch.org/docs/1.8.1/generated/torch.nn.Softmin.html#torch.nn.Softmin) | Unsupported | +| 2 | [nn.Softmax](https://pytorch.org/docs/1.8.1/generated/torch.nn.Softmax.html#torch.nn.Softmax) | Unsupported | +| 3 | [nn.Softmax2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.Softmax2d.html#torch.nn.Softmax2d) | Unsupported | +| 4 | [nn.LogSoftmax](https://pytorch.org/docs/1.8.1/generated/torch.nn.LogSoftmax.html#torch.nn.LogSoftmax) | Unsupported | +| 5 | [nn.AdaptiveLogSoftmaxWithLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.AdaptiveLogSoftmaxWithLoss.html#torch.nn.AdaptiveLogSoftmaxWithLoss) | Unsupported | + +## [Normalization Layers](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.BatchNorm1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.BatchNorm1d.html#torch.nn.BatchNorm1d) | Unsupported | +| 2 | [nn.BatchNorm2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.BatchNorm2d.html#torch.nn.BatchNorm2d) | Unsupported | +| 3 | [nn.BatchNorm3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.BatchNorm3d.html#torch.nn.BatchNorm3d) | Unsupported | +| 4 | [nn.GroupNorm](https://pytorch.org/docs/1.8.1/generated/torch.nn.GroupNorm.html#torch.nn.GroupNorm) | Unsupported | +| 5 | [nn.SyncBatchNorm](https://pytorch.org/docs/1.8.1/generated/torch.nn.SyncBatchNorm.html#torch.nn.SyncBatchNorm) | Unsupported | +| 6 | [nn.InstanceNorm1d](https://pytorch.org/docs/1.8.1/generated/torch.nn.InstanceNorm1d.html#torch.nn.InstanceNorm1d) | Unsupported | +| 7 | [nn.InstanceNorm2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.InstanceNorm2d.html#torch.nn.InstanceNorm2d) | Unsupported | +| 8 | [nn.InstanceNorm3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.InstanceNorm3d.html#torch.nn.InstanceNorm3d) | Unsupported | +| 9 | [nn.LayerNorm](https://pytorch.org/docs/1.8.1/generated/torch.nn.LayerNorm.html#torch.nn.LayerNorm) | Unsupported | +| 10 | [nn.LocalResponseNorm](https://pytorch.org/docs/1.8.1/generated/torch.nn.LocalResponseNorm.html#torch.nn.LocalResponseNorm) | Unsupported | + + + +## [Recurrent Layers](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.RNNBase](https://pytorch.org/docs/1.8.1/generated/torch.nn.RNNBase.html#torch.nn.RNNBase) | Unsupported | +| 2 | [nn.RNN](https://pytorch.org/docs/1.8.1/generated/torch.nn.RNN.html#torch.nn.RNN) | Unsupported | +| 3 | [nn.LSTM](https://pytorch.org/docs/1.8.1/generated/torch.nn.LSTM.html#torch.nn.LSTM) | Unsupported | +| 4 | [nn.GRU](https://pytorch.org/docs/1.8.1/generated/torch.nn.GRU.html#torch.nn.GRU) | Unsupported | +| 5 | [nn.RNNCell](https://pytorch.org/docs/1.8.1/generated/torch.nn.RNNCell.html#torch.nn.RNNCell) | Unsupported | +| 6 | [nn.LSTMCell](https://pytorch.org/docs/1.8.1/generated/torch.nn.LSTMCell.html#torch.nn.LSTMCell) | Unsupported | +| 7 | [nn.GRUCell](https://pytorch.org/docs/1.8.1/generated/torch.nn.GRUCell.html#torch.nn.GRUCell) | Unsupported | + + + +## [Transformer Layers](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.Transformer](https://pytorch.org/docs/1.8.1/generated/torch.nn.Transformer.html#torch.nn.Transformer) | Unsupported | +| 2 | [nn.TransformerEncoder](https://pytorch.org/docs/1.8.1/generated/torch.nn.TransformerEncoder.html#torch.nn.TransformerEncoder) | Unsupported | +| 3 | [nn.TransformerDecoder](https://pytorch.org/docs/1.8.1/generated/torch.nn.TransformerDecoder.html#torch.nn.TransformerDecoder) | Unsupported | +| 4 | [nn.TransformerEncoderLayer](https://pytorch.org/docs/1.8.1/generated/torch.nn.TransformerEncoderLayer.html#torch.nn.TransformerEncoderLayer) | Unsupported | +| 5 | [nn.TransformerDecoderLayer](https://pytorch.org/docs/1.8.1/generated/torch.nn.TransformerDecoderLayer.html#torch.nn.TransformerDecoderLayer) | Unsupported | + + + +## [Linear Layers](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.Identity](https://pytorch.org/docs/1.8.1/generated/torch.nn.Identity.html#torch.nn.Identity) | Unsupported | +| 2 | [nn.Linear](https://pytorch.org/docs/1.8.1/generated/torch.nn.Linear.html#torch.nn.Linear) | Unsupported | +| 3 | [nn.Bilinear](https://pytorch.org/docs/1.8.1/generated/torch.nn.Bilinear.html#torch.nn.Bilinear) | Unsupported | +| 4 | [nn.LazyLinear](https://pytorch.org/docs/1.8.1/generated/torch.nn.LazyLinear.html#torch.nn.LazyLinear) | Unsupported | + + + +## [Dropout Layers](https://pytorch.org/docs/1.8.1/nn.html#id1) + + + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.Dropout](https://pytorch.org/docs/1.8.1/generated/torch.nn.Dropout.html#torch.nn.Dropout) | Unsupported | +| 2 | [nn.Dropout2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.Dropout2d.html#torch.nn.Dropout2d) | Unsupported | +| 3 | [nn.Dropout3d](https://pytorch.org/docs/1.8.1/generated/torch.nn.Dropout3d.html#torch.nn.Dropout3d) | Unsupported | +| 4 | [nn.AlphaDropout](https://pytorch.org/docs/1.8.1/generated/torch.nn.AlphaDropout.html#torch.nn.AlphaDropout) | Unsupported | + +## [Sparse Layers](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.Embedding](https://pytorch.org/docs/1.8.1/generated/torch.nn.Embedding.html#torch.nn.Embedding) | Unsupported | +| 2 | [nn.EmbeddingBag](https://pytorch.org/docs/1.8.1/generated/torch.nn.EmbeddingBag.html#torch.nn.EmbeddingBag) | Unsupported | + + + +## [Distance Functions](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.CosineSimilarity](https://pytorch.org/docs/1.8.1/generated/torch.nn.CosineSimilarity.html#torch.nn.CosineSimilarity) | Unsupported | +| 2 | [nn.PairwiseDistance](https://pytorch.org/docs/1.8.1/generated/torch.nn.PairwiseDistance.html#torch.nn.PairwiseDistance) | Unsupported | + + + +## [Loss Functions](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.L1Loss](https://pytorch.org/docs/1.8.1/generated/torch.nn.L1Loss.html#torch.nn.L1Loss) | Unsupported | +| 2 | [nn.MSELoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.MSELoss.html#torch.nn.MSELoss) | Unsupported | +| 3 | [nn.CrossEntropyLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) | Unsupported | +| 4 | [nn.CTCLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.CTCLoss.html#torch.nn.CTCLoss) | Unsupported | +| 5 | [nn.NLLLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.NLLLoss.html#torch.nn.NLLLoss) | Unsupported | +| 6 | [nn.PoissonNLLLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.PoissonNLLLoss.html#torch.nn.PoissonNLLLoss) | Unsupported | +| 7 | [nn.GaussianNLLLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.GaussianNLLLoss.html#torch.nn.GaussianNLLLoss) | Unsupported | +| 8 | [nn.KLDivLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.KLDivLoss.html#torch.nn.KLDivLoss) | Unsupported | +| 9 | [nn.BCELoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.BCELoss.html#torch.nn.BCELoss) | Unsupported | +| 10 | [nn.BCEWithLogitsLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.BCEWithLogitsLoss.html#torch.nn.BCEWithLogitsLoss) | Unsupported | +| 11 | [nn.MarginRankingLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.MarginRankingLoss.html#torch.nn.MarginRankingLoss) | Unsupported | +| 12 | [nn.HingeEmbeddingLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.HingeEmbeddingLoss.html#torch.nn.HingeEmbeddingLoss) | Unsupported | +| 13 | [nn.MultiLabelMarginLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.MultiLabelMarginLoss.html#torch.nn.MultiLabelMarginLoss) | Unsupported | +| 14 | [nn.SmoothL1Loss](https://pytorch.org/docs/1.8.1/generated/torch.nn.SmoothL1Loss.html#torch.nn.SmoothL1Loss) | Unsupported | +| 15 | [nn.SoftMarginLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.SoftMarginLoss.html#torch.nn.SoftMarginLoss) | Unsupported | +| 16 | [nn.MultiLabelSoftMarginLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.MultiLabelSoftMarginLoss.html#torch.nn.MultiLabelSoftMarginLoss) | Unsupported | +| 17 | [nn.CosineEmbeddingLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.CosineEmbeddingLoss.html#torch.nn.CosineEmbeddingLoss) | Unsupported | +| 18 | [nn.MultiMarginLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.MultiMarginLoss.html#torch.nn.MultiMarginLoss) | Unsupported | +| 19 | [nn.TripletMarginLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.TripletMarginLoss.html#torch.nn.TripletMarginLoss) | Unsupported | +| 20 | [nn.TripletMarginWithDistanceLoss](https://pytorch.org/docs/1.8.1/generated/torch.nn.TripletMarginWithDistanceLoss.html#torch.nn.TripletMarginWithDistanceLoss) | Unsupported | + +## [Vision Layers](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.PixelShuffle](https://pytorch.org/docs/1.8.1/generated/torch.nn.PixelShuffle.html#torch.nn.PixelShuffle) | Unsupported | +| 2 | [nn.PixelUnshuffle](https://pytorch.org/docs/1.8.1/generated/torch.nn.PixelUnshuffle.html#torch.nn.PixelUnshuffle) | Unsupported | +| 3 | [nn.Upsample](https://pytorch.org/docs/1.8.1/generated/torch.nn.Upsample.html#torch.nn.Upsample) | Unsupported | +| 4 | [nn.UpsamplingNearest2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.UpsamplingNearest2d.html#torch.nn.UpsamplingNearest2d) | Unsupported | +| 5 | [nn.UpsamplingBilinear2d](https://pytorch.org/docs/1.8.1/generated/torch.nn.UpsamplingBilinear2d.html#torch.nn.UpsamplingBilinear2d) | Unsupported | + + + +## [Shuffle Layers](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.ChannelShuffle](https://pytorch.org/docs/1.8.1/generated/torch.nn.ChannelShuffle.html#torch.nn.ChannelShuffle) | Unsupported | + + + +## [DataParallel Layers (Multi-GPU, Distributed)](https://pytorch.org/docs/1.8.1/nn.html#id1) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.DataParallel](https://pytorch.org/docs/1.8.1/generated/torch.nn.DataParallel.html#torch.nn.DataParallel) | Unsupported | +| 2 | [nn.parallel.DistributedDataParallel](https://pytorch.org/docs/1.8.1/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) | Unsupported | + +## [Utilities](https://pytorch.org/docs/1.8.1/nn.html#id1) + + + +From the `torch.nn.utils` module + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [clip_grad_norm_](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.clip_grad_norm_.html#torch.nn.utils.clip_grad_norm_) | Unsupported | +| 2 | [clip_grad_value_](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.clip_grad_value_.html#torch.nn.utils.clip_grad_value_) | Unsupported | +| 3 | [parameters_to_vector](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.parameters_to_vector.html#torch.nn.utils.parameters_to_vector) | Unsupported | +| 4 | [vector_to_parameters](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.vector_to_parameters.html#torch.nn.utils.vector_to_parameters) | Unsupported | +| 5 | [prune.BasePruningMethod](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.BasePruningMethod.html#torch.nn.utils.prune.BasePruningMethod) | Unsupported | +| 6 | [prune.PruningContainer](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.PruningContainer.html#torch.nn.utils.prune.PruningContainer) | Unsupported | +| 7 | [prune.Identity](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.Identity.html#torch.nn.utils.prune.Identity) | Unsupported | +| 8 | [prune.RandomUnstructured](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.RandomUnstructured.html#torch.nn.utils.prune.RandomUnstructured) | Unsupported | +| 9 | [prune.L1Unstructured](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.L1Unstructured.html#torch.nn.utils.prune.L1Unstructured) | Unsupported | +| 10 | [prune.RandomStructured](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.RandomStructured.html#torch.nn.utils.prune.RandomStructured) | Unsupported | +| 11 | [prune.LnStructured](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.LnStructured.html#torch.nn.utils.prune.LnStructured) | Unsupported | +| 12 | [prune.CustomFromMask](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.CustomFromMask.html#torch.nn.utils.prune.CustomFromMask) | Unsupported | +| 13 | [prune.identity](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.identity.html#torch.nn.utils.prune.identity) | Unsupported | +| 14 | [prune.random_unstructured](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.random_unstructured.html#torch.nn.utils.prune.random_unstructured) | Unsupported | +| 15 | [prune.l1_unstructured](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.l1_unstructured.html#torch.nn.utils.prune.l1_unstructured) | Unsupported | +| 16 | [prune.random_structured](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.random_structured.html#torch.nn.utils.prune.random_structured) | Unsupported | +| 17 | [prune.ln_structured](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.ln_structured.html#torch.nn.utils.prune.ln_structured) | Unsupported | +| 18 | [prune.global_unstructured](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.global_unstructured.html#torch.nn.utils.prune.global_unstructured) | Unsupported | +| 19 | [prune.custom_from_mask](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.custom_from_mask.html#torch.nn.utils.prune.custom_from_mask) | Unsupported | +| 20 | [prune.remove](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.remove.html#torch.nn.utils.prune.remove) | Unsupported | +| 21 | [prune.is_pruned](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.prune.is_pruned.html#torch.nn.utils.prune.is_pruned) | Unsupported | +| 22 | [weight_norm](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.weight_norm.html#torch.nn.utils.weight_norm) | Unsupported | +| 23 | [remove_weight_norm](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.remove_weight_norm.html#torch.nn.utils.remove_weight_norm) | Unsupported | +| 24 | [spectral_norm](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.spectral_norm.html#torch.nn.utils.spectral_norm) | Unsupported | +| 25 | [remove_spectral_norm](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.remove_spectral_norm.html#torch.nn.utils.remove_spectral_norm) | Unsupported | + + + +### Utility Functions in Other Modules + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.utils.rnn.PackedSequence](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.rnn.PackedSequence.html#torch.nn.utils.rnn.PackedSequence) | Unsupported | +| 2 | [nn.utils.rnn.pack_padded_sequence](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.rnn.pack_padded_sequence.html#torch.nn.utils.rnn.pack_padded_sequence) | Unsupported | +| 3 | [nn.utils.rnn.pad_packed_sequence](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.rnn.pad_packed_sequence.html#torch.nn.utils.rnn.pad_packed_sequence) | Unsupported | +| 4 | [nn.utils.rnn.pad_sequence](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.rnn.pad_sequence.html#torch.nn.utils.rnn.pad_sequence) | Unsupported | +| 5 | [nn.utils.rnn.pack_sequence](https://pytorch.org/docs/1.8.1/generated/torch.nn.utils.rnn.pack_sequence.html#torch.nn.utils.rnn.pack_sequence) | Unsupported | +| 6 | [nn.Flatten](https://pytorch.org/docs/1.8.1/generated/torch.nn.Flatten.html#torch.nn.Flatten) | Unsupported | +| 7 | [nn.Unflatten](https://pytorch.org/docs/1.8.1/generated/torch.nn.Unflatten.html#torch.nn.Unflatten) | Unsupported | + +### Lazy Modules Initialization + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [nn.modules.lazy.LazyModuleMixin](https://pytorch.org/docs/1.8.1/generated/torch.nn.modules.lazy.LazyModuleMixin.html#torch.nn.modules.lazy.LazyModuleMixin) | Unsupported | + + + + + + + + + + + +# Functions(torch.nn.functional) + +## [Convolution Functions](https://pytorch.org/docs/1.8.1/nn.functional.html#convolution-functions) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [conv1d](https://pytorch.org/docs/1.8.1/nn.functional.html#conv1d) | Unsupported | +| 2 | [conv2d](https://pytorch.org/docs/1.8.1/nn.functional.html#conv2d) | Unsupported | +| 3 | [conv3d](https://pytorch.org/docs/1.8.1/nn.functional.html#conv3d) | Unsupported | +| 4 | [conv_transpose1d](https://pytorch.org/docs/1.8.1/nn.functional.html#conv-transpose1d) | Unsupported | +| 5 | [conv_transpose2d](https://pytorch.org/docs/1.8.1/nn.functional.html#conv-transpose2d) | Unsupported | +| 6 | [conv_transpose3d](https://pytorch.org/docs/1.8.1/nn.functional.html#conv-transpose3d) | Unsupported | +| 7 | [unfold](https://pytorch.org/docs/1.8.1/nn.functional.html#unfold) | Unsupported | +| 8 | [fold](https://pytorch.org/docs/1.8.1/nn.functional.html#fold) | Unsupported | + +## [Pooling Functions](https://pytorch.org/docs/1.8.1/nn.functional.html#pooling-functions) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [avg_pool1d](https://pytorch.org/docs/1.8.1/nn.functional.html#avg-pool1d) | Unsupported | +| 2 | [avg_pool2d](https://pytorch.org/docs/1.8.1/nn.functional.html#avg-pool2d) | Unsupported | +| 3 | [avg_pool3d](https://pytorch.org/docs/1.8.1/nn.functional.html#avg-pool3d) | Unsupported | +| 4 | [max_pool1d](https://pytorch.org/docs/1.8.1/nn.functional.html#max-pool1d) | Unsupported | +| 5 | [max_pool2d](https://pytorch.org/docs/1.8.1/nn.functional.html#max-pool2d) | Unsupported | +| 6 | [max_pool3d](https://pytorch.org/docs/1.8.1/nn.functional.html#max-pool3d) | Unsupported | +| 7 | [max_unpool1d](https://pytorch.org/docs/1.8.1/nn.functional.html#max-unpool1d) | Unsupported | +| 8 | [max_unpool2d](https://pytorch.org/docs/1.8.1/nn.functional.html#max-unpool2d) | Unsupported | +| 9 | [max_unpool3d](https://pytorch.org/docs/1.8.1/nn.functional.html#max-unpool3d) | Unsupported | +| 10 | [lp_pool1d](https://pytorch.org/docs/1.8.1/nn.functional.html#lp-pool1d) | Unsupported | +| 11 | [lp_pool2d](https://pytorch.org/docs/1.8.1/nn.functional.html#lp-pool2d) | Unsupported | +| 12 | [adaptive_max_pool1d](https://pytorch.org/docs/1.8.1/nn.functional.html#adaptive-max-pool1d) | Unsupported | +| 13 | [adaptive_max_pool2d](https://pytorch.org/docs/1.8.1/nn.functional.html#adaptive-max-pool2d) | Unsupported | +| 14 | [adaptive_max_pool3d](https://pytorch.org/docs/1.8.1/nn.functional.html#adaptive-max-pool3d) | Unsupported | +| 15 | [adaptive_avg_pool1d](https://pytorch.org/docs/1.8.1/nn.functional.html#adaptive-avg-pool1d) | Unsupported | +| 16 | [adaptive_avg_pool2d](https://pytorch.org/docs/1.8.1/nn.functional.html#adaptive-avg-pool2d) | Unsupported | +| 17 | [adaptive_avg_pool3d](https://pytorch.org/docs/1.8.1/nn.functional.html#adaptive-avg-pool3d) | Unsupported | + +## [Non-Linear Activation Functions](https://pytorch.org/docs/1.8.1/nn.functional.html#non-linear-activation-functions) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [threshold](https://pytorch.org/docs/1.8.1/nn.functional.html#threshold) | Unsupported | +| 2 | [relu](https://pytorch.org/docs/1.8.1/nn.functional.html#relu) | Unsupported | +| 3 | [hardtanh](https://pytorch.org/docs/1.8.1/nn.functional.html#hardtanh) | Unsupported | +| 4 | [hardswish](https://pytorch.org/docs/1.8.1/nn.functional.html#hardswish) | Unsupported | +| 5 | [relu6](https://pytorch.org/docs/1.8.1/nn.functional.html#relu6) | Unsupported | +| 6 | [elu](https://pytorch.org/docs/1.8.1/nn.functional.html#elu) | Unsupported | +| 7 | [selu](https://pytorch.org/docs/1.8.1/nn.functional.html#selu) | Unsupported | +| 8 | [celu](https://pytorch.org/docs/1.8.1/nn.functional.html#celu) | Unsupported | +| 9 | [leaky_relu](https://pytorch.org/docs/1.8.1/nn.functional.html#leaky-relu) | Unsupported | +| 10 | [prelu](https://pytorch.org/docs/1.8.1/nn.functional.html#prelu) | Unsupported | +| 11 | [rrelu](https://pytorch.org/docs/1.8.1/nn.functional.html#rrelu) | Unsupported | +| 12 | [glu](https://pytorch.org/docs/1.8.1/nn.functional.html#glu) | Unsupported | +| 13 | [gelu](https://pytorch.org/docs/1.8.1/nn.functional.html#gelu) | Unsupported | +| 14 | [logsigmoid](https://pytorch.org/docs/1.8.1/nn.functional.html#logsigmoid) | Unsupported | +| 15 | [hardshrink](https://pytorch.org/docs/1.8.1/nn.functional.html#hardshrink) | Unsupported | +| 16 | [tanhshrink](https://pytorch.org/docs/1.8.1/nn.functional.html#tanhshrink) | Unsupported | +| 17 | [softsign](https://pytorch.org/docs/1.8.1/nn.functional.html#softsign) | Unsupported | +| 18 | [softplus](https://pytorch.org/docs/1.8.1/nn.functional.html#softplus) | Unsupported | +| 19 | [softmin](https://pytorch.org/docs/1.8.1/nn.functional.html#softmin) | Unsupported | +| 20 | [softmax](https://pytorch.org/docs/1.8.1/nn.functional.html#softmax) | Unsupported | +| 21 | [softshrink](https://pytorch.org/docs/1.8.1/nn.functional.html#softshrink) | Unsupported | +| 22 | [gumbel_softmax](https://pytorch.org/docs/1.8.1/nn.functional.html#gumbel-softmax) | Unsupported | +| 23 | [log_softmax](https://pytorch.org/docs/1.8.1/nn.functional.html#log-softmax) | Unsupported | +| 24 | [tanh](https://pytorch.org/docs/1.8.1/nn.functional.html#tanh) | Unsupported | +| 25 | [sigmoid](https://pytorch.org/docs/1.8.1/nn.functional.html#sigmoid) | Unsupported | +| 26 | [hardsigmoid](https://pytorch.org/docs/1.8.1/nn.functional.html#hardsigmoid) | Unsupported | +| 27 | [silu](https://pytorch.org/docs/1.8.1/nn.functional.html#silu) | Unsupported | + +## [Normalization Functions](https://pytorch.org/docs/1.8.1/nn.functional.html#normalization-functions) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [batch_norm](https://pytorch.org/docs/1.8.1/nn.functional.html#batch-norm) | Unsupported | +| 2 | [instance_norm](https://pytorch.org/docs/1.8.1/nn.functional.html#instance-norm) | Unsupported | +| 3 | [layer_norm](https://pytorch.org/docs/1.8.1/nn.functional.html#layer-norm) | Unsupported | +| 4 | [local_response_norm](https://pytorch.org/docs/1.8.1/nn.functional.html#local-response-norm) | Unsupported | +| 5 | [normalize](https://pytorch.org/docs/1.8.1/nn.functional.html#normalize) | Unsupported | + +## [Linear Functions](https://pytorch.org/docs/1.8.1/nn.functional.html#linear-functions) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [linear](https://pytorch.org/docs/1.8.1/nn.functional.html#linear) | Unsupported | +| 2 | [bilinear](https://pytorch.org/docs/1.8.1/nn.functional.html#bilinear) | Unsupported | + +## [Dropout Functions](https://pytorch.org/docs/1.8.1/nn.functional.html#dropout-functions) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [dropout](https://pytorch.org/docs/1.8.1/nn.functional.html#dropout) | Unsupported | +| 2 | [alpha_dropout](https://pytorch.org/docs/1.8.1/nn.functional.html#alpha-dropout) | Unsupported | +| 3 | [feature_alpha_dropout](https://pytorch.org/docs/1.8.1/nn.functional.html#feature-alpha-dropout) | Unsupported | +| 4 | [dropout2d](https://pytorch.org/docs/1.8.1/nn.functional.html#dropout2d) | Unsupported | +| 5 | [dropout3d](https://pytorch.org/docs/1.8.1/nn.functional.html#dropout3d) | Unsupported | + +## [Sparse Functions](https://pytorch.org/docs/1.8.1/nn.functional.html#sparse-functions) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [embedding](https://pytorch.org/docs/1.8.1/nn.functional.html#embedding) | Unsupported | +| 2 | [embedding_bag](https://pytorch.org/docs/1.8.1/nn.functional.html#embedding-bag) | Unsupported | +| 3 | [one_hot](https://pytorch.org/docs/1.8.1/nn.functional.html#one-hot) | Unsupported | + +## [Distance Functions](https://pytorch.org/docs/1.8.1/nn.functional.html#distance-functions) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [pairwise_distance](https://pytorch.org/docs/1.8.1/nn.functional.html#pairwise-distance) | Unsupported | +| 2 | [cosine_similarity](https://pytorch.org/docs/1.8.1/nn.functional.html#cosine-similarity) | Unsupported | +| 3 | [pdist](https://pytorch.org/docs/1.8.1/nn.functional.html#pdist) | Unsupported | + +## [Loss Functions](https://pytorch.org/docs/1.8.1/nn.functional.html#loss-functions) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [binary_cross_entropy](https://pytorch.org/docs/1.8.1/nn.functional.html#binary-cross-entropy) | Unsupported | +| 2 | [binary_cross_entropy_with_logits](https://pytorch.org/docs/1.8.1/nn.functional.html#binary-cross-entropy-with-logits) | Unsupported | +| 3 | [poisson_nll_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#poisson-nll-loss) | Unsupported | +| 4 | [cosine_embedding_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#cosine-embedding-loss) | Unsupported | +| 5 | [cross_entropy](https://pytorch.org/docs/1.8.1/nn.functional.html#cross-entropy) | Unsupported | +| 6 | [ctc_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#ctc-loss) | Unsupported | +| 7 | [hinge_embedding_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#hinge-embedding-loss) | Unsupported | +| 8 | [kl_div](https://pytorch.org/docs/1.8.1/nn.functional.html#kl-div) | Unsupported | +| 9 | [l1_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#l1-loss) | Unsupported | +| 10 | [mse_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#mse-loss) | Unsupported | +| 11 | [margin_ranking_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#margin-ranking-loss) | Unsupported | +| 12 | [multilabel_margin_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#multilabel-margin-loss) | Unsupported | +| 13 | [multilabel_soft_margin_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#multilabel-soft-margin-loss) | Unsupported | +| 14 | [multi_margin_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#multi-margin-loss) | Unsupported | +| 15 | [nll_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#nll-loss) | Unsupported | +| 16 | [smooth_l1_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#smooth-l1-loss) | Unsupported | +| 17 | [soft_margin_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#soft-margin-loss) | Unsupported | +| 18 | [triplet_margin_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#triplet-margin-loss) | Unsupported | +| 19 | [triplet_margin_with_distance_loss](https://pytorch.org/docs/1.8.1/nn.functional.html#triplet-margin-with-distance-loss) | Unsupported | + +## [Vision Functions](https://pytorch.org/docs/1.8.1/nn.functional.html#vision-functions) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [pixel_shuffle](https://pytorch.org/docs/1.8.1/nn.functional.html#pixel-shuffle) | Unsupported | +| 2 | [pixel_unshuffle](https://pytorch.org/docs/1.8.1/nn.functional.html#pixel-unshuffle) | Unsupported | +| 3 | [pad](https://pytorch.org/docs/1.8.1/nn.functional.html#pad) | Unsupported | +| 4 | [interpolate](https://pytorch.org/docs/1.8.1/nn.functional.html#interpolate) | Unsupported | +| 5 | [upsample](https://pytorch.org/docs/1.8.1/nn.functional.html#upsample) | Unsupported | +| 6 | [upsample_nearest](https://pytorch.org/docs/1.8.1/nn.functional.html#upsample-nearest) | Unsupported | +| 7 | [upsample_bilinear](https://pytorch.org/docs/1.8.1/nn.functional.html#upsample-bilinear) | Unsupported | +| 8 | [grid_sample](https://pytorch.org/docs/1.8.1/nn.functional.html#grid-sample) | Unsupported | +| 9 | [affine_grid](https://pytorch.org/docs/1.8.1/nn.functional.html#affine-grid) | Unsupported | + +## [Data Parallel Functions (Multi-GPU, Distributed)](https://pytorch.org/docs/1.8.1/nn.functional.html#dataparallel-functions-multi-gpu-distributed) + +| No. | API | Supported/Unsupported | +| ---- | ------------------------------------------------------------ | --------------------- | +| 1 | [data_parallel](https://pytorch.org/docs/1.8.1/nn.functional.html#data-parallel) | Unsupported | + +# [torch.distributed](https://pytorch.org/docs/1.8.1/distributed.html) + +| No. | API | Supported/Unsupported | +| ---- | ----------------------------------------- | --------------------- | +| 1 | torch.distributed.is_available | Unsupported | +| 2 | torch.distributed.init_process_group | Unsupported | +| 3 | torch.distributed.Backend | Unsupported | +| 4 | torch.distributed.get_backend | Unsupported | +| 5 | torch.distributed.get_rank | Unsupported | +| 6 | torch.distributed.get_world_size | Unsupported | +| 7 | torch.distributed.is_initialized | Unsupported | +| 8 | torch.distributed.is_mpi_available | Unsupported | +| 9 | torch.distributed.is_nccl_available | Unsupported | +| 10 | torch.distributed.Store | Unsupported | +| 11 | torch.distributed.TCPStore | Unsupported | +| 12 | torch.distributed.HashStore | Unsupported | +| 13 | torch.distributed.FileStore | Unsupported | +| 14 | torch.distributed.PrefixStore | Unsupported | +| 15 | torch.distributed.Store.set | Unsupported | +| 16 | torch.distributed.Store.get | Unsupported | +| 17 | torch.distributed.Store.add | Unsupported | +| 18 | torch.distributed.Store.wait | Unsupported | +| 19 | torch.distributed.Store.num_keys | Unsupported | +| 20 | torch.distributed.Store.delete_key | Unsupported | +| 21 | torch.distributed.Store.set_timeout | Unsupported | +| 22 | torch.distributed.new_group | Unsupported | +| 23 | torch.distributed.send | Unsupported | +| 24 | torch.distributed.recv | Unsupported | +| 25 | torch.distributed.isend | Unsupported | +| 26 | torch.distributed.irecv | Unsupported | +| 27 | is_completed | Unsupported | +| 28 | wait | Unsupported | +| 29 | torch.distributed.broadcast | Unsupported | +| 30 | torch.distributed.broadcast_object_list | Unsupported | +| 31 | torch.distributed.all_reduce | Unsupported | +| 32 | torch.distributed.reduce | Unsupported | +| 33 | torch.distributed.all_gather | Unsupported | +| 34 | torch.distributed.all_gather_object | Unsupported | +| 35 | torch.distributed.gather | Unsupported | +| 36 | torch.distributed.gather_object | Unsupported | +| 37 | torch.distributed.scatter | Unsupported | +| 38 | torch.distributed.scatter_object_list | Unsupported | +| 39 | torch.distributed.reduce_scatter | Unsupported | +| 40 | torch.distributed.all_to_all | Unsupported | +| 41 | torch.distributed.barrier | Unsupported | +| 42 | torch.distributed.ReduceOp | Unsupported | +| 43 | torch.distributed.reduce_op | Unsupported | +| 44 | torch.distributed.broadcast_multigpu | Unsupported | +| 45 | torch.distributed.all_reduce_multigpu | Unsupported | +| 46 | torch.distributed.reduce_multigpu | Unsupported | +| 47 | torch.distributed.all_gather_multigpu | Unsupported | +| 48 | torch.distributed.reduce_scatter_multigpu | Unsupported | +| 49 | torch.distributed.launch | Unsupported | +| 50 | torch.multiprocessing.spawn | Unsupported | + +# torch.npu + +| No. | API | NPU API | Supported/Unsupported | +| ---- | ------------------------------------- | :----------------------------------- | --------------------- | +| 1 | torch.cuda.current_blas_handle | torch.npu.current_blas_handle | Unsupported | +| 2 | torch.cuda.current_device | torch.npu.current_device | Supported | +| 3 | torch.cuda.current_stream | torch.npu.current_stream | Supported | +| 4 | torch.cuda.default_stream | torch.npu.default_stream | Supported | +| 5 | torch.cuda.device | torch.npu.device | Unsupported | +| 6 | torch.cuda.device_count | torch.npu.device_count | Supported | +| 7 | torch.cuda.device_of | torch.npu.device_of | Unsupported | +| 8 | torch.cuda.get_device_capability | torch.npu.get_device_capability | Unsupported | +| 9 | torch.cuda.get_device_name | torch.npu.get_device_name | Unsupported | +| 10 | torch.cuda.init | torch.npu.init | Supported | +| 11 | torch.cuda.ipc_collect | torch.npu.ipc_collect | Unsupported | +| 12 | torch.cuda.is_available | torch.npu.is_available | Supported | +| 13 | torch.cuda.is_initialized | torch.npu.is_initialized | Supported | +| 14 | torch.cuda.set_device | torch.npu.set_device | Partially supported | +| 15 | torch.cuda.stream | torch.npu.stream | Supported | +| 16 | torch.cuda.synchronize | torch.npu.synchronize | Supported | +| 17 | torch.cuda.get_rng_state | torch.npu.get_rng_state | Unsupported | +| 18 | torch.cuda.get_rng_state_all | torch.npu.get_rng_state_all | Unsupported | +| 19 | torch.cuda.set_rng_state | torch.npu.set_rng_state | Unsupported | +| 20 | torch.cuda.set_rng_state_all | torch.npu.set_rng_state_all | Unsupported | +| 21 | torch.cuda.manual_seed | torch.npu.manual_seed | Unsupported | +| 22 | torch.cuda.manual_seed_all | torch.npu.manual_seed_all | Unsupported | +| 23 | torch.cuda.seed | torch.npu.seed | Unsupported | +| 24 | torch.cuda.seed_all | torch.npu.seed_all | Unsupported | +| 25 | torch.cuda.initial_seed | torch.npu.initial_seed | Unsupported | +| 26 | torch.cuda.comm.broadcast | torch.npu.comm.broadcast | Unsupported | +| 27 | torch.cuda.comm.broadcast_coalesced | torch.npu.comm.broadcast_coalesced | Unsupported | +| 28 | torch.cuda.comm.reduce_add | torch.npu.comm.reduce_add | Unsupported | +| 29 | torch.cuda.comm.scatter | torch.npu.comm.scatter | Unsupported | +| 30 | torch.cuda.comm.gather | torch.npu.comm.gather | Unsupported | +| 31 | torch.cuda.Stream | torch.npu.Stream | Supported | +| 32 | torch.cuda.Stream.query | torch.npu.Stream.query | Unsupported | +| 33 | torch.cuda.Stream.record_event | torch.npu.Stream.record_event | Supported | +| 34 | torch.cuda.Stream.synchronize | torch.npu.Stream.synchronize | Supported | +| 35 | torch.cuda.Stream.wait_event | torch.npu.Stream.wait_event | Supported | +| 36 | torch.cuda.Stream.wait_stream | torch.npu.Stream.wait_stream | Supported | +| 37 | torch.cuda.Event | torch.npu.Event | Supported | +| 38 | torch.cuda.Event.elapsed_time | torch.npu.Event.elapsed_time | Supported | +| 39 | torch.cuda.Event.from_ipc_handle | torch.npu.Event.from_ipc_handle | Unsupported | +| 40 | torch.cuda.Event.ipc_handle | torch.npu.Event.ipc_handle | Unsupported | +| 41 | torch.cuda.Event.query | torch.npu.Event.query | Supported | +| 42 | torch.cuda.Event.record | torch.npu.Event.record | Supported | +| 43 | torch.cuda.Event.synchronize | torch.npu.Event.synchronize | Supported | +| 44 | torch.cuda.Event.wait | torch.npu.Event.wait | Supported | +| 45 | torch.cuda.empty_cache | torch.npu.empty_cache | Supported | +| 46 | torch.cuda.memory_stats | torch.npu.memory_stats | Supported | +| 47 | torch.cuda.memory_summary | torch.npu.memory_summary | Supported | +| 48 | torch.cuda.memory_snapshot | torch.npu.memory_snapshot | Supported | +| 49 | torch.cuda.memory_allocated | torch.npu.memory_allocated | Supported | +| 50 | torch.cuda.max_memory_allocated | torch.npu.max_memory_allocated | Supported | +| 51 | torch.cuda.reset_max_memory_allocated | torch.npu.reset_max_memory_allocated | Supported | +| 52 | torch.cuda.memory_reserved | torch.npu.memory_reserved | Supported | +| 53 | torch.cuda.max_memory_reserved | torch.npu.max_memory_reserved | Supported | +| 54 | torch.cuda.memory_cached | torch.npu.memory_cached | Supported | +| 55 | torch.cuda.max_memory_cached | torch.npu.max_memory_cached | Supported | +| 56 | torch.cuda.reset_max_memory_cached | torch.npu.reset_max_memory_cached | Supported | +| 57 | torch.cuda.nvtx.mark | torch.npu.nvtx.mark | Unsupported | +| 58 | torch.cuda.nvtx.range_push | torch.npu.nvtx.range_push | Unsupported | +| 59 | torch.cuda.nvtx.range_pop | torch.npu.nvtx.range_pop | Unsupported | +| 60 | torch.cuda._sleep | torch.npu._sleep | Unsupported | +| 61 | torch.cuda.Stream.priority_range | torch.npu.Stream.priority_range | Unsupported | +| 62 | torch.cuda.get_device_properties | torch.npu.get_device_properties | Unsupported | +| 63 | torch.cuda.amp.GradScaler | torch.npu.amp.GradScaler | Unsupported | + +# NPU Custom Operators + +| No. | Operator | +| ---- | ---------------------------------------------- | +| 1 | npu_convolution_transpose | +| 2 | npu_conv_transpose2d | +| 3 | npu_convolution_transpose_backward | +| 4 | npu_conv_transpose2d_backward | +| 5 | npu_conv_transpose3d_backward | +| 6 | npu_convolution | +| 7 | npu_convolution_backward | +| 8 | npu_convolution_double_backward | +| 9 | npu_conv2d | +| 10 | npu_conv2d.out | +| 11 | npu_conv2d_backward | +| 12 | npu_conv3d | +| 13 | npu_conv3d.out | +| 14 | npu_conv3d_backward | +| 15 | one_ | +| 16 | npu_sort_v2.out | +| 17 | npu_sort_v2 | +| 18 | npu_format_cast | +| 19 | npu_format_cast_.acl_format | +| 20 | npu_format_cast_.src | +| 21 | npu_transpose_to_contiguous | +| 22 | npu_transpose | +| 23 | npu_transpose.out | +| 24 | npu_broadcast | +| 25 | npu_broadcast.out | +| 26 | npu_dtype_cast | +| 27 | npu_dtype_cast_.Tensor | +| 28 | npu_roi_alignbk | +| 29 | empty_with_format | +| 30 | empty_with_format.names | +| 31 | copy_memory_ | +| 32 | npu_one_hot | +| 33 | npu_stride_add | +| 34 | npu_softmax_cross_entropy_with_logits | +| 35 | npu_softmax_cross_entropy_with_logits_backward | +| 36 | npu_ps_roi_pooling | +| 37 | npu_ps_roi_pooling_backward | +| 38 | npu_roi_align | +| 39 | npu_nms_v4 | +| 40 | npu_lstm | +| 41 | npu_lstm_backward | +| 42 | npu_iou | +| 43 | npu_ptiou | +| 44 | npu_nms_with_mask | +| 45 | npu_pad | +| 46 | npu_bounding_box_encode | +| 47 | npu_bounding_box_decode | +| 48 | npu_gru | +| 49 | npu_gru_backward | +| 50 | npu_set_.source_Storage_storage_offset_format | +| 51 | npu_random_choice_with_mask | +| 52 | npu_batch_nms | +| 53 | npu_slice | +| 54 | npu_slice.out | +| 55 | npu_dropoutV2 | +| 56 | npu_dropoutV2_backward | +| 57 | _npu_dropout | +| 58 | _npu_dropout_inplace | +| 59 | npu_dropout_backward | +| 60 | npu_indexing | +| 61 | npu_indexing.out | +| 62 | npu_ifmr | +| 63 | npu_max.dim | +| 64 | npu_max.names_dim | +| 65 | npu_scatter | +| 66 | npu_max_backward | +| 67 | npu_apply_adam | +| 68 | npu_layer_norm_eval | +| 69 | npu_alloc_float_status | +| 70 | npu_get_float_status | +| 71 | npu_clear_float_status | +| 72 | npu_confusion_transpose | +| 73 | npu_confusion_transpose_backward | +| 74 | npu_bmmV2 | +| 75 | fast_gelu | +| 76 | fast_gelu_backward | +| 77 | npu_sub_sample | +| 78 | npu_deformable_conv2d | +| 79 | npu_deformable_conv2dbk | +| 80 | npu_mish | +| 81 | npu_anchor_response_flags | +| 82 | npu_yolo_boxes_encode | +| 83 | npu_grid_assign_positive | +| 84 | npu_mish_backward | +| 85 | npu_normalize_batch | +| 86 | npu_masked_fill_range | +| 87 | npu_linear | +| 88 | npu_linear_backward | +| 89 | npu_bert_apply_adam | +| 90 | npu_giou | +| 91 | npu_giou_backward | + +Operator descriptions: + +> ``` +> npu_apply_adam(beta1_power, beta2_power, lr, beta1, beta2, epsilon, grad, use_locking, use_nesterov, out = (var, m, v)) +> ``` + +count adam result. + +- Parameters: + - **beta1_power** (Number) - power of beta1. + - **beta2_power** (Number) - power of beta2. + - **lr** (Number) - learning rate. + - **beta1** (Number) - exponential decay rate for the 1st moment estimates. + - **beta2** (Number) - exponential decay rate for the 2nd moment estimates. + - **epsilon** (Number) - term added to the denominator to improve numerical stability. + - **grad** (Tensor) - the gradient. + - **use_locking** (bool) - If `True` use locks for update operations. + - **use_nesterov** (bool) -If `True`, uses the nesterov update. + - **var** (Tensor) - variables to be optimized. + - **m** (Tensor) - mean value of variables. + - **v** (Tensor) - variance of variables. + +- constraints: + + None + +- Examples: + + None + +> npu_bert_apply_adam(var, m, v, lr, beta1, beta2, epsilon, grad, max_grad_norm, global_grad_norm, weight_decay, ) + +count adam result in bert. + +- Parameters: + - **lr** (Number) - learning rate. + - **beta1** (Number) - exponential decay rate for the 1st moment estimates. + - **beta2** (Number) - exponential decay rate for the 2nd moment estimates. + - **epsilon** (Number) - term added to the denominator to improve numerical stability. + - **grad** (Tensor) - the gradient. + - **max_grad_norm** (Number) - maximum norm for the gradients. + - **global_grad_norm** (Number) - L2_norm for the gradients. + - **weight_decay** (Number) - weight decay + - **var** (Tensor) - variables to be optimized. + - **m** (Tensor) -mean value of variables. + - **v** (Tensor) - variance of variables. + +- constraints: + + None + +- Examples: + + ```python + >>> var_in = torch.rand(321538).uniform_(-32.,21.).npu() + >>> var_in + tensor([ 0.6119, 5.8193, 3.0683, ..., -28.5832, 12.9402, -24.0488], + device='npu:0') + >>> m_in = torch.zeros(321538).npu() + >>> v_in = torchzeros(321538).npu() + >>> grad = torch.rand(321538).uniform_(-0.05,0.03).npu() + >>> grad + tensor([-0.0315, -0.0113, -0.0132, ..., 0.0106, -0.0226, -0.0252], + device='npu:0') + >>> max_grad_norm = -1. + >>> beta1 = 0.9 + >>> beta2 = 0.99 + >>> weight_decay = 0. + >>> lr = 0.1 + >>> epsilon = 1e-06 + >>> global_grad_norm = 0. + >>> var_out, m_out, v_out = torch.npu_bert_apply_adam(var_in, m_in, v_in, lr, beta1, beta2, epsilon, grad, max_grad_norm, global_grad_norm, weight_decay) + >>> var_out + tensor([ 0.7118, 5.9192, 3.1682, ..., -28.6831, 13.0402, -23.9489], + device='npu:0') + >>> m_out + tensor([-0.0032, -0.0011, -0.0013, ..., 0.0011, -0.0023, -0.0025], + device='npu:0') + >>> v_out + tensor([9.9431e-06, 1.2659e-06, 1.7328e-06, ..., 1.1206e-06, 5.0933e-06, + 6.3495e-06], device='npu:0') + ``` + + \ No newline at end of file diff --git a/docs/en/PyTorch Network Model Porting and Training Guide/PyTorch Network Model Porting and Training Guide.md b/docs/en/PyTorch Network Model Porting and Training Guide/PyTorch Network Model Porting and Training Guide.md index d250cd923478b49503dabbabec0a59133f959201..e99f00231957da1e12309442e7ccd9a703032c72 100644 --- a/docs/en/PyTorch Network Model Porting and Training Guide/PyTorch Network Model Porting and Training Guide.md +++ b/docs/en/PyTorch Network Model Porting and Training Guide/PyTorch Network Model Porting and Training Guide.md @@ -1,4190 +1,4362 @@ -# PyTorch Network Model Porting and Training Guide -- [Overview](#overviewmd) -- [Restrictions and Limitations](#restrictions-and-limitationsmd) -- [Porting Process](#porting-processmd) -- [Model Porting Evaluation](#model-porting-evaluationmd) -- [Environment Setup](#environment-setupmd) -- [Model Porting](#model-portingmd) - - [Tool-Facilitated](#tool-facilitatedmd) - - [Introduction](#introductionmd) - - [Instructions](#instructionsmd) - - [Result Analysis](#result-analysismd) - - [Manual](#manualmd) - - [Single-Device Training Model Porting](#single-device-training-model-portingmd) - - [Multi-Device Training Model Porting](#multi-device-training-model-portingmd) - - [PyTorch-related API Replacement](#pytorch-related-api-replacementmd) - - [Mixed Precision](#mixed-precisionmd) -- [Model Training](#model-trainingmd) -- [Performance Analysis and Optimization](#performance-analysis-and-optimizationmd) - - [Prerequisites](#prerequisitesmd) - - [Commissioning Process](#commissioning-processmd) - - [Overall Guideline](#overall-guidelinemd) - - [Training Data Collection](#training-data-collectionmd) - - [Host-side Performance Optimization](#host-side-performance-optimizationmd) - - [Overview](#overview-0md) - - [Changing the CPU Performance Mode \(x86 Server\)](#changing-the-cpu-performance-mode-x86-servermd) - - [Changing the CPU Performance Mode \(ARM Server\)](#changing-the-cpu-performance-mode-arm-servermd) - - [Installing the High-Performance Pillow Library \(x86 Server\)](#installing-the-high-performance-pillow-library-x86-servermd) - - [\(Optional\) Installing the OpenCV Library of the Specified Version](#optional-installing-the-opencv-library-of-the-specified-versionmd) - - [Training Performance Optimization](#training-performance-optimizationmd) - - [Affinity Library](#affinity-librarymd) - - [Source](#sourcemd) - - [Functions](#functionsmd) -- [Precision Commissioning](#precision-commissioningmd) - - [Prerequisites](#prerequisites-1md) - - [Commissioning Process](#commissioning-process-2md) - - [Overall Guideline](#overall-guideline-3md) - - [Precision Tuning Methods](#precision-tuning-methodsmd) - - [Single-Operator Overflow/Underflow Detection](#single-operator-overflow-underflow-detectionmd) - - [Network-wide Commissioning ](#network-wide-commissioningmd) -- [Model Saving and Conversion](#model-saving-and-conversionmd) - - [Introduction](#introduction-4md) - - [Saving a Model](#saving-a-modelmd) - - [Exporting an ONNX Model](#exporting-an-onnx-modelmd) -- [Samples](#samplesmd) - - [ResNet-50 Model Porting](#resnet-50-model-portingmd) - - [Obtaining Samples](#obtaining-samplesmd) - - [Porting the Training Script](#porting-the-training-scriptmd) - - [Single-Device Training Modification](#single-device-training-modificationmd) - - [Distributed Training Modification](#distributed-training-modificationmd) - - [Script Execution](#script-executionmd) - - [ShuffleNet Model Optimization](#shufflenet-model-optimizationmd) - - [Obtaining Samples](#obtaining-samples-5md) - - [Model Evaluation](#model-evaluationmd) - - [Porting the Network](#porting-the-networkmd) - - [Commissioning the Network](#commissioning-the-networkmd) -- [References](#referencesmd) - - [Single-Operator Sample Building](#single-operator-sample-buildingmd) - - [Single-Operator Dump Method](#single-operator-dump-methodmd) - - [Common Environment Variables](#common-environment-variablesmd) - - [dump op Method](#dump-op-methodmd) - - [Compilation Option Settings](#compilation-option-settingsmd) - - [How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0md) - - [HDF5 Compilation and Installation](#hdf5-compilation-and-installationmd) -- [FAQs](#faqsmd) - - [FAQs About Software Installation](#faqs-about-software-installationmd) - - [pip3.7 install Pillow==5.3.0 Installation Failed](#pip3-7-install-pillow-5-3-0-installation-failedmd) - - [FAQs About Model and Operator Running](#faqs-about-model-and-operator-runningmd) - - [What Do I Do If the Error Message "RuntimeError: ExchangeDevice:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-runtimeerror-exchangedevice-is-displayed-during-model-or-operatormd) - - [What Do I Do If the Error Message "Error in atexit.\_run\_exitfuncs:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-error-in-atexit-_run_exitfuncs-is-displayed-during-model-or-operatmd) - - [What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what\(\): HelpACLExecute:" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-terminate-called-after-throwing-an-instance-of-c10-error-what-hemd) - - [What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what\(\): 0 INTERNAL ASSERT" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-terminate-called-after-throwing-an-instance-of-c10-error-what-0md) - - [What Do I Do If the Error Message "ImportError: libhccl.so." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-importerror-libhccl-so-is-displayed-during-model-runningmd) - - [What Do I Do If the Error Message "RuntimeError: Initialize." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-runtimeerror-initialize-is-displayed-during-model-runningmd) - - [What Do I Do If the Error Message "TVM/te/cce error." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-tvm-te-cce-error-is-displayed-during-model-runningmd) - - [What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-runningmd) - - [What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-running-6md) - - [What Do I Do If the Error Message "HelpACLExecute." Is Displayed After Multi-Task Delivery Is Disabled \(export TASK\_QUEUE\_ENABLE=0\) During Model Running?](#what-do-i-do-if-the-error-message-helpaclexecute-is-displayed-after-multi-task-delivery-is-disabledmd) - - [What Do I Do If the Error Message "55056 GetInputConstDataOut: ErrorNo: -1\(failed\)" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-55056-getinputconstdataout-errorno--1failed-is-displayed-duringmd) - - [FAQs About Model Commissioning](#faqs-about-model-commissioningmd) - - [What Do I Do If the Error Message "RuntimeError: malloc:/..../pytorch/c10/npu/NPUCachingAllocator.cpp:293 NPU error, error code is 500000." Is Displayed During Model Commissioning?](#what-do-i-do-if-the-error-message-runtimeerror-malloc-pytorch-c10-npu-npucachingallocator-cpp-293-npmd) - - [What Do I Do If the Error Message "RuntimeError: Could not run 'aten::trunc.out' with arguments from the 'NPUTensorId' backend." Is Displayed During Model Commissioning](#what-do-i-do-if-the-error-message-runtimeerror-could-not-run-aten-trunc-out-with-arguments-from-themd) - - [What Do I Do If the MaxPoolGradWithArgmaxV1 and max Operators Report Errors During Model Commissioning?](#what-do-i-do-if-the-maxpoolgradwithargmaxv1-and-max-operators-report-errors-during-model-commissionimd) - - [What Do I Do If the Error Message "ModuleNotFoundError: No module named 'torch.\_C'" Is Displayed When torch Is Called?](#what-do-i-do-if-the-error-message-modulenotfounderror-no-module-named-torch-_c-is-displayed-when-tormd) - - [FAQs About Other Operations](#faqs-about-other-operationsmd) - - [What Do I Do If an Error Is Reported During CUDA Stream Synchronization?](#what-do-i-do-if-an-error-is-reported-during-cuda-stream-synchronizationmd) - - [What Do I Do If aicpu\_kernels/libpt\_kernels.so Does Not Exist?](#what-do-i-do-if-aicpu_kernels-libpt_kernels-so-does-not-existmd) - - [What Do I Do If the Python Process Is Residual When the npu-smi info Command Is Used to View Video Memory?](#what-do-i-do-if-the-python-process-is-residual-when-the-npu-smi-info-command-is-used-to-view-video-mmd) - - [What Do I Do If the Error Message "match op inputs failed"Is Displayed When the Dynamic Shape Is Used?](#what-do-i-do-if-the-error-message-match-op-inputs-failed-is-displayed-when-the-dynamic-shape-is-usedmd) - - [What Do I Do If the Error Message "Op type SigmoidCrossEntropyWithLogitsV2 of ops kernel AIcoreEngine is unsupported" Is Displayed?](#what-do-i-do-if-the-error-message-op-type-sigmoidcrossentropywithlogitsv2-of-ops-kernel-aicoreenginemd) - - [What Do I Do If a Hook Failure Occurs?](#what-do-i-do-if-a-hook-failure-occursmd) - - [What Do I Do If the Error Message "load state\_dict error." Is Displayed When the Weight Is Loaded?](#what-do-i-do-if-the-error-message-load-state_dict-error-is-displayed-when-the-weight-is-loadedmd) - - [FAQs About Distributed Model Training](#faqs-about-distributed-model-trainingmd) - - [What Do I Do If the Error Message "host not found." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-host-not-found-is-displayed-during-distributed-model-trainingmd) - - [What Do I Do If the Error Message "RuntimeError: connect\(\) timed out." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-runtimeerror-connect-timed-out-is-displayed-during-distributed-mmd) -

Overview

- -Currently, the solution of adapting to the Ascend AI Processor is an online solution. - -### Solution Features and Advantages - -The acceleration of the Ascend AI Processor is implemented by calling various operators \(OP-based\). That is, the AscendCL is used to call one or more D affinity operators to replace the original GPU-based implementation. [Figure 1](#fig2267112413239) shows the logical model of the implementation. - -**Figure 1** Logical model - - -![](figures/pytorch适配逻辑结构图-优化.png) - -Currently, the main reasons for selecting the online adaptation solution are as follows: - -1. The dynamic graph feature of the PyTorch framework is inherited to the maximum extent. -2. The GPU's usage on the PyTorch is inherited to the maximum extent, which minimizes the changes in the development mode and code reuse when a model is ported to the Ascend AI Processor for training. -3. The original PyTorch architecture is inherited to the maximum extent and the excellent features of the PyTorch architecture are retained, such as automatic differentiation, dynamic distribution, debugging, profiling, storage sharing mechanism, and dynamic memory management on the device side. -4. It has good scalability. During the streamlining process, only the development and implementation of related compute operators are involved for new network types or structures. Framework operators, reverse graph building, and implementation mechanisms can be reused. -5. The usage and style are the same as those of the GPU-based implementation. During online adaption, you only need to specify the device as the Ascend AI Processor in Python and device operations to develop, train, and debug the network in PyTorch using the Ascend AI Processor. You do not need to pay attention to the underlying details of the Ascend AI Processor. In this way, you can minimize the modification and complete porting with low costs. - -

Restrictions and Limitations

- -- In the **infershape** phase, operators do not support unknown shape inference. -- Only the float16 operator can be used for cube computing. -- inf/nan data of the float16 type cannot be input or output. -- Dimensions cannot be reduced when the format larger than 4D is used. -- In the current version, Apex is implemented using Python, and the customized optimization CUDA kernel in Apex is not supported. -- The current version of Apex supports only the mixed precision calculation and multiple fusion optimizer functions adapted to Ascend AI Processors. -- The restrictions on collective communication are as follows: - - In data parallel mode, the graphs executed on different devices must be the same. - - Allocation at only 1, 2, 4, or 8 processors is supported. - - Only the int8, int32, float16, and float32 data types are supported. - - -

Porting Process

- -Model porting refers to moving models that have been implemented in the open-source community to an Ascend AI Processor. [Figure 1](#fig759451810422) shows the model porting process. - -**Figure 1** Porting process -![](figures/porting-process.png "porting-process") - -**Table 1** Porting process - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Scenario

-

Description

-

Model selection

-

For details, see Model Selection.

-

Model porting evaluation

-

For details, see Model Porting Evaluation.

-

Operator development

-

For details, see the PyTorch Operator Development Guide.

-

Environment setup

-

For details, see Environment Setup.

-

Model porting

-

For details, see Model Porting.

-

Model training

-

For details, see Model Training.

-

Error analysis

-

For details, see "AI Core Error Analyzer Instructions" in the CANN Log Reference and CANN Auxiliary Development Tool User Guide .

-

Performance analysis and optimization

-

For details, see Performance Optimization and Analysis.

-

Precision commissioning

-

For details, see Precision Commissioning.

-

Model saving and conversion

-

For details, see Model Saving and Conversion and "ATC Tool Instructions" in the CANN Auxiliary Development Tool User Guide .

-

Application software development

-

For details, see the CANN Application Software Development Guide (C and C++, Inference).

-

FAQs

-

Describes how to prepare the environment, port models, commission models, and resolve other common problems. For details, see FAQs.

-
- -

Model Porting Evaluation

- -1. When selecting models, select authoritative PyTorch models as benchmarks, including but not limited to PyTorch \([example](https://github.com/pytorch/examples/tree/master/imagenet)/[vision](https://github.com/pytorch/vision)\), facebookresearch \([Detectron](https://github.com/facebookresearch/Detectron)/[detectron2](https://github.com/facebookresearch/detectron2)\), and open-mmlab \([mmdetection](https://github.com/open-mmlab/mmdetection)/[mmpose](https://github.com/open-mmlab/mmpose)\). -2. Check the operator adaptation. Before porting the original model and training script to an Ascend AI Processor, train the original model and training script on the CPU, obtain the operator information by using the dump op method, and compare the operator information with that in the _PyTorch Operator Support_ to check whether the operator is supported. For details about the dump op method, see [dump op Method](#dump-op-methodmd). If an operator is not supported, develop the operator. For details, see the _PyTorch Operator Development Guide_. - - >![](public_sys-resources/icon-note.gif) **NOTE:** - >You can also port the model and training script to the Ascend AI Processor for training to view the error information. For details about how to port the model and training script, see the following sections. Generally, a message is displayed, indicating that an operator \(the first operator that is not supported\) cannot run in the backend of the Ascend AI Processor. - - -

Environment Setup

- -Refer to the _PyTorch Installation Guide_ to install PyTorch and the mixed precision module, and configure required environment variables. - -

Model Porting

- -- **[Tool-Facilitated](#tool-facilitatedmd)** - -- **[Manual](#manualmd)** - -- **[Mixed Precision](#mixed-precisionmd)** - - -

Tool-Facilitated

- -The Ascend platform provides a script conversion tool to enable you to port training scripts to Ascend AI Processors using commands. The following will provide the details. In addition to using commands, you can also use the PyTorch GPU2Ascend function integrated in MindStudio to port scripts. For details, see the _MindStudio User Guide_. - -- **[Introduction](#introductionmd)** - -- **[Instructions](#instructionsmd)** - -- **[Result Analysis](#result-analysismd)** - - -

Introduction

- -##### Overview - -Ascend NPU is an up-and-comer in the AI computing field, but most training and online inference scripts are based on GPUs. Due to the architecture differences between NPUs and GPUs, GPU-based training and online inference scripts cannot be directly used on NPUs. The script conversion tool provides an automated method for converting GPU-based scripts into NPU-based scripts, reducing the learning cost and workload of manual script migration, thereby improving the migration efficiency. - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->- msFmkTransplt provides suggestions and converts scripts by the adaptation rules, significantly accelerating script migration and reducing development workload. The scripts in [Table 1](#en-us_topic_0000001133095885_table4705239194613) can be directly executed after being converted. The conversion results of other scripts are for reference only. You need to perform adaptation based on the site requirements. ->- The original scripts in [Table 1](#en-us_topic_0000001133095885_table4705239194613) must be executed in the GPU environment and based on Python 3. ->- For scripts in [Table 1](#en-us_topic_0000001133095885_table4705239194613), the execution logic after conversion is the same as that before conversion. ->- This script conversion tool only supports the conversion of PyTorch training scripts. - -**Table 1** Supported models - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

No.

-

Model

-

1

-

3D AttentionNet

-

2

-

3D Nested_UNet

-

3

-

Advanced East

-

4

-

AlexNet

-

5

-

DeeplabV3+(Xception-JFT)

-

6

-

DeepMar

-

7

-

Densenet121

-

8

-

DenseNet161

-

9

-

DenseNet169

-

10

-

DenseNet201

-

11

-

EAST

-

12

-

FCN

-

13

-

FD-GAN

-

14

-

FOTS

-

15

-

GENet

-

16

-

GoogleNet

-

17

-

GRU

-

18

-

Inception V4

-

19

-

InceptionV2

-

20

-

LPRNet

-

21

-

LSTM

-

22

-

MNASNet0_5

-

23

-

MNASNet0_75

-

24

-

MNASNet1_0

-

25

-

MNASNet1_3

-

26

-

MobileNetV1

-

27

-

MobileNetV2

-

28

-

PNet

-

29

-

PSENet

-

30

-

RAFT

-

31

-

RecVAE

-

32

-

ResNet101

-

33

-

ResNet152

-

34

-

ResNet18

-

35

-

ResNet34

-

36

-

ResNet50

-

37

-

Resnext101_32x8d

-

38

-

Resnext50

-

39

-

RNet

-

40

-

Shufflenetv2

-

41

-

SqueezeNet1_0

-

42

-

SqueezeNet1_1

-

43

-

U-Net

-

44

-

VAE+GAN

-

45

-

VGG11

-

46

-

VGG11_BN

-

47

-

VGG13

-

48

-

VGG13_BN

-

49

-

VGG16

-

50

-

VGG16_BN

-

51

-

VGG19

-

52

-

VGG19_BN

-

53

-

VIT-base

-

54

-

Wide_ResNet101_2

-

55

-

Wide_ResNet50_2

-
- -##### System Requirement - -msFmkTransplt runs on Ubuntu 18.04, CentOS 7.6, and EulerOS 2.8 only. - -##### Environment Setup - -Set up the development environment by referring to the _CANN Software Installation Guide_. - -

Instructions

- -##### Command-line Options - -**Table 1** Command-line options - - - - - - - - - - - - - - - - - - - - - - - - -

Option

-

Description

-

Example Value

-

-i

-

--input

-
  • Path of the folder or file where the original script file to be converted is located.
  • Required
-
  • /home/username/fmktransplt
  • /home/username/fmktransplt.py
-

-o

-

--output

-
  • Output path of the script conversion result. A folder with the .msft suffix will be generated in the path.
  • Required
-

/home/username/fmktransplt_output

-

-r

-

--rule

-
  • Path of the JSON file for custom general conversion rules, which cover function parameter, function name, and module name modifications.
  • Optional
-

/home/username/fmktransplt_rule.json

-

-h

-

--help

-

Help information.

-

-

-
- -##### Customizing a Rule File - -An example of a custom conversion rule is as follows: - -``` -{ - "rules": { - "ArgsModifyRule": [ - { - "func_name": "name1", - "arg_idx": 0, - "arg_new": "agrs0" - }, - { - "func_name": "name2", - "arg_idx": 0, - "arg_new": "agrs0" - } - ], - "FuncNameModifyRule": [ - { - "old_name": "func", - "new_name": "new_func" - } - ], - "ModuleNameModifyRule": [ - { - "old_name": "module", - "new_name": "new_module", - "parent_module":"parent_module" - } - ] - } -} -``` - -**Table 2** Options - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Option

-

Description

-

ArgsModifyRule

-

Function parameter modification

-

func_name

-

Function name

-

arg_idx

-

Parameter position

-

arg_new

-

New parameter

-

FuncNameModifyRule

-

Function name modification

-

ModuleNameModifyRule

-

Module name modification

-

old_name

-

Old name

-

new_name

-

New name

-

parent_module

-

Parent module name

-
- -##### Performing Conversion - -1. Go to the directory of the script conversion tool msFmkTransplt. - - ``` - cd {Ascend-CANN-Toolkit install path}/ascend-toolkit/{version}/{arch}-linux/toolkit/tools/ms_fmk_transplt - ``` - -2. Execute msFmkTransplt. - - ``` - python3 ms_fmk_transplt.py -i original script path -o output path of the script conversion result [-r path of the JSON file for custom general conversion rules] - ``` - -3. Find the converted script in the specified output path. - -

Result Analysis

- -You can view the result files in the output path when the script is converted. - -``` -├── xxx_msft // Directory for storing script conversion results. The default directory is the directory of the original script. xxx indicates the name of the folder where the original script is stored. -│ ├── generated script file // The directory structure is the same as that of the script file before conversion. -│ ├── msFmkTranspltlog.txt // Log file generated during script conversion -│ ├── unsupported_op.xlsx // File of the unsupported operator list -``` - -

Manual

- -- **[Single-Device Training Model Porting](#single-device-training-model-portingmd)** - -- **[Multi-Device Training Model Porting](#multi-device-training-model-portingmd)** - -- **[PyTorch-related API Replacement](#pytorch-related-api-replacementmd)** - - -

Single-Device Training Model Porting

- -The advantage of the online adaption is that the training on the Ascend AI Processor is consistent with the usage of the GPU. During online adaption,** you only need to specify the device as the Ascend AI Processor in Python and device operations** to develop, train, and debug the network in PyTorch using the Ascend AI Processor. For single-device model training, main changes for porting are as follows: - -GPU code before porting: - -``` - CALCULATE_DEVICE = "gpu:0" - torch.cuda.set_device(CALCULATE_DEVICE) - # Two methods for porting the code to device - model = model.cuda() # Method 1 - model = model.to(CALCULATE_DEVICE) # Method 2 - # Port the input from host to device. - images = images.to(CALCULATE_DEVICE) - target = target.to(CALCULATE_DEVICE) -``` - -The code ported to the Ascend AI Processor is as follows: - -``` - CALCULATE_DEVICE = "npu:0" - torch.npu.set_device(CALCULATE_DEVICE) - # Two methods for porting the code to device - model = model.npu() # Method 1 - model = model.to(CALCULATE_DEVICE) # Method 2 - # Port the input from host to device. - images = images.to(CALCULATE_DEVICE) - target = target.to(CALCULATE_DEVICE) -``` - -For details, see [Single-Device Training Modification](#single-device-training-modificationmd). - -

Multi-Device Training Model Porting

- -To port a multi-device training model, **you need to specify the device as the Ascend AI Processor in Python and device operations**. In addition, you can perform distributed training using PyTorch **DistributedDataParallel**, that is, run **init\_process\_group** during model initialization, and then initialize the model into a **DistributedDataParallel** model. Note that the **backend **must be set to **hccl **and the initialization mode must be shielded when **init\_process\_group** is executed. - -PyTorch distributed training code example \(some code is omitted\): - -``` -import torch -import torch.distributed as dist -import torch.nn.parallel -def main(): - args = parser.parse_args() - # The initialization mode needs to be shielded. - dist.init_process_group(backend='hccl',# init_method=args.dist_url, - world_size=args.world_size, rank=args.rank) - model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) # The model needs to be delivered to the NPU. - train_loader = torch.utils.data.DataLoader( - train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None), - num_workers=args.workers, pin_memory=True, sampler=train_sampler) - for epoch in range(args.start_epoch, args.epochs): - acc1 = train(train_loader, model, criterion, optimizer, epoch, args,ngpus_per_node, - lr_scheduler) -``` - -For details, see [Distributed Training Modification](#distributed-training-modificationmd). - - - -1. To enable the Ascend AI Processor to use the capabilities of the PyTorch framework, the native PyTorch framework needs to be adapted at the device layer. The APIs related to the CPU and CUDA need to be replaced for external presentation. During network porting, some device-related APIs need to be replaced with the APIs related to the Ascend AI Processor. [Table 1](#table1922064517344) lists the supported device-related APIs. - - **Table 1** Device-related APIs - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Original PyTorch API

-

API Adapted to the Ascend AI Processor

-

Description

-

torch.cuda.is_available()

-

torch.npu.is_available()

-

Checks whether the device is available in the current environment (not the final result).

-

torch.cuda.current_device()

-

torch.npu.current_device()

-

Obtains the device in use.

-

torch.cuda.device_count()

-

torch.npu.device_count()

-

Obtains the number of devices in the current environment.

-

torch.cuda.set_device()

-

torch.npu.set_device()

-

Sets the device in use.

-

torch.tensor([1,2,3]).is_cuda

-

torch.tensor([1,2,3]).is_npu

-

Checks whether a tensor is in the format on the CUDA or NPU device.

-

torch.tensor([1,2,3]).cuda()

-

torch.tensor([1,2,3]).npu()

-

Converts a tensor to the format on the CUDA or NPU device.

-

torch.tensor([1,2,3]).to("cuda")

-

torch.tensor([1,2,3]).to('npu')

-

Converts a tensor to the format on the CUDA or NPU device.

-

torch.cuda.synchronize()

-

torch.npu.synchronize()

-

Waits until the event is complete.

-

torch.cuda.device

-

torch.npu.device

-

Generates a device class, which can be used to perform device-related operations.

-

torch.cuda.Stream(device)

-

torch.npu.Stream(device)

-

Generates a stream object.

-

torch.cuda.stream(Stream)

-

torch.npu.stream(Stream)

-

Mainly used for scope restriction.

-

torch.cuda.current_stream()

-

torch.npu.current_stream()

-

Obtains the current stream.

-

torch.cuda.default_stream()

-

torch.npu.default_stream()

-

Obtains the default stream.

-

device = torch.device("cuda:0")

-

device = torch.device("npu:0")

-

Specifies a device.

-

torch.autograd.profiler.profile

-

(use_cuda=True)

-

torch.autograd.profiler.profile

-

(use_npu=True)

-

Specifies that CUDA/NPU is used during profiler execution.

-

torch.cuda.Event()

-

torch.npu.Event()

-

Returns events on a device.

-
- -2. When building or porting a network, you need to create tensors of specified data types. The following table lists the tensors created on the Ascend AI Processor. - - **Table 2** Tensor-related APIs - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

GPU tensor

-

API Adapted to the Ascend AI Processor

-

torch.tensor([1,2,3],dtype=torch.long,device='cuda')

-

torch.tensor([1,2,3],dtype=torch.long,device='npu')

-

torch.tensor([1,2,3],dtype=torch.int,device='cuda')

-

torch.tensor([1,2,3],dtype=torch.int,device='npu')

-

torch.tensor([1,2,3],dtype=torch.half,device='cuda')

-

torch.tensor([1,2,3],dtype=torch.half,device='npu')

-

torch.tensor([1,2,3],dtype=torch.float,device='cuda')

-

torch.tensor([1,2,3],dtype=torch.float,device='npu')

-

torch.tensor([1,2,3],dtype=torch.bool,device='cuda')

-

torch.tensor([1,2,3],dtype=torch.bool,device='npu')

-

torch.cuda.BoolTensor([1,2,3])

-

torch.npu.BoolTensor([1,2,3])

-

torch.cuda.FloatTensor([1,2,3])

-

torch.npu.FloatTensor([1,2,3])

-

torch.cuda.IntTensor([1,2,3])

-

torch.npu.IntTensor([1,2,3])

-

torch.cuda.LongTensor([1,2,3])

-

torch.npu.LongTensor([1,2,3])

-

torch.cuda.HalfTensor([1,2,3])

-

torch.npu.HalfTensor([1,2,3])

-
- - -For more APIs, see the _PyTorch API Support_. - -

Mixed Precision

- -#### Overview - -Based on the architecture features of the NPU chip, mixed precision training is involved, that is, the scenario where the float16 and float32 data types are used together. Replacing float32 with float16 has the following advantages: - -- The memory usage of intermediate variables is reduced. -- The data transfer time decreases because the memory usage is reduced. -- The computing units of float16 provide better computing performance. - -However, the mixed precision training is limited by the precision range expressed by float16. If float32 is converted into float16, the training convergence is affected. To use float16 for acceleration in some computations and ensure training convergence, the mixed precision module Apex is used. The mixed precision module Apex is a comprehensive optimization library that features high optimization performance and precision. - -In addition to the preceding advantages, the mixed precision module Apex adapted to Ascend AI Processors can improve computing performance. Details are described as follows: - -- During mixed precision calculation, Apex calculates the grad of the model. You can enable combine\_grad to accelerate these operations. Set the **combine\_grad** parameter of the amp.initialize\(\) interface to **True**. -- After the adaptation, Apex optimizes optimizers, such as adadelta, adam, sgd, and lamb to adapt them to Ascend AI Processors. As a result, the obtained NPU-based fusion optimizers are consistent with the native algorithms, but the calculation speed is faster. You only need to replace the original optimizer with **apex.optimizers.\*** \(**\*** indicates the optimizer name, for example, **NpuFusedSGD**\). - -#### Supported Features - -[Table 1](#table10717173813332) describes the functions and optimization of the mixed precision module. - -**Table 1** Functions of the mixed precision module - - - - - - - - - - - - - - - - - - - -

Function

-

Description

-

O1 configuration

-

Conv and Matmul use float16 for computing, and Softmax and BN use float32.

-

O2 configuration

-

BN uses float32, and others use float16.

-

Static loss scale

-

Parameters are statically set to ensure the convergence of mixed precision training.

-

Dynamic loss scale

-

The loss scale value is dynamically calculated to determine whether overflow occurs.

-
- ->![](public_sys-resources/icon-note.gif) **NOTE:** ->- In the current version, Apex is implemented using Python and does not support AscendCL or CUDA optimization. ->- Ascend AI devices do not support the original FusedLayerNorm interface module of Apex. If the original model script file uses the FusedLayerNorm interface module, you need to replace the script header file **from apex.normalization import FusedLayerNorm** with **from torch.nn import LayerNorm**. - -#### Integrating Mixed Precision Module Into the PyTorch Model - -1. To use the mixed precision module Apex, you need to import the amp from the Apex library as follows: - - ``` - from apex import amp - ``` - -2. After the amp module is imported, you need to initialize the amp module so that it can modify the model, optimizer, and PyTorch internal functions. The initialization code is as follows: - - ``` - model, optimizer = amp.initialize(model, optimizer, combine_grad=True) - ``` - -3. Mark the location where the back propagation **.backward\(\)** occurs so that the amp can perform loss scaling and clear the status of each iteration. The code is as follows: - - Original code: - - ``` - loss = criterion(...) - loss.backward() - optimizer.step() - ``` - - Code after the modification to support loss scaling: - - ``` - loss = criterion(...) - with amp.scale_loss(loss, optimizer) as scaled_loss: - scaled_loss.backward() - optimizer.step() - ``` - - -

Model Training

- -After the training scripts are ported, set environment variables by following the instructions in [Environment Variable Configuration](#en-us_topic_0000001144082004md) and run the **python3** _xxx_ command to train a model. For details, see [Script Execution](#script-executionmd). - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->When running the **python3** _xxx_ command, create a soft link between Python 3 and the installation path of Python that matches the current PyTorch version. - -

Performance Analysis and Optimization

- -- **[Prerequisites](#prerequisitesmd)** - -- **[Commissioning Process](#commissioning-processmd)** - -- **[Affinity Library](#affinity-librarymd)** - - -

Prerequisites

- -1. Modify the open-source code to ensure that the model can run properly, including data preprocessing, forward propagation, loss calculation, mixed precision, back propagation, and parameter update. For details, see [Samples](#samplesmd). -2. During model porting, check whether the model can run properly and whether the existing operators can meet the requirements. If no operator meets the requirements, develop an adapted operator. For details, see the _PyTorch Operator Development Guide_. -3. Prioritize the single-device function, and then enable the multi-device function. - -

Commissioning Process

- -- **[Overall Guideline](#overall-guidelinemd)** - -- **[Training Data Collection](#training-data-collectionmd)** - -- **[Host-side Performance Optimization](#host-side-performance-optimizationmd)** - -- **[Training Performance Optimization](#training-performance-optimizationmd)** - - -

Overall Guideline

- -1. Check whether the throughput meets the expected requirements based on the training execution result. -2. If the throughput does not meet requirements, you need to find out the causes of the performance bottleneck. Possible causes are as follows: - - Operator bottleneck: The execution of an operator is too slow. - - Copy bottleneck: The bottleneck is caused by the copy operation during converting non-contiguous tensors to contiguous tensors. - - Framework bottleneck: Additional operations are required due to operator format conversion. - - Compilation bottleneck: Repeated compilation is caused by the changes of shape or attributes. - -3. Analyze the preceding causes of performance bottlenecks and optimize the performance. - -

Training Data Collection

- -##### Profile Data Collection - -During model training, if the throughput does not meet requirements, you can collect profile data generated during the training process to analyze which step and which operator cause the performance consumption. The profile data is collected at the PyTorch layer \(PyTorch API data\) and CANN layer \(TBE operator data\). - -Select a collection mode based on the site requirements and perform the following steps to collect the profile data. - -- Profile data collection at the PyTorch layer - 1. Obtain the **chrome\_trace** file. - - Use the profile API to reconstruct the loss calculation and optimization process of the original code. - - ``` - # Use the profile API adapted to Ascend-PyTorch. You are advised to run only one step. - with torch.autograd.profiler.profile(use_npu=True) as prof: - out = model(input_tensor) - loss=loss_func(out) - loss.backward() - optimizer.zero_grad() - optimizer.step() - # Export the chrome_trace file to a specified path. - output_path = '/home/HwHiAiUser/profile_data.json' - prof.export_chrome_trace(output_path) - ``` - - 2. View the **chrome\_trace** file. - - To view the **chrome\_trace** file, access **chrome://tracing** in the Chrome browser, drag the file in the blank space. You can press **W**, **A**, **S**, or **D** to zoom in, zoom out, or move the profiling result. - - -- Profile data collection at the CANN layer - 1. Obtain the profile data file. - - ``` - profiler_result_path = "/home/profiling_data" # folder for storing the profile data. You need to manually create the folder in advance based on the site requirements. - with torch.npu.profile(profiler_result_path): - out = model(input_tensor) - loss=loss_func(out,target) - loss.backward() - optimizer.zero_grad() - optimizer.step() - ``` - - >![](public_sys-resources/icon-note.gif) **NOTE:** - >When obtaining the profile data file, deliver **model**, **input\_tensor**, and **target** to the NPU. - - 2. Parse the profile data file. - - For details, see "Profiling Instructions \(Training\)" in the _CANN Auxiliary Development Tool User Guide _. - - - -##### Obtaining Operator Information \(OP\_INFO\) - -The network model is executed as an operator \(OP\). The OPInfo log can be used to obtain the operator and its attributes during the actual execution. Obtain the information by running the **get\_ascend\_op\_info.py** script. - -1. Write the **get\_ascend\_op\_info.py** script to obtain the operator information. The script content is as follows: - - ``` - # -*- coding: utf-8 -*- - """ Used to export operator information. - """ - import os - import sys - import argparse - - def func(host_log_folder): - """ - :param host_log_folder: where host_log_folder addr is. - :return: - """ - host_log_files = os.listdir(host_log_folder) - result = {} - - for host_log in host_log_files: - if not host_log.endswith('.log') or host_log.endswith('.out'): - continue - with open(os.path.join(host_log_folder, host_log), 'r')as f: - host_log_lines = f.readlines() - for line in host_log_lines: - if line.startswith('[INFO] ASCENDCL') and "aclopCompile::aclOp" in line: - op_info = line.split('OpType: ')[1][:-2] - op_type = op_info.split(',')[0] - op_param = op_info[len(op_type) + 2:] - if op_type not in result.keys(): - result[op_type] = [op_param] - else: - result[op_type].append(op_param) - - with open('ascend_op_info_summary.txt', 'w')as f: - for k, v in result.items(): - v_set = set(v) - for info in v_set: - f.write(k + " " + info + "\n") - - if __name__ == "__main__": - parser = argparse.ArgumentParser(description='trans the log') - parser.add_argument('--host_log_folder', default="./", - help="input the dir name, trans the current dir with default") - ags = parser.parse_args() - func(ags.host_log_folder) - ``` - -2. Set the environment variable to print host logs to the screen. - - ``` - export ASCEND_SLOG_PRINT_TO_STDOUT=1 - ``` - -3. Set the log level to **info**. For details, see the _CANN Log Reference_. -4. Run the training script to train the model. After the training is complete, obtain the host logs. By default, the logs are stored in the **$HOME/ascend/log/plog** directory. **$HOME** indicates the root directory of the user on the host. -5. After the host logs are parsed, obtain the operator information **ascend\_op\_info\_summary.txt** in the current directory. - - ``` - python3 get_ascend_op_info.py --host_log_folder $HOME/ascend/log/plog - ``` - -6. Analyze the extra tasks in TaskInfo, especially transdata. - -

Host-side Performance Optimization

- -- **[Overview](#overview-0md)** - -- **[Changing the CPU Performance Mode \(x86 Server\)](#changing-the-cpu-performance-mode-(x86-server)md)** - -- **[Changing the CPU Performance Mode \(ARM Server\)](#changing-the-cpu-performance-mode-(arm-server)md)** - -- **[Installing the High-Performance Pillow Library \(x86 Server\)](#installing-the-high-performance-pillow-library-(x86-server)md)** - -- **[\(Optional\) Installing the OpenCV Library of the Specified Version](#(optional)-installing-the-opencv-library-of-the-specified-versionmd)** - - -
Overview
- -During PyTorch model porting and training, the number of images recognized within one second \(FPS\) for some network models is low and the performance does not meet the requirements. You can perform the following optimization on the server to improve the training performance: - -- Change the CPU performance mode. -- Install the high-performance Pillow library. -- \(Optional\) Install the OpenCV library of the specified version. - -
Changing the CPU Performance Mode (x86 Server)
- -###### Setting the Power Policy to High Performance - -To improve network performance, you need to set the power policy to high performance in the BIOS settings of the x86 server. The detailed operations are as follows: - -1. Log in to the iBMC WebUI, start the virtual console, and select **HTML5 Integrated Remote Console**, as shown in [Figure 1](#fig15869135420288). - - **Figure 1** Remote console - ![](figures/remote-console.png "remote-console") - -2. On the virtual toolbar, click the startup item tool ![](figures/en-us_image_0000001144241932.png). The startup item drop-down list is displayed, as shown in [Figure 2](#fig744814574243). - - **Figure 2** Startup item tool - ![](figures/startup-item-tool.png "startup-item-tool") - -3. In the drop-down list, choose, select **BIOS Setup**, and click ![](figures/en-us_image_0000001190201999.png) on the toolbar to restart the server. -4. After the system restarts, the BIOS configuration screen is displayed. Choose **Advanced** \> **Socket Configuration**. See [Figure 3](#fig4546303814). - - **Figure 3** Socket Configuration - ![](figures/socket-configuration.png "socket-configuration") - -5. On the **Advanced Power Mgmt. Configuration** page displayed, set **Power Policy** to **Performance**, See [Figure 4](#fig15501111014442). - - **Figure 4** Setting the power policy - ![](figures/setting-the-power-policy.png "setting-the-power-policy") - -6. Press **F10** to save the settings and reboot the server. - -###### Setting the CPU Mode to Performance - -Perform the following steps as the **root** user: - -1. Run the following command to check the current CPU mode: - - ``` - cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor - ``` - - After the preceding command is run, the current CPU mode is displayed. For details, see [Table 1](#table354392019384). If the current CPU mode is not performance, perform the following operations to set the CPU mode to performance: Otherwise, skip this step. - - **Table 1** CPU mode - - - - - - - - - - - - - - - - - - - - - - - - - -

Governor

-

Description

-

performance

-

The CPU runs at the maximum frequency.

-

powersave

-

The CPU runs at the minimum frequency.

-

userspace

-

The CPU runs at a frequency specified by the user.

-

ondemand

-

The CPU frequency is dynamically adjusted as required. Once a task needs CPU computing power, the CPU runs at the maximum frequency. If the idle time increases, the CPU frequency decreases.

-

conservative

-

The CPU frequency is dynamically adjusted as required. The adjustment is more conservative than that of the ondemand mode.

-

schedutil

-

The CPU frequency is adjusted based on the scheduler.

-
- -2. Run the following command to install the tool: - - The **ubuntu/debian** system is used as an example. - - ``` - apt-get install linux-tools-$(uname -r) - ``` - - - The **centos/bclinux/euler** system is used as an example: - - ``` - yum install kernel-tools -y - systemctl daemon-reload - systemctl enable cpupower - systemctl start cpupower - ``` - -3. Sets the CPU mode to performance. - - ``` - cpupower frequency-set -g performance - ``` - -4. Perform [Step 1](#li158435131344) again to check whether the current CPU mode is set to performance. - -
Changing the CPU Performance Mode (ARM Server)
- -###### Setting the Power Policy to High Performance - -Some models that have demanding requirements on the CPUs on the host, for example, the object detection model, require complex image pre-processing. Enabling the high-performance mode of the power supply can improve performance and stability. To improve network performance, you need to set the power policy to high performance in the BIOS settings of the ARM server. The detailed operations are as follows: - -1. Log in to the iBMC WebUI, start the virtual console, and select **HTML5 Integrated Remote Console**, as shown in [Figure 1](#fig15869135420288). - - **Figure 1** Remote console - ![](figures/remote-console-0.png "remote-console-0") - -2. On the virtual toolbar, click the startup item tool ![](figures/en-us_image_0000001190202013.png). The startup item drop-down list is displayed, as shown in [Figure 2](#fig744814574243). - - **Figure 2** Startup item tool - ![](figures/startup-item-tool-1.png "startup-item-tool-1") - -3. In the drop-down list, select **BIOS Setup**, and click ![](figures/en-us_image_0000001190081877.png) on the toolbar to restart the server. -4. After the system restarts, the BIOS configuration screen is displayed. Choose **Advanced** \> **Performance Config**. See [Figure 3](#fig4546303814). - - **Figure 3** Performance Config - ![](figures/performance-config.png "performance-config") - -5. On the **Performance Config** page, set **Power Policy** to **Performance**. See [Figure 4](#fig15501111014442). - - **Figure 4** Setting the power policy - ![](figures/setting-the-power-policy-2.png "setting-the-power-policy-2") - -6. Press **F10** to save the settings and reboot the server. - -
Installing the High-Performance Pillow Library (x86 Server)
- -1. Run the following command to install the dependencies for the high-performance pillow library: - - Ubuntu/Debian: - - ``` - apt-get install libtiff5-dev libjpeg8-dev libopenjp2-7-dev zlib1g-dev libfreetype6-dev liblcms2-dev libwebp-dev tcl8.6-dev tk8.6-dev python3-tk libharfbuzz-dev libfribidi-dev libxcb1-dev - ``` - - CentOS/BC-Linux/EulerOS: - - ``` - yum install libtiff-devel libjpeg-devel openjpeg2-devel zlib-devel freetype-devel lcms2-devel libwebp-devel tcl-devel tk-devel harfbuzz-devel fribidi-devel libraqm-devel libimagequant-devel libxcb-devel - ``` - -2. Install the high-performance Pillow library. - 1. Run the following command to uninstall the native Pillow: - - ``` - pip3.7 uninstall -y pillow - ``` - - 2. Install the pillow-simd of the SSE4 version. - - Run the following command as the **root** user. If a non-root user is used, add **--user** to the end of the command. - - ``` - pip3.7 install pillow-simd - ``` - - >![](public_sys-resources/icon-note.gif) **NOTE:** - >If the CPU supports the AVX2 instruction set, run the following command to install pillow-simd of the AVX2 version: - >``` - >CC="cc -mavx2" pip3.7 install -U --force-reinstall pillow-simd - >``` - - -3. Modify the TorchVision code to solve the problem that the pillow-simd does not contain the **PILLOW\_VERSION** field. For details about how to install TorchVision, see [How to Obtain](#obtaining-samplesmd). - - Modify the code in line 5 of **/usr/local/python3._x.x_/lib/python3._x_/site-packages/torchvision/transforms/functional.py** as follows: - - ``` - try: - from PIL import Image, ImageOps, ImageEnhance,PILLOW_VERSION - except: - from PIL import Image, ImageOps, ImageEnhance - PILLOW_VERSION="7.0.0" - ``` - - -
(Optional) Installing the OpenCV Library of the Specified Version
- -If the model depends on OpenCV, you are advised to install OpenCV 3.4.10 to ensure training performance. - -1. Source code: [Link](https://opencv.org/releases/) -2. Installation guide: [Link](https://docs.opencv.org/3.4.10/d7/d9f/tutorial_linux_install.html) - -

Training Performance Optimization

- -##### Operator Bottleneck Optimization - -1. Obtain the profile data during training. For details, see [Profile Data Collection](#training-data-collectionmd). -2. Analyze the profile data to obtain the time-consuming operator. -3. See [Single-Operator Sample Building](#single-operator-sample-buildingmd) to build the single-operator sample of the time-consuming operator, and compare the execution time of a single-operator sample on the CPU and GPU. If the performance is insufficient, use either of the following methods to solve the problem: - - Workaround: Use other efficient operators with the same semantics. - - Solution: Improve the operator performance. - - -##### Copy Bottleneck Optimization - -1. Obtain the profile data during training. For details, see [Profile Data Collection](#training-data-collectionmd). -2. Analyze the Profile data to obtain the execution time of **D2DCopywithStreamSynchronize**, **PTCopy**, or **format\_contiguous** in the entire network. -3. If the execution takes a long time, use either of the following methods to solve the problem: - - Method 1 \(workaround\): Replace view operators with compute operators. In PyTorch, view operators cause conversion from non-contiguous tensors to contiguous tensors. The optimization idea is to replace view operators with compute operators. Common view operators include view, permute, and transpose operators. For more view operators, go to [https://pytorch.org/docs/stable/tensor\_view.html](https://pytorch.org/docs/stable/tensor_view.html). - - Method 2 \(solution\): Accelerate the operation of converting non-contiguous tensors to contiguous tensors. - - -##### Framework Bottleneck Optimization - -1. Obtain the operator information \(OP\_INFO\) during the training. For details, see [Obtaining Operator Information \(OP\_INFO\)](#training-data-collectionmd). -2. Analyze the specifications and calling relationship of operators in OP\_INFO to check whether redundant operators are inserted. Pay special attention to check whether transdata is proper. -3. Solution: Specify the initialization format of some operators to eliminate cast operators. -4. In **pytorch/torch/nn/modules/module.py**, specify the operator initialization format in **cast\_weight**, as shown in the following figure. - - ![](figures/指定算子初始化方式.png) - - The format setting principle is as follows: - - - For the Conv2D operator, weight can be set to FZ format, for example, line 424. - - For the linear operator, weight can be set to NZ format, for example, line 409. - - -##### Compilation Bottleneck Optimization - -1. Obtain the operator information \(OP\_INFO\) during the training. For details, see [Obtaining Operator Information \(OP\_INFO\)](#training-data-collectionmd). -2. View the INFO log and check the keyword **aclopCompile::aclOp** after the first step. If **Match op inputs/type failed** or **To compile op** is displayed, the operator is dynamically compiled and needs to be optimized. -3. Use either of the following methods to solve the problem: - - Workaround: Based on the understanding of model semantics and related APIs, replace dynamic shape with static shape. - - Solution: Reduce compilation or do not compile the operator. - - For details about how to optimize the operator compilation configuration, see [Compilation Option Settings](#compilation-option-settingsmd). - - -

Affinity Library

- -- **[Source](#sourcemd)** - -- **[Functions](#functionsmd)** - - -

Source

- -The common network structures and functions in the public models are optimized to greatly improve computing performance. In addition, the network structures and functions are integrated into the PyTorch framework to facilitate model performance optimization. - -

Functions

- - - - - - - - - - - - - - - - - - - - - - - - -

Function

-

Location

-

Description

-

pairwise_iou

-

torch.contrib.npu.optimized_lib

-

Calculates the IOUs of the two bounding boxes.

-

fast_rcnn_inference_single_image

-

torch.contrib.npu.optimized_lib

-

Provides the inference API of the Mask R-CNN and Faster R-CNN models.

-

ChannelShuffle

-

torch.contrib.npu.optimized_lib

-

Provides NPU-affinity channelshuffle operations and applies to models such as shufflenetv2.

-

PreLoader

-

torch.contrib.npu.optimized_lib

-

Provides the data loading method for accelerating Ascend AI Processors.

-
- ->![](public_sys-resources/icon-note.gif) **NOTE:** ->The optimization content will be enhanced and updated with the version. Use the content in the corresponding path of the actual PyTorch version. - -

Precision Commissioning

- -- **[Prerequisites](#prerequisites-1md)** - -- **[Commissioning Process](#commissioning-process-2md)** - - -

Prerequisites

- -Run a certain number of epochs \(20% of the total number of epoches is recommended\) with the same semantics and hyperparameters to align the precision and loss with the corresponding level of the GPU. After the alignment is complete, align the final precision. - -

Commissioning Process

- -- **[Overall Guideline](#overall-guideline-3md)** - -- **[Precision Tuning Methods](#precision-tuning-methodsmd)** - - -

Overall Guideline

- -To locate the precision problem, you need to find out the step in which the problem occurs. The following aspects are involved: - -1. Model network calculation error - - Locating method: Add a hook to the network to determine which part is suspected. Then build a [single-operator sample](#single-operator-sample-buildingmd) to narrow down the error range. This can prove that the operator calculation is incorrect in the current network. You can compare the result with the CPU or GPU result to prove the problem. - - - Workaround: Use other operators with the same semantics. - - - Solution: Improve the operator precision or function. - -2. Loss calculation error - - Locating method: The loss is special and can be customized. After determining that the loss calculation is incorrect, you are advised to dump the loss input in the network instead of a random tensor with the identical shape, so that the problem can be better reproduced and proved. - - - Workaround: Use other operators with the same semantics. - - - Solution: Improve the operator precision or function. \(Loss is also formed by operators.\) - -3. Parameter update error - - - Locating method: Before each **optim.step\(\)**, print the gradients of the parameters in the network one by one to determine which part is suspected. Then build a single-operator sample to narrow down the error range. This can prove that the gradient calculation by the operator is incorrect in the current network. You can compare the result with the CPU or GPU result to prove the problem. The priority of this item should be lower than that of items [1](#li17755175510322) and [2](#li25281726103316) because the errors of items 1 and 2 can also cause the gradient exception. - - - Workaround: Use other operators with the same semantics. - - - Solution: Improve the precision or function of the operator for gradient calculation. - -4. Multi-device calculation error - - - Locating method: When the precision of a single-device is ensured, multi-device calculation errors occur. - - - Solution: Contact Huawei support to provide the single-device script and multi-device script of stable reproduction. - - - -

Precision Tuning Methods

- -General model precision problems are as follows: training loss not converge or unqualified precision due to operator overflow/underflow; unqualified performance due to network-wide training. You can perform single-operator overflow/underflow detection and network-wide commissioning to resolve the preceding problems. - -- **[Single-Operator Overflow/Underflow Detection](#single-operator-overflow-underflow-detectionmd)** - -- **[Network-wide Commissioning](#network-wide-commissioningmd)** - - -
Single-Operator Overflow/Underflow Detection
- -With this function, you can check whether an operator overflows/underflows and collect data of overflowed/underflowed operators, helping developers quickly locate and solve operator precision problems. - -###### Restrictions - -- Install the HDF5 tool to support the operator dump function. For details about how to install the tool, see [HDF5 Compilation and Installation](#hdf5-compilation-and-installationmd). -- This function provides only IR-level operator overflow/underflow detection for only the AI Core \(not Atomic\). -- Add the **USE\_DUMP=1** field to the **build.sh** file of the PyTorch source code. - - ``` - Before the modification: DEBUG=0 USE_DISTRIBUTED=1 USE_HCCL=1 USE_MKLDNN=0 USE_CUDA=0 USE_NPU=1 BUILD_TEST=0 USE_NNPACK=0 python3 setup.py build bdist_wheel - After the modification: DEBUG=0 USE_DISTRIBUTED=1 USE_HCCL=1 USE_MKLDNN=0 USE_CUDA=0 USE_NPU=1 BUILD_TEST=0 USE_NNPACK=0 USE_DUMP=1 python3 setup.py build - ``` - - Recompile and install PyTorch by referring to "Manual Build and Installation" in the _PyTorch Installation Guide_. - -- When using the single-operator overflow/underflow detection function, do not enable the dynamic loss scale mode of apex and the tensor fusion function at the same time. - -###### Collecting Data of Overflowed/Underflowed Operators - -``` -# check_overflow is the overflow/underflow detection control switch. -# dump_path is the path for storing dump files. -with torch.utils.dumper(check_overflow=check_overflow, dump_path=dump_path, load_file_path='') as dump: - # Code snippet for detecting operator overflow/underflow. -``` - -During model running, if an operator overflows/underflows, the name of the corresponding IR is printed. - -###### Viewing Dump Data - -If dump data is collected during training, an .h5 file of the dump data is generated in the **\{dump\_path\}** directory. You can go to the directory to view the dump data. - -###### Solution - -Send the screenshots of operator overflow/underflow and the collected .h5 file to Huawei R&D engineers as the attachment of an issue. - -
Network-wide Commissioning
- -You can also commission the network model precision by analyzing the entire network. - -1. Determine whether the calculation on the Ascend AI Processor is correct by comparing the calculation result on the CPU and that on the Ascend AI Processor. - - Code example \(this example shows only the basic method and does not allow direct copy\): - - ``` - # The input parameters are fixed to ensure that the model and input data are the same on the CPU and Ascend AI Processor. - input_tensor_cpu = torch.Tensor() - model_cpu = build_model() - # Port the input data to the Ascend AI Processor. - input_tensor_npu = input_tensor_cpu.npu() - # Port the model to the Ascend AI Processor. - model_npu = model_cpu.npu() - - #Compare the calculation results. - output_cpu = model_cpu(input_tensor_cpu) - output_npu = model_npu(input_tensor_npu) - compute_result = (output_cpu - output_npu).abs().mean()) - print(compute_result) - ``` - - The calculation results are slightly different because the hardware architecture of the Ascend AI Processor is different from that of the CPU. If the calculation results are close \(generally not higher than 1e-4\), then they are normal. - -2. Use the hook mechanism of PyTorch to print the inputs and outputs of the module in the forward and backward propagation for analysis. - - Code example \(this example shows only the basic method and does not allow direct copy\): - - ``` - # Set the hook function. - def hook_func(name, module): - def hook_function(module, inputs, outputs): - print(name+' inputs', inputs) - print(name+' outputs', outputs) - return hook_function - - # Register the forward and backward hooks. - for name, module in model.named_modules(): - module.register_forward_hook(hook_func('[forward]: '+name, module)) - module.register_backward_hook(hook_func('[backward]: '+name, module)) - - # Run - model(input_tensor) - ``` - - Analyze the printed inputs and outputs in the forward and backward propagation. - -3. Obtain parameters such as **grad**, **running\_mean**, and **running\_var** of the module to analyze the updates. - - Code example \(this example shows only the basic method and does not allow direct copy\): - - ``` - # For example, obtain the gradient and average value of BN for check. - for name, module in model.named_modules(): - if isinstance(module, nn._BatchNorm): - print("[BN_buffer]: "+name, module.running_mean, module.running_var) - print("[grad]: "+name, module.grad) - ``` - - -

Model Saving and Conversion

- -- **[Introduction](#introduction-4md)** - -- **[Saving a Model](#saving-a-modelmd)** - -- **[Exporting an ONNX Model](#exporting-an-onnx-modelmd)** - - -

Introduction

- -After the model training is complete, save the model file and export the ONNX model by using the APIs provided by PyTorch. Then use the ATC tool to convert the model into an .om file that adapts to the Ascend AI Processor for offline inference. - -This section describes how to convert the trained .pth or .pth.tar file into the ONNX model. For details about how to convert the ONNX model into an .om file adapted to the Ascend AI Processor, see "ATC Tool Instructions" in the _CANN Auxiliary Development Tool User Guide _. - -For details about how to use the Auto Tune function, see "Auto Tune Instructions" in the _CANN Auxiliary Development Tool User Guide _. - -For details about how to build an offline inference application, see the _CANN Application Software Development Guide \(C and C++, Inference\)_. The process is as follows: - -![](figures/en-us_image_0000001144082132.png) - -

Saving a Model

- -During PyTorch training, **torch.save\(\)** is used to save checkpoint files. Based on the usage of model files, model files are saved in the following two formats: - -- .pth or .pt files: These files are used for online inference or exporting ONNX models. Only model parameters are saved, and the model structure is not saved, so that the compressed file can be opened using a visualization tool such as Netron. [Figure 1](#fig315704722610) shows an example. - - **Figure 1** .pth file - ![](figures/pth-file.jpg "pth-file") - - Use **state\_dict** to save and load a model. The following is an example: - - 1. Save a model. - - ``` - # Create a storage path. - PATH = "state_dict_model.pt" - # Save a model. - torch.save(net.state_dict(), PATH) - ``` - - 2. Load the model for online inference. The following is an example. For details, see the _PyTorch Online Inference Guide_. - - ``` - # Path for storing the model file - PATH = "state_dict_model.pt" - model = TheModelClass(*args, **kwargs) - # Load a model. - model.load_state_dict(torch.load(PATH)) - model.eval() - ``` - - >![](public_sys-resources/icon-notice.gif) **NOTICE:** - >The model definition file must be provided when the .pth or .pt file is saved. Otherwise, the deployment cannot be performed. - -- .pth.tar files: can be used for online inference or training after reloading. Multiple components are saved in dictionary format. Common components include the **state\_dict** of the model and optimizer, epoch when the training stops, training loss of the latest record, and the external torch.nn.Embedding layer. If only an inference model needs to be deployed, you are advised to save the weight information only, that is, the **state\_dict** of the model, in the .pth.tar file. - - The following is an example of saving and loading a model: - - 1. Save a model. - - ``` - PATH = "checkpoint.pth.tar" - torch.save({ - 'epoch': epoch, - 'loss': loss, - 'state_dict': model.state_dict(), - 'optimizer' : optimizer.state_dict(), - ... - }, PATH) - ``` - - 2. Load a model for inference or resuming training. - - ``` - model = TheModelClass(*args, **kwargs) - optimizer = TheOptimizerClass(*args, **kwargs) - - checkpoint = torch.load(PATH) - model.load_state_dict(checkpoint['model_state_dict']) - optimizer.load_state_dict(checkpoint['optimizer_state_dict']) - epoch = checkpoint['epoch'] - loss = checkpoint['loss'] - - model.eval() - # - or - - model.train() - ``` - - - ->![](public_sys-resources/icon-notice.gif) **NOTICE:** ->Generally, an operator is processed in different ways in the training graph and inference graph \(for example, BatchNorm and dropout operators\), and the input formats are also different. Therefore, before inference or ONNX model exporting, **model.eval\(\)** must be called to set the dropout and batch normalization layers to the inference mode. - -

Exporting an ONNX Model

- -#### Introduction - -The deployment policy of the Ascend AI Processor for PyTorch models is implemented based on the ONNX module that is supported by PyTorch. ONNX is a mainstream model format in the industry and is widely used for model sharing and deployment. This section describes how to export a checkpoint file as an ONNX model by using the **torch.onnx.export\(\)** API. - -#### Using the .pth or .pt File to Export the ONNX Model - -The saved .pth or .pt file can be restored by building a model using PyTorch and then loading the weight. Then you can export the ONNX model. The following is an example. - -``` -import torch -import torch.onnx -import torchvision.models as models -# Set the CPU to be used to export the model. -device = torch.device("cpu") - -def convert(): -# The model definition comes from the torchvision. The model file generated in the example is based on the ResNet-50 model. - model = models.resnet50(pretrained = False) - resnet50_model = torch.load('resnet50.pth', map_location='cpu') - model.load_state_dict(resnet50_model) - - batch_size = 1 # Size of the batch processing - input_shape = (3, 224, 224) # Input data. Replace it with the actual shape. - - # Set the model to inference mode. - model.eval() - - dummy_input = torch.randn(batch_size, *input_shape) # Define the input shape. - torch.onnx.export(model, - dummy_input, - "resnet50_official.onnx", - input_names = ["input"], # Construct the input name. - output_names = ["output"], # Construct the output name. - opset_version=11, # Currently, the ATC tool supports only opset_version=11. - dynamic_axes={"input":{0:"batch_size"}, "output":{0:"batch_size"}}) # Dynamic axes of the output is supported. - ) - -if __name__ == "__main__": - convert() -``` - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->- Before exporting the ONNX model, the **model.eval\(\)** must be called to set the dropout and batch normalization layers to inference mode. ->- The model in the sample script comes from the definition in the torchvision module. You need to specify a model when using your own model. ->- The constructed input and output must correspond to the input and output during training. Otherwise, the inference cannot be performed properly. - -#### Using the .pth.tar File to Export the ONNX Model - -Before exporting the ONNX model using the .pth.tar file, you need to check the saved information. Sometimes, the saved node name may be different from the node name in the model definition. For example, a prefix and suffix may be added. During the conversion, you can modify the node name. The following is an example of the conversion. - -``` -import torch -import torch.onnx -from collections import OrderedDict -import mobilenet - -# In this example, when the .pth.tar file is saved, the prefix module is added to the node name. Delete it by traversing. -def proc_nodes_module(checkpoint, AttrName): - new_state_dict = OrderedDict() - for key, value in checkpoint[AttrName].items(): - if key == "module.features.0.0.weight": - print(value) - if(key[0:7] == "module."): - name = key[7:] - else: - name = key[0:] - - new_state_dict[name] = value - return new_state_dict - -def convert(): - checkpoint = torch.load("./mobilenet_cpu.pth.tar", map_location=torch.device('cpu')) - checkpoint['state_dict'] = proc_nodes_module(checkpoint,'state_dict') - model = mobilenet.mobilenet_v2(pretrained = False) - model.load_state_dict(checkpoint['state_dict']) - model.eval() - input_names = ["actual_input_1"] - output_names = ["output1"] - dummy_input = torch.randn(1, 3, 224, 224) - torch.onnx.export(model, dummy_input, "mobilenetV2_npu.onnx", input_names = input_names, output_names = output_names, opset_version=11) - -if __name__ == "__main__": - convert() -``` - -

Samples

- -- **[ResNet-50 Model Porting](#resnet-50-model-portingmd)** - -- **[ShuffleNet Model Optimization](#shufflenet-model-optimizationmd)** - - -

ResNet-50 Model Porting

- -- **[Obtaining Samples](#obtaining-samplesmd)** - -- **[Porting the Training Script](#porting-the-training-scriptmd)** - -- **[Script Execution](#script-executionmd)** - - -

Obtaining Samples

- -##### How to Obtain - -1. This sample is used to adapt to the porting and reconstruction of the Ascend 910 AI Processor based on the ImageNet dataset training model provided by the PyTorch official website. The sample can be obtained from [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet). -2. This sample depends on torchvision. Therefore, you need to install the torchvision dependency. If you install it as a non-root user, add **--user** to the end of the command. - - If the server runs in the x86 environment, run the following command: - - ``` - pip3.7 install torchvision==0.6.0 --no-deps - ``` - - If the server runs in the ARM environment, run the following command: - - ``` - pip3.7 install torchvision==0.2.2.post3 --no-deps - ``` - -3. For details about the ResNet-50 model, go to [https://pytorch.org/hub/pytorch\_vision\_resnet/](https://pytorch.org/hub/pytorch_vision_resnet/). The following two methods are available: - 1. Directly call the corresponding API. For example: - - ``` - import torchvision.models as models - model = models.resnet50() - ``` - - >![](public_sys-resources/icon-note.gif) **NOTE:** - >ResNet-50 is a model built in PyTorch. For more built-in models, visit the [PyTorch official website](https://pytorch.org/). - - 2. During script execution, set **arch** to **resnet50**. This method is used in the sample. For details, see [Script Execution](#script-executionmd). - - ``` - --arch resnet50 - ``` - - - -##### Directory Structure - -The structure of major directories and files is as follows: - -``` -├──main.py -``` - -

Porting the Training Script

- -- **[Single-Device Training Modification](#single-device-training-modificationmd)** - -- **[Distributed Training Modification](#distributed-training-modificationmd)** - - -
Single-Device Training Modification
- -1. Add the header file to **main.py** to support model training on the Ascend 910 AI Processor based on the PyTorch framework. - - ``` - import torch.npu - ``` - -2. Add parameters to the end of the header file in the **main.py** file to specify that the Ascend 910 AI Processor is used for training. - - ``` - CALCULATE_DEVICE = "npu:1" - ``` - -3. Modify the parameter and option so that training is performed only on the Ascend 910 AI Processor. - - Code location: **main\_worker\(\)** in **main.py** \(The changes are in bold.\) - - ``` - def main_worker(gpu, ngpus_per_node, args): - global best_acc1 - # The original code specifies the GPU for training. The original code is as follows: - # args.gpu = gpu - ############## npu modify begin ############# - args.gpu = None - ############## npu modify end ############# - if args.gpu is not None: - print("Use GPU: {} for training".format(args.gpu)) - - if args.distributed: - if args.dist_url == "env://" and args.rank == -1: - args.rank = int(os.environ["RANK"]) - if args.multiprocessing_distributed: - # For multiprocessing distributed training, rank needs to be the - # global rank among all the processes - args.rank = args.rank * ngpus_per_node + gpu - dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, - world_size=args.world_size, rank=args.rank) - # create model - if args.pretrained: - print("=> using pre-trained model '{}'".format(args.arch)) - model = models.__dict__[args.arch](pretrained=True) - else: - print("=> creating model '{}'".format(args.arch)) - model = models.__dict__[args.arch]() - # The original code determines whether to perform training on the GPU. The code is as follows: - # if not torch.cuda.is_available(): - # print('using CPU, this will be slow') - # elif args.distributed: - ############## npu modify begin ############# - # After the migration, the code directly determines whether to perform distributed training and does not determine whether to perform training on the GPU. - if args.distributed: - ############## npu modify end ############# - # For multiprocessing distributed, DistributedDataParallel constructor - # should always set the single device scope, otherwise, - # DistributedDataParallel will use all available devices. - if args.gpu is not None: - ...... - ``` - -4. Migrate the model and loss function to the Ascend 910 AI Processor for calculation. - - Code location: **main\_worker\(\)** in **main.py** \(The changes are in bold.\) - - ``` - elif args.gpu is not None: - torch.cuda.set_device(args.gpu) - model = model.cuda(args.gpu) - else: - # DataParallel will divide and allocate batch_size to all available GPUs - if args.arch.startswith('alexnet') or args.arch.startswith('vgg'): - model.features = torch.nn.DataParallel(model.features) - model.cuda() - else: - # The original code uses the torch.nn.DataParallel() class to accelerate training using multiple GPUs. - # model = torch.nn.DataParallel(model).cuda() - ############## npu modify begin ############# - # Migrate the model to the NPU for training. - model = model.to(CALCULATE_DEVICE) - ############## npu modify end ############# - # In the original code, the loss function is calculated on the GPU. - # # define loss function (criterion) and optimizer - # criterion = nn.CrossEntropyLoss().cuda(args.gpu) - ############## npu modify begin ############# - # Migrate the loss function to the NPU for calculation. - criterion = nn.CrossEntropyLoss().to(CALCULATE_DEVICE) - ############## npu modify end ############# - ``` - -5. Change the type of the **target** operator in the dataset to **int32** to resolve the operator error. Migrate the dataset to the Ascend 910 AI Processor for calculation. - - Code location: **train\(\)** in **main.py** \(The changes are in bold.\) - - ``` - for i, (images, target) in enumerate(train_loader): - # measure data loading time - data_time.update(time.time() - end) - - if args.gpu is not None: - images = images.cuda(args.gpu, non_blocking=True) - # In the original code, the training dataset is loaded and calculated on the GPU. The original code is as follows: - # if torch.cuda.is_available(): - # target = target.cuda(args.gpu, non_blocking=True) - ############## npu modify begin ############# - # Port the dataset to the NPU for calculation and modify the target data type to improve performance. - if 'npu' in CALCULATE_DEVICE: - target = target.to(torch.int32) - images, target = images.to(CALCULATE_DEVICE, non_blocking=True), target.to(CALCULATE_DEVICE, non_blocking=True) - ############## npu modify end ############# - ``` - - - Code location: **validate\(\)** in **main.py** \(The changes are in bold.\) - - ``` - with torch.no_grad(): - end = time.time() - for i, (images, target) in enumerate(val_loader): - if args.gpu is not None: - images = images.cuda(args.gpu, non_blocking=True) - # In the original code, the training dataset is loaded and calculated on the GPU. The original code is as follows: - # if torch.cuda.is_available(): - # target = target.cuda(args.gpu, non_blocking=True) - ############## npu modify begin ############# - # Port the dataset to the NPU for calculation and modify the target data type. - if 'npu' in CALCULATE_DEVICE: - target = target.to(torch.int32) - images, target = images.to(CALCULATE_DEVICE, non_blocking=True), target.to(CALCULATE_DEVICE, non_blocking=True) - ############## npu modify end ############# - ``` - -6. Set the device in use. - - Code location: Main function entry in **main.py** \(The changes are in bold.\) - - ``` - if __name__ == '__main__': - ############## npu modify begin ############# - if 'npu' in CALCULATE_DEVICE: - torch.npu.set_device(CALCULATE_DEVICE) - ############## npu modify begin ############# - main() - ``` - - -
Distributed Training Modification
- -1. Add the header file to **main.py** to support mixed-precision model training on the Ascend 910 AI Processor based on the PyTorch framework. - - ``` - import torch.npu - from apex import amp - ``` - -2. Add the following parameters, including the parameters for specifying the Ascend 910 AI Processor involved in training and the parameters required for mixed-precision training. - - ``` - parser.add_argument('--device', default='npu', type=str, help='npu or gpu') - parser.add_argument('--addr', default='10.136.181.115', type=str, help='master addr') - parser.add_argument('--device-list', default='0,1,2,3,4,5,6,7', type=str, help='device id list') - parser.add_argument('--amp', default=False, action='store_true', help='use amp to train the model') - parser.add_argument('--loss-scale', default=1024., type=float, - help='loss scale using in amp, default -1 means dynamic') - parser.add_argument('--opt-level', default='O2', type=str, - help='loss scale using in amp, default -1 means dynamic') - ``` - -3. Create a mapping function from **device\_id** to **process\_id** and specify the device for training. Add the following API to the **main.py** function: - - ``` - def device_id_to_process_device_map(device_list): - devices = device_list.split(",") - devices = [int(x) for x in devices] - devices.sort() - - process_device_map = dict() - for process_id, device_id in enumerate(devices): - process_device_map[process_id] = device_id - - return process_device_map - ``` - -4. Specify the IP address and the port number of the training server. - - Code location: Main function **main\(\)** in **main.py** \(The changes are in bold.\) - - ``` - def main(): - args = parser.parse_args() - ############## npu modify begin ############# - os.environ['MASTER_ADDR'] = args.addr - os.environ['MASTER_PORT'] = '29688' - ############## npu modify end ############# - ``` - -5. Create a mapping parameter from **device\_id** to **process\_id** to obtain the number of Ascend 910 AI Processors on a single node. - - Code location: Main function **main\(\)** in **main.py** \(The changes are in bold.\) - - ``` - args.distributed = args.world_size > 1 or args.multiprocessing_distributed - ############## npu modify begin ############# - args.process_device_map = device_id_to_process_device_map(args.device_list) - if args.device == 'npu': - ngpus_per_node = len(args.process_device_map) - else: - ngpus_per_node = torch.cuda.device_count() - ############## npu modify end ############# - # The original code is as follows: - # ngpus_per_node = torch.cuda.device_count() - ``` - -6. Obtain the ID of the Ascend 910 AI Processor corresponding to **process\_id** and specify the Ascend 910 AI Processor for training. - - Code location: **main\_worker\(\)** in **main.py** \(The changes are in bold.\) - - ``` - def main_worker(gpu, ngpus_per_node, args): - global best_acc1 - ############## npu modify begin ############# - args.gpu = args.process_device_map[gpu] - ############## npu modify end ############# - # The original code is as follows: - # args.gpu = gpu - ``` - -7. Initialize the process group and shield the initialization mode. - - Code location: **main\_worker\(\)** in **main.py** \(The changes are in bold.\) - - ``` - ############## npu modify begin ############# - if args.device == 'npu': - dist.init_process_group(backend=args.dist_backend, #init_method=args.dist_url, - world_size=args.world_size, rank=args.rank) - else: - dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, - world_size=args.world_size, rank=args.rank) - ############## npu modify begin ############# - # The original code is as follows: - # dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, - world_size=args.world_size, rank=args.rank) - ``` - -8. To perform distributed training, the mixed precision module needs to be introduced, and the model needs to be ported to the Ascend AI Processor. Therefore, the code for determining whether the training is distributed training and whether the model is trained on the GPU needs to be masked. - - Code location: **main\_worker\(\)** in **main.py** \(The changes are in bold.\) - - ``` - # create model - if args.pretrained: - print("=> using pre-trained model '{}'".format(args.arch)) - model = models.__dict__[args.arch](pretrained=True) - else: - print("=> creating model '{}'".format(args.arch)) - model = models.__dict__[args.arch]() - ############## npu modify begin ############# - # Add the following to the code: - # Specify the Ascend AI Processor as the training device. - loc = 'npu:{}'.format(args.gpu) - torch.npu.set_device(loc) - # Calculate batch_size and workers used for training. - args.batch_size = int(args.batch_size / ngpus_per_node) - args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node) - ############## npu modify end ############# - # The original code is as follows. The code needs to be masked and is commented out. - # if not torch.cuda.is_available(): - # print('using CPU, this will be slow') - # elif args.distributed: - # # For multiprocessing distributed, DistributedDataParallel constructor - # # should always set the single device scope, otherwise, - # # DistributedDataParallel will use all available devices. - # if args.gpu is not None: - # torch.cuda.set_device(args.gpu) - # model.cuda(args.gpu) - # # When using a single GPU per process and per - # # DistributedDataParallel, we need to divide the batch size - # # ourselves based on the total number of GPUs we have - # args.batch_size = int(args.batch_size / ngpus_per_node) - # args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node) - # model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) - # else: - # model.cuda() - # # DistributedDataParallel will divide and allocate batch_size to all - # # available GPUs if device_ids are not set - # model = torch.nn.parallel.DistributedDataParallel(model) - # elif args.gpu is not None: - # torch.cuda.set_device(args.gpu) - # model = model.cuda(args.gpu) - # else: - # # DataParallel will divide and allocate batch_size to all available GPUs - # if args.arch.startswith('alexnet') or args.arch.startswith('vgg'): - # model.features = torch.nn.DataParallel(model.features) - # model.cuda() - # else: - # model = torch.nn.DataParallel(model).cuda() - ``` - -9. The loss function, optimizer, and breakpoint training are masked, and this part is combined with the mixed precision training later. - - Code location: **main\_worker\(\)** in **main.py** \(The changes are in bold.\) - - ``` - # The original code is masked and commented out. - # # define loss function (criterion) and optimizer - # criterion = nn.CrossEntropyLoss().cuda(args.gpu) - # - # optimizer = torch.optim.SGD(model.parameters(), args.lr, - # momentum=args.momentum, - # weight_decay=args.weight_decay) - # - # # optionally resume from a checkpoint - # if args.resume: - # if os.path.isfile(args.resume): - # print("=> loading checkpoint '{}'".format(args.resume)) - # if args.gpu is None: - # checkpoint = torch.load(args.resume) - # else: - # # Map model to be loaded to specified single gpu. - # loc = 'cuda:{}'.format(args.gpu) - # checkpoint = torch.load(args.resume, map_location=loc) - # args.start_epoch = checkpoint['epoch'] - # best_acc1 = checkpoint['best_acc1'] - # if args.gpu is not None: - # # best_acc1 may be from a checkpoint from a different GPU - # best_acc1 = best_acc1.to(args.gpu) - # model.load_state_dict(checkpoint['state_dict']) - # optimizer.load_state_dict(checkpoint['optimizer']) - # print("=> loaded checkpoint '{}' (epoch {})" - # .format(args.resume, checkpoint['epoch'])) - # else: - # print("=> no checkpoint found at '{}'".format(args.resume)) - # - # cudnn.benchmark = True - ``` - -10. A data loader combines a dataset and a sampler and can provide multiple threads to process the dataset. If the Ascend AI Processor is used for training, **pin\_memory** must be set to **False**. Currently, only training in a static shape is supported. The number of remaining samples in the data flow may be less than the batch size. Therefore, **drop\_last** must be set to **True**. In addition, you need to set **shuffle** to **True** for some datasets to be verified. - - Code location: **main\_worker\(\)** in **main.py** \(The changes are in bold.\) - - ``` - ############## npu modify begin ############# - train_loader = torch.utils.data.DataLoader( - train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None), - num_workers=args.workers, pin_memory=False, sampler=train_sampler, drop_last=True) - - val_loader = torch.utils.data.DataLoader( - datasets.ImageFolder(valdir, transforms.Compose([ - transforms.Resize(256), - transforms.CenterCrop(224), - transforms.ToTensor(), - normalize, - ])), - batch_size=args.batch_size, shuffle=True, - num_workers=args.workers, pin_memory=False, drop_last=True) - ############## npu modify end ############# - ``` - -11. Construct the loss function and optimizer, and port the model and loss function to the Ascend AI Processor. The optimizer, the model and the mixed precision module are combined to support the mixed precision training. The breakpoint training part is combined with the mixed precision module to support the mixed precision training. - - Code location: after the data loading verification part of **main\_worker\(\)** in **main.py** \(The changes are in bold.\) - - ``` - val_loader = torch.utils.data.DataLoader( - datasets.ImageFolder(valdir, transforms.Compose([ - transforms.Resize(256), - transforms.CenterCrop(224), - transforms.ToTensor(), - normalize, - ])), - batch_size=args.batch_size, shuffle=True, - num_workers=args.workers, pin_memory=False, drop_last=True) - - ############## npu modify begin ############# - model = model.to(loc) - # define loss function (criterion) and optimizer - criterion = nn.CrossEntropyLoss().to(loc) - optimizer = torch.optim.SGD(model.parameters(), args.lr, - momentum=args.momentum, - weight_decay=args.weight_decay) - - if args.amp: - model, optimizer = amp.initialize(model, optimizer, opt_level=args.opt_level, loss_scale=args.loss_scale) - model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) - - # optionally resume from a checkpoint - if args.resume: - if os.path.isfile(args.resume): - print("=> loading checkpoint '{}'".format(args.resume)) - checkpoint = torch.load(args.resume, map_location=loc) - args.start_epoch = checkpoint['epoch'] - best_acc1 = checkpoint['best_acc1'] - model.load_state_dict(checkpoint['state_dict']) - optimizer.load_state_dict(checkpoint['optimizer']) - if args.amp: - amp.load_state_dict(checkpoint['amp']) - print("=> loaded checkpoint '{}' (epoch {})" - .format(args.resume, checkpoint['epoch'])) - else: - print("=> no checkpoint found at '{}'".format(args.resume)) - - cudnn.benchmark = True - ############## npu modify end ############# - ``` - -12. The checkpoint saving needs to be combined with the mixed precision training. The modification is as follows: - - Code location: **main\_worker\(\)** in **main.py** \(The changes are in bold.\) - - ``` - # remember best acc@1 and save checkpoint - is_best = acc1 > best_acc1 - best_acc1 = max(acc1, best_acc1) - - if not args.multiprocessing_distributed or (args.multiprocessing_distributed - and args.rank % ngpus_per_node == 0): - ############## npu modify begin ############# - if args.amp: - save_checkpoint({ - 'epoch': epoch + 1, - 'arch': args.arch, - 'state_dict': model.state_dict(), - 'best_acc1': best_acc1, - 'optimizer' : optimizer.state_dict(), - 'amp': amp.state_dict(), - }, is_best) - else: - save_checkpoint({ - 'epoch': epoch + 1, - 'arch': args.arch, - 'state_dict': model.state_dict(), - 'best_acc1': best_acc1, - 'optimizer' : optimizer.state_dict(), - }, is_best) - ############## npu modify end ############# - ``` - -13. During training, you need to migrate the dataset to the Ascend AI Processor. The modification is as follows: - - Code location: **train\(\)** in **main.py** \(The changes are in bold.\) - - ``` - for i, (images, target) in enumerate(train_loader): - # measure data loading time - data_time.update(time.time() - end) - ############## npu modify begin ############# - loc = 'npu:{}'.format(args.gpu) - target = target.to(torch.int32) - images, target = images.to(loc, non_blocking=False), target.to(loc, non_blocking=False) - ############## npu modify end ############# - # The original model code is as follows: - # if args.gpu is not None: - # images = images.cuda(args.gpu, non_blocking=True) - # if torch.cuda.is_available(): - # target = target.cuda(args.gpu, non_blocking=True) - ``` - -14. Mark the location where the backpropagation .backward\(\) occurs so that the mixed precision module can perform loss scaling and clear the status of each iteration. The code is as follows: - - Code location: **train\(\)** in **main.py** \(The changes are in bold.\) - - ``` - optimizer.zero_grad() - ############## npu modify begin ############# - if args.amp: - with amp.scale_loss(loss, optimizer) as scaled_loss: - scaled_loss.backward() - else: - loss.backward() - # The original code is as follows: - # loss.backward() - ############## npu modify end ############# - optimizer.step() - ``` - -15. Before verification, you need to migrate the dataset to be verified to the Ascend AI Processor. The modification is as follows: - - Code location: **validate\(\)** in **main.py** \(The changes are in bold.\) - - ``` - with torch.no_grad(): - end = time.time() - for i, (images, target) in enumerate(val_loader): - ############## npu modify begin ############# - loc = 'npu:{}'.format(args.gpu) - target = target.to(torch.int32) - images, target = images.to(loc, non_blocking=False), target.to(loc, non_blocking=False) - ############## npu modify end ############# - # The original model code is as follows: - # if args.gpu is not None: - # images = images.cuda(args.gpu, non_blocking=True) - # if torch.cuda.is_available(): - # target = target.cuda(args.gpu, non_blocking=True) - ``` - - -

Script Execution

- -##### Preparing a Dataset - -Prepare a dataset and upload it to a directory in the operating environment, for example, **/home/data/resnet50/imagenet**. - -##### Configuring Environment Variables - -For details, see [Environment Variable Configuration](#en-us_topic_0000001144082004md). - -##### Command - -Example: - -Single-device: - -``` -python3 main.py /home/data/resnet50/imagenet --batch-size 128 \ # Training batch size - --lr 0.1 \ # Learning rate - --epochs 90 \ # Number of training iterations - --arch resnet50 \ # Model architecture - --world-size 1 \ - --rank 0 \ - --workers 40 \ # Number of processes for loading data - --momentum 0.9 \ # Momentum - --weight-decay 1e-4 # Weight attenuation -``` - -Distributed: - -``` -python3 main.py /home/data/resnet50/imagenet --addr='1.1.1.1' \ # Example IP address. Replace it with the actual IP address. - --seed 49 \ # Random seed - --workers 160 \ # Number of processes for loading data - --lr 0.8 \ - --print-freq 1 \ - --arch resnet50 \ # Model architecture - --dist-url 'tcp://127.0.0.1:50000' \ - --dist-backend 'hccl' \ - --multiprocessing-distributed \ # Multi-device training - --world-size 1 \ - --batch-size 2048 \ # Training batch size - --epochs 90 \ # Number of training iterations - --rank 0 \ - --device-list '0,1,2,3,4,5,6,7' \ - --amp # Use mixed precision for training. -``` - ->![](public_sys-resources/icon-note.gif) **NOTE:** ->**dist-backend** must be set to **hccl** to support distributed training on the Ascend AI device. - -

ShuffleNet Model Optimization

- -- **[Obtaining Samples](#obtaining-samples-5md)** - -- **[Model Evaluation](#model-evaluationmd)** - -- **[Porting the Network](#porting-the-networkmd)** - -- **[Commissioning the Network](#commissioning-the-networkmd)** - - -

Obtaining Samples

- -##### How to Obtain - -1. This sample is used to adapt to the porting and reconstruction of the Ascend 910 AI Processor based on the ImageNet dataset training model provided by the PyTorch official website. The sample can be obtained from [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet). -2. For details about the ShuffleNet model, see the [ShuffleNet V2](https://pytorch.org/hub/pytorch_vision_shufflenet_v2/) in the PyTorch official website. Set the **arch** parameter to **shufflenet\_v2\_x1\_0** during script execution. - - ``` - --arch shufflenet_v2_x1_0 - ``` - - >![](public_sys-resources/icon-note.gif) **NOTE:** - >ShuffleNet is a model built in PyTorch. For more built-in models, visit the [PyTorch official website](https://pytorch.org/). - - -##### Directory Structure - -The structure of major directories and files is as follows: - -``` -├──main.py -``` - -

Model Evaluation

- -Model evaluation focuses on operator adaptation. Use the dump op method to obtain the ShuffleNet operator information and compare the information with that in the _PyTorch Operator Support_. If an operator is not supported, in simple scenarios, you can replace the operator with a similar operator or place the operator on the CPU to avoid this problem. In complex scenarios, operator development is required. For details, see the _PyTorch Operator Development Guide_. - -

Porting the Network

- -For details about how to port the training scripts, see [Single-Device Training Modification](#single-device-training-modificationmd) and [Distributed Training Modification](#distributed-training-modificationmd). During the script execution, select the **--arch shufflenet\_v2\_x1\_0** parameter. - -

Commissioning the Network

- -For details about how to commission the network, see [Commissioning Process](#commissioning-processmd). After check, it is found that too much time is consumed by operators during ShuffleNet running. The following provides the time consumption data and solutions. - -##### Forward check - -The forward check record table is as follows: - -**Table 1** Forward check - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

No.

-

time (ms)

-

batch_size

-

Detail

-

1

-

1100

-

512

-

Replace channel_shuffle with channel_shuffle_index_select.

-

2

-

600

-

512

-

Perform the channel_shuffle_index_select operation twice to reduce the non-contiguous tensors caused by chunk.

-

3

-

300

-

512

-

Specify the concat output format to NCHW through the framework layer to eliminate excessive transdata.

-

4

-

285

-

512

-

Rectify the weight format.

-

5

-

275

-

512

-

Rectify the problem that the output format 5HD was not specified for DWCONV.

-
- -The details are as follows: - -- The native **torch.transpose\(x, 1, 2\).contiguous\(\)** uses the view operator transpose, which produced non-contiguous tensors. For example, the copy bottleneck described in the [copy bottleneck optimization](#training-performance-optimizationmd) uses **channel\_shuffle\_index\_select** to replace the framework operator with the compute operator when the semantics is the same, reducing the time consumption. -- ShuffleNet V2 contains a large number of chunk operations, and chunk operations are framework operators in PyTorch. As a result, a tensor is split into several non-contiguous tensors of the same length. The operation of converting non-contiguous tensors to contiguous tensors takes a long time. Therefore, the compute operator is used to eliminate non-contiguous tensors. For details, see the copy bottleneck described in the [copy bottleneck optimization](#training-performance-optimizationmd) -- During operator adaptation, the output format is specified as the input format by default. However, Concat does not support the 5HD format whose C dimension is not an integral multiple of 16, so it converts the format into 4D for processing. In addition, the Concat is followed by the GatherV2 operator, which supports only the 4D format. Therefore, the data format conversion process is 5HD \> 4D \> Concat \> 5HD \> 4D \> GatherV2 \> 5HD. The solution is to modify the Concat output format. When the output format is not an integer multiple of 16, the specified output format is 4D. After the optimization, the data format conversion process is 5HD \> 4D \> Concat \> GatherV2 \> 5HD. For details about the method for ShuffleNet, see line 121 in **pytorch/aten/src/ATen/native/npu/CatKernelNpu.cpp**. -- Set the weight initialization format to avoid repeated transdata during calculation, for example, the framework bottleneck described in the [copy bottleneck optimization](#training-performance-optimizationmd). -- The output format of the DWCONV weight is rectified to avoid the unnecessary conversion from 5HD to 4D. - -##### Entire Network Check - -The record table of the entire network check is as follows: - -**Table 2** Entire network check - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

No.

-

time (ms)

-

batch_size

-

Detail

-

1

-

5500

-

512

-

The index_add operation is performed by copying index to CPU through the framework layer.

-

2

-

4000

-

512

-

Customize operators to pre-generate an index.

-

3

-

1800

-

512

-

Customize operators to combine index_add and chunk.

-

4

-

885

-

512

-

Add contiguous_with_gatherv2.

-

5

-

3480

-

1024

-

Modify batchsize.

-

6

-

1650

-

1024

-

Modify batchsize and contiguous_with_gatherv2.

-

7

-

1424

-

1024

-

Customize operators to combine cat, shuffle, and chunk to eliminate non-contiguous tensors.

-

8

-

1360

-

1024

-

Modify the format of the gradient transferred by ReluGrad through the framework layer.

-

9

-

1300

-

1024

-

Modify the backward propagation input format of IndexSelectFullImplementation.

-

10

-

920

-

1024

-

Modify amp O1.

-

11

-

860

-

1024

-

Modify amp O2.

-

12

-

830

-

1024

-

Eliminate the excessive transdata introduced by the AXPY during BN parameter update.

-

13

-

800

-

1024

-

Cancel the stream synchronization among forward propagation, backward propagation, and parm_update.

-

14

-

461

-

1024

-

Optimize the GatherV2 operator for non-32-byte alignment scenarios.

-

15

-

429

-

1024

-

Optimize GatherV2 to GatherV3 in the ShuffleNet V2 scenario.

-
- -The details are as follows: - -1. Replace framework operators with compute operators. - -2. Use buffer to record the index information to the NPU, and cancel the **index.to\(npu creation\)** operation. - -3. Use compute operators to eliminate non-contiguous tensors. - -4. The AI Core operator GatherV2 is used for **contiguous\_with\_gatherv2** to convert non-contiguous tensors to contiguous tensors. - -5. Modify **batchsize**. - -6. Modify **batchsize **and **contiguous\_with\_gatherv2**. - -7. The chunk operator is the backward calculation mode of the Concat operator. It may produce non-contiguous tensors. Therefore, the backward calculation mode of the Concat operator needs to be customized. Combine cat, shuffle, and chunk, then replace chunk with GatherV2 to eliminate non-contiguous tensors. - -8. The ReluGrad operator has two inputs: **grad\_output** \(backward input\) and **self** \(forward output\). In ShuffleNet, the 4D and 5HD formats exist at the same time in some cases. However, the FE format is usually aligned with the format of the first tensor, so the following process occurs: \(4D, 5HD\) \> \(4D, 4D\) \> ReluGrad \> 4D \> 5HD. The forward output format is basically the input format, and ReLU is usually used together with Conv and BN. In this scenario, 5HD format is more suitable for output. Therefore, insert **npu\_format\_cast** manually, and the following process occurs: \(4D, 5HD\) \> \(5HD, 5HD\) \> ReluGrad \> 5HD. - -9. In IndexSelectFullImplementation, the gatherv2 operation is performed twice on a 5HD tensor. In this case, the conversion from 5HD to 4D is performed twice. You can manually convert 5HD to 4D once, so that transdata is not performed during the gatherv2 operation, reducing a transdata operation. - -10. Add the mixed precision O1. - -11. Add the mixed precision O2. -12. Due to the parameter verification of the Axpy operator, when the parameters of all networks are updated, if C dimension is not exactly divided by 16, the Axpy operation for 4D is performed by transdata operators. In this case, a large number of transdata operators are introduced. To solve this problem, add a function, when the Axpy input shapes are the same, the verification ends. This avoids format conversion and improves the running efficiency. - -13. Delete all the stream synchronization operations. This is not adopted because it is easy to cause non-convergence. - -14. After using the GatherV2 operator optimized for non-alignment scenarios, the overall performance is improved to the delivery level. - -15. After using the GatherV3 operator optimized for the ShuffleNet V2 scenario, the overall performance can be further improved. - - -##### Python Optimization Details - -The optimization on the Python side is to make the network more affinity on the NPU by modifying some equivalent semantics. The current operations of converting non-contiguous tensors to contiguous tensors can be the performance bottleneck. The **channel\_shuffle** operation in ShuffleNet V2 involves the conversion operations after permute, causing poor performance of the entire network. The performance of the entire network can be greatly improved by modifying the equivalent semantics of the **channel\_shuffle** operation and combining it with the concat operation. The torchvision version is used. For details, go to [open source link](https://github.com/pytorch/vision/blob/master/torchvision/models/shufflenetv2.py). - -- Original **channel\_shuffle** operation: - - ``` - def channel_shuffle(x, groups): - # type: (torch.Tensor, int) -> torch.Tensor - batchsize, num_channels, height, width = x.data.size() - channels_per_group = num_channels // groups - # reshape - x = x.view(batchsize, groups, - channels_per_group, height, width) - x = torch.transpose(x, 1, 2).contiguous() - # flatten - x = x.view(batchsize, -1, height, width) - return x - - class InvertedResidual(nn.Module): - def __init__(self, inp, oup, stride): - super(InvertedResidual, self).__init__() - if not (1 <= stride <= 3): - raise ValueError('illegal stride value') - self.stride = stride - branch_features = oup // 2 - assert (self.stride != 1) or (inp == branch_features << 1) - if self.stride > 1: - self.branch1 = nn.Sequential( - self.depthwise_conv(inp, inp, kernel_size=3, stride=self.stride, padding=1), - nn.BatchNorm2d(inp), - nn.Conv2d(inp, branch_features, kernel_size=1, stride=1, padding=0, bias=False), - nn.BatchNorm2d(branch_features), - nn.ReLU(inplace=True), - ) - else: - self.branch1 = nn.Sequential() - - self.branch2 = nn.Sequential( - nn.Conv2d(inp if (self.stride > 1) else branch_features, - branch_features, kernel_size=1, stride=1, padding=0, bias=False), - nn.BatchNorm2d(branch_features), - nn.ReLU(inplace=True), - self.depthwise_conv(branch_features, branch_features, kernel_size=3, stride=self.stride, padding=1), - nn.BatchNorm2d(branch_features), - nn.Conv2d(branch_features, branch_features, kernel_size=1, stride=1, padding=0, bias=False), - nn.BatchNorm2d(branch_features), - nn.ReLU(inplace=True), - ) - - @staticmethod - def depthwise_conv(i, o, kernel_size, stride=1, padding=0, bias=False): - return nn.Conv2d(i, o, kernel_size, stride, padding, bias=bias, groups=i) - - def forward(self, x): - if self.stride == 1: - x1, x2 = x.chunk(2, dim=1) - out = torch.cat((x1, self.branch2(x2)), dim=1) - else: - out = torch.cat((self.branch1(x), self.branch2(x)), dim=1) - - out = channel_shuffle(out, 2) - - return out - ``` - -- Equivalent semantics rewriting: - -``` -def channel_shuffle_index_select(x, groups=2): - N, C, H, W = x.shape - inp = C -# The channel_shuffle operation is to rearrange the C dimension according to certain rules. It can be expressed as a simple rearrangement. - group_len = inp // groups - index = torch.from_numpy(np.array(list(range(inp))).reshape(groups, group_len).transpose(1, 0).flatten()).long() - - x = x.index_select(1, index) - return x - -# Compare the results of the two operations. The semantics are the same. -x = torch.randn(2, 232, 14, 14) -for group in [2, 4, 8]: - out1 = channel_shuffle(x, group) - out2 = channel_shuffle_index_select(x, group) - print((out1 - out2).sum()) -``` - -- Affinity writing method of the Ascend AI Processor: - - ``` - # Corresponding to out = channel_shuffle(torch.cat((self.branch1(x), self.branch2(x)), dim=1)) - # Replace channel_shuffle with channel_shuffle_index_select. - # Customize operators to combine channel_shuffle_index_select and cat, and use compute operators to reduce non-contiguous tensors. - class IndexSelectFullImplementation(torch.autograd.Function): - @staticmethod - def forward(ctx, x1, x2, fp_index, bp_index1, bp_index2): - # Forcible stream synchronization, which is used only for training stabilization. - stream = torch.npu.current_stream() - stream.synchronize() - - # Register bp_index1 and bp_index2 with context so that they can be used in backward propagation. - ctx.bp_index1 = bp_index1 - ctx.bp_index2 = bp_index2 - - x = torch.cat([x1, x2], dim=1) - - # Replace channel_shuffle with index_select. In this example, the chunk operator is not used. - result = x.index_select(1, fp_index) - - return result - - @staticmethod - def backward(ctx, grad_output): - # Forcible stream synchronization, which is used only for training stabilization. - stream = torch.npu.current_stream() - stream.synchronize() - - # Convert the format to NCHW to reduce extra transdata because index_select does not support the 5HD format. - grad_output.data = grad_output.data.npu_format_cast(0) - - # Use index_select to reverse index_select and cat based on the reverse expression obtained from forward derivation. - out1 = grad_output.index_select(1, ctx.bp_index1) - out2 = grad_output.index_select(1, ctx.bp_index2) - return out1, out2, None, None, None, None - - - class IndexSelectHalfImplementation(torch.autograd.Function): - @staticmethod - def forward(ctx, x1, x2, fp_index1, fp_index2, bp_index1, bp_index2): - ctx.bp_index1 = bp_index1 - ctx.bp_index2 = bp_index2 - x = torch.cat([x1, x2], dim=1) - - # Replace channel_shuffle with index_select. In this example, the chunk operator is used. - return x.index_select(1, fp_index1), x.index_select(1, fp_index2) - - @staticmethod - def backward(ctx, grad_output1, grad_output2): - grad_output = torch.cat([grad_output1, grad_output2], 1) - - out1 = grad_output.index_select(1, ctx.bp_index1) - out2 = grad_output.index_select(1, ctx.bp_index2) - return out1, out2, None, None, None, None - - - class Channel_Shuffle(nn.Module): - def __init__(self, inp, groups=2, split_shuffle=True): - super(Channel_Shuffle, self).__init__() - - self.split_shuffle = split_shuffle - self.group_len = inp // groups - - # Initialize fp_index to be used in channel_shuffle_index_select. - self.out = np.array(list(range(inp))).reshape(groups, self.group_len).transpose(1, 0).flatten().tolist() - - # Register the initialized fp_index as the buffer of the module. When to.device is called, the buffer is brought to the device to reduce the time consumed by host-to-device copy. - # This section describes only the common usage when the value of group is 2. Expand based on the actual scenario. - if self.split_shuffle: - self.register_buffer('fp_index1', torch.tensor(self.out[:self.group_len], dtype=torch.int32)) - self.register_buffer('fp_index2', torch.tensor(self.out[self.group_len:], dtype=torch.int32)) - else: - self.register_buffer('fp_index', torch.tensor(self.out, dtype=torch.int32)) - - # Register the corresponding bp_index as the buffer of the module. When to.device is called, the buffer is brought to the device to reduce the time consumed by host-to-device copy. - self.register_buffer('bp_index1', torch.tensor(list(range(0, inp, 2)), dtype=torch.int32)) - self.register_buffer('bp_index2', torch.tensor(list(range(1, inp, 2)), dtype=torch.int32)) - - def forward(self, x1, x2): - if self.split_shuffle: - return IndexSelectHalfImplementation.apply(x1, x2, self.fp_index1, self.fp_index2, self.bp_index1, - self.bp_index2) - else: - return IndexSelectFullImplementation.apply(x1, x2, self.fp_index, self.bp_index1, self.bp_index2) - - - class InvertedResidual(nn.Module): - def __init__(self, inp, oup, stride, split_shuffle=True): - super(InvertedResidual, self).__init__() - - if not (1 <= stride <= 3): - raise ValueError('illegal stride value') - self.stride = stride - - branch_features = oup // 2 - assert (self.stride != 1) or (inp == branch_features << 1) - - if self.stride > 1: - self.branch1 = nn.Sequential( - self.depthwise_conv(inp, inp, kernel_size=3, stride=self.stride, padding=1), - nn.BatchNorm2d(inp), - nn.Conv2d(inp, branch_features, kernel_size=1, stride=1, padding=0, bias=False), - nn.BatchNorm2d(branch_features), - nn.ReLU(inplace=True), - ) - else: - self.branch1 = nn.Sequential() - - self.branch2 = nn.Sequential( - nn.Conv2d(inp if (self.stride > 1) else branch_features, - branch_features, kernel_size=1, stride=1, padding=0, bias=False), - nn.BatchNorm2d(branch_features), - nn.ReLU(inplace=True), - self.depthwise_conv(branch_features, branch_features, kernel_size=3, stride=self.stride, padding=1), - nn.BatchNorm2d(branch_features), - nn.Conv2d(branch_features, branch_features, kernel_size=1, stride=1, padding=0, bias=False), - nn.BatchNorm2d(branch_features), - nn.ReLU(inplace=True), - ) - - if self.stride > 1: - self.channel_shuffle = Channel_Shuffle(inp=branch_features + branch_features, groups=2, - split_shuffle=split_shuffle) - else: - self.channel_shuffle = Channel_Shuffle(inp=inp, groups=2, split_shuffle=split_shuffle) - - @staticmethod - def depthwise_conv(i, o, kernel_size, stride=1, padding=0, bias=False): - return nn.Conv2d(i, o, kernel_size, stride, padding, bias=bias, groups=i) - - def forward(self, x): - - # Delete the concat and chunk operations and combine them into self.channel_shuffle for processing. - if self.stride == 1: - x1, x2 = x - x2 = self.branch2(x2) - else: - x1 = self.branch1(x) - x2 = self.branch2(x) - - out = self.channel_shuffle(x1, x2) - - return out - ``` - - -

References

- -- **[Single-Operator Sample Building](#single-operator-sample-buildingmd)** - -- **[Single-Operator Dump Method](#single-operator-dump-methodmd)** - -- **[Common Environment Variables](#common-environment-variablesmd)** - -- **[dump op Method](#dump-op-methodmd)** - -- **[Compilation Option Settings](#compilation-option-settingsmd)** - -- **[How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-7-3-0md)** - -- **[HDF5 Compilation and Installation](#hdf5-compilation-and-installationmd)** - - -

Single-Operator Sample Building

- -When a problem occurs in a model, it is costly to reproduce the problem in the entire network. You can build a single-operator sample to reproduce the precision or performance problem to locate and solve the problem. A single-operator sample can be built in either of the following ways: For details about single-operator dump methods, see [Single-Operator Dump Method](#single-operator-dump-methodmd). - -1. Build a single-operator sample test case. You can directly call the operator to reproduce the error scenario. - - The following is an example of building a single-operator sample of the max operator: - - ``` - import torch - import copy - from torch.testing._internal.common_utils import TestCase, run_tests - class TestMax(TestCase): - def cpu_op_exec(self, input1): - # Call the operator. - output = torch.max(input1) - output = output.to('cpu') - output = output.numpy() - return output - - def npu_op_exec(self, input1): - # Call the corresponding NPU operator. - output = torch.max(input1) - return output - - def test_max(self): - input = torch.randn(10,20)) - input = input.to(torch.int64) # Convert the data type. - input_cpu = copy.deepcopy(input) - input_npu = copy.deepcopy(input).npu() - - output_cpu = self.cpu_op_exec(input_cpu) - output_npu = self.npu_op_exec(input_npu) - - # Compare the calculation results of the CPU and NPU. prec is the allowed error. - self.assertEqual(output_cpu, output_npu, prec = 1e-4) - - if __name__ == '__main__': - run_tests() - ``` - - >![](public_sys-resources/icon-note.gif) **NOTE:** - >- Run the preceding code. If the reported error information is the same as that of the max operator in the model, the single-operator test case is successfully built. - >- Assume that the data type conversion code is commented out. If no error is reported in the test case, an error of the max operator is reported on the NPU when the input parameter is **torch.int64**. - -2. Build a single-operator test case based on the context. - - Although this is a single-operator sample, sometimes it is not only an operation but also a scenario with context or a module with parameters. The module mode is a more common method. The following is an example of building a module that contains two operators: - - ``` - import torch - import copy - from torch.testing._internal.common_utils import TestCase, run_tests - - class Model(nn.Module): - def __init__(self, in_channels=1, hooks=False): - super(Model, self).__init__() - self.conv = nn.Conv2d(in_channels, in_channels*2, kernel_size=64) - if hooks: - self.conv.weight.register_hook(lambda grad: print(grad)) - def forward(self, x): - out = self.conv(x) - return out - - class TestConv2d(TestCase): - def test_conv2d(self): - - model = Model(in_channels=16) - - # Add hooks to obtain the backward propagation result. - # model = Model(in_channels=16, hooks=True) - # Create an input tensor. - input_tensor = torch.randn(4,16,64,64) - - input_tensor_cpu= copy.deepcopy(input_tensor) - out = model(input_tensor_cpu) - loss = out.sum() - loss.backward() - cpuout = out - - # Run the model and input tensor on the NPU. - torch.npu.set_device("npu:0") # Set the running device first. - model_npu = Model(in_channels=16).npu() - input_tensor_npu= copy.deepcopy(input_tensor).npu() - out = model_npu(input_tensor_npu) - loss = out.sum() - loss.backward() - npuout = out - # Determine whether the scenario is an error scenario based on the result. - self.assertEqual(cpuout, npuout, prec = 1e-4) - - if __name__ == '__main__': - run_tests() - ``` - - -

Single-Operator Dump Method

- -#### Collecting Dump Data - -Currently, the PyTorch adapted to Ascend AI Processors uses the init\_dump\(\), set\_dump\(\), and finalize\_dump\(\) interfaces in **torch.npu** to collect operator dump data. The init\_dump\(\) interface initializes the dump configuration, invokes the set\_dump\(\) interface to import the configuration file to configure dump parameters, and invokes the finalize\_dump interface to end the dump. The following uses the add\_ operator as an example to describe how to collect dump data. - -``` -import torch -torch.npu.set_device("npu:0") -torch.npu.init_dump() -torch.npu.set_dump("/home/HwHiAiUser/dump.json") # "/home/HwHiAiUser/dump.json" is the path of the configuration file. You can configure it as required. -a = torch.tensor([2, 2]).to("npu:0") -a.add_(1) -torch.npu.finalize_dump() -``` - -The configuration method of **dump.json** is as follows. - -``` -{ - "dump": - { - "dump_list":[], - "dump_path":"/home/HwHiAiUser/dump/output", - "dump_mode":"all", - "dump_op_switch":"on" - } -``` - -The fields of **dump.json** are described as follows. - - - - - - - - - - - - - - - - - - - -

Field

-

Description

-

dump_list

-

Operator model whose data is to be dumped. Leave this parameter empty.

-

dump_path

-

Directory where dump data files are stored in the operating environment. The value can be an absolute path or a relative path.

-
  • An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output.
  • A relative path starts with a directory name, for example, output.
-

For example, if dump_path is set to /home/HwHiAiUser/output, the dump data files are generated under the /home/HwHiAiUser/output directory in the operating environment.

-

dump_mode

-

Dump data mode. The configuration is as follows:

-
  • output (default): dumps operator outputs only.
  • input: dumps operator inputs only.
  • all: dumps both operator inputs and outputs.
-

dump_op_switch

-

Dump data status of the single-operator model. The configuration is as follows:

-
  • off (default): disables dump for the single-operator model.
-
  • on: enables dump for the single-operator model.
-
- -#### Viewing Overflowed Data - -The collected dump data is generated in the _\{dump\_path\}_**/**_\{time\}_**/**_\{deviceid\}_**/**_\{model\_id\}_**/**_\{data\_index\}_ directory, for example, **/home/HwHiAiUser/output/20200808163566/0/0**. - -The fields in the dump data path and file are described as follows: - -- _dump\_path_: user-defined path for storing overflowed data, for example, **/home/HwHiAiUser/output**. - -- _time_: timestamp \(for example, **20200808163566**\) -- _deviceid_: device ID -- **_model\_id_**: subgraph ID -- A dump file is named as: _\{op\_type\}_._\{op\_name\}_._\{taskid\}_._\{stream\_id\}_._\{timestamp\}_. Any period \(.\), slash \(/\), backslash \(\\\), or space in the _op\_type_ or _op\_name_ field is replaced by an underscore \(\_\). - -#### Parse the dump file of an overflow operator. - -1. Upload the **_\{op\_type\}.\{op\_name\}.\{taskid\}.\{stream\_id\}.\{timestamp\}_** file to the environment with CANN installed. -2. Go to the path where the parsing script is stored. Assume that the installation directory of the CANN is **/home/HwHiAiUser/Ascend**. - - **cd /home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/tools/operator\_cmp/compare** - -3. Run the **msaccucmp.pyc** script to convert the dump file into a NumPy file. The following is an example: - - **python3 msaccucmp.pyc convert -d /home/HwHiAiUser/dump -out /home/HwHiAiUser/dumptonumpy -v 2** - - >![](public_sys-resources/icon-note.gif) **NOTE:** - >The **-d** option enables the conversion of a single dump file or all dump files in a path. - -4. Use Python to save the NumPy data into a .txt file. The following is an example: - - **$ python3** - - **\>\>\> import numpy as np** - - **\>\>\> a = np.load\("/home/HwHiAiUser/dumptonumpy/Pooling.pool1.1147.1589195081588018.output.0.npy"\)** - - **\>\>\> b = a.flatten\(\)** - - **\>\>\> np.savetxt\("/home/HwHiAiUser/dumptonumpy/Pooling.pool1.1147.1589195081588018.output.0.txt", b\)** - - The dimension and **Dtype** information no longer exist in the .txt file. For details, visit the NumPy website. - - -

Common Environment Variables

- -1. Enables the task delivery in multi-thread mode. When this function is enabled, the training performance of the entire network is improved in most cases. - - **export TASK\_QUEUE\_ENABLE=1** - -2. Redirects logs to stdout, which is used to export host logs to the screen. - - **export ASCEND\_SLOG\_PRINT\_TO\_STDOUT=0** - -3. Sets the log level. Log levels in descending order are: debug \> info \> warning \> error \> null. Generally, the log level is set to **error**. **info** is used for debugging. For details about how to set the log level, see the _CANN Log Reference_. - - **export ASCEND\_GLOBAL\_LOG\_LEVEL=3** - -4. Dumps graph, which is used to view the graph structure. - - **export DUMP\_GE\_GRAPH=2** - - **export DUMP\_GRAPH\_LEVEL=3** - -5. Enables/Disables the event log function. - - **export ASCEND\_GLOBAL\_EVENT\_ENABLE=0** - -6. Enables/Disables PTCopy. - - **export PTCOPY\_ENABLE=1** - -7. Enables/Disables the combined flag. - - **export COMBINED\_ENABLE=1** - -8. Sets whether to recompile the code in special scenarios. You do not need to modify this parameter. - - **export DYNAMIC\_OP="ADD\#MUL"** - -9. Enables/Disables the HCCL trustlist. - - **export HCCL\_WHITELIST\_DISABLE=1** - - -

dump op Method

- -1. Use the profile API to reconstruct the loss calculation and optimization process of the original code training script and print the operator information. The following is a code example. - - ``` - with torch.autograd.profiler.profile() as prof: - out = model(input_tensor) - loss = out.sum() - loss.backward() - # You can also export the file. - print(prof.key_averages().table(sort_by="self_cpu_time_total")) - ``` - -2. Train the reconstructed training script on the CPU. The related operator information is displayed. - -

Compilation Option Settings

- -Configure the attributes of an operator during compilation to improve performance, which is implemented by ACL APIs. The usage and explanation are as follows: - -``` -import torch -option = {key: val} -torch.npu.set_option(option) # Set in dict mode. - -The key options are as follows: -ACL_OP_SELECT_IMPL_MODE, // Sets the operator implementation mode (high-precision or high-performance). -ACL_OPTYPELIST_FOR_IMPLMODE, // Lists operator types. Operators on the list are implemented in the mode specified by ACL_OP_SELECT_IMPL_MODE. -ACL_OP_DEBUG_LEVEL, // Enables TBE operator debug during operator compilation. -ACL_DEBUG_DIR, // Sets the debug directory, for saving the files generated during model conversion and network migration, including the .o, .json, and .cce files of operators. The diretory must exist. -ACL_OP_COMPILER_CACHE_MODE, // Sets the disk cache mode for operator compilation. -ACL_OP_COMPILER_CACHE_DIR, // Sets the path of the disk cache for operator compilation. The path must exist. - -The key values are as follows: -ACL_OP_SELECT_IMPL_MODE: Sets the operator implementation mode (high-precision or high-performance). If this option is not set, high_precision is used by default. - high_precision: All operators in the network are implemented with high precision. - high_performance: All operators in the network are implemented with high performance. - -ACL_OPTYPELIST_FOR_IMPLMODE: Sets the implementation mode of an operator in the optype list. Currently, this parameter can set the implementation mode of only one operator, such as Pooling, SoftmaxV2, LRN, or ROIAlign. Operators in the operator type list use the modes specified by ACL_OP_SELECT_IMPL_MODE. - -ACL_OP_DEBUG_LEVEL: Enables TBE operator debug during operator compilation. - 0: Disables operator debug. The operator binary file (.o) and operator description file (.json) are not retained in the kernel_meta folder in the atc command execution directory. - 1: Enables operator debug. TBE instruction mapping files, including an operator CCE file (*.cce) and a Python-CCE mapping file (*_loc.json), are generated in the kernel_meta folder under the atc command execution directory. You can locate AI Core errors by using tools. - 2: Enables operator debug. TBE instruction mapping files, including an operator CCE file (.cce), a Python-CCE mapping file (*_loc.json) and a Python-CCE mapping file (*_loc.json), are generated in the kernel_meta folder under the atc command execution directory. Build optimization is disabled and CCE compiler debug is enabled (by setting -O0-g). You can locate AI Core errors by using tools. - 3: Disables operator debug. However, the operator binary file (.o) and operator description file (.json) are retained in the kernel_meta folder in the atc command execution directory. - 4: Disables operator debug. The operator binary (.o) and operator description file (.json) are retained, and a TBE instruction mapping file (.cce) and a UB fusion description file ({$kernel_name}_compute.json) are generated in the kernel_meta folder under the atc command execution directory. - -ACL_DEBUG_DIR: Sets the debug directory for saving the debug-related files generated during model conversion and network migration, including the .o, .json, and .cce files of operators. - -ACL_OP_COMPILER_CACHE_MODE: Configures the disk cache mode for operator compilation. This compilation option must be used together with ACL_OP_COMPILER_CACHE_DIR. - enable: operator compilation cache enabled. - disable: operator compilation cache disabled. - force: cache forcibly refreshed. That is, the existing cache is deleted, recompiled, and then added to the cache. When the Python or dependency library of a user changes, you need to use force to clear the existing cache. - -ACL_OP_COMPILER_CACHE_DIR: Configures the disk cache directory for operator compilation. This compilation option must be used together with ACL_OP_COMPILER_CACHE_MODE. -``` - -

How Do I Install GCC 7.3.0?

- -Perform the following steps as the **root** user. - -1. Download **gcc-7.3.0.tar.gz** from [https://mirrors.tuna.tsinghua.edu.cn/gnu/gcc/gcc-7.3.0/gcc-7.3.0.tar.gz](https://mirrors.tuna.tsinghua.edu.cn/gnu/gcc/gcc-7.3.0/gcc-7.3.0.tar.gz). -2. GCC installation requires adequate temporary space. Run the following command to clear the **/tmp** directory in advance: - - ``` - sudo rm -rf /tmp/* - ``` - -3. Install dependencies. - - For CentOS/BCLinux, run the following command: - - ``` - yum install bzip2 - ``` - - For Ubuntu/Debian, run the following command: - - ``` - apt-get install bzip2 - ``` - -4. Build and install GCC. - 1. Go to the directory where the source package **gcc-7.3.0.tar.gz** is located and run the following command to decompress it: - - ``` - tar -zxvf gcc-7.3.0.tar.gz - ``` - - 2. Go to the extracted directory and run the following command to download the GCC dependency packages: - - ``` - cd gcc-7.3.0 - ./contrib/download_prerequisites - ``` - - If an error is reported during the command execution, run the following commands in the **gcc-7.3.0/** directory to download the dependency packages: - - ``` - wget http://gcc.gnu.org/pub/gcc/infrastructure/gmp-6.1.0.tar.bz2 - wget http://gcc.gnu.org/pub/gcc/infrastructure/mpfr-3.1.4.tar.bz2 - wget http://gcc.gnu.org/pub/gcc/infrastructure/mpc-1.0.3.tar.gz - wget http://gcc.gnu.org/pub/gcc/infrastructure/isl-0.16.1.tar.bz2 - ``` - - After the preceding dependencies are downloaded, run the following command again: - - ``` - ./contrib/download_prerequisites - ``` - - If the validation fails, check whether the dependency packages are repeatedly downloaded. The packages should be downloaded at a time. - - 3. Run the following commands for configuration, build, and installation. - - ``` - ./configure --enable-languages=c,c++ --disable-multilib --with-system-zlib --prefix=/usr/local/linux_gcc7.3.0 - make -j15 # Check the number of CPUs by running grep -w processor /proc/cpuinfo|wc -l. In this example, the number is 15. - make install - ``` - - >![](public_sys-resources/icon-caution.gif) **CAUTION:** - >The **--prefix** option is used to specify the linux\_gcc7.3.0 installation path, which is configurable. Do not set it to **/usr/local** or **/usr**, which is the default installation path for the GCC installed by using the software source. Otherwise, a conflict occurs and the original GCC compilation environment of the system is damaged. In this example, the installation path is set to **/usr/local/linux\_gcc7.3.0**. - - -5. Set the environment variable. - - Training must be performed in the compilation environment with GCC upgraded. If you will run training, configure the following environment variable in your training script: - - ``` - export LD_LIBRARY_PATH=${install_path}/lib64:${LD_LIBRARY_PATH} - ``` - - **$\{install\_path\}** indicates the GCC 7.3.0 installation path configured in [3](#en-us_topic_0000001173199577_en-us_topic_0000001172534867_en-us_topic_0276688294_li1649343041310). In this example, the GCC 7.3.0 installation path is **/usr/local/gcc7.3.0/**. - - >![](public_sys-resources/icon-note.gif) **NOTE:** - >Skip this step if you do not need to use the compilation environment with GCC upgraded. - - -

HDF5 Compilation and Installation

- -Perform the following steps as the **root** user. - -1. Obtain the code. - - ``` - git clone https://github.com/HDFGroup/hdf5.git - ``` - -2. Switch to the hdf5-1\_10\_7 branch. - - ``` - cd hdf5 - git checkout remotes/origin/hdf5_1_10_7 - ``` - -3. Compile HDF5. - - ``` - ./configure --prefix=/usr/local/hdf5 --enable-cxx - make -j72 # The value following -j can be set based on the number of CPU cores. - make check # run test suite. - make install - make check-install # verify installation. - ``` - -4. Add environment variables. - - ``` - export PATH=/usr/local/hdf5/bin:$PATH - export LD_LIBRARY_PATH=/usr/local/hdf5/lib:$LD_LIBRARY_PATH - export LIBRARY_PATH=/usr/local/hdf5/lib:$LIBRARY_PATH - export CPATH=/usr/local/hdf5/include:$CPATH - ``` - - -

FAQs

- -- **[FAQs About Software Installation](#faqs-about-software-installationmd)** - -- **[FAQs About Model and Operator Running](#faqs-about-model-and-operator-runningmd)** - -- **[FAQs About Model Commissioning](#faqs-about-model-commissioningmd)** - -- **[FAQs About Other Operations](#faqs-about-other-operationsmd)** - -- **[FAQs About Distributed Model Training](#faqs-about-distributed-model-trainingmd)** - - -

FAQs About Software Installation

- -- **[pip3.7 install Pillow==5.3.0 Installation Failed](#pip3-7-install-pillow-5-3-0-installation-failedmd)** - - -

pip3.7 install Pillow==5.3.0 Installation Failed

- -##### Symptom - -**pip3.7 install pillow==5.3.0** installation failed. - -##### Possible Causes - -Necessary dependencies are missing, such as libjpeg, python-devel, zlib-devel, and libjpeg-turbo-devel. - -##### Solutions - -Run the following commands to install the dependencies: - -- CentOS/EulerOS/Tlinux/BClinux/Suse - - **yum install libjpeg python-devel zlib-devel libjpeg-turbo-devel** - -- Ubuntu/Debian/UOS - - **apt-get install libjpeg python-devel zlib-devel libjpeg-turbo-devel** - - -

FAQs About Model and Operator Running

- -- **[What Do I Do If the Error Message "RuntimeError: ExchangeDevice:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-runtimeerror-exchangedevice-is-displayed-during-model-or-operatormd)** - -- **[What Do I Do If the Error Message "Error in atexit.\_run\_exitfuncs:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-error-in-atexit-_run_exitfuncs-is-displayed-during-model-or-operatmd)** - -- **[What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what\(\): HelpACLExecute:" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-terminate-called-after-throwing-an-instance-of-c10-error-what-hemd)** - -- **[What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what\(\): 0 INTERNAL ASSERT" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-terminate-called-after-throwing-an-instance-of-c10-error-what-0md)** - -- **[What Do I Do If the Error Message "ImportError: libhccl.so." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-importerror-libhccl-so-is-displayed-during-model-runningmd)** - -- **[What Do I Do If the Error Message "RuntimeError: Initialize." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-runtimeerror-initialize-is-displayed-during-model-runningmd)** - -- **[What Do I Do If the Error Message "TVM/te/cce error." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-tvm-te-cce-error-is-displayed-during-model-runningmd)** - -- **[What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-runningmd)** - -- **[What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-running-6md)** - -- **[What Do I Do If the Error Message "HelpACLExecute." Is Displayed After Multi-Task Delivery Is Disabled \(export TASK\_QUEUE\_ENABLE=0\) During Model Running?](#what-do-i-do-if-the-error-message-helpaclexecute-is-displayed-after-multi-task-delivery-is-disabledmd)** - -- **[What Do I Do If the Error Message "55056 GetInputConstDataOut: ErrorNo: -1\(failed\)" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-55056-getinputconstdataout-errorno--1failed-is-displayed-duringmd)** - - -

What Do I Do If the Error Message "RuntimeError: ExchangeDevice:" Is Displayed During Model or Operator Running?

- -##### Symptom - -![](figures/faq1.png) - -##### Possible Causes - -Currently, only one NPU device can be called in a thread. When different NPU devices are switched, the preceding error occurs. - -##### Solution - -In the code, when **torch.npu.set\_device\(device\)**, **tensor.to\(device\)**, or **model.to\(device\)** is called in the same thread, the device names are inconsistent. For multiple threads \(such as multi-device training\), each thread can call only a fixed NPU device. - -

What Do I Do If the Error Message "Error in atexit.\_run\_exitfuncs:" Is Displayed During Model or Operator Running?

- -##### Symptom - -![](figures/faq2.png) - -##### Possible Causes - -If no NPU device is specified by **torch.npu.device\(id\)** during torch initialization, device 0 is used by default. If another NPU device is directly used, for example, a tensor is created on device 1, the preceding error occurs during running. - -##### Solution - -Before calling an NPU device, specify the NPU device by using **torch.npu.set\_device\(device\)**. - -

What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what(): HelpACLExecute:" Is Displayed During Model Running?

- -##### Symptom - -![](figures/faq3.png) - -##### Possible Causes - -Currently, the HelpACLExecute error cannot be directly located. In this case, an error is reported when the task is delivered. This is because the multi-thread delivery of the task is enabled \(**export TASK\_QUEUE\_ENABLE=1**\), and the error information is encapsulated at the upper layer. As a result, more detailed error logs cannot be obtained. - -##### Solution - -You can resolve this exception by using either of the following methods: - -- Check the host error log information. The default log path is **/var/log/npu/slog/host-0/**. Search for the log file whose name is prefixed with **host-0** based on the time identifier, open the log file, and search for error information using keyword **ERROR**. -- Disable multi-thread delivery \(**export TASK\_QUEUE\_ENABLE=0**\) and run the code again. Generally, you can locate the fault based on the error information reported by the terminal. - -

What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what(): 0 INTERNAL ASSERT" Is Displayed During Model Running?

- -##### Symptom - -``` -import torch - -npu = "npu" - -def test_cpu(): - input = torch.randn(2000, 1000).detach().requires_grad_() - output = torch.sum(input) - output.backward(torch.ones_like(output)) - -def test_npu(): - input = torch.randn(2000, 1000).detach().requires_grad_().npu() - output = torch.sum(input) - output.backward(torch.ones_like(output)) - -if __name__ == "__main__": - test_cpu() - torch.npu.set_device(f"{npu}:1") - test_npu() -``` - -The following error message is displayed after code execution. - -![](figures/en-us_image_0000001208897433.png) - -##### Possible Causes - -After the backward operation is performed, the **set\_decice\(\)** method is used to manually set the device. As a result, an error is reported. During the backward operation, if the device is not set, the program automatically initializes the device to **0** by default. That is, **set\_device\("npu:0"\)** is executed. Currently, the device cannot be switched for calculation. If the device is manually set by using the **set\_decice\(\)** method, this error may occur. - -##### Solution - -Before performing the backward operation, use the **set\_decice\(\)** method to manually set the device. The modification is as follows: - -``` -if __name__ == "__main__": - torch.npu.set_device(f"{npu}:1") - test_cpu() - test_npu() -``` - -

What Do I Do If the Error Message "ImportError: libhccl.so." Is Displayed During Model Running?

- -##### Symptom - -![](figures/faq7.png) - -##### Possible Causes - -Currently, the released PyTorch installation package uses the NPU and HCCL functions by default. Therefore, you need to add the path of the HCCL module to the environment variables when calling the PyTorch installation package. The error message "can not find libhccl.so" indicates that the cause is that the HCCL library file is missing. - -##### Solution - -Add the path of the HCCL module to the environment variables. Generally, the path of the HCCL library file is **.../fwkacllib/python/site-packages/hccl** in the installation package. - -

What Do I Do If the Error Message "RuntimeError: Initialize." Is Displayed During Model Running?

- -##### Symptom - -![](figures/faq9.png) - -##### Possible Causes - -According to the error information, it is preliminarily determined that an error occurs during the initialization of the NPU device. The error information in the host log is as follows: - -![](figures/faq9-1.png) - -The log information indicates that an error is reported when the system starts the NPU device. - -##### Solution - -To solve the problem, perform the following steps: - -1. Restart the server and all NPU devices. - - If the problem is resolved, no further action is required. - - If the problem persists, go to [2](#li77121667913). - -2. Check whether the driver version matches the firmware version. - - If no, go to [3](#li967615545918). - - If yes, go to [4](#li475615212912). - -3. Ensure that the driver version matches the firmware version. - - If the problem is resolved, no further action is required. - - If the problem persists, go to Step 4. - -4. Contact Huawei technical support personnel. - -

What Do I Do If the Error Message "TVM/te/cce error." Is Displayed During Model Running?

- -##### Symptom - -![](figures/faq10.png) - -##### Possible Causes - -Calling an NPU operator in PyTorch strongly depends on the TE, CCE, and TVM components. The PyTorch, CANN/NNAE, and TE versions must be the same. After CANN/NNAE is updated, components such as TE are not automatically updated. When their versions do not match, this error is reported. - -##### Solution - -Update the versions of components such as TE. The **te-\*.whl** and **topi-\*.whl** installation packages need to be updated. In the **lib64** subdirectory of the CANN or NNAE installation directory \(the installation user is the **root** user and the default installation directory is **/usr/local/Ascend/ascend-toolkit/latest/lib64**\), update the installation packages: The **topi-0.4.0-py3-none-any.whl** and **te-0.4.0-py3-none-any.whl** installation packages exist in the directory. Run the **pip3 install --upgrade topi-0.4.0-py3-none-any.whl** and **pip install --upgrade te-0.4.0-py3-none-any.whl** commands, respectively. - -![](figures/faq10-1.png) - -

What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?

- -##### Symptom - -Scripts: - -``` - import torch - - def test_sum(): - xs_shape = [22400, 8] - ys_shape = [22400, 8] - gt_bboxes_shape = [22400, 8,4] - xs = torch.rand(xs_shape).npu() - ys = torch.rand(ys_shape).npu() - gt_bboxes = torch.rand(gt_bboxes_shape).npu().half() - left = xs - gt_bboxes[..., 0] - right = gt_bboxes[..., 2] - xs - top = ys - gt_bboxes[..., 1] - bottom = gt_bboxes[..., 3] - ys - # stream = torch.npu.current_stream() - # stream.synchronize() - # left, top: fp32, right, bottom: fp16, - # print(left.dtype, top.dtype, right.dtype, bottom.dtype) - bbox_targets = torch.stack((left, top, right, bottom), -1) # Error reported here - # stream.synchronize() - - bbox_targets = torch.sum(bbox_targets) -``` - -Shell error message: - -``` - RuntimeError: Run:/usr1/workspace/PyTorch_Apex_Daily_c20tr5/CODE/aten/src/ATen/native/npu/utils/OpParamMaker.h:280 NPU error,NPU error code is:500002 - [ERROR] RUNTIME(160809)kernel task happen error, retCode=0x28, [aicpu timeout]. - [ERROR] RUNTIME(160809)aicpu kernel execute failed, device_id=0, stream_id=512, task_id=24, fault so_name=, fault kernel_name=, extend_info=. - Error in atexit._run_exitfuncs: - Traceback (most recent call last): - File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/__init__.py", line 429, in _npu_shutdown - torch._C._npu_shutdown() - RuntimeError: npuSynchronizeDevice:/usr1/workspace/PyTorch_Apex_Daily_c20tr5/CODE/c10/npu/NPUStream.cpp:806 NPU error, error code is 0 -``` - -Log message: - -``` - [ERROR] RUNTIME(12731,python3.7):2021-02-02-22:23:56.475.679 [../../../../../../runtime/feature/src/npu_driver.cc:1408]12828 MemCopySync:drvMemcpy failed: dst=0x108040288000, destMax=1240, src=0x7fe7649556d0, size=1240, kind=1, drvRetCode=17! - [ERROR] RUNTIME(12731,python3.7):2021-02-02-22:23:56.475.698 [../../../../../../runtime/feature/src/logger.cc:113]12828 KernelLaunch:launch kernel failed, kernel=140631803535760/ArgMinWithValue_tvmbin, dim=32, stream=0x55b22b3def50 - [ERROR] RUNTIME(12731,python3.7):2021-02-02-22:23:56.475.717 [../../../../../../runtime/feature/src/api_c.cc:224]12828 rtKernelLaunch:ErrCode=207001, desc=[module new memory error], InnerCode=0x70a0002 -``` - -##### Possible Causes - -The shell error message does not match the log message. - -The shell error message indicates that the error occurs on the AI CPU during synchronization. However, the log message indicates that the error occurs on the min operator \(internal call of ArgMinWithValue\_tvmbin\). The two error messages do not match. Generally, this problem occurs because the error information generation in the log is delayed. - -The possible cause is that the AI CPU operator is executed asynchronously. As a result, the error information is delayed. - -##### Solution - -Perform the following steps to locate the fault based on the actual error information: - -1. Disable multi-task operator delivery. It is found that the result remains unchanged. It is inferred that the error occurs before the error in the shell error message and the error in the log message occur. -2. Perform stream synchronization based on the error information to narrow down the error range and locate the error operator. Stream synchronization requires that all calculations before the position where the code runs must be complete to locate the error. -3. It is determined that the error operator is stack. -4. Print the shape, dtype, and npu\_format of all stack parameters. Construct a single-operator case to reproduce the problem. The cause is that the data types of the input parameters for subtraction are different. As a result, the data types of the a-b and b-a results are different, and an error is reported in the stack operator. -5. Convert the data types of the stack input parameters to the same one to temporarily avoid the problem. - -

What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?

- -##### Symptom - -Script: - -``` - import torch - - def test_sum(): - xs_shape = [22400, 8] - ys_shape = [22400, 8] - gt_bboxes_shape = [22400, 8,4] - xs = torch.rand(xs_shape).npu() - ys = torch.rand(ys_shape).npu() - gt_bboxes = torch.rand(gt_bboxes_shape).npu().half() - left = xs - gt_bboxes[..., 0] - right = gt_bboxes[..., 2] - xs - top = ys - gt_bboxes[..., 1] - bottom = gt_bboxes[..., 3] - ys - # stream = torch.npu.current_stream() - # stream.synchronize() - # left, top: fp32, right, bottom: fp16, - # print(left.dtype, top.dtype, right.dtype, bottom.dtype) - bbox_targets = torch.stack((left, top, right, bottom), -1) # Error reported here - # stream.synchronize() - - bbox_targets = torch.sum(bbox_targets) -``` - -Shell error message: - -``` - RuntimeError: Run:/usr1/workspace/PyTorch_Apex_Daily_c20tr5/CODE/aten/src/ATen/native/npu/utils/OpParamMaker.h:280 NPU error,NPU error code is:500002 - [ERROR] RUNTIME(160809)kernel task happen error, retCode=0x28, [aicpu timeout]. - [ERROR] RUNTIME(160809)aicpu kernel execute failed, device_id=0, stream_id=512, task_id=24, fault so_name=, fault kernel_name=, extend_info=. - Error in atexit._run_exitfuncs: - Traceback (most recent call last): - File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/__init__.py", line 429, in _npu_shutdown - torch._C._npu_shutdown() - RuntimeError: npuSynchronizeDevice:/usr1/workspace/PyTorch_Apex_Daily_c20tr5/CODE/c10/npu/NPUStream.cpp:806 NPU error, error code is 0 -``` - -Log message: - -``` - [ERROR] RUNTIME(12731,python3.7):2021-02-02-22:23:56.475.679 [../../../../../../runtime/feature/src/npu_driver.cc:1408]12828 MemCopySync:drvMemcpy failed: dst=0x108040288000, destMax=1240, src=0x7fe7649556d0, size=1240, kind=1, drvRetCode=17! - [ERROR] RUNTIME(12731,python3.7):2021-02-02-22:23:56.475.698 [../../../../../../runtime/feature/src/logger.cc:113]12828 KernelLaunch:launch kernel failed, kernel=140631803535760/ArgMinWithValue_tvmbin, dim=32, stream=0x55b22b3def50 - [ERROR] RUNTIME(12731,python3.7):2021-02-02-22:23:56.475.717 [../../../../../../runtime/feature/src/api_c.cc:224]12828 rtKernelLaunch:ErrCode=207001, desc=[module new memory error], InnerCode=0x70a0002 -``` - -##### Possible Causes - -The shell error message does not match the log message. - -The shell error message indicates that the error occurs on the AI CPU during synchronization. However, the log message indicates that the error occurs on the min operator \(internal call of ArgMinWithValue\_tvmbin\). The two error messages do not match. Generally, this problem occurs because the error information generation in the log is delayed. - -The possible cause is that the AI CPU operator is executed asynchronously. As a result, the error information is delayed. - -##### Solution - -Perform the following steps to locate the fault based on the actual error information: - -1. Disable multi-task operator delivery. It is found that the result remains unchanged. It is inferred that the error occurs before the error in the shell error message and the error in the log message occur. -2. Perform stream synchronization based on the error information to narrow down the error range and locate the error operator. Stream synchronization requires that all calculations before the position where the code runs must be complete to locate the error. -3. It is determined that the error operator is stack. -4. Print the shape, dtype, and npu\_format of all stack parameters. Construct a single-operator case to reproduce the problem. The cause is that the data types of the input parameters for subtraction are different. As a result, the data types of the a-b and b-a results are different, and an error is reported in the stack operator. -5. Convert the data types of the stack input parameters to the same one to temporarily avoid the problem. - -

What Do I Do If the Error Message "HelpACLExecute." Is Displayed After Multi-Task Delivery Is Disabled (export TASK\_QUEUE\_ENABLE=0) During Model Running?

- -##### Symptom - -![](figures/faq8.png) - -##### Possible Causes - -The PyTorch operator runs on the NPU and calls the optimized operators at the bottom layer through the AcendCL API. When the error message "HelpACLExecute." is reported at the upper layer, the error information and logs are being optimized. As a result, when errors occur in some operators, the error information fails to be obtained. - -##### Solution - -View the host log to determine the operator and location where the error is reported. The default log path is **/var/log/npu/slog/host-0**. Search for the **ERROR** field in the log file of the corresponding time to find the error information. For the preceding error, the **ERROR** field in the log is as follows: - -![](figures/faq8-1.png) - -The error information in the log indicates that the error operator is topKD and the error cause is "The number of attrs in op desc and op store does not match." Therefore, it is determined that the error cause is that the parameters of the topKD operator do not match. - -Locate the topKD operator in the model code and check whether the operator can be replaced by another operator. If the operator can be replaced by another operator, use the replacement solution and report the operator error information to Huawei engineers. If the operator cannot be replaced by another operator, contact Huawei technical support. - -

What Do I Do If the Error Message "55056 GetInputConstDataOut: ErrorNo: -1(failed)" Is Displayed During Model Running?

- -##### Symptom - -During model training, the following error information may be displayed in the host training log \(directory: **/root/ascend/log/plog/**\): - -![](figures/20210720-102720(welinkpc).png) - -##### Possible Causes - -A public API is called. - -##### Solution - -The error information does not affect the training function and performance and can be ignored. - -

FAQs About Model Commissioning

- -- **[What Do I Do If the Error Message "RuntimeError: malloc:/..../pytorch/c10/npu/NPUCachingAllocator.cpp:293 NPU error, error code is 500000." Is Displayed During Model Commissioning?](#what-do-i-do-if-the-error-message-runtimeerror-malloc-pytorch-c10-npu-npucachingallocator-cpp-293-npmd)** - -- **[What Do I Do If the Error Message "RuntimeError: Could not run 'aten::trunc.out' with arguments from the 'NPUTensorId' backend." Is Displayed During Model Commissioning](#what-do-i-do-if-the-error-message-runtimeerror-could-not-run-aten-trunc-out-with-arguments-from-themd)** - -- **[What Do I Do If the MaxPoolGradWithArgmaxV1 and max Operators Report Errors During Model Commissioning?](#what-do-i-do-if-the-maxpoolgradwithargmaxv1-and-max-operators-report-errors-during-model-commissionimd)** - -- **[What Do I Do If the Error Message "ModuleNotFoundError: No module named 'torch.\_C'" Is Displayed When torch Is Called?](#what-do-i-do-if-the-error-message-modulenotfounderror-no-module-named-torch-_c-is-displayed-when-tormd)** - - -

What Do I Do If the Error Message "RuntimeError: malloc:/..../pytorch/c10/npu/NPUCachingAllocator.cpp:293 NPU error, error code is 500000." Is Displayed During Model Commissioning?

- -##### Symptom - -![](figures/faq4.png) - -##### Possible Causes - -For the malloc error in **NPUCachingAllocator**, the possible cause is that the required video memory is larger than the available video memory on the NPU. - -##### Solution - -During model commissioning, you can decrease the value of the **batch size** parameter to reduce the size of the occupied video memory on the NPU. - -

What Do I Do If the Error Message "RuntimeError: Could not run 'aten::trunc.out' with arguments from the 'NPUTensorId' backend." Is Displayed During Model Commissioning

- -##### Symptom - -![](figures/faq5.png) - -##### Possible Causes - -Currently, the NPU supports only some PyTorch operators. The preceding error is reported when operators that are not supported are used. The operators are being developed. For details about the supported operators, see [PyTorch Native Operators](https://support.huaweicloud.com/intl/en-us/opl-pytorch/atlasptol_09_0001.html). - -##### Solution - -During model commissioning, you can decrease the value of the **batch size** parameter to reduce the size of the occupied video memory on the NPU. - -

What Do I Do If the MaxPoolGradWithArgmaxV1 and max Operators Report Errors During Model Commissioning?

- -##### Symptom - -![](figures/faq6.png) - -![](figures/faq6-1.png) - -##### Possible Causes - -During model building, the operator input parameters are diversified. For some operators \(such as MaxPoolGradWithArgmaxV1 and max\) with specific parameters, an error is reported during calculation or the operators are not supported. You can locate the operators based on the error information. - -##### Solution - -Locate the operators based on the error information and perform the following steps: - -1. Check whether the call mode and parameters of the operators in the model are correct. -2. Build a single-operator case based on the error operators to construct the error scenario. -3. Generally, operator errors cannot be resolved on Python, and error scenarios need to be constructed. Post the error scenario in the forum and ask for help from Huawei engineers. - - >![](public_sys-resources/icon-note.gif) **NOTE:** - >Pay special attention to the input parameters **shape** and **dtype**, which are the main causes of operator errors. - - -In the preceding figure, the error information indicates that the MaxPoolGradWithArgmaxV1 and max operators report the error. MaxPoolGradWithArgmaxV1 reports the error during backward propagation. Therefore, construct a reverse scenario. The max operator reports the error during forward propagation. Therefore, construct a forward scenario. - -If an operator error is reported in the model, you are advised to build a single-operator test case and determine the error scenario and cause. If a single-operator case cannot be built in a single operator, you need to construct a context-based single-operator scenario. For details about how to build a test case, see [Single-Operator Sample Building](#single-operator-sample-buildingmd). - -

What Do I Do If the Error Message "ModuleNotFoundError: No module named 'torch.\_C'" Is Displayed When torch Is Called?

- -##### Symptom - -![](figures/faq11.png) - -##### Possible Causes - -In the preceding figure, the error path is **.../code/pytorch/torch/\_\_init\_\_.py**. However, the current operating path is **.../code/pytorch**. When the **import torch** command is executed, the **torch** folder is searched in the current directory by default. As a result, an error is reported. The torch package installed in the system directory instead of the torch package in the current directory is called. - -##### Solution - -Switch to another directory to run the script. - -

FAQs About Other Operations

- -- **[What Do I Do If an Error Is Reported During CUDA Stream Synchronization?](#what-do-i-do-if-an-error-is-reported-during-cuda-stream-synchronizationmd)** - -- **[What Do I Do If aicpu\_kernels/libpt\_kernels.so Does Not Exist?](#what-do-i-do-if-aicpu_kernels-libpt_kernels-so-does-not-existmd)** - -- **[What Do I Do If the Python Process Is Residual When the npu-smi info Command Is Used to View Video Memory?](#what-do-i-do-if-the-python-process-is-residual-when-the-npu-smi-info-command-is-used-to-view-video-mmd)** - -- **[What Do I Do If the Error Message "match op inputs failed"Is Displayed When the Dynamic Shape Is Used?](#what-do-i-do-if-the-error-message-match-op-inputs-failed-is-displayed-when-the-dynamic-shape-is-usedmd)** - -- **[What Do I Do If the Error Message "Op type SigmoidCrossEntropyWithLogitsV2 of ops kernel AIcoreEngine is unsupported" Is Displayed?](#what-do-i-do-if-the-error-message-op-type-sigmoidcrossentropywithlogitsv2-of-ops-kernel-aicoreenginemd)** - -- **[What Do I Do If a Hook Failure Occurs?](#what-do-i-do-if-a-hook-failure-occursmd)** - -- **[What Do I Do If the Error Message "load state\_dict error." Is Displayed When the Weight Is Loaded?](#what-do-i-do-if-the-error-message-load-state_dict-error-is-displayed-when-the-weight-is-loadedmd)** - - -

What Do I Do If an Error Is Reported During CUDA Stream Synchronization?

- -##### Symptom - -![](figures/model_faq11_20210728.jpg) - -##### Possible Causes - -The NPU does not use NPU stream synchronization. - -##### Solution - -Use NPU stream synchronization. - -``` -stream = torch.npu.current_stream() -stream.synchronize() -``` - -

What Do I Do If aicpu\_kernels/libpt\_kernels.so Does Not Exist?

- -##### Symptom - -![](figures/faq13.png) - -##### Possible Causes - -The AI CPU is not imported. - -##### Solution - -Import the AI CPU. \(The following describes how to install the CANN software package as the **root** user in the default installation path.\) - -``` -export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest -``` - -

What Do I Do If the Python Process Is Residual When the npu-smi info Command Is Used to View Video Memory?

- -##### Symptom - -![](figures/faq14.png) - -##### Possible Causes - -The Python process needs to be killed. - -##### Solution - -Kill the Python process. - -``` -pkill -9 python -``` - -

What Do I Do If the Error Message "match op inputs failed"Is Displayed When the Dynamic Shape Is Used?

- -##### Symptom - -![](figures/faq15.png) - -##### Possible Causes - -The operator compiled by **PTIndexPut** does not match the input shape, and the log starting with **acl\_dynamic\_shape\_op** is displayed. It is determined that an error is reported for the dynamic shape. - -##### Solution - -**PTIndexPut** corresponds to **tensor\[indices\] = value**. Locate the field in the code and change the dynamic shape to a fixed shape. - -

What Do I Do If the Error Message "Op type SigmoidCrossEntropyWithLogitsV2 of ops kernel AIcoreEngine is unsupported" Is Displayed?

- -##### Symptom - -``` -[ERROR] GE(24836,python3.7):2021-01-27-18:27:51.562.111 [../../../../../../graphengine/ge/engine_manager/dnnengine_manager.cc:266]25155 GetDNNEngineName: ErrorNo: 1343242282(assign engine failed) GetDNNEngineName:Op type SigmoidCrossEntropyWithLogitsV2 of ops kernel AIcoreEngine is unsupported, reason:Op SigmoidCrossEntropyWithLogitsV2 not supported reason: The type of this op is not found in op store, check whether the op store has this type of op. Op store name is tbe-custom. -The dtype, format or shape of input in op desc is not supported in op store, check the dtype, format or shape of input between the op store and the graph. Op store name is tbe-builtin. -``` - -##### Possible Causes - -The input data type is not supported by the SigmoidCrossEntropyWithLogitsV2 operator. The possible cause is that the input data type is int64. - -##### Solution - -Check the input data type in the Python code and modify the data type. - -

What Do I Do If a Hook Failure Occurs?

- -##### Symptom - -``` -Traceback (most recent call last): - File "tools/train.py", line 227, in - main() - File "tools/train.py", line 221, in main - meta=meta) - File "/root/YoloV3/mmdetection/mmdet/apis/train.py", line 192, in train_detector - runner.run(data_loaders, cfg.workflow, cfg.total_epochs) - File "/usr/local/python3.7.5/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 166, in run - epoch_runner(data_loaders[i], **kwargs) - File "/usr/local/python3.7.5/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train - self.run_iter(data_batch, train_mode=True) - File "/usr/local/python3.7.5/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter - outputs = self.model.train_step(data_batch, self.optimizer, **kwargs) - File "/usr/local/python3.7.5/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 100, in train_step - return self.module.train_step(*inputs[0], **kwargs[0]) - File "/root/YoloV3/mmdetection/mmdet/models/detectors/base.py", line 251, in train_step - losses = self(**data) - File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 660, in __call__ - var = next((v for v in var.values() if isinstance(v, torch.Tensor))) -StopIteration -``` - -##### Possible Causes - -The loss structure of the mmdet triggers the bug of the native hook of PyTorch, leading to an infinite loop. - -##### Solution - -Add **try** to line 658 to skip in the **/usr/local/python3.7.5/lib/python3.7/site-packages/torch/nn/modules/module.py** file: - -``` -if len(self._backward_hooks) > 0: - var = result - try: - while not isinstance(var, torch.Tensor): - if isinstance(var, dict): - var = next((v for v in var.values() if isinstance(v, torch.Tensor))) - else: - var = var[0] - grad_fn = var.grad_fn - if grad_fn is not None: - for hook in self._backward_hooks.values(): - wrapper = functools.partial(hook, self) - functools.update_wrapper(wrapper, hook) - grad_fn.register_hook(wrapper) - except Exception as e: - print('hook failed..') - print(str(e)) -return result -``` - -

What Do I Do If the Error Message "load state\_dict error." Is Displayed When the Weight Is Loaded?

- -##### Symptom - -![](figures/faq18.png) - -![](figures/faq18-1.png) - -##### Possible Causes - -The key value of **state\_dict** saved after model training is different from the key value of **state\_dict** when the model is loaded. When the model is saved, a **module** prefix is added to the beginning of each key. - -##### Solution - -When loading the weight, traverse the **state\_dict** dictionary, modify the key value, and use the new dictionary. For details about the test case, see **demo.py**. - -The script is as follows: - -``` - ckpt = torch.load("checkpoint.pth", map_location=loc) - # model.load_state_dict(ckpt['state_dict']) - state_dict_old = ckpt['state_dict'] - state_dict = {} - for key, value in state_dict_old.items(): - key = key[7:] - state_dict[key] = value - model.load_state_dict(state_dict) -``` - -

FAQs About Distributed Model Training

- -- **[What Do I Do If the Error Message "host not found." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-host-not-found-is-displayed-during-distributed-model-trainingmd)** - -- **[What Do I Do If the Error Message "RuntimeError: connect\(\) timed out." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-runtimeerror-connect-timed-out-is-displayed-during-distributed-mmd)** - - -

What Do I Do If the Error Message "host not found." Is Displayed During Distributed Model Training?

- -##### Symptom - -![](figures/faq19.png) - -##### Possible Causes - -During distributed model training, the Huawei Collective Communication Library \(HCCL\) is invoked. You need to set the IP address and port number based on the site requirements. The error information indicates that the IP address is incorrect. - -##### Solution - -Set the correct IP address in the running script. If a single server is deployed, set the IP address to the IP address of the server. If multiple servers are deployed, set the IP address in the script on each server to the IP address of the active node. - -

What Do I Do If the Error Message "RuntimeError: connect\(\) timed out." Is Displayed During Distributed Model Training?

- -##### Symptom - -![](figures/1234.png) - -##### Possible Causes - -During distributed model training, the system firewall may block the communication of the HCCL port. Check whether the communication port is enabled based on the error information and perform related settings. - -##### Solution - -Query the HCCL port that is blocked by the system firewall and enable the port. - +# PyTorch Network Model Porting and Training Guide + +- [PyTorch Network Model Porting and Training Guide](#pytorch-network-model-porting-and-training-guide) + - [Overview](#overview) + - [Solution Features and Advantages](#solution-features-and-advantages) + - [Restrictions and Limitations](#restrictions-and-limitations) + - [Porting Process](#porting-process) + - [Quick Start](#quick-start) + - [Introduction](#introduction) + - [Model Selection](#model-selection) + - [Model Porting Evaluation](#model-porting-evaluation) + - [Environment Setup](#environment-setup) + - [Model Porting](#model-porting) + - [Single-Device Training Porting](#single-device-training-porting) + - [Single-Server Multi-Device Training Modification](#single-server-multi-device-training-modification) + - [Model Training](#model-training) + - [Model Porting Evaluation](#model-porting-evaluation-1) + - [Environment Setup](#environment-setup-1) + - [Model Porting](#model-porting-1) + - [Tool-Facilitated](#tool-facilitated) + - [Introduction](#introduction-1) + - [Overview](#overview-1) + - [System Requirement](#system-requirement) + - [Environment Setup](#environment-setup-2) + - [Instructions](#instructions) + - [Command-line Options](#command-line-options) + - [Customizing a Rule File](#customizing-a-rule-file) + - [Performing Conversion](#performing-conversion) + - [Result Analysis](#result-analysis) + - [Manual](#manual) + - [Single-Device Training Model Porting](#single-device-training-model-porting) + - [Multi-Device Training Model Porting](#multi-device-training-model-porting) + - [PyTorch-related API Replacement](#pytorch-related-api-replacement) + - [Mixed Precision](#mixed-precision) + - [Overview](#overview-2) + - [Supported Features](#supported-features) + - [Integrating Mixed Precision Module Into the PyTorch Model](#integrating-mixed-precision-module-into-the-pytorch-model) + - [Model Training](#model-training-1) + - [Performance Analysis and Optimization](#performance-analysis-and-optimization) + - [Prerequisites](#prerequisites) + - [Commissioning Process](#commissioning-process) + - [Overall Guideline](#overall-guideline) + - [Training Data Collection](#training-data-collection) + - [Profile Data Collection](#profile-data-collection) + - [Obtaining Operator Information (OP_INFO)](#obtaining-operator-information-op_info) + - [Host-side Performance Optimization](#host-side-performance-optimization) + - [Overview](#overview-3) + - [Changing the CPU Performance Mode (x86 Server)](#changing-the-cpu-performance-mode-x86-server) + - [Setting the Power Policy to High Performance](#setting-the-power-policy-to-high-performance) + - [Setting the CPU Mode to Performance](#setting-the-cpu-mode-to-performance) + - [Changing the CPU Performance Mode (ARM Server)](#changing-the-cpu-performance-mode-arm-server) + - [Setting the Power Policy to High Performance](#setting-the-power-policy-to-high-performance-1) + - [Installing the High-Performance Pillow Library (x86 Server)](#installing-the-high-performance-pillow-library-x86-server) + - [(Optional) Installing the OpenCV Library of the Specified Version](#optional-installing-the-opencv-library-of-the-specified-version) + - [Training Performance Optimization](#training-performance-optimization) + - [Operator Bottleneck Optimization](#operator-bottleneck-optimization) + - [Copy Bottleneck Optimization](#copy-bottleneck-optimization) + - [Framework Bottleneck Optimization](#framework-bottleneck-optimization) + - [Compilation Bottleneck Optimization](#compilation-bottleneck-optimization) + - [E2E Performance Tool (E2E prof) Instructions](#e2e-performance-tool-e2e-prof-instructions) + - [Introduction](#introduction-2) + - [Usage Tutorial](#usage-tutorial) + - [Result Parsing](#result-parsing) + - [Advanced Settings](#advanced-settings) + - [Affinity Library](#affinity-library) + - [Source](#source) + - [Functions](#functions) + - [Precision Commissioning](#precision-commissioning) + - [Prerequisites](#prerequisites-1) + - [Commissioning Process](#commissioning-process-1) + - [Overall Guideline](#overall-guideline-1) + - [Precision Tuning Methods](#precision-tuning-methods) + - [**Environment Setup**](#environment-setup-3) + - [Model Operator Precision Comparison](#model-operator-precision-comparison) + - [Single-Operator Overflow/Underflow Detection](#single-operator-overflowunderflow-detection) + - [Mapping Between IR and TBE Operators](#mapping-between-ir-and-tbe-operators) + - [Mapping Between NPU and GPU Operators.](#mapping-between-npu-and-gpu-operators) + - [Model Saving and Conversion](#model-saving-and-conversion) + - [Introduction](#introduction-3) + - [Saving a Model](#saving-a-model) + - [Exporting an ONNX Model](#exporting-an-onnx-model) + - [Introduction](#introduction-4) + - [Using the .pth or .pt File to Export the ONNX Model](#using-the-pth-or-pt-file-to-export-the-onnx-model) + - [Using the .pth.tar File to Export the ONNX Model](#using-the-pthtar-file-to-export-the-onnx-model) + - [Samples](#samples) + - [ShuffleNet Model Optimization](#shufflenet-model-optimization) + - [Obtaining Samples](#obtaining-samples) + - [How to Obtain](#how-to-obtain) + - [Directory Structure](#directory-structure) + - [Model Evaluation](#model-evaluation) + - [Porting the Network](#porting-the-network) + - [Commissioning the Network](#commissioning-the-network) + - [Forward check](#forward-check) + - [Entire Network Check](#entire-network-check) + - [Python Optimization Details](#python-optimization-details) + - [References](#references) + - [Single-Operator Sample Building](#single-operator-sample-building) + - [Single-Operator Dump Method](#single-operator-dump-method) + - [Collecting Dump Data](#collecting-dump-data) + - [Viewing Overflowed Data](#viewing-overflowed-data) + - [Parse the dump file of an overflow operator.](#parse-the-dump-file-of-an-overflow-operator) + - [Common Environment Variables](#common-environment-variables) + - [dump op Method](#dump-op-method) + - [Compilation Option Settings](#compilation-option-settings) + - [How Do I Install GCC 7.3.0?](#how-do-i-install-gcc-730) + - [HDF5 Compilation and Installation](#hdf5-compilation-and-installation) + - [FAQs](#faqs) + - [FAQs About Software Installation](#faqs-about-software-installation) + - [FAQs About Model and Operator Running](#faqs-about-model-and-operator-running) + - [FAQs About Model Commissioning](#faqs-about-model-commissioning) + - [FAQs About Other Operations](#faqs-about-other-operations) + - [FAQs About Distributed Model Training](#faqs-about-distributed-model-training) + +## Overview + +Currently, the solution of adapting to the Ascend AI Processor is an online solution. + +### Solution Features and Advantages + +The acceleration of the Ascend AI Processor is implemented by calling various operators (OP-based). That is, the AscendCL is used to call one or more D affinity operators to replace the original GPU-based implementation. [Figure 1](#fig2267112413239) shows the logical model of the implementation. + +**Figure 1** Logical model + + +![](figures/pytorch适配逻辑结构图-优化.png) + +Currently, the main reasons for selecting the online adaptation solution are as follows: + +1. The dynamic graph feature of the PyTorch framework is inherited to the maximum extent. +2. The GPU's usage on the PyTorch is inherited to the maximum extent, which minimizes the changes in the development mode and code reuse when a model is ported to the Ascend AI Processor for training. +3. The original PyTorch architecture is inherited to the maximum extent and the excellent features of the PyTorch architecture are retained, such as automatic differentiation, dynamic distribution, debugging, profiling, storage sharing mechanism, and dynamic memory management on the device side. +4. It has good scalability. During the streamlining process, only the development and implementation of related compute operators are involved for new network types or structures. Framework operators, reverse graph building, and implementation mechanisms can be reused. +5. The usage and style are the same as those of the GPU-based implementation. During online adaption, you only need to specify the device as the Ascend AI Processor in Python and device operations to develop, train, and debug the network in PyTorch using the Ascend AI Processor. You do not need to pay attention to the underlying details of the Ascend AI Processor. In this way, you can minimize the modification and complete porting with low costs. + +## Restrictions and Limitations + +- In the **infershape** phase, operators do not support unknown shape inference. +- Only the float16 operator can be used for cube computing. +- inf/nan data of the float16 type cannot be input or output. +- Dimensions cannot be reduced when the format larger than 4D is used. +- In the current version, Apex is implemented using Python, and the customized optimization CUDA kernel in Apex is not supported. +- The current version of Apex supports only the mixed precision calculation and multiple fusion optimizer functions adapted to Ascend AI Processors. +- The restrictions on collective communication are as follows: + - In data parallel mode, the graphs executed on different devices must be the same. + - Allocation at only 1, 2, 4, or 8 processors is supported. + - Only the int8, int32, float16, and float32 data types are supported. + +## Porting Process + +Model porting refers to moving models that have been implemented in the open-source community to an Ascend AI Processor. [Figure 2](#fig759451810422) shows the model porting process. + +**Figure 2** Porting process +![](figures/porting-process.png "porting-process") + +**Table 1** Porting process + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Scenario

+

Description

+

Model selection

+

Select the model to be ported.

+

Model porting evaluation

+

For details, see Model Porting Evaluation.

+

Operator development

+

For details, see the PyTorch Operator Developer Guide.

+

Environment setup

+

For details, see Environment Setup.

+

Model porting

+

For details, see Model Porting.

+

Model training

+

For details, see Model Training.

+

Error analysis

+

For details, see "AI Core Error Analyzer Instructions" in the CANN Log Reference and CANN Auxiliary Development Tool User Guide .

+

Performance analysis and optimization

+

For details, see Performance Optimization and Analysis.

+

Precision commissioning

+

For details, see Precision Commissioning.

+

Model saving and conversion

+

For details, see Model Saving and Conversion and "ATC Tool Instructions" in the CANN Auxiliary Development Tool User Guide .

+

Application software development

+

For details, see the CANN Application Software Development Guide (C and C++, Inference).

+

FAQs

+

Describes how to prepare the environment, port models, commission models, and resolve other common problems. For details, see FAQs.

+
+ +## Quick Start + +### Introduction + +This section describes how to port a ResNet-50 model to help users quickly understand the porting process. + +### Model Selection + +In this example, the [main.py](https://github.com/pytorch/examples/tree/master/imagenet/main.py) script for model training on the ImageNet dataset is ported to adapt to Ascend 910 AI Processors. This script can be obtained from the PyTorch official website + +### Model Porting Evaluation + +Whether a model can be successfully ported depends on whether its operators are supported by Ascend AI Processors. Therefore, you can evaluate whether operators of the model are supported by Ascend AI Processors using either of the following methods: + +- Before model porting, obtain information about the operators by dumping them, and then compare them with those in the PyTorch Operator Support to determine whether they are supported by Ascend AI Processors. +- After model porting, run the training script on an Ascend AI Processor. If operators not supported by Ascend AI Processors exist, an error is reported. + +If operators not supported by Ascend AI Processors exist, you can replace them with equivalent operators or develop other appropriate operators. For details, see the *PyTorch Operator Developer Guide*. + +The operators used by the ResNet-50 model are supported by Ascend AI Processors. + +### Environment Setup + +Install the CANN software, PyTorch framework, and mixed precision module, and set environment variables. For details, see the *PyTorch Installation Guide*. + +Set up the Python environment and prepare dependencies required for model running. For details, see the PyTorch [examples](https://github.com/pytorch/examples/tree/master/imagenet). + +### Model Porting + +Modify the **main.py** training script to implement single-device model training and single-server multi-device model training porting. + +#### Single-Device Training Porting + +1. Import the **torch.npu** module to **main.py**. + + ```python + import torch.npu + ``` + +2. Define the training device in **main.py**. + + ```python + CALCULATE_DEVICE = "npu:0" + ``` + +3. Modify the parameters and options so that the script can be trained only on Ascend 910 AI Processors. + + Code location: **main_worker()** function in **main.py**: + + ```python + def main_worker(gpu, ngpus_per_node, args): + global best_acc1 + # The source code specifies that the GPU is used for training. The following is an example. + # args.gpu = gpu + ############## npu modify begin ############# + args.gpu = None + ############## npu modify end ############# + + if args.gpu is not None: + print("Use GPU: {} for training".format(args.gpu)) + + if args.distributed: + if args.dist_url == "env://" and args.rank == -1: + args.rank = int(os.environ["RANK"]) + if args.multiprocessing_distributed: + # For multiprocessing distributed training, rank needs to be the + # global rank among all the processes + args.rank = args.rank * ngpus_per_node + gpu + dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, + world_size=args.world_size, rank=args.rank) + # create model + if args.pretrained: + print("=> using pre-trained model '{}'".format(args.arch)) + model = models.__dict__[args.arch](pretrained=True) + else: + print("=> creating model '{}'".format(args.arch)) + model = models.__dict__[args.arch]() + # The source code determines whether to perform training on the GPU. The following is an example. + # if not torch.cuda.is_available(): + # print('using CPU, this will be slow') + # elif args.distributed: + ############## npu modify begin ############# + # After the porting, the source code only determines whether to perform distributed training (without determining whether to perform training on the GPU). + if args.distributed: + ############## npu modify end ############# + # For multiprocessing distributed, DistributedDataParallel constructor + # should always set the single device scope, otherwise, + # DistributedDataParallel will use all available devices. + if args.gpu is not None: + ...... + ``` + +4. Port the model and loss function to an Ascend 910 AI Processor for calculation. + + Code location: **main_worker()** function in **main.py**: + + ```python + elif args.gpu is not None: + torch.cuda.set_device(args.gpu) + model = model.cuda(args.gpu) + else: + # DataParallel will divide and allocate batch_size to all available GPUs + if args.arch.startswith('alexnet') or args.arch.startswith('vgg'): + model.features = torch.nn.DataParallel(model.features) + model.cuda() + else: + # The source code uses the torch.nn.DataParallel() class to accelerate training on multiple GPUs. + # model = torch.nn.DataParallel(model).cuda() + ############## npu modify begin ############# + # Port the model to the NPU for training. + model = model.to(CALCULATE_DEVICE) + ############## npu modify end ############# + # In the source code, the loss function is calculated on the GPU. + # # define loss function (criterion) and optimizer + # criterion = nn.CrossEntropyLoss().cuda(args.gpu) + ############## npu modify begin ############# + # Port the loss function to the NPU for calculation. + criterion = nn.CrossEntropyLoss().to(CALCULATE_DEVICE) + ############## npu modify end ############# + ``` + +5. Change the type of the **target** operator in the dataset to **int32** to resolve the operator error. Port the dataset to the Ascend 910 AI Processor for calculation. + + - Code location: **train()** function in **main.py**: + + ```python + for i, (images, target) in enumerate(train_loader): + # measure data loading time + data_time.update(time.time() - end) + + if args.gpu is not None: + images = images.cuda(args.gpu, non_blocking=True) + # In the source code, the training dataset is loaded and calculated on the GPU. The following is an example. + # if torch.cuda.is_available(): + # target = target.cuda(args.gpu, non_blocking=True) + ############## npu modify begin ############# + # Port the dataset to the NPU for calculation and modify the target data type to improve performance. + if 'npu' in CALCULATE_DEVICE: + target = target.to(torch.int32) + images, target = images.to(CALCULATE_DEVICE, non_blocking=True), target.to(CALCULATE_DEVICE, non_blocking=True) + ############## npu modify end ############# + ``` + + - Code location: **validate()** function in **main.py**: + + ```python + with torch.no_grad(): + end = time.time() + for i, (images, target) in enumerate(val_loader): + if args.gpu is not None: + images = images.cuda(args.gpu, non_blocking=True) + # In the source code, the training dataset is loaded and calculated on the GPU. The following is an example. + # if torch.cuda.is_available(): + # target = target.cuda(args.gpu, non_blocking=True) + ############## npu modify begin ############# + # Port the dataset to the NPU for calculation and modify the target data type. + if 'npu' in CALCULATE_DEVICE: + target = target.to(torch.int32) + images, target = images.to(CALCULATE_DEVICE, non_blocking=True), target.to(CALCULATE_DEVICE, non_blocking=True) + ############## npu modify end ############# + ``` + +6. Set the device in use. + + Code location: Main function entry point in **main.py** + + ```python + if __name__ == '__main__': + ############## npu modify begin ############# + if 'npu' in CALCULATE_DEVICE: + torch.npu.set_device(CALCULATE_DEVICE) + ############## npu modify begin ############# + main() + ``` + +#### Single-Server Multi-Device Training Modification + +1. Add a header file to **main.py** to support mixed precision training of PyTorch-based models on Ascend 910 AI Processors. + + ```python + import torch.npu + from apex import amp + ``` + +2. Add the following parameters, including those for specifying the Ascend 910 AI Processors involved in training and those required for mixed precision training. + + ```python + parser.add_argument('--device', default='npu', type=str, help='npu or gpu') + parser.add_argument('--addr', default='10.136.181.115', type=str, help='master addr') + parser.add_argument('--device-list', default='0,1,2,3,4,5,6,7', type=str, help='device id list') + parser.add_argument('--amp', default=False, action='store_true', help='use amp to train the model') + parser.add_argument('--loss-scale', default=1024., type=float, + help='loss scale using in amp, default -1 means dynamic') + parser.add_argument('--opt-level', default='O2', type=str, + help='loss scale using in amp, default -1 means dynamic') + ``` + +3. Create a mapping function from **device_id** to **process_id** and specify the device for training. Add the following API to the **main.py** function: + + ```python + def device_id_to_process_device_map(device_list): + devices = device_list.split(",") + devices = [int(x) for x in devices] + devices.sort() + + process_device_map = dict() + for process_id, device_id in enumerate(devices): + process_device_map[process_id] = device_id + + return process_device_map + ``` + +4. Specify the IP address and port number of the training server. + + Code location: Main function **main()** in **main.py** (The changes are in bold.) + + ```python + def main(): + args = parser.parse_args() + ############## npu modify begin ############# + os.environ['MASTER_ADDR'] = args.addr + os.environ['MASTER_PORT'] = '29688' + ############## npu modify end ############# + ``` + +5. Create a mapping parameter from **device_id** to **process_id** to obtain the number of Ascend 910 AI Processors on a single node. + + Code location: Main function **main()** in **main.py** + + ```python + args.distributed = args.world_size > 1 or args.multiprocessing_distributed + ############## npu modify begin ############# + args.process_device_map = device_id_to_process_device_map(args.device_list) + if args.device == 'npu': + ngpus_per_node = len(args.process_device_map) + else: + ngpus_per_node = torch.cuda.device_count() + ############## npu modify end ############# + # The source code is as follows: + # ngpus_per_node = torch.cuda.device_count() + ``` + +6. Obtain the ID of the Ascend 910 AI Processor corresponding to **process_id** and specify the Ascend 910 AI Processor for training. + + Code location: **main_worker()** function in **main.py**: + + ```python + def main_worker(gpu, ngpus_per_node, args): + global best_acc1 + ############## npu modify begin ############# + args.gpu = args.process_device_map[gpu] + ############## npu modify end ############# + # The source code is as follows: + # args.gpu = gpu + ``` + +7. Initialize the process group and mask the initialization mode. + + Code location: **main_worker()** function in **main.py**: + + ```python + ############## npu modify begin ############# + if args.device == 'npu': + dist.init_process_group(backend=args.dist_backend, #init_method=args.dist_url, + world_size=args.world_size, rank=args.rank) + else: + dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, + world_size=args.world_size, rank=args.rank) + ############## npu modify begin ############# + # The source code is as follows: + # dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, + # world_size=args.world_size, rank=args.rank) + ``` + +8. To perform distributed training, the mixed precision module needs to be introduced, and the model needs to be ported to Ascend AI Processors. Therefore, the code for determining whether the training is distributed training and whether the model is trained on the GPU needs to be masked. + + Code location: **main_worker()** function in **main.py**: + + ```python + # create model + if args.pretrained: + print("=> using pre-trained model '{}'".format(args.arch)) + model = models.__dict__[args.arch](pretrained=True) + else: + print("=> creating model '{}'".format(args.arch)) + model = models.__dict__[args.arch]() + ############## npu modify begin ############# + # Add the following content to the code. + # Specify Ascend AI Processors as the training devices. + loc = 'npu:{}'.format(args.gpu) + torch.npu.set_device(loc) + # Calculate batch_size and workers used for training. + args.batch_size = int(args.batch_size / ngpus_per_node) + args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node) + ############## npu modify end ############# + # The source code is as follows, which needs to be masked and is commented out. + # if not torch.cuda.is_available(): + # print('using CPU, this will be slow') + # elif args.distributed: + # # For multiprocessing distributed, DistributedDataParallel constructor + # # should always set the single device scope, otherwise, + # # DistributedDataParallel will use all available devices. + # if args.gpu is not None: + # torch.cuda.set_device(args.gpu) + # model.cuda(args.gpu) + # # When using a single GPU per process and per + # # DistributedDataParallel, we need to divide the batch size + # # ourselves based on the total number of GPUs we have + # args.batch_size = int(args.batch_size / ngpus_per_node) + # args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node) + # model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) + # else: + # model.cuda() + # # DistributedDataParallel will divide and allocate batch_size to all + # # available GPUs if device_ids are not set + # model = torch.nn.parallel.DistributedDataParallel(model) + # elif args.gpu is not None: + # torch.cuda.set_device(args.gpu) + # model = model.cuda(args.gpu) + # else: + # # DataParallel will divide and allocate batch_size to all available GPUs + # if args.arch.startswith('alexnet') or args.arch.startswith('vgg'): + # model.features = torch.nn.DataParallel(model.features) + # model.cuda() + # else: + # model = torch.nn.DataParallel(model).cuda() + ``` + +9. Mask the loss function, optimizer, and resume training at breakpoint. This part is combined with the mixed precision training later. + + Code location: **main_worker()** function in **main.py**: + + ```python + # The source code is masked and commented out. + # # define loss function (criterion) and optimizer + # criterion = nn.CrossEntropyLoss().cuda(args.gpu) + # + # optimizer = torch.optim.SGD(model.parameters(), args.lr, + # momentum=args.momentum, + # weight_decay=args.weight_decay) + # + # # optionally resume from a checkpoint + # if args.resume: + # if os.path.isfile(args.resume): + # print("=> loading checkpoint '{}'".format(args.resume)) + # if args.gpu is None: + # checkpoint = torch.load(args.resume) + # else: + # # Map model to be loaded to specified single gpu. + # loc = 'cuda:{}'.format(args.gpu) + # checkpoint = torch.load(args.resume, map_location=loc) + # args.start_epoch = checkpoint['epoch'] + # best_acc1 = checkpoint['best_acc1'] + # if args.gpu is not None: + # # best_acc1 may be from a checkpoint from a different GPU + # best_acc1 = best_acc1.to(args.gpu) + # model.load_state_dict(checkpoint['state_dict']) + # optimizer.load_state_dict(checkpoint['optimizer']) + # print("=> loaded checkpoint '{}' (epoch {})" + # .format(args.resume, checkpoint['epoch'])) + # else: + # print("=> no checkpoint found at '{}'".format(args.resume)) + # + # cudnn.benchmark = True + ``` + +10. A data loader combines a dataset and a sampler and can provide multiple threads to process the dataset. If Ascend AI Processors are used for training, **pin_memory** must be set to **False**. Currently, only training in a static shape is supported. The number of remaining samples in the data flow may be less than the batch size. Therefore, **drop_last** must be set to **True**. In addition, you need to set **shuffle** to **True** for some datasets to be verified. + + Code location: **main_worker()** function in **main.py**: + + ```python + ############## npu modify begin ############# + train_loader = torch.utils.data.DataLoader( + train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None), + num_workers=args.workers, pin_memory=False, sampler=train_sampler, drop_last=True) + + val_loader = torch.utils.data.DataLoader( + datasets.ImageFolder(valdir, transforms.Compose([ + transforms.Resize(256), + transforms.CenterCrop(224), + transforms.ToTensor(), + normalize, + ])), + batch_size=args.batch_size, shuffle=True, + num_workers=args.workers, pin_memory=False, drop_last=True) + ############## npu modify end ############# + ``` + +11. Construct the loss function and optimizer, and port the model and loss function to Ascend AI Processors. The optimizer, model, and resume training are combined with the mixed precision module are combined to support the mixed precision training. + + Code location: after data loading verification in **main_worker()** in **main.py**. + + ```python + val_loader = torch.utils.data.DataLoader( + datasets.ImageFolder(valdir, transforms.Compose([ + transforms.Resize(256), + transforms.CenterCrop(224), + transforms.ToTensor(), + normalize, + ])), + batch_size=args.batch_size, shuffle=True, + num_workers=args.workers, pin_memory=False, drop_last=True) + + ############## npu modify begin ############# + model = model.to(loc) + # define loss function (criterion) and optimizer + criterion = nn.CrossEntropyLoss().to(loc) + optimizer = torch.optim.SGD(model.parameters(), args.lr, + momentum=args.momentum, + weight_decay=args.weight_decay) + + if args.amp: + model, optimizer = amp.initialize(model, optimizer, opt_level=args.opt_level, loss_scale=args.loss_scale) + model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) + + # optionally resume from a checkpoint + if args.resume: + if os.path.isfile(args.resume): + print("=> loading checkpoint '{}'".format(args.resume)) + checkpoint = torch.load(args.resume, map_location=loc) + args.start_epoch = checkpoint['epoch'] + best_acc1 = checkpoint['best_acc1'] + model.load_state_dict(checkpoint['state_dict']) + optimizer.load_state_dict(checkpoint['optimizer']) + if args.amp: + amp.load_state_dict(checkpoint['amp']) + print("=> loaded checkpoint '{}' (epoch {})" + .format(args.resume, checkpoint['epoch'])) + else: + print("=> no checkpoint found at '{}'".format(args.resume)) + + cudnn.benchmark = True + ############## npu modify end ############# + ``` + +12. The checkpoint saving needs to be combined with the mixed precision training. The modification is as follows: + + Code location: **main_worker()** in **main.py** (The changes are in bold.) + + ```python + # remember best acc@1 and save checkpoint + is_best = acc1 > best_acc1 + best_acc1 = max(acc1, best_acc1) + + if not args.multiprocessing_distributed or (args.multiprocessing_distributed + and args.rank % ngpus_per_node == 0): + ############## npu modify begin ############# + if args.amp: + save_checkpoint({ + 'epoch': epoch + 1, + 'arch': args.arch, + 'state_dict': model.state_dict(), + 'best_acc1': best_acc1, + 'optimizer' : optimizer.state_dict(), + 'amp': amp.state_dict(), + }, is_best) + else: + save_checkpoint({ + 'epoch': epoch + 1, + 'arch': args.arch, + 'state_dict': model.state_dict(), + 'best_acc1': best_acc1, + 'optimizer' : optimizer.state_dict(), + }, is_best) + ############## npu modify end ############# + ``` + +13. During training, you need to port the dataset to Ascend AI Processors. The modification is as follows: + + Code location: **train()** in **main.py** (The changes are in bold.) + + ```python + for i, (images, target) in enumerate(train_loader): + # measure data loading time + data_time.update(time.time() - end) + ############## npu modify begin ############# + loc = 'npu:{}'.format(args.gpu) + target = target.to(torch.int32) + images, target = images.to(loc, non_blocking=False), target.to(loc, non_blocking=False) + ############## npu modify end ############# + # The source model code is as follows: + # if args.gpu is not None: + # images = images.cuda(args.gpu, non_blocking=True) + # if torch.cuda.is_available(): + # target = target.cuda(args.gpu, non_blocking=True) + ``` + +14. Mark the location where the backpropagation **.backward()** occurs so that the mixed precision module can perform loss scaling and clear the status of each iteration. The code is as follows: + + Code location: **train()** in **main.py** (The changes are in bold.) + + ```python + optimizer.zero_grad() + ############## npu modify begin ############# + if args.amp: + with amp.scale_loss(loss, optimizer) as scaled_loss: + scaled_loss.backward() + else: + loss.backward() + ############## npu modify end ############# + # The source code is as follows: + # loss.backward() + optimizer.step() + ``` + +15. Before validation, you need to port the validation dataset to Ascend AI Processors. The modification is as follows: + + Code location: **validate()** function in **main.py**: + + ```python + with torch.no_grad(): + end = time.time() + for i, (images, target) in enumerate(val_loader): + ############## npu modify begin ############# + loc = 'npu:{}'.format(args.gpu) + target = target.to(torch.int32) + images, target = images.to(loc, non_blocking=False), target.to(loc, non_blocking=False) + ############## npu modify end ############# + # The source model code is as follows: + # if args.gpu is not None: + # images = images.cuda(args.gpu, non_blocking=True) + # if torch.cuda.is_available(): + # target = target.cuda(args.gpu, non_blocking=True) + ``` + +### Model Training + +**Dataset Preparation** + +Prepare a dataset and upload it to a directory in the operating environment, for example, **/home/data/resnet50/imagenet**. + +**Command Execution** + +Single-device training: + +```shell +python3 main.py /home/data/resnet50/imagenet --batch-size 128 \ # Training batch size + --lr 0.1 \ # Learning rate + --epochs 90 \ # Number of training epochs + --arch resnet50 \ # Model architecture + --world-size 1 \ + --rank 0 \ + --workers 40 \ # Number of processes for loading data + --momentum 0.9 \ # Momentum + --weight-decay 1e-4 # Weight decay +``` + +Distributed training: + +```shell +python3 main.py /home/data/resnet50/imagenet --addr='1.1.1.1' \ # Example IP address. Replace it with the actual IP address. + --seed 49 \ # Random seed + --workers 160 \ # Number of processes for loading data + --lr 0.8 \ + --print-freq 1 \ + --arch resnet50 \ # Model architecture + --dist-url 'tcp://127.0.0.1:50000' \ + --dist-backend 'hccl' \ + --multiprocessing-distributed \ # Multi-device training + --world-size 1 \ + --batch-size 2048 \ # Training batch size + --epochs 90 \ # Number of training epochs + --rank 0 \ + --device-list '0,1,2,3,4,5,6,7' \ + --amp # Use mixed precision for training. +``` + +>![](public_sys-resources/icon-note.gif) **NOTE:** **dist-backend** must be set to **hccl** to support distributed training on Ascend AI devices. + +## Model Porting Evaluation + +1. When selecting models, select authoritative PyTorch models as benchmarks, including but not limited to PyTorch ([example](https://github.com/pytorch/examples/tree/master/imagenet)/[vision](https://github.com/pytorch/vision)), facebookresearch ([Detectron](https://github.com/facebookresearch/Detectron)/[detectron2](https://github.com/facebookresearch/detectron2)), and open-mmlab ([mmdetection](https://github.com/open-mmlab/mmdetection)/[mmpose](https://github.com/open-mmlab/mmpose)). + +2. Check the operator adaptation. Before porting the original model and training script to an Ascend AI Processor, train the original model and training script on the CPU, obtain the operator information by using the dump op method, and compare the operator information with that in the _PyTorch Operator Support_ to check whether the operator is supported. For details about the dump op method, see [dump op Method](#dump-op-method). If an operator is not supported, develop the operator. For details, see the *PyTorch Operator Development Guide*. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >You can also port the model and training script to the Ascend AI Processor for training to view the error information. For details about how to port the model and training script, see the following sections. Generally, a message is displayed, indicating that an operator (the first operator that is not supported) cannot run in the backend of the Ascend AI Processor. Environment Setup + +## Environment Setup + +Refer to the _PyTorch Installation Guide_ to install PyTorch and the mixed precision module, and configure required environment variables. + +## Model Porting + +### Tool-Facilitated + +The Ascend platform provides a script conversion tool to enable you to port training scripts to Ascend AI Processors using commands. The following will provide the details. In addition to using commands, you can also use the PyTorch GPU2Ascend function integrated in MindStudio to port scripts. For details, see the _MindStudio User Guide_. + +#### Introduction + +##### Overview + +Ascend NPU is an up-and-comer in the AI computing field, but most training and online inference scripts are based on GPUs. Due to the architecture differences between NPUs and GPUs, GPU-based training and online inference scripts cannot be directly used on NPUs. The script conversion tool provides an automated method for converting GPU-based scripts into NPU-based scripts, reducing the learning cost and workload of manual script migration, thereby improving the migration efficiency. + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>- msFmkTransplt provides suggestions and converts scripts by the adaptation rules, significantly accelerating script migration and reducing development workload. The scripts in [Table 2](#en-us_topic_0000001133095885_table4705239194613) can be directly executed after being converted. The conversion results of other scripts are for reference only. You need to perform adaptation based on the site requirements. +>- The original scripts in [Table 2](#en-us_topic_0000001133095885_table4705239194613) must be executed in the GPU environment and based on Python 3. +>- For scripts in [Table 2](#en-us_topic_0000001133095885_table4705239194613), the execution logic after conversion is the same as that before conversion. +>- This script conversion tool only supports the conversion of PyTorch training scripts. + +**Table 2** Supported models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

No.

+

Model

+

1

+

3D AttentionNet

+

2

+

3D Nested_UNet

+

3

+

Advanced East

+

4

+

AlexNet

+

5

+

DeeplabV3+(Xception-JFT)

+

6

+

DeepMar

+

7

+

Densenet121

+

8

+

DenseNet161

+

9

+

DenseNet169

+

10

+

DenseNet201

+

11

+

EAST

+

12

+

FCN

+

13

+

FD-GAN

+

14

+

FOTS

+

15

+

GENet

+

16

+

GoogleNet

+

17

+

GRU

+

18

+

Inception V4

+

19

+

InceptionV2

+

20

+

LPRNet

+

21

+

LSTM

+

22

+

MNASNet0_5

+

23

+

MNASNet0_75

+

24

+

MNASNet1_0

+

25

+

MNASNet1_3

+

26

+

MobileNetV1

+

27

+

MobileNetV2

+

28

+

PNet

+

29

+

PSENet

+

30

+

RAFT

+

31

+

RecVAE

+

32

+

ResNet101

+

33

+

ResNet152

+

34

+

ResNet18

+

35

+

ResNet34

+

36

+

ResNet50

+

37

+

Resnext101_32x8d

+

38

+

Resnext50

+

39

+

RNet

+

40

+

Shufflenetv2

+

41

+

SqueezeNet1_0

+

42

+

SqueezeNet1_1

+

43

+

U-Net

+

44

+

VAE+GAN

+

45

+

VGG11

+

46

+

VGG11_BN

+

47

+

VGG13

+

48

+

VGG13_BN

+

49

+

VGG16

+

50

+

VGG16_BN

+

51

+

VGG19

+

52

+

VGG19_BN

+

53

+

VIT-base

+

54

+

Wide_ResNet101_2

+

55

+

Wide_ResNet50_2

+
+ +##### System Requirement + +msFmkTransplt runs on Ubuntu 18.04, CentOS 7.6, and EulerOS 2.8 only. + +##### Environment Setup + +Set up the development environment by referring to the _CANN Software Installation Guide_. + +#### Instructions + +##### Command-line Options + +**Table 3** Command-line options + + + + + + + + + + + + + + + + + + + + + + + +

Option

+

Description

+

Example Value

+

-i

+

--input

+
  • Path of the folder or file where the original script file to be converted is located.
  • Required
+
  • /home/username/fmktransplt
  • /home/username/fmktransplt.py
+

-o

+

--output

+
  • Output path of the script conversion result. A folder with the .msft suffix will be generated in the path.
  • Required
+

/home/username/fmktransplt_output

+

-r

+

--rule

+
  • Path of the JSON file for custom general conversion rules, which cover function parameter, function name, and module name modifications.
  • Optional
+

/home/username/fmktransplt_rule.json

+

-h

+

--help

+

Help information.

+

-

+
+ +##### Customizing a Rule File + +An example of a custom conversion rule is as follows: + +``` +{ + "rules": { + "ArgsModifyRule": [ + { + "func_name": "name1", + "arg_idx": 0, + "arg_new": "agrs0" + }, + { + "func_name": "name2", + "arg_idx": 0, + "arg_new": "agrs0" + } + ], + "FuncNameModifyRule": [ + { + "old_name": "func", + "new_name": "new_func" + } + ], + "ModuleNameModifyRule": [ + { + "old_name": "module", + "new_name": "new_module", + "parent_module":"parent_module" + } + ] + } +} +``` + +**Table 4** Options + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Option

+

Description

+

ArgsModifyRule

+

Function parameter modification

+

func_name

+

Function name

+

arg_idx

+

Parameter position

+

arg_new

+

New parameter

+

FuncNameModifyRule

+

Function name modification

+

ModuleNameModifyRule

+

Module name modification

+

old_name

+

Old name

+

new_name

+

New name

+

parent_module

+

Parent module name

+
+ +##### Performing Conversion + +1. Go to the directory of the script conversion tool msFmkTransplt. + + ``` + cd {Ascend-CANN-Toolkit install path}/ascend-toolkit/{version}/{arch}-linux/toolkit/tools/ms_fmk_transplt + ``` + +2. Execute msFmkTransplt. + + ``` + python3 ms_fmk_transplt.py -i original script path -o output path of the script conversion result [-r path of the JSON file for custom general conversion rules] + ``` + +3. Find the converted script in the specified output path. + +#### Result Analysis + +You can view the result files in the output path when the script is converted. + +``` +├── xxx_msft // Directory for storing script conversion results. The default directory is the directory of the original script. xxx indicates the name of the folder where the original script is stored. +│ ├── generated script file // The directory structure is the same as that of the script file before conversion. +│ ├── msFmkTranspltlog.txt // Log file generated during script conversion +│ ├── unsupported_op.xlsx // File of the unsupported operator list +``` + +### Manual + +#### Single-Device Training Model Porting + +The advantage of the online adaption is that the training on the Ascend AI Processor is consistent with the usage of the GPU. During online adaption,** you only need to specify the device as the Ascend AI Processor in Python and device operations** to develop, train, and debug the network in PyTorch using the Ascend AI Processor. For single-device model training, main changes for porting are as follows: + +GPU code before porting: + +``` + CALCULATE_DEVICE = "gpu:0" + torch.cuda.set_device(CALCULATE_DEVICE) + # Two methods for porting the code to device + model = model.cuda() # Method 1 + model = model.to(CALCULATE_DEVICE) # Method 2 + # Port the input from host to device. + images = images.to(CALCULATE_DEVICE) + target = target.to(CALCULATE_DEVICE) +``` + +The code ported to the Ascend AI Processor is as follows: + +``` + CALCULATE_DEVICE = "npu:0" + torch.npu.set_device(CALCULATE_DEVICE) + # Two methods for porting the code to device + model = model.npu() # Method 1 + model = model.to(CALCULATE_DEVICE) # Method 2 + # Port the input from host to device. + images = images.to(CALCULATE_DEVICE) + target = target.to(CALCULATE_DEVICE) +``` + +For details, see [Single-Device Training Porting](#single-device-training-porting). + +#### Multi-Device Training Model Porting + +To port a multi-device training model, you need to specify the device as the Ascend AI Processor in Python and device operations. In addition, you can perform distributed training using PyTorch **DistributedDataParallel**, that is, run **init\_process\_group** during model initialization, and then initialize the model into a **DistributedDataParallel** model. Note that the **backend **must be set to **hccl **and the initialization mode must be shielded when **init\_process\_group** is executed. + +PyTorch distributed training code example \(some code is omitted\): + +``` +import torch +import torch.distributed as dist +import torch.nn.parallel +def main(): + args = parser.parse_args() + # The initialization mode needs to be shielded. + dist.init_process_group(backend='hccl',# init_method=args.dist_url, + world_size=args.world_size, rank=args.rank) + model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) # The model needs to be delivered to the NPU. + train_loader = torch.utils.data.DataLoader( + train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None), + num_workers=args.workers, pin_memory=True, sampler=train_sampler) + for epoch in range(args.start_epoch, args.epochs): + acc1 = train(train_loader, model, criterion, optimizer, epoch, args,ngpus_per_node, + lr_scheduler) +``` + +For details, see [Single-Server Multi-Device Training Modification](#single-server-multi-device-training-modification). + +#### PyTorch-related API Replacement + +1. To enable the Ascend AI Processor to use the capabilities of the PyTorch framework, the native PyTorch framework needs to be adapted at the device layer. The APIs related to the CPU and CUDA need to be replaced for external presentation. During network porting, some device-related APIs need to be replaced with the APIs related to the Ascend AI Processor. [Table 5](#table1922064517344) lists the supported device-related APIs. + + **Table 5** Device-related APIs + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Original PyTorch API

+

API Adapted to the Ascend AI Processor

+

Description

+

torch.cuda.is_available()

+

torch.npu.is_available()

+

Checks whether the device is available in the current environment (not the final result).

+

torch.cuda.current_device()

+

torch.npu.current_device()

+

Obtains the device in use.

+

torch.cuda.device_count()

+

torch.npu.device_count()

+

Obtains the number of devices in the current environment.

+

torch.cuda.set_device()

+

torch.npu.set_device()

+

Sets the device in use.

+

torch.tensor([1,2,3]).is_cuda

+

torch.tensor([1,2,3]).is_npu

+

Checks whether a tensor is in the format on the CUDA or NPU device.

+

torch.tensor([1,2,3]).cuda()

+

torch.tensor([1,2,3]).npu()

+

Converts a tensor to the format on the CUDA or NPU device.

+

torch.tensor([1,2,3]).to("cuda")

+

torch.tensor([1,2,3]).to('npu')

+

Converts a tensor to the format on the CUDA or NPU device.

+

torch.cuda.synchronize()

+

torch.npu.synchronize()

+

Waits until the event is complete.

+

torch.cuda.device

+

torch.npu.device

+

Generates a device class, which can be used to perform device-related operations.

+

torch.cuda.Stream(device)

+

torch.npu.Stream(device)

+

Generates a stream object.

+

torch.cuda.stream(Stream)

+

torch.npu.stream(Stream)

+

Mainly used for scope restriction.

+

torch.cuda.current_stream()

+

torch.npu.current_stream()

+

Obtains the current stream.

+

torch.cuda.default_stream()

+

torch.npu.default_stream()

+

Obtains the default stream.

+

device = torch.device("cuda:0")

+

device = torch.device("npu:0")

+

Specifies a device.

+

torch.autograd.profiler.profile

+

(use_cuda=True)

+

torch.autograd.profiler.profile

+

(use_npu=True)

+

Specifies that CUDA/NPU is used during profiler execution.

+

torch.cuda.Event()

+

torch.npu.Event()

+

Returns events on a device.

+
+ +2. When building or porting a network, you need to create tensors of specified data types. The following table lists the tensors created on the Ascend AI Processor. + + **Table 6** Tensor-related APIs + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

GPU tensor

+

API Adapted to the Ascend AI Processor

+

torch.tensor([1,2,3],dtype=torch.long,device='cuda')

+

torch.tensor([1,2,3],dtype=torch.long,device='npu')

+

torch.tensor([1,2,3],dtype=torch.int,device='cuda')

+

torch.tensor([1,2,3],dtype=torch.int,device='npu')

+

torch.tensor([1,2,3],dtype=torch.half,device='cuda')

+

torch.tensor([1,2,3],dtype=torch.half,device='npu')

+

torch.tensor([1,2,3],dtype=torch.float,device='cuda')

+

torch.tensor([1,2,3],dtype=torch.float,device='npu')

+

torch.tensor([1,2,3],dtype=torch.bool,device='cuda')

+

torch.tensor([1,2,3],dtype=torch.bool,device='npu')

+

torch.cuda.BoolTensor([1,2,3])

+

torch.npu.BoolTensor([1,2,3])

+

torch.cuda.FloatTensor([1,2,3])

+

torch.npu.FloatTensor([1,2,3])

+

torch.cuda.IntTensor([1,2,3])

+

torch.npu.IntTensor([1,2,3])

+

torch.cuda.LongTensor([1,2,3])

+

torch.npu.LongTensor([1,2,3])

+

torch.cuda.HalfTensor([1,2,3])

+

torch.npu.HalfTensor([1,2,3])

+
+ + +For more APIs, see the _PyTorch API Support_. + +### Mixed Precision + +#### Overview + +Based on the architecture features of the NPU chip, mixed precision training is involved, that is, the scenario where the float16 and float32 data types are used together. Replacing float32 with float16 has the following advantages: + +- The memory usage of intermediate variables is reduced. +- The data transfer time decreases because the memory usage is reduced. +- The computing units of float16 provide better computing performance. + +However, the mixed precision training is limited by the precision range expressed by float16. If float32 is converted into float16, the training convergence is affected. To use float16 for acceleration in some computations and ensure training convergence, the mixed precision module Apex is used. The mixed precision module Apex is a comprehensive optimization library that features high optimization performance and precision. + +In addition to the preceding advantages, the mixed precision module Apex adapted to Ascend AI Processors can improve computing performance. Details are described as follows: + +- During mixed precision calculation, Apex calculates the grad of the model. You can enable combine\_grad to accelerate these operations. Set the **combine\_grad** parameter of the amp.initialize\(\) interface to **True**. +- After the adaptation, Apex optimizes optimizers, such as adadelta, adam, sgd, and lamb to adapt them to Ascend AI Processors. As a result, the obtained NPU-based fusion optimizers are consistent with the native algorithms, but the calculation speed is faster. You only need to replace the original optimizer with **apex.optimizers.\*** \(**\*** indicates the optimizer name, for example, **NpuFusedSGD**\). + +#### Supported Features + +[Table 7](#table10717173813332) describes the functions and optimization of the mixed precision module. + +**Table 7** Functions of the mixed precision module + + + + + + + + + + + + + + + + + + +

Function

+

Description

+

O1 configuration

+

Conv and Matmul use float16 for computing, and Softmax and BN use float32.

+

O2 configuration

+

BN uses float32, and others use float16.

+

Static loss scale

+

Parameters are statically set to ensure the convergence of mixed precision training.

+

Dynamic loss scale

+

The loss scale value is dynamically calculated to determine whether overflow occurs.

+
+ +>![](public_sys-resources/icon-note.gif) **NOTE:** +>- In the current version, Apex is implemented using Python and does not support AscendCL or CUDA optimization. +>- Ascend AI devices do not support the original FusedLayerNorm interface module of Apex. If the original model script file uses the FusedLayerNorm interface module, you need to replace the script header file **from apex.normalization import FusedLayerNorm** with **from torch.nn import LayerNorm**. + +#### Integrating Mixed Precision Module Into the PyTorch Model + +1. To use the mixed precision module Apex, you need to import the amp from the Apex library as follows: + + ``` + from apex import amp + ``` + +2. After the amp module is imported, you need to initialize the amp module so that it can modify the model, optimizer, and PyTorch internal functions. The initialization code is as follows: + + ``` + model, optimizer = amp.initialize(model, optimizer, combine_grad=True) + ``` + +3. Mark the location where the back propagation **.backward\(\)** occurs so that the amp can perform loss scaling and clear the status of each iteration. The code is as follows: + + Original code: + + ``` + loss = criterion(...) + loss.backward() + optimizer.step() + ``` + + Code after the modification to support loss scaling: + + ``` + loss = criterion(...) + with amp.scale_loss(loss, optimizer) as scaled_loss: + scaled_loss.backward() + optimizer.step() + ``` + +## Model Training + +After the training scripts are ported, set environment variables by following the instructions in [Environment Variable Configuration](#en-us_topic_0000001144082004md) and run the **python3** _xxx_ command to train a model. For details, see [Script Execution](#script-executionmd). + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>When running the **python3** _xxx_ command, create a soft link between Python 3 and the installation path of Python that matches the current PyTorch version. + +## Performance Analysis and Optimization + +### Prerequisites + +1. Modify the open-source code to ensure that the model can run properly, including data preprocessing, forward propagation, loss calculation, mixed precision, back propagation, and parameter update. For details, see [Samples](#samples). +2. During model porting, check whether the model can run properly and whether the existing operators can meet the requirements. If no operator meets the requirements, develop an adapted operator. For details, see the _PyTorch Operator Development Guide_. +3. Prioritize the single-device function, and then enable the multi-device function. + +### Commissioning Process + +#### Overall Guideline + +1. Check whether the throughput meets the expected requirements based on the training execution result. +2. If the throughput does not meet requirements, you need to find out the causes of the performance bottleneck. Possible causes are as follows: + - Operator bottleneck: The execution of an operator is too slow. + - Copy bottleneck: The bottleneck is caused by the copy operation during converting non-contiguous tensors to contiguous tensors. + - Framework bottleneck: Additional operations are required due to operator format conversion. + - Compilation bottleneck: Repeated compilation is caused by the changes of shape or attributes. + +3. Analyze the preceding causes of performance bottlenecks and optimize the performance. + +#### Training Data Collection + +##### Profile Data Collection + +During model training, if the throughput does not meet requirements, you can collect profile data generated during the training process to analyze which step and which operator cause the performance consumption. The profile data is collected at the PyTorch layer (PyTorch API data) and CANN layer (TBE operator data). + +Select a collection mode based on the site requirements and perform the following steps to collect the profile data. + +- Profile data collection at the PyTorch layer + 1. Obtain the **chrome\_trace** file. + + Use the profile API to reconstruct the loss calculation and optimization process of the original code. + + ``` + # Use the profile API adapted to Ascend-PyTorch. You are advised to run only one step. + with torch.autograd.profiler.profile(use_npu=True) as prof: + out = model(input_tensor) + loss=loss_func(out) + loss.backward() + optimizer.zero_grad() + optimizer.step() + # Print the profiling result. + print(prof) + # Export the chrome_trace file to a specified path. + output_path = '/home/HwHiAiUser/profile_data.json' + prof.export_chrome_trace(output_path) + ``` + + 2. After the execution is successful, print the profiling result. + + The printed result includes the CPU and NPU time consumption. For details, see Table 8. + + **Table 8** Profiling result fields + + | Name | Self CPU % | Self CPU | CPU total % | CPU total | CPU time avg | Self NPU % | Self NPU | NPU total | NPU time avg | # of Calls | + | ---- | ---------- | -------- | ----------- | --------- | ------------ | ---------- | -------- | --------- | ------------ | :--------: | + + 3. View the **chrome_trace** file. + + To view the **chrome_trace** file, access **chrome://tracing** in the Chrome browser, drag the file in the blank space. You can press **W**, **A**, **S**, or **D** to zoom in, zoom out, or move the profiling result. + + 4. Other profiling functions are as follows. + + - Obtain the shape information of the input tensor of an operator. + + ```python + # Add the record_shapes parameter to obtain the shape information of the input tensor. + with torch.autograd.profiler.profile(use_npu=True, record_shapes=True) as prof: + # Add the model calculation process. + print(prof) + ``` + + The `Input Shape` information of each operator is added to the printed result. + + - Obtain the memory information of the NPU in use. + + ```python + # Add the profile parameter to obtain the memory usage of the operators. + with torch.autograd.profiler.profile(use_npu=True, profile_memory=True) as prof: + # Add the model calculation process. + print(prof) + ``` + + The `CPU Mem`, `Self CPU Mem`, `NPU Mem`, and `Self NPU Mem` information of each operator is added to the printed result. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + > + >This function is supported only by PyTorch 1.8 or later. + + - Obtain a simplified operator performance report. + + This function prints only the operator information at the bottom layer of each operator stack, simplifying the analysis result. + + ```python + # Add the use_npu_simple parameter to obtain the simplified operator performance report. + with torch.autograd.profiler.profile(use_npu=True, use_npu_simple=True) as prof: + # Add the model calculation process. + # Export the chrome_trace file to a specified path. + output_path = '/home/HwHiAiUser/profile_data.json' + prof.export_chrome_trace(output_path) + ``` + + Open the **chrome_trace** result file in the Chrome browser to view the simplified operator performance report. + + +- Profile data collection at the CANN layer + 1. Obtain the profile data file. + + ``` + profiler_result_path = "/home/profiling_data" # folder for storing the profile data. You need to manually create the folder in advance based on the site requirements. + with torch.npu.profile(profiler_result_path): + out = model(input_tensor) + loss=loss_func(out,target) + loss.backward() + optimizer.zero_grad() + optimizer.step() + ``` + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >When obtaining the profile data file, deliver **model**, **input\_tensor**, and **target** to the NPU. + + 2. Parse the profile data file. + + For details, see "Profiling Instructions \(Training\)" in the *CANN Auxiliary Development Tool User Guide*. + +##### Obtaining Operator Information (OP_INFO) + +The network model is executed as an operator (OP). The OPInfo log can be used to obtain the operator and its attributes during the actual execution. Obtain the information by running the **get_ascend_op_info.py** script. + +1. Write the **get_ascend_op_info.py** script to obtain the operator information. The script content is as follows: + + ``` + # -*- coding: utf-8 -*- + """ Used to export operator information. + """ + import os + import sys + import argparse + + def func(host_log_folder): + """ + :param host_log_folder: where host_log_folder addr is. + :return: + """ + host_log_files = os.listdir(host_log_folder) + result = {} + + for host_log in host_log_files: + if not host_log.endswith('.log') or host_log.endswith('.out'): + continue + with open(os.path.join(host_log_folder, host_log), 'r')as f: + host_log_lines = f.readlines() + for line in host_log_lines: + if line.startswith('[INFO] ASCENDCL') and "aclopCompile::aclOp" in line: + op_info = line.split('OpType: ')[1][:-2] + op_type = op_info.split(',')[0] + op_param = op_info[len(op_type) + 2:] + if op_type not in result.keys(): + result[op_type] = [op_param] + else: + result[op_type].append(op_param) + + with open('ascend_op_info_summary.txt', 'w')as f: + for k, v in result.items(): + v_set = set(v) + for info in v_set: + f.write(k + " " + info + "\n") + + if __name__ == "__main__": + parser = argparse.ArgumentParser(description='trans the log') + parser.add_argument('--host_log_folder', default="./", + help="input the dir name, trans the current dir with default") + ags = parser.parse_args() + func(ags.host_log_folder) + ``` + +2. Set the environment variable to print host logs to the screen. + + ``` + export ASCEND_SLOG_PRINT_TO_STDOUT=1 + ``` + +3. Set the log level to **info**. For details, see the _CANN Log Reference_. +4. Run the training script to train the model. After the training is complete, obtain the host logs. By default, the logs are stored in the **$HOME/ascend/log/plog** directory. **$HOME** indicates the root directory of the user on the host. +5. After the host logs are parsed, obtain the operator information **ascend_op_info_summary.txt** in the current directory. + + ``` + python3 get_ascend_op_info.py --host_log_folder $HOME/ascend/log/plog + ``` + +6. Analyze the extra tasks in TaskInfo, especially transdata. + +#### Host-side Performance Optimization + +##### Overview + +During PyTorch model porting and training, the number of images recognized within one second (FPS) for some network models is low and the performance does not meet the requirements. You can perform the following optimization on the server to improve the training performance: + +- Change the CPU performance mode. +- Install the high-performance Pillow library. +- (Optional) Install the OpenCV library of the specified version. + +##### Changing the CPU Performance Mode (x86 Server) + +###### Setting the Power Policy to High Performance + +To improve network performance, you need to set the power policy to high performance in the BIOS settings of the x86 server. The detailed operations are as follows: + +1. Log in to the iBMC WebUI, start the virtual console, and select **HTML5 Integrated Remote Console**, as shown in [Figure 3](#fig15869135420288). + + **Figure 3** Remote console + + ![](figures/remote-console.png "remote-console") + +2. On the virtual toolbar, click the startup item tool ![](figures/en-us_image_0000001144241932.png). The startup item drop-down list is displayed, as shown in [Figure 4](#fig744814574243). + + **Figure 4** Startup item tool + + ![](figures/startup-item-tool.png "startup-item-tool") + +3. In the drop-down list, choose, select **BIOS Setup**, and click ![](figures/en-us_image_0000001190201999.png) on the toolbar to restart the server. +4. After the system restarts, the BIOS configuration screen is displayed. Choose **Advanced** \> **Socket Configuration**. See [Figure 5](#fig4546303814). + + **Figure 5** Socket Configuration + + ![](figures/socket-configuration.png "socket-configuration") + +5. On the **Advanced Power Mgmt. Configuration** page displayed, set **Power Policy** to **Performance**, See [Figure 6](#fig15501111014442). + + **Figure 6** Setting the power policy + + ![](figures/setting-the-power-policy.png "setting-the-power-policy") + +6. Press **F10** to save the settings and reboot the server. + +###### Setting the CPU Mode to Performance + +Perform the following steps as the **root** user: + +1. Run the following command to check the current CPU mode: + + ``` + cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor + ``` + + After the preceding command is run, the current CPU mode is displayed. For details, see [Table 1](#table354392019384). If the current CPU mode is not performance, perform the following operations to set the CPU mode to performance: Otherwise, skip this step. + + **Table 1** CPU mode + + + + + + + + + + + + + + + + + + + + + + + + +

Governor

+

Description

+

performance

+

The CPU runs at the maximum frequency.

+

powersave

+

The CPU runs at the minimum frequency.

+

userspace

+

The CPU runs at a frequency specified by the user.

+

ondemand

+

The CPU frequency is dynamically adjusted as required. Once a task needs CPU computing power, the CPU runs at the maximum frequency. If the idle time increases, the CPU frequency decreases.

+

conservative

+

The CPU frequency is dynamically adjusted as required. The adjustment is more conservative than that of the ondemand mode.

+

schedutil

+

The CPU frequency is adjusted based on the scheduler.

+
+ +2. Run the following command to install the tool: + - The **ubuntu/debian** system is used as an example. + + ``` + apt-get install linux-tools-$(uname -r) + ``` + + - The **centos/bclinux/euler** system is used as an example: + + ``` + yum install kernel-tools -y + systemctl daemon-reload + systemctl enable cpupower + systemctl start cpupower + ``` + +3. Sets the CPU mode to performance. + + ``` + cpupower frequency-set -g performance + ``` + +4. Perform [Step 1](#li158435131344) again to check whether the current CPU mode is set to performance. + +##### Changing the CPU Performance Mode (ARM Server) + +###### Setting the Power Policy to High Performance + +Some models that have demanding requirements on the CPUs on the host, for example, the object detection model, require complex image pre-processing. Enabling the high-performance mode of the power supply can improve performance and stability. To improve network performance, you need to set the power policy to high performance in the BIOS settings of the ARM server. The detailed operations are as follows: + +1. Log in to the iBMC WebUI, start the virtual console, and select **HTML5 Integrated Remote Console**, as shown in [Figure 7](#fig15869135420288). + + **Figure 7** Remote console + + ![](figures/remote-console-0.png "remote-console-0") + +2. On the virtual toolbar, click the startup item tool ![](figures/en-us_image_0000001190202013.png). The startup item drop-down list is displayed, as shown in [Figure 8](#fig744814574243). + + **Figure 8** Startup item tool + + ![](figures/startup-item-tool-1.png "startup-item-tool-1") + +3. In the drop-down list, select **BIOS Setup**, and click ![](figures/en-us_image_0000001190081877.png) on the toolbar to restart the server. +4. After the system restarts, the BIOS configuration screen is displayed. Choose **Advanced** \> **Performance Config**. See [Figure 9](#fig4546303814). + + **Figure 9** Performance Config + + ![](figures/performance-config.png "performance-config") + +5. On the **Performance Config** page, set **Power Policy** to **Performance**. See [Figure 10](#fig15501111014442). + + **Figure 10** Setting the power policy + + ![](figures/setting-the-power-policy-2.png "setting-the-power-policy-2") + +6. Press **F10** to save the settings and reboot the server. + +##### Installing the High-Performance Pillow Library (x86 Server) + +1. Run the following command to install the dependencies for the high-performance pillow library: + + Ubuntu/Debian: + + ``` + apt-get install libtiff5-dev libjpeg8-dev libopenjp2-7-dev zlib1g-dev libfreetype6-dev liblcms2-dev libwebp-dev tcl8.6-dev tk8.6-dev python3-tk libharfbuzz-dev libfribidi-dev libxcb1-dev + ``` + + CentOS/BC-Linux/EulerOS: + + ``` + yum install libtiff-devel libjpeg-devel openjpeg2-devel zlib-devel freetype-devel lcms2-devel libwebp-devel tcl-devel tk-devel harfbuzz-devel fribidi-devel libraqm-devel libimagequant-devel libxcb-devel + ``` + +2. Install the high-performance Pillow library. + 1. Run the following command to uninstall the native Pillow: + + ``` + pip3.7 uninstall -y pillow + ``` + + 2. Install the pillow-simd of the SSE4 version. + + Run the following command as the **root** user. If a non-root user is used, add **--user** to the end of the command. + + ``` + pip3.7 install pillow-simd + ``` + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >If the CPU supports the AVX2 instruction set, run the following command to install pillow-simd of the AVX2 version: + >``` + >CC="cc -mavx2" pip3.7 install -U --force-reinstall pillow-simd + >``` + + +3. Modify the TorchVision code to solve the problem that the pillow-simd does not contain the **PILLOW\_VERSION** field. For details about how to install TorchVision, see [How to Obtain](#how-to-obtain). + + Modify the code in line 5 of **/usr/local/python3._x.x_/lib/python3._x_/site-packages/torchvision/transforms/functional.py** as follows: + + ``` + try: + from PIL import Image, ImageOps, ImageEnhance,PILLOW_VERSION + except: + from PIL import Image, ImageOps, ImageEnhance + PILLOW_VERSION="7.0.0" + ``` + +##### (Optional) Installing the OpenCV Library of the Specified Version + +If the model depends on OpenCV, you are advised to install OpenCV 3.4.10 to ensure training performance. + +1. Source code: [Link](https://opencv.org/releases/) +2. Installation guide: [Link](https://docs.opencv.org/3.4.10/d7/d9f/tutorial_linux_install.html) + +#### Training Performance Optimization + +##### Operator Bottleneck Optimization + +1. Obtain the profile data during training. For details, see [Profile Data Collection](#profile-data-collection). +2. Analyze the profile data to obtain the time-consuming operator. +3. See [Single-Operator Sample Building](#single-operator-sample-building) to build the single-operator sample of the time-consuming operator, and compare the execution time of a single-operator sample on the CPU and GPU. If the performance is insufficient, use either of the following methods to solve the problem: + - Workaround: Use other efficient operators with the same semantics. + - Solution: Improve the operator performance. + + +##### Copy Bottleneck Optimization + +1. Obtain the profile data during training. For details, see [Profile Data Collection](#profile-data-collection). +2. Analyze the Profile data to obtain the execution time of **D2DCopywithStreamSynchronize**, **PTCopy**, or **format\_contiguous** in the entire network. +3. If the execution takes a long time, use either of the following methods to solve the problem: + - Method 1 \(workaround\): Replace view operators with compute operators. In PyTorch, view operators cause conversion from non-contiguous tensors to contiguous tensors. The optimization idea is to replace view operators with compute operators. Common view operators include view, permute, and transpose operators. For more view operators, go to [https://pytorch.org/docs/stable/tensor\_view.html](https://pytorch.org/docs/stable/tensor_view.html). + - Method 2 \(solution\): Accelerate the operation of converting non-contiguous tensors to contiguous tensors. + + +##### Framework Bottleneck Optimization + +1. Obtain the operator information (OP_INFO) during the training. For details, see [Obtaining Operator Information (OP_INFO)](#obtaining-operator-information-op_info). +2. Analyze the specifications and calling relationship of operators in OP\_INFO to check whether redundant operators are inserted. Pay special attention to check whether transdata is proper. +3. Solution: Specify the initialization format of some operators to eliminate cast operators. +4. In **pytorch/torch/nn/modules/module.py**, specify the operator initialization format in **cast\_weight**, as shown in the following figure. + + ![](figures/指定算子初始化方式.png) + + The format setting principle is as follows: + + - For the Conv2D operator, weight can be set to FZ format, for example, line 424. + - For the linear operator, weight can be set to NZ format, for example, line 409. + + +##### Compilation Bottleneck Optimization + +1. Obtain the operator information (OP_INFO) during the training. For details, see [Obtaining Operator Information (OP_INFO)](#obtaining-operator-information-op_info). +2. View the INFO log and check the keyword **aclopCompile::aclOp** after the first step. If **Match op inputs/type failed** or **To compile op** is displayed, the operator is dynamically compiled and needs to be optimized. +3. Use either of the following methods to solve the problem: + - Workaround: Based on the understanding of model semantics and related APIs, replace dynamic shape with static shape. + - Solution: Reduce compilation or do not compile the operator. + - For details about how to optimize the operator compilation configuration, see [Compilation Option Settings](#compilation-option-settings). + +### E2E Performance Tool (E2E prof) Instructions + +#### Introduction + +The E2E prof tool integrates the framework-layer data obtained by the Profiling tool of PyTorch and the operator profile data obtained by the CANN prof tool to implement end-to-end model and operator performance analysis. + +#### Usage Tutorial + +Add the following with statement to enable the E2E prof function. + +``` +with torch.npu.profile(profiler_result_path="./result",use_e2e_profiler=Ture): + + model_train() +``` + +- **profiler_result_path** indicates the path for storing the prof results. If no path is specified, the current path is used by default. +- **use_e2e_profiler** indicates whether to enable the E2E prof function. The default value is **False**, indicating that only the CANN prof function is enabled. + +(The NUP operator can be executed only after compilation. To ensure data accuracy, you are advised to run it for 10 steps first, and then perform the E2E prof operation. Generally, only one or two steps are required for profiling.) + +#### Result Parsing + +The results obtained by using the E2E prof tool are raw data, which can be viewed only after being parsed. + +1. Use the path in the tutorial as an example. The tool creates a folder in the *profiler_result_path* directory to save the raw data. + + ![](https://gitee.com/ascend/pytorch/raw/master/docs/zh/PyTorch%E7%BD%91%E7%BB%9C%E6%A8%A1%E5%9E%8B%E7%A7%BB%E6%A4%8D&%E8%AE%AD%E7%BB%83%E6%8C%87%E5%8D%97/figures/1.png) + +2. Switch to the **./result** directory in the preceding figure and run the following script: + + ``` + /usr/local/Ascend/ascend-toolkit/latest/toolkit/tools/profiler/bin/msprof --export=on --output=./ + ``` + + - **output**: indicates the path of the raw data. + +3. After the running is complete, find the **timeline** directory in the raw data path. See the following figure. + + ![](https://gitee.com/ascend/pytorch/raw/master/docs/zh/PyTorch%E7%BD%91%E7%BB%9C%E6%A8%A1%E5%9E%8B%E7%A7%BB%E6%A4%8D&%E8%AE%AD%E7%BB%83%E6%8C%87%E5%8D%97/figures/2.png) + +4. The **timeline** directory stores the parsed profile data, which can be opened in **chrome://tracing/**. + + 1. Open a browser and enter **chrome://tracing/** in the address box. + + 2. Click **Load** to upload the file. + + ![](https://gitee.com/ascend/pytorch/raw/master/docs/zh/PyTorch%E7%BD%91%E7%BB%9C%E6%A8%A1%E5%9E%8B%E7%A7%BB%E6%A4%8D&%E8%AE%AD%E7%BB%83%E6%8C%87%E5%8D%97/figures/chrometracing.png) + An example is provided as follows: + + ![](https://gitee.com/ascend/pytorch/raw/master/docs/zh/PyTorch%E7%BD%91%E7%BB%9C%E6%A8%A1%E5%9E%8B%E7%A7%BB%E6%A4%8D&%E8%AE%AD%E7%BB%83%E6%8C%87%E5%8D%97/figures/3.png) + + This example contains four layers from top to bottom. The first layer (MsprofTx) contains the PyTorch framework data, the second layer (AscendCL) contains the AscendCL data, the third layer (Task Scheduler) contains the device data, and the fourth layer (AI CPU) contains the AI CPU data. + +#### Advanced Settings + +By default, the E2E prof tool can obtain all of the preceding data. However, the process of obtaining data affects the performance. If a large amount of data is obtained, the profile data cannot be used as a reference. Therefore, the E2E prof tool provides configurable options for fine-grained control over obtaining data of specified layers. + +``` +with torch.npu.profile(profiler_result_path="./results", use_e2e_profiler=True,config=torch.npu. profileConfig(ACL_PROF_ACL_API=True, ACL_PROF_TASK_TIME=True, ACL_PROF_AICORE_METRICS=True,ACL_PROF_AICPU=True, ACL_PROF_L2CACHE=True, ACL_PROF_HCCL_TRACE=True, ACL_PROF_TRAINING_TRACE=True, aiCoreMetricsType=0)): +``` + +- **ACL_PROF_ACL_API**: collects profile data of AscendCL APIs. The default value is **True**. +- **ACL_PROF_TASK_TIME**: collects the execution time of AI Core operators. The default value is **True**. +- **ACL_PROF_AICORE_METRICS**: collects the AI Core performance metrics. Only those configured in **aicore_metrics** are valid. The default value is **True**. +- **ACL_PROF_AICPU**: 0x0008, collects traces of AI CPU tasks, including the start and end of each task. The default value is **True**. +- **ACL_PROF_L2CACHE**: collects L2 cache data. The default value is **True**. +- **ACL_PROF_HCCL_TRACE**: collects HCCL data. The default value is **True**. +- **ACL_PROF_TRAINING_TRACE**: collects iteration traces, which record the forward and backward propagation steps of a model. The default value is **True**. + +The values of **aiCoreMetricsType** are defined as follows. The default value is **0**. + +- **ACL_AICORE_ARITHMETIC_UTILIZATION = 0**: percentages of arithmetic throughput, including metrics **mac_fp16_ratio**, **mac_int8_ratio**, **vec_fp32_ratio**, **vec_fp16_ratio**, **vec_int32_ratio**, and **vec_misc_ratio** +- **ACL_AICORE_PIPE_UTILIZATION = 1**: percentages of time taken by the compute units and MTEs, including metrics **vec_ratio**, **mac_ratio**, **scalar_ratio**, **mte1_ratio**, **mte2_ratio**, **mte3_ratio**, and **icache_miss_rate** +- **ACL_AICORE_MEMORY_BANDWIDTH = 2**: percentages of external memory read/write instructions, including metrics **ub_read_bw**, **ub_write_bw**, **l1_read_bw**, **l1_write_bw**, **l2_read_bw**, **l2_write_bw**, **main_mem_read_bw**, and **main_mem_write_bw** +- **ACL_AICORE_L0B_AND_WIDTH**: percentages of internal memory read/write instructions, including **scalar_ld_ratio**, **scalar_st_ratio**, **l0a_read_bw**, **l0a_write_bw**, **l0b_read_bw**, **l0b_write_bw**, **l0c_read_bw**, and **l0c_write_bw**. +- **ACL_AICORE_RESOURCE_CONFLICT_RATIO**: percentages of pipeline stall instructions, including **vec_bankgroup_cflt_ratio**, **vec_bank_cflt_ratio**, **vec_resc_cflt_ratio**, **mte1_iq_full_ratio**, **mte2_iq_full_ratio**, **mte3_iq_full_ratio**, **cube_iq_full_ratio**, **vec_iq_full_ratio**, and **iq_full_ratio**. +- **ACL_AICORE_NONE = 0xFF**: Profiling disabled + +### Affinity Library + +#### Source + +The common network structures and functions in the public models are optimized to greatly improve computing performance. In addition, the network structures and functions are integrated into the PyTorch framework to facilitate model performance optimization. + +#### Functions + + + + + + + + + + + + + + + + + + + + + + + +

Function

+

Location

+

Description

+

pairwise_iou

+

torch.contrib.npu.optimized_lib

+

Calculates the IOUs of the two bounding boxes.

+

fast_rcnn_inference_single_image

+

torch.contrib.npu.optimized_lib

+

Provides the inference API of the Mask R-CNN and Faster R-CNN models.

+

ChannelShuffle

+

torch.contrib.npu.optimized_lib

+

Provides NPU-affinity channelshuffle operations and applies to models such as shufflenetv2.

+

PreLoader

+

torch.contrib.npu.optimized_lib

+

Provides the data loading method for accelerating Ascend AI Processors.

+
+ +>![](public_sys-resources/icon-note.gif) **NOTE:** +>The optimization content will be enhanced and updated with the version. Use the content in the corresponding path of the actual PyTorch version. + +## Precision Commissioning + +### Prerequisites + +Run a certain number of epochs \(20% of the total number of epoches is recommended\) with the same semantics and hyperparameters to align the precision and loss with the corresponding level of the GPU. After the alignment is complete, align the final precision. + +### Commissioning Process + +#### Overall Guideline + +To locate the precision problem, you need to find out the step in which the problem occurs. The following aspects are involved: + +1. Model network calculation error + - Locating method: Add a hook to the network to determine which part is suspected. Then build a single-operator sample by referring to [Single-Operator Sample Building](#single-operator-sample-building) to narrow down the error range. This can prove that the operator calculation is incorrect in the current network. You can compare the result with the CPU or GPU result to prove the problem. + + - Workaround: Use other operators with the same semantics. + + - Solution: Improve the operator precision or function. + +2. Loss calculation error + - Locating method: The loss is special and can be customized. After determining that the loss calculation is incorrect, you are advised to dump the loss input in the network instead of a random tensor with the identical shape, so that the problem can be better reproduced and proved. + + - Workaround: Use other operators with the same semantics. + + - Solution: Improve the operator precision or function. \(Loss is also formed by operators.\) + +3. Parameter update error + + - Locating method: Before each **optim.step\(\)**, print the gradients of the parameters in the network one by one to determine which part is suspected. Then build a single-operator sample to narrow down the error range. This can prove that the gradient calculation by the operator is incorrect in the current network. You can compare the result with the CPU or GPU result to prove the problem. The priority of this item should be lower than that of items [1](#li17755175510322) and [2](#li25281726103316) because the errors of items 1 and 2 can also cause the gradient exception. + + - Workaround: Use other operators with the same semantics. + + - Solution: Improve the precision or function of the operator for gradient calculation. + +4. Multi-device calculation error + + - Locating method: When the precision of a single-device is ensured, multi-device calculation errors occur. + + - Solution: Contact Huawei support to provide the single-device script and multi-device script of stable reproduction. + +#### Precision Tuning Methods + +General model precision problems are as follows: training loss not converge or unqualified precision due to operator overflow/underflow; unqualified performance due to network-wide training. You can perform single-operator overflow/underflow detection and network-wide commissioning to resolve the preceding problems. + +##### **Environment Setup** + +- Install the HDF5 tool to support the operator dump function. For details about how to install the tool, see [HDF5 Compilation and Installation](#hdf5-compilation-and-installation)。 + + To use the operator precision comparison function, Install the HDF5 tool in both the NPU and GPU environments. Otherwise, install it only in the NPU environment. + +- Install the Ascend PyTorch framework that supports the dump function. Modify the **build.sh** script before compilation. For details about other operations, see the *PyTorch Installation Guide*. + + - Install PyTorch in the NPU environment. + + Add the `USE_DUMP=1` field to the **build.sh** script before compilation. + + ```bash + DEBUG=0 USE_DISTRIBUTED=1 USE_HCCL=1 USE_MKLDNN=0 USE_CUDA=0 USE_NPU=1 BUILD_TEST=0 USE_NNPACK=0 USE_DUMP=1 python"${PY_VERSION}" setup.py build bdist_wheel + ``` + + - (Optional) Install PyTorch in the GPU environment. Perform this operation only when you want to compare the precision of model operators. + + Before compilation, open the **build.sh** script, add the `USE_DUMP=1 ` and `USE_NCCL=0` fields, change the values of the `USE_HCCL` and `USE_NPU` fields to **0**, and change the value of the `USE_CUDA` field to **1**. + + ```bash + DEBUG=0 USE_DISTRIBUTED=1 USE_HCCL=0 USE_NCCL=0 USE_MKLDNN=0 USE_CUDA=1 USE_NPU=0 BUILD_TEST=0 USE_NNPACK=0 USE_DUMP=1 python"${PY_VERSION}" setup.py build bdist_wheel + ``` + +##### Model Operator Precision Comparison + +With the same inputs, you can use the Model Accuracy Analyzer to obtain the precision difference of the operator outputs of a model when the model is trained on the GPU and NPU, helping you locate operator precision problems. + +Restrictions: + +- You are advised to use a small batch size (**8** or fewer). + + The input and output data of each operator is stored on drives and occupies a large space. Therefore, you are advised to set a small batch size to save the drive space. + +- You are advised to dump data of only one step for precision comparison. + +- Currently, operators during O1 or O2 training can be used for precision comparison (FP32 only). + +Comparison modes: + +- Assume that the input and output of the GPU are known data. Load the input data of the GPU to the NPU to obtain the output data, and compare the NPU-based output with the GPU-based output. +- Assume that the input and output of the NPU are known data. Load the input data of the NPU to the GPU to obtain the output data, and compare the NPU-based output with the GPU-based output. + +Procedure: + +1. In the GPU or NPU environment, use the Dumper tool to obtain the model input and operator output on the GPU or NPU. + + Modify the training code to add the data dump function. Use the `with` statement in the forward and backward propagation positions of the model training code to add the `torch.utils.dumper()` method to dump data. For example, the following is a modification example in the GPU environment: + + ```python + for i, data in enumerate(dataloader): + with torch.utils.dumper(use_dump=True, dump_path="./model_gpu.h5") as dump: + # Model training code + xxx # forward code + xxx # backward code + exit() + xxx # optimizer code + ``` + + **dump_path** indicates the path of the dump data file, including the file name. You are advised to dump the data of only one step for precision comparison and place the parameter update code outside the with statement. + +2. Copy the **model_gpu.h5** data dumped in the GPU (NPU) environment to the NPU (GPU) environment. + +3. In the GPU or NPU environment, use the Dumper tool to load the dumped data and obtain the operator output data. + + Modify the training code and add the data load and dump functions. Use the `with` statement in the forward and backward propagation positions of the model training code to add the `torch.utils.dumper()` method to load and dump data. For example, the following is a modification example in the NPU environment: + + ```python + for i, data in enumerate(dataloader): + with torch.utils.dumper(use_dump=True, load_file_path="./model_gpu.h5", dump_path="./model_npu.h5") as dump: + # Model training code + xxx # forward code + xxx # backward code + exit() + xxx # optimizer code + ``` + + **load_file_path** indicates the path of dump data obtained from the GPU or NPU. **dump_path** indicates the path of the dump data file, including the file name. You are advised to dump the data of only one step for precision comparison and place the parameter update code outside the with statement. + +4. Use msaccucmp.py to compare the operator output data. + + 1. Ascend-Toolkit provides the msaccucmp.py tool for precision comparison. + + - The script is stored in **/user/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py**. + + The path is for reference only. Replace it with the actual installation path of Ascend-Toolkit. + + - You can also run the following command to query the path of msaccucmp.py: + + ```linux + find / -name msaccucmp.py + ``` + + 2. Run the msaccucmp.py script to compare the precision. + + ``` + python3 /user/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py compare -m ./model_npu.h5 -g ./model_gpu.h5 + ``` + + Parameters: + + `-g` passes the path of dump data obtained from the GPU. + + `-m` passes the path of dump data obtained from the NPU. + +##### Single-Operator Overflow/Underflow Detection + +With this function, you can check whether an operator overflows/underflows and collect data of overflow/underflow operators, helping developers quickly locate and solve operator precision problems. + +Restrictions: + +- This function provides only IR-level operator overflow/underflow detection for only the AI Core (not Atomic). +- When using the single-operator overflow/underflow detection function, do not enable the dynamic loss scale mode of apex and the tensor fusion function at the same time. + +Collecting data of overflow/underflow operators + +``` +# check_overflow is the overflow/underflow detection control switch. +# dump_path is the path for storing dump files. +with torch.utils.dumper(check_overflow=check_overflow, dump_path=dump_path, load_file_path='') as dump: + # Code snippet for detecting operator overflow/underflow. +``` + +During model running of a step, if an operator overflows/underflows, the name of the corresponding IR is printed. + +Viewing dump data: + +If dump data is collected during training, an .h5 file of the dump data is generated in the {dump_path} directory. You can go to the directory to view the dump data. + +Solution: + +1. Map the collected .h5 file to the TBE operators. For details, see [Mapping Between IR and TBE Operators](#Mapping Between IR and TBE Operators). + +2. Send the screenshots of operator overflow/underflow and the input and output files of the mapped TBE operators to Huawei R&D engineers as the attachment of an issue. + +##### Mapping Between IR and TBE Operators + +Prerequisites: + +- Set the environment variable export `ACL_DUMP_DATA=0`. +- Do not use the `torch.npu.init.dump()` and `torch.npu.set.dump()` APIs in the script. + +Procedure: + +1. Prepare the .h5 file of the operators to be mapped. + + - In the operator overflow/underflow detection scenario, the .h5 file of the operators to be mapped has been generated for single-operator overflow/underflow detection. + + - In the precision comparison scenario, run the following command to extract the .h5 file of the operators to be mapped based on the comparison result: + + ``` + h5copy -pv -i "./input.h5" -o "./output.h5" -s "/op1/seqid/" -d "/op1/seqid/" + ``` + + **-i** indicates the input precision comparison file. + + **-o** indicates the output .h5 file of the operators to be mapped. + + **-s** indicates the name and **seqid** of the source operator to be extracted. + + **-d** indicates the name and **seqid** of the target operator to be extracted. + + If multiple operators need to be extracted, modify the **-s** and **-d** parameters and run the command for multiple times to extract multiple operators to **output.h5**. + + The **-s** and **-d** parameters in this command must be the same. + + Example: + + ``` + h5copy -pv -i "./dump_npu.h5" -o "./output.h5" -s "/numpy_T/1/" -d "/numpy_T/1/" + ``` + + This example indicates that the input and output data of the numpy_T operator whose **seqid** is **1** is extracted from **./dump_npu.h5** to the **./output.h5** file. + +2. Configure the **acl.json** file. + + Create the **acl.json** configuration file required by the AscendCL dump function in the model directory. + + ``` + { + "dump": + { + "dump_list":[] + "dump_path":"./output_IR2TBE"# Mapping result output path + "dump_mode":"all" + "dump_op_switch":"on" + } + + } + ``` + + Change `dump_path` to the result output path. Other fields do not need to be modified. + +3. Modify the training script. + + Add the `with` statement to the training script to enable the IR-to-TBE mapping function. + + ```python + with torch.utils.dumper(use_load=True, dump_path="./",load_file_path="./output.h5", load_with_acl_dump=True) as dump: + # Model calculation code, which needs to be added by users + # x = model(input_data) + ``` + +4. Run the model. + + Run a complete model calculation process. During the calculation, if **load** encounters data in **output.h5**, the AscendCL dump function is automatically enabled to execute the IR and dump the input and output data of the TBE operators corresponding to the IR. After the IR is executed, the AscendCL dump ends. + +5. Obtain the mapping file. + + After the execution is successful, view the output result file in the `dump_path` directory in the **acl.json** configuration file. + +##### Mapping Between NPU and GPU Operators. + +For details, see "Data Preparation > [Preparing Data Files for Accuracy Comparison with PyTorch as the Original Training Network]([CANN 5.0.3 Auxiliary Development Tool User Guide 01 - Huawei](https://support.huawei.com/enterprise/en/doc/EDOC1100219270/2324edc8#ZH-CN_TOPIC_0000001162580808))" in "Model Accuracy Analyzer Instructions (Training)" in the Auxiliary Development Tool User Guide. + +
Network-wide Commissioning
+ +You can also commission the network model precision by analyzing the entire network. + +1. Determine whether the calculation on the Ascend AI Processor is correct by comparing the calculation result on the CPU and that on the Ascend AI Processor. + + Code example (this example shows only the basic method and does not allow direct copy): + + ``` + # The input parameters are fixed to ensure that the model and input data are the same on the CPU and Ascend AI Processor. + input_tensor_cpu = torch.Tensor() + model_cpu = build_model() + # Port the input data to the Ascend AI Processor. + input_tensor_npu = input_tensor_cpu.npu() + # Port the model to the Ascend AI Processor. + model_npu = model_cpu.npu() + + #Compare the calculation results. + output_cpu = model_cpu(input_tensor_cpu) + output_npu = model_npu(input_tensor_npu) + compute_result = (output_cpu - output_npu).abs().mean()) + print(compute_result) + ``` + + The calculation results are slightly different because the hardware architecture of the Ascend AI Processor is different from that of the CPU. If the calculation results are close \(generally not higher than 1e-4\), then they are normal. + +2. Use the hook mechanism of PyTorch to print the inputs and outputs of the module in the forward and backward propagation for analysis. + + Code example (this example shows only the basic method and does not allow direct copy): + + ``` + # Set the hook function. + def hook_func(name, module): + def hook_function(module, inputs, outputs): + print(name+' inputs', inputs) + print(name+' outputs', outputs) + return hook_function + + # Register the forward and backward hooks. + for name, module in model.named_modules(): + module.register_forward_hook(hook_func('[forward]: '+name, module)) + module.register_backward_hook(hook_func('[backward]: '+name, module)) + + # Run + model(input_tensor) + ``` + + Analyze the printed inputs and outputs in the forward and backward propagation. + +3. Obtain parameters such as **grad**, **running\_mean**, and **running\_var** of the module to analyze the updates. + + Code example (this example shows only the basic method and does not allow direct copy): + + ``` + # For example, obtain the gradient and average value of BN for check. + for name, module in model.named_modules(): + if isinstance(module, nn._BatchNorm): + print("[BN_buffer]: "+name, module.running_mean, module.running_var) + print("[grad]: "+name, module.grad) + ``` + +## Model Saving and Conversion + +### Introduction + +After the model training is complete, save the model file and export the ONNX model by using the APIs provided by PyTorch. Then use the ATC tool to convert the model into an .om file that adapts to the Ascend AI Processor for offline inference. + +This section describes how to convert the trained .pth or .pth.tar file into the ONNX model. For details about how to convert the ONNX model into an .om file adapted to the Ascend AI Processor, see "ATC Tool Instructions" in the _CANN Auxiliary Development Tool User Guide _. + +For details about how to use the Auto Tune function, see "Auto Tune Instructions" in the _CANN Auxiliary Development Tool User Guide _. + +For details about how to build an offline inference application, see the _CANN Application Software Development Guide (C and C++, Inference)_. The process is as follows: + +![](figures/en-us_image_0000001144082132.png) + +### Saving a Model + +During PyTorch training, **torch.save()** is used to save checkpoint files. Based on the usage of model files, model files are saved in the following two formats: + +- .pth or .pt files: These files are used for online inference or exporting ONNX models. Only model parameters are saved, and the model structure is not saved, so that the compressed file can be opened using a visualization tool such as Netron. [Figure 11](#fig315704722610) shows an example. + + **Figure 11** .pth file + ![](figures/pth-file.jpg "pth-file") + + Use **state\_dict** to save and load a model. The following is an example: + + 1. Save a model. + + ``` + # Create a storage path. + PATH = "state_dict_model.pt" + # Save a model. + torch.save(net.state_dict(), PATH) + ``` + + 2. Load the model for online inference. The following is an example. For details, see the _PyTorch Online Inference Guide_. + + ``` + # Path for storing the model file + PATH = "state_dict_model.pt" + model = TheModelClass(*args, **kwargs) + # Load a model. + model.load_state_dict(torch.load(PATH)) + model.eval() + ``` + + >![](public_sys-resources/icon-notice.gif) **NOTICE:** + >The model definition file must be provided when the .pth or .pt file is saved. Otherwise, the deployment cannot be performed. + +- .pth.tar files: can be used for online inference or training after reloading. Multiple components are saved in dictionary format. Common components include the **state\_dict** of the model and optimizer, epoch when the training stops, training loss of the latest record, and the external torch.nn.Embedding layer. If only an inference model needs to be deployed, you are advised to save the weight information only, that is, the **state\_dict** of the model, in the .pth.tar file. + + The following is an example of saving and loading a model: + + 1. Save a model. + + ``` + PATH = "checkpoint.pth.tar" + torch.save({ + 'epoch': epoch, + 'loss': loss, + 'state_dict': model.state_dict(), + 'optimizer' : optimizer.state_dict(), + ... + }, PATH) + ``` + + 2. Load a model for inference or resuming training. + + ``` + model = TheModelClass(*args, **kwargs) + optimizer = TheOptimizerClass(*args, **kwargs) + + checkpoint = torch.load(PATH) + model.load_state_dict(checkpoint['model_state_dict']) + optimizer.load_state_dict(checkpoint['optimizer_state_dict']) + epoch = checkpoint['epoch'] + loss = checkpoint['loss'] + + model.eval() + # - or - + model.train() + ``` + +>![](public_sys-resources/icon-notice.gif) **NOTICE:** +>Generally, an operator is processed in different ways in the training graph and inference graph (for example, BatchNorm and dropout operators), and the input formats are also different. Therefore, before inference or ONNX model exporting, **model.eval\(\)** must be called to set the dropout and batch normalization layers to the inference mode. + +### Exporting an ONNX Model + +#### Introduction + +The deployment policy of the Ascend AI Processor for PyTorch models is implemented based on the ONNX module that is supported by PyTorch. ONNX is a mainstream model format in the industry and is widely used for model sharing and deployment. This section describes how to export a checkpoint file as an ONNX model by using the **torch.onnx.export\(\)** API. + +#### Using the .pth or .pt File to Export the ONNX Model + +The saved .pth or .pt file can be restored by building a model using PyTorch and then loading the weight. Then you can export the ONNX model. The following is an example. + +``` +import torch +import torch.onnx +import torchvision.models as models +# Set the CPU to be used to export the model. +device = torch.device("cpu") + +def convert(): +# The model definition comes from the torchvision. The model file generated in the example is based on the ResNet-50 model. + model = models.resnet50(pretrained = False) + resnet50_model = torch.load('resnet50.pth', map_location='cpu') + model.load_state_dict(resnet50_model) + + batch_size = 1 # Size of the batch processing + input_shape = (3, 224, 224) # Input data. Replace it with the actual shape. + + # Set the model to inference mode. + model.eval() + + dummy_input = torch.randn(batch_size, *input_shape) # Define the input shape. + torch.onnx.export(model, + dummy_input, + "resnet50_official.onnx", + input_names = ["input"], # Construct the input name. + output_names = ["output"], # Construct the output name. + opset_version=11, # Currently, the ATC tool supports only opset_version=11. + dynamic_axes={"input":{0:"batch_size"}, "output":{0:"batch_size"}}) # Dynamic axes of the output is supported. + ) + +if __name__ == "__main__": + convert() +``` + +>![](public_sys-resources/icon-note.gif) **NOTE:** +>- Before exporting the ONNX model, the **model.eval\(\)** must be called to set the dropout and batch normalization layers to inference mode. +>- The model in the sample script comes from the definition in the torchvision module. You need to specify a model when using your own model. +>- The constructed input and output must correspond to the input and output during training. Otherwise, the inference cannot be performed properly. + +#### Using the .pth.tar File to Export the ONNX Model + +Before exporting the ONNX model using the .pth.tar file, you need to check the saved information. Sometimes, the saved node name may be different from the node name in the model definition. For example, a prefix and suffix may be added. During the conversion, you can modify the node name. The following is an example of the conversion. + +``` +import torch +import torch.onnx +from collections import OrderedDict +import mobilenet + +# In this example, when the .pth.tar file is saved, the prefix module is added to the node name. Delete it by traversing. +def proc_nodes_module(checkpoint, AttrName): + new_state_dict = OrderedDict() + for key, value in checkpoint[AttrName].items(): + if key == "module.features.0.0.weight": + print(value) + if(key[0:7] == "module."): + name = key[7:] + else: + name = key[0:] + + new_state_dict[name] = value + return new_state_dict + +def convert(): + checkpoint = torch.load("./mobilenet_cpu.pth.tar", map_location=torch.device('cpu')) + checkpoint['state_dict'] = proc_nodes_module(checkpoint,'state_dict') + model = mobilenet.mobilenet_v2(pretrained = False) + model.load_state_dict(checkpoint['state_dict']) + model.eval() + input_names = ["actual_input_1"] + output_names = ["output1"] + dummy_input = torch.randn(1, 3, 224, 224) + torch.onnx.export(model, dummy_input, "mobilenetV2_npu.onnx", input_names = input_names, output_names = output_names, opset_version=11) + +if __name__ == "__main__": + convert() +``` + +## Samples + +### ShuffleNet Model Optimization + +#### Obtaining Samples + +##### How to Obtain + +1. This sample is used to adapt to the porting and reconstruction of the Ascend 910 AI Processor based on the ImageNet dataset training model provided by the PyTorch official website. The sample can be obtained from [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet). +2. For details about the ShuffleNet model, see the [ShuffleNet V2](https://pytorch.org/hub/pytorch_vision_shufflenet_v2/) in the PyTorch official website. Set the **arch** parameter to **shufflenet\_v2\_x1\_0** during script execution. + + ``` + --arch shufflenet_v2_x1_0 + ``` + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >ShuffleNet is a model built in PyTorch. For more built-in models, visit the [PyTorch official website](https://pytorch.org/). + + +##### Directory Structure + +The structure of major directories and files is as follows: + +``` +├──main.py +``` + +#### Model Evaluation + +Model evaluation focuses on operator adaptation. Use the dump op method to obtain the ShuffleNet operator information and compare the information with that in the _PyTorch Operator Support_. If an operator is not supported, in simple scenarios, you can replace the operator with a similar operator or place the operator on the CPU to avoid this problem. In complex scenarios, operator development is required. For details, see the _PyTorch Operator Development Guide_. + +#### Porting the Network + +For details about how to port the training scripts, see [Single-Device Training Porting](#single-device-training-porting) and [Single-Server Multi-Device Training Modification](#single-server-multi-device-training-modification). During the script execution, select the **--arch shufflenet\_v2\_x1\_0** parameter. + +#### Commissioning the Network + +For details about how to commission the network, see [Commissioning Process](#commissioning-process). After check, it is found that too much time is consumed by operators during ShuffleNet running. The following provides the time consumption data and solutions. + +##### Forward check + +The forward check record table is as follows: + +**Table 1** Forward check + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

No.

+

time (ms)

+

batch_size

+

Detail

+

1

+

1100

+

512

+

Replace channel_shuffle with channel_shuffle_index_select.

+

2

+

600

+

512

+

Perform the channel_shuffle_index_select operation twice to reduce the non-contiguous tensors caused by chunk.

+

3

+

300

+

512

+

Specify the concat output format to NCHW through the framework layer to eliminate excessive transdata.

+

4

+

285

+

512

+

Rectify the weight format.

+

5

+

275

+

512

+

Rectify the problem that the output format 5HD was not specified for DWCONV.

+
+ +The details are as follows: + +- The native **torch.transpose\(x, 1, 2\).contiguous\(\)** uses the view operator transpose, which produced non-contiguous tensors. For example, the copy bottleneck described in the [copy bottleneck optimization](#training-performance-optimizationmd) uses **channel\_shuffle\_index\_select** to replace the framework operator with the compute operator when the semantics is the same, reducing the time consumption. +- ShuffleNet V2 contains a large number of chunk operations, and chunk operations are framework operators in PyTorch. As a result, a tensor is split into several non-contiguous tensors of the same length. The operation of converting non-contiguous tensors to contiguous tensors takes a long time. Therefore, the compute operator is used to eliminate non-contiguous tensors. For details, see the copy bottleneck described in the [copy bottleneck optimization](#training-performance-optimizationmd) +- During operator adaptation, the output format is specified as the input format by default. However, Concat does not support the 5HD format whose C dimension is not an integral multiple of 16, so it converts the format into 4D for processing. In addition, the Concat is followed by the GatherV2 operator, which supports only the 4D format. Therefore, the data format conversion process is 5HD \> 4D \> Concat \> 5HD \> 4D \> GatherV2 \> 5HD. The solution is to modify the Concat output format. When the output format is not an integer multiple of 16, the specified output format is 4D. After the optimization, the data format conversion process is 5HD \> 4D \> Concat \> GatherV2 \> 5HD. For details about the method for ShuffleNet, see line 121 in **pytorch/aten/src/ATen/native/npu/CatKernelNpu.cpp**. +- Set the weight initialization format to avoid repeated transdata during calculation, for example, the framework bottleneck described in the [copy bottleneck optimization](#training-performance-optimizationmd). +- The output format of the DWCONV weight is rectified to avoid the unnecessary conversion from 5HD to 4D. + +##### Entire Network Check + +The record table of the entire network check is as follows: + +**Table 2** Entire network check + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

No.

+

time (ms)

+

batch_size

+

Detail

+

1

+

5500

+

512

+

The index_add operation is performed by copying index to CPU through the framework layer.

+

2

+

4000

+

512

+

Customize operators to pre-generate an index.

+

3

+

1800

+

512

+

Customize operators to combine index_add and chunk.

+

4

+

885

+

512

+

Add contiguous_with_gatherv2.

+

5

+

3480

+

1024

+

Modify batchsize.

+

6

+

1650

+

1024

+

Modify batchsize and contiguous_with_gatherv2.

+

7

+

1424

+

1024

+

Customize operators to combine cat, shuffle, and chunk to eliminate non-contiguous tensors.

+

8

+

1360

+

1024

+

Modify the format of the gradient transferred by ReluGrad through the framework layer.

+

9

+

1300

+

1024

+

Modify the backward propagation input format of IndexSelectFullImplementation.

+

10

+

920

+

1024

+

Modify amp O1.

+

11

+

860

+

1024

+

Modify amp O2.

+

12

+

830

+

1024

+

Eliminate the excessive transdata introduced by the AXPY during BN parameter update.

+

13

+

800

+

1024

+

Cancel the stream synchronization among forward propagation, backward propagation, and parm_update.

+

14

+

461

+

1024

+

Optimize the GatherV2 operator for non-32-byte alignment scenarios.

+

15

+

429

+

1024

+

Optimize GatherV2 to GatherV3 in the ShuffleNet V2 scenario.

+
+ +The details are as follows: + +1. Replace framework operators with compute operators. + +2. Use buffer to record the index information to the NPU, and cancel the **index.to(npu creation)** operation. + +3. Use compute operators to eliminate non-contiguous tensors. + +4. The AI Core operator GatherV2 is used for **contiguous_with_gatherv2** to convert non-contiguous tensors to contiguous tensors. + +5. Modify **batchsize**. + +6. Modify **batchsize **and **contiguous_with_gatherv2**. + +7. The chunk operator is the backward calculation mode of the Concat operator. It may produce non-contiguous tensors. Therefore, the backward calculation mode of the Concat operator needs to be customized. Combine cat, shuffle, and chunk, then replace chunk with GatherV2 to eliminate non-contiguous tensors. + +8. The ReluGrad operator has two inputs: **grad_output** (backward input) and **self** (forward output). In ShuffleNet, the 4D and 5HD formats exist at the same time in some cases. However, the FE format is usually aligned with the format of the first tensor, so the following process occurs: (4D, 5HD) > (4D, 4D) > ReluGrad > 4D > 5HD. The forward output format is basically the input format, and ReLU is usually used together with Conv and BN. In this scenario, 5HD format is more suitable for output. Therefore, insert **npu_format_cast** manually, and the following process occurs: (4D, 5HD) > (5HD, 5HD) > ReluGrad > 5HD. + +9. In IndexSelectFullImplementation, the gatherv2 operation is performed twice on a 5HD tensor. In this case, the conversion from 5HD to 4D is performed twice. You can manually convert 5HD to 4D once, so that transdata is not performed during the gatherv2 operation, reducing a transdata operation. + +10. Add the mixed precision O1. + +11. Add the mixed precision O2. +12. Due to the parameter verification of the Axpy operator, when the parameters of all networks are updated, if C dimension is not exactly divided by 16, the Axpy operation for 4D is performed by transdata operators. In this case, a large number of transdata operators are introduced. To solve this problem, add a function, when the Axpy input shapes are the same, the verification ends. This avoids format conversion and improves the running efficiency. + +13. Delete all the stream synchronization operations. This is not adopted because it is easy to cause non-convergence. + +14. After using the GatherV2 operator optimized for non-alignment scenarios, the overall performance is improved to the delivery level. + +15. After using the GatherV3 operator optimized for the ShuffleNet V2 scenario, the overall performance can be further improved. + + +##### Python Optimization Details + +The optimization on the Python side is to make the network more affinity on the NPU by modifying some equivalent semantics. The current operations of converting non-contiguous tensors to contiguous tensors can be the performance bottleneck. The **channel\_shuffle** operation in ShuffleNet V2 involves the conversion operations after permute, causing poor performance of the entire network. The performance of the entire network can be greatly improved by modifying the equivalent semantics of the **channel\_shuffle** operation and combining it with the concat operation. The torchvision version is used. For details, go to [open source link](https://github.com/pytorch/vision/blob/master/torchvision/models/shufflenetv2.py). + +- Original **channel\_shuffle** operation: + + ``` + def channel_shuffle(x, groups): + # type: (torch.Tensor, int) -> torch.Tensor + batchsize, num_channels, height, width = x.data.size() + channels_per_group = num_channels // groups + # reshape + x = x.view(batchsize, groups, + channels_per_group, height, width) + x = torch.transpose(x, 1, 2).contiguous() + # flatten + x = x.view(batchsize, -1, height, width) + return x + + class InvertedResidual(nn.Module): + def __init__(self, inp, oup, stride): + super(InvertedResidual, self).__init__() + if not (1 <= stride <= 3): + raise ValueError('illegal stride value') + self.stride = stride + branch_features = oup // 2 + assert (self.stride != 1) or (inp == branch_features << 1) + if self.stride > 1: + self.branch1 = nn.Sequential( + self.depthwise_conv(inp, inp, kernel_size=3, stride=self.stride, padding=1), + nn.BatchNorm2d(inp), + nn.Conv2d(inp, branch_features, kernel_size=1, stride=1, padding=0, bias=False), + nn.BatchNorm2d(branch_features), + nn.ReLU(inplace=True), + ) + else: + self.branch1 = nn.Sequential() + + self.branch2 = nn.Sequential( + nn.Conv2d(inp if (self.stride > 1) else branch_features, + branch_features, kernel_size=1, stride=1, padding=0, bias=False), + nn.BatchNorm2d(branch_features), + nn.ReLU(inplace=True), + self.depthwise_conv(branch_features, branch_features, kernel_size=3, stride=self.stride, padding=1), + nn.BatchNorm2d(branch_features), + nn.Conv2d(branch_features, branch_features, kernel_size=1, stride=1, padding=0, bias=False), + nn.BatchNorm2d(branch_features), + nn.ReLU(inplace=True), + ) + + @staticmethod + def depthwise_conv(i, o, kernel_size, stride=1, padding=0, bias=False): + return nn.Conv2d(i, o, kernel_size, stride, padding, bias=bias, groups=i) + + def forward(self, x): + if self.stride == 1: + x1, x2 = x.chunk(2, dim=1) + out = torch.cat((x1, self.branch2(x2)), dim=1) + else: + out = torch.cat((self.branch1(x), self.branch2(x)), dim=1) + + out = channel_shuffle(out, 2) + + return out + ``` + +- Equivalent semantics rewriting: + +``` +def channel_shuffle_index_select(x, groups=2): + N, C, H, W = x.shape + inp = C +# The channel_shuffle operation is to rearrange the C dimension according to certain rules. It can be expressed as a simple rearrangement. + group_len = inp // groups + index = torch.from_numpy(np.array(list(range(inp))).reshape(groups, group_len).transpose(1, 0).flatten()).long() + + x = x.index_select(1, index) + return x + +# Compare the results of the two operations. The semantics are the same. +x = torch.randn(2, 232, 14, 14) +for group in [2, 4, 8]: + out1 = channel_shuffle(x, group) + out2 = channel_shuffle_index_select(x, group) + print((out1 - out2).sum()) +``` + +- Affinity writing method of the Ascend AI Processor: + + ``` + # Corresponding to out = channel_shuffle(torch.cat((self.branch1(x), self.branch2(x)), dim=1)) + # Replace channel_shuffle with channel_shuffle_index_select. + # Customize operators to combine channel_shuffle_index_select and cat, and use compute operators to reduce non-contiguous tensors. + class IndexSelectFullImplementation(torch.autograd.Function): + @staticmethod + def forward(ctx, x1, x2, fp_index, bp_index1, bp_index2): + # Forcible stream synchronization, which is used only for training stabilization. + stream = torch.npu.current_stream() + stream.synchronize() + + # Register bp_index1 and bp_index2 with context so that they can be used in backward propagation. + ctx.bp_index1 = bp_index1 + ctx.bp_index2 = bp_index2 + + x = torch.cat([x1, x2], dim=1) + + # Replace channel_shuffle with index_select. In this example, the chunk operator is not used. + result = x.index_select(1, fp_index) + + return result + + @staticmethod + def backward(ctx, grad_output): + # Forcible stream synchronization, which is used only for training stabilization. + stream = torch.npu.current_stream() + stream.synchronize() + + # Convert the format to NCHW to reduce extra transdata because index_select does not support the 5HD format. + grad_output.data = grad_output.data.npu_format_cast(0) + + # Use index_select to reverse index_select and cat based on the reverse expression obtained from forward derivation. + out1 = grad_output.index_select(1, ctx.bp_index1) + out2 = grad_output.index_select(1, ctx.bp_index2) + return out1, out2, None, None, None, None + + + class IndexSelectHalfImplementation(torch.autograd.Function): + @staticmethod + def forward(ctx, x1, x2, fp_index1, fp_index2, bp_index1, bp_index2): + ctx.bp_index1 = bp_index1 + ctx.bp_index2 = bp_index2 + x = torch.cat([x1, x2], dim=1) + + # Replace channel_shuffle with index_select. In this example, the chunk operator is used. + return x.index_select(1, fp_index1), x.index_select(1, fp_index2) + + @staticmethod + def backward(ctx, grad_output1, grad_output2): + grad_output = torch.cat([grad_output1, grad_output2], 1) + + out1 = grad_output.index_select(1, ctx.bp_index1) + out2 = grad_output.index_select(1, ctx.bp_index2) + return out1, out2, None, None, None, None + + + class Channel_Shuffle(nn.Module): + def __init__(self, inp, groups=2, split_shuffle=True): + super(Channel_Shuffle, self).__init__() + + self.split_shuffle = split_shuffle + self.group_len = inp // groups + + # Initialize fp_index to be used in channel_shuffle_index_select. + self.out = np.array(list(range(inp))).reshape(groups, self.group_len).transpose(1, 0).flatten().tolist() + + # Register the initialized fp_index as the buffer of the module. When to.device is called, the buffer is brought to the device to reduce the time consumed by host-to-device copy. + # This section describes only the common usage when the value of group is 2. Expand based on the actual scenario. + if self.split_shuffle: + self.register_buffer('fp_index1', torch.tensor(self.out[:self.group_len], dtype=torch.int32)) + self.register_buffer('fp_index2', torch.tensor(self.out[self.group_len:], dtype=torch.int32)) + else: + self.register_buffer('fp_index', torch.tensor(self.out, dtype=torch.int32)) + + # Register the corresponding bp_index as the buffer of the module. When to.device is called, the buffer is brought to the device to reduce the time consumed by host-to-device copy. + self.register_buffer('bp_index1', torch.tensor(list(range(0, inp, 2)), dtype=torch.int32)) + self.register_buffer('bp_index2', torch.tensor(list(range(1, inp, 2)), dtype=torch.int32)) + + def forward(self, x1, x2): + if self.split_shuffle: + return IndexSelectHalfImplementation.apply(x1, x2, self.fp_index1, self.fp_index2, self.bp_index1, + self.bp_index2) + else: + return IndexSelectFullImplementation.apply(x1, x2, self.fp_index, self.bp_index1, self.bp_index2) + + + class InvertedResidual(nn.Module): + def __init__(self, inp, oup, stride, split_shuffle=True): + super(InvertedResidual, self).__init__() + + if not (1 <= stride <= 3): + raise ValueError('illegal stride value') + self.stride = stride + + branch_features = oup // 2 + assert (self.stride != 1) or (inp == branch_features << 1) + + if self.stride > 1: + self.branch1 = nn.Sequential( + self.depthwise_conv(inp, inp, kernel_size=3, stride=self.stride, padding=1), + nn.BatchNorm2d(inp), + nn.Conv2d(inp, branch_features, kernel_size=1, stride=1, padding=0, bias=False), + nn.BatchNorm2d(branch_features), + nn.ReLU(inplace=True), + ) + else: + self.branch1 = nn.Sequential() + + self.branch2 = nn.Sequential( + nn.Conv2d(inp if (self.stride > 1) else branch_features, + branch_features, kernel_size=1, stride=1, padding=0, bias=False), + nn.BatchNorm2d(branch_features), + nn.ReLU(inplace=True), + self.depthwise_conv(branch_features, branch_features, kernel_size=3, stride=self.stride, padding=1), + nn.BatchNorm2d(branch_features), + nn.Conv2d(branch_features, branch_features, kernel_size=1, stride=1, padding=0, bias=False), + nn.BatchNorm2d(branch_features), + nn.ReLU(inplace=True), + ) + + if self.stride > 1: + self.channel_shuffle = Channel_Shuffle(inp=branch_features + branch_features, groups=2, + split_shuffle=split_shuffle) + else: + self.channel_shuffle = Channel_Shuffle(inp=inp, groups=2, split_shuffle=split_shuffle) + + @staticmethod + def depthwise_conv(i, o, kernel_size, stride=1, padding=0, bias=False): + return nn.Conv2d(i, o, kernel_size, stride, padding, bias=bias, groups=i) + + def forward(self, x): + + # Delete the concat and chunk operations and combine them into self.channel_shuffle for processing. + if self.stride == 1: + x1, x2 = x + x2 = self.branch2(x2) + else: + x1 = self.branch1(x) + x2 = self.branch2(x) + + out = self.channel_shuffle(x1, x2) + + return out + ``` + +## References + +### Single-Operator Sample Building + +When a problem occurs in a model, it is costly to reproduce the problem in the entire network. You can build a single-operator sample to reproduce the precision or performance problem to locate and solve the problem. A single-operator sample can be built in either of the following ways: For details about single-operator dump methods, see [Single-Operator Dump Method](#single-operator-dump-method). + +1. Build a single-operator sample test case. You can directly call the operator to reproduce the error scenario. + + The following is an example of building a single-operator sample of the max operator: + + ``` + import torch + import copy + from torch.testing._internal.common_utils import TestCase, run_tests + class TestMax(TestCase): + def cpu_op_exec(self, input1): + # Call the operator. + output = torch.max(input1) + output = output.to('cpu') + output = output.numpy() + return output + + def npu_op_exec(self, input1): + # Call the corresponding NPU operator. + output = torch.max(input1) + return output + + def test_max(self): + input = torch.randn(10,20)) + input = input.to(torch.int64) # Convert the data type. + input_cpu = copy.deepcopy(input) + input_npu = copy.deepcopy(input).npu() + + output_cpu = self.cpu_op_exec(input_cpu) + output_npu = self.npu_op_exec(input_npu) + + # Compare the calculation results of the CPU and NPU. prec is the allowed error. + self.assertEqual(output_cpu, output_npu, prec = 1e-4) + + if __name__ == '__main__': + run_tests() + ``` + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >- Run the preceding code. If the reported error information is the same as that of the max operator in the model, the single-operator test case is successfully built. + >- Assume that the data type conversion code is commented out. If no error is reported in the test case, an error of the max operator is reported on the NPU when the input parameter is **torch.int64**. + +2. Build a single-operator test case based on the context. + + Although this is a single-operator sample, sometimes it is not only an operation but also a scenario with context or a module with parameters. The module mode is a more common method. The following is an example of building a module that contains two operators: + + ``` + import torch + import copy + from torch.testing._internal.common_utils import TestCase, run_tests + + class Model(nn.Module): + def __init__(self, in_channels=1, hooks=False): + super(Model, self).__init__() + self.conv = nn.Conv2d(in_channels, in_channels*2, kernel_size=64) + if hooks: + self.conv.weight.register_hook(lambda grad: print(grad)) + def forward(self, x): + out = self.conv(x) + return out + + class TestConv2d(TestCase): + def test_conv2d(self): + + model = Model(in_channels=16) + + # Add hooks to obtain the backward propagation result. + # model = Model(in_channels=16, hooks=True) + # Create an input tensor. + input_tensor = torch.randn(4,16,64,64) + + input_tensor_cpu= copy.deepcopy(input_tensor) + out = model(input_tensor_cpu) + loss = out.sum() + loss.backward() + cpuout = out + + # Run the model and input tensor on the NPU. + torch.npu.set_device("npu:0") # Set the running device first. + model_npu = Model(in_channels=16).npu() + input_tensor_npu= copy.deepcopy(input_tensor).npu() + out = model_npu(input_tensor_npu) + loss = out.sum() + loss.backward() + npuout = out + # Determine whether the scenario is an error scenario based on the result. + self.assertEqual(cpuout, npuout, prec = 1e-4) + + if __name__ == '__main__': + run_tests() + ``` + +### Single-Operator Dump Method + +#### Collecting Dump Data + +Currently, the PyTorch adapted to Ascend AI Processors uses the init\_dump\(\), set\_dump\(\), and finalize\_dump\(\) interfaces in **torch.npu** to collect operator dump data. The init\_dump\(\) interface initializes the dump configuration, invokes the set\_dump\(\) interface to import the configuration file to configure dump parameters, and invokes the finalize\_dump interface to end the dump. The following uses the add\_ operator as an example to describe how to collect dump data. + +``` +import torch +torch.npu.set_device("npu:0") +torch.npu.init_dump() +torch.npu.set_dump("/home/HwHiAiUser/dump.json") # "/home/HwHiAiUser/dump.json" is the path of the configuration file. You can configure it as required. +a = torch.tensor([2, 2]).to("npu:0") +a.add_(1) +torch.npu.finalize_dump() +``` + +The configuration method of **dump.json** is as follows. + +``` +{ + "dump": + { + "dump_list":[], + "dump_path":"/home/HwHiAiUser/dump/output", + "dump_mode":"all", + "dump_op_switch":"on" + } +``` + +The fields of **dump.json** are described as follows. + + + + + + + + + + + + + + + + + + +

Field

+

Description

+

dump_list

+

Operator model whose data is to be dumped. Leave this parameter empty.

+

dump_path

+

Directory where dump data files are stored in the operating environment. The value can be an absolute path or a relative path.

+
  • An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output.
  • A relative path starts with a directory name, for example, output.
+

For example, if dump_path is set to /home/HwHiAiUser/output, the dump data files are generated under the /home/HwHiAiUser/output directory in the operating environment.

+

dump_mode

+

Dump data mode. The configuration is as follows:

+
  • output (default): dumps operator outputs only.
  • input: dumps operator inputs only.
  • all: dumps both operator inputs and outputs.
+

dump_op_switch

+

Dump data status of the single-operator model. The configuration is as follows:

+
  • off (default): disables dump for the single-operator model.
+
  • on: enables dump for the single-operator model.
+
+ +#### Viewing Overflowed Data + +The collected dump data is generated in the _\{dump\_path\}_**/**_\{time\}_**/**_\{deviceid\}_**/**_\{model\_id\}_**/**_\{data\_index\}_ directory, for example, **/home/HwHiAiUser/output/20200808163566/0/0**. + +The fields in the dump data path and file are described as follows: + +- _dump\_path_: user-defined path for storing overflowed data, for example, **/home/HwHiAiUser/output**. + +- _time_: timestamp (for example, **20200808163566**) +- _deviceid_: device ID +- **_model\_id_**: subgraph ID +- A dump file is named as: _\{op\_type\}_._\{op\_name\}_._\{taskid\}_._\{stream\_id\}_._\{timestamp\}_. Any period \(.\), slash \(/\), backslash \(\\\), or space in the _op\_type_ or _op\_name_ field is replaced by an underscore \(\_\). + +#### Parse the dump file of an overflow operator. + +1. Upload the **_\{op\_type\}.\{op\_name\}.\{taskid\}.\{stream\_id\}.\{timestamp\}_** file to the environment with CANN installed. +2. Go to the path where the parsing script is stored. Assume that the installation directory of the CANN is **/home/HwHiAiUser/Ascend**. + + **cd /home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/tools/operator\_cmp/compare** + +3. Run the **msaccucmp.pyc** script to convert the dump file into a NumPy file. The following is an example: + + **python3 msaccucmp.pyc convert -d /home/HwHiAiUser/dump -out /home/HwHiAiUser/dumptonumpy -v 2** + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >The **-d** option enables the conversion of a single dump file or all dump files in a path. + +4. Use Python to save the NumPy data into a .txt file. The following is an example: + + **$ python3** + + **\>\>\> import numpy as np** + + **\>\>\> a = np.load("/home/HwHiAiUser/dumptonumpy/Pooling.pool1.1147.1589195081588018.output.0.npy")** + + **\>\>\> b = a.flatten\(\)** + + **\>\>\> np.savetxt("/home/HwHiAiUser/dumptonumpy/Pooling.pool1.1147.1589195081588018.output.0.txt", b)** + + The dimension and **Dtype** information no longer exist in the .txt file. For details, visit the NumPy website. + +### Common Environment Variables + +1. Enables the task delivery in multi-thread mode. When this function is enabled, the training performance of the entire network is improved in most cases. + + **export TASK\_QUEUE\_ENABLE=1** + +2. Redirects logs to stdout, which is used to export host logs to the screen. + + **export ASCEND\_SLOG\_PRINT\_TO\_STDOUT=0** + +3. Sets the log level. Log levels in descending order are: debug \> info \> warning \> error \> null. Generally, the log level is set to **error**. **info** is used for debugging. For details about how to set the log level, see the _CANN Log Reference_. + + **export ASCEND\_GLOBAL\_LOG\_LEVEL=3** + +4. Dumps graph, which is used to view the graph structure. + + **export DUMP\_GE\_GRAPH=2** + + **export DUMP\_GRAPH\_LEVEL=3** + +5. Enables/Disables the event log function. + + **export ASCEND\_GLOBAL\_EVENT\_ENABLE=0** + +6. Enables/Disables PTCopy. + + **export PTCOPY\_ENABLE=1** + +7. Enables/Disables the combined flag. + + **export COMBINED\_ENABLE=1** + +8. Sets whether to recompile the code in special scenarios. You do not need to modify this parameter. + + **export DYNAMIC\_OP="ADD\#MUL"** + +9. Enables/Disables the HCCL trustlist. + + **export HCCL\_WHITELIST\_DISABLE=1** + +### dump op Method + +1. Use the profile API to reconstruct the loss calculation and optimization process of the original code training script and print the operator information. The following is a code example. + + ``` + with torch.autograd.profiler.profile() as prof: + out = model(input_tensor) + loss = out.sum() + loss.backward() + # You can also export the file. + print(prof.key_averages().table(sort_by="self_cpu_time_total")) + ``` + +2. Train the reconstructed training script on the CPU. The related operator information is displayed. + +### Compilation Option Settings + +Configure the attributes of an operator during compilation to improve performance, which is implemented by ACL APIs. The usage and explanation are as follows: + +``` +import torch +option = {key: val} +torch.npu.set_option(option) # Set in dict mode. + +The key options are as follows: +ACL_OP_SELECT_IMPL_MODE, // Sets the operator implementation mode (high-precision or high-performance). +ACL_OPTYPELIST_FOR_IMPLMODE, // Lists operator types. Operators on the list are implemented in the mode specified by ACL_OP_SELECT_IMPL_MODE. +ACL_OP_DEBUG_LEVEL, // Enables TBE operator debug during operator compilation. +ACL_DEBUG_DIR, // Sets the debug directory, for saving the files generated during model conversion and network migration, including the .o, .json, and .cce files of operators. The diretory must exist. +ACL_OP_COMPILER_CACHE_MODE, // Sets the disk cache mode for operator compilation. +ACL_OP_COMPILER_CACHE_DIR, // Sets the path of the disk cache for operator compilation. The path must exist. + +The key values are as follows: +ACL_OP_SELECT_IMPL_MODE: Sets the operator implementation mode (high-precision or high-performance). If this option is not set, high_precision is used by default. + high_precision: All operators in the network are implemented with high precision. + high_performance: All operators in the network are implemented with high performance. + +ACL_OPTYPELIST_FOR_IMPLMODE: Sets the implementation mode of an operator in the optype list. Currently, this parameter can set the implementation mode of only one operator, such as Pooling, SoftmaxV2, LRN, or ROIAlign. Operators in the operator type list use the modes specified by ACL_OP_SELECT_IMPL_MODE. + +ACL_OP_DEBUG_LEVEL: Enables TBE operator debug during operator compilation. + 0: Disables operator debug. The operator binary file (.o) and operator description file (.json) are not retained in the kernel_meta folder in the atc command execution directory. + 1: Enables operator debug. TBE instruction mapping files, including an operator CCE file (*.cce) and a Python-CCE mapping file (*_loc.json), are generated in the kernel_meta folder under the atc command execution directory. You can locate AI Core errors by using tools. + 2: Enables operator debug. TBE instruction mapping files, including an operator CCE file (.cce), a Python-CCE mapping file (*_loc.json) and a Python-CCE mapping file (*_loc.json), are generated in the kernel_meta folder under the atc command execution directory. Build optimization is disabled and CCE compiler debug is enabled (by setting -O0-g). You can locate AI Core errors by using tools. + 3: Disables operator debug. However, the operator binary file (.o) and operator description file (.json) are retained in the kernel_meta folder in the atc command execution directory. + 4: Disables operator debug. The operator binary (.o) and operator description file (.json) are retained, and a TBE instruction mapping file (.cce) and a UB fusion description file ({$kernel_name}_compute.json) are generated in the kernel_meta folder under the atc command execution directory. + +ACL_DEBUG_DIR: Sets the debug directory for saving the debug-related files generated during model conversion and network migration, including the .o, .json, and .cce files of operators. + +ACL_OP_COMPILER_CACHE_MODE: Configures the disk cache mode for operator compilation. This compilation option must be used together with ACL_OP_COMPILER_CACHE_DIR. + enable: operator compilation cache enabled. + disable: operator compilation cache disabled. + force: cache forcibly refreshed. That is, the existing cache is deleted, recompiled, and then added to the cache. When the Python or dependency library of a user changes, you need to use force to clear the existing cache. + +ACL_OP_COMPILER_CACHE_DIR: Configures the disk cache directory for operator compilation. This compilation option must be used together with ACL_OP_COMPILER_CACHE_MODE. +``` + +### How Do I Install GCC 7.3.0? + +Perform the following steps as the **root** user. + +1. Download **gcc-7.3.0.tar.gz** from [https://mirrors.tuna.tsinghua.edu.cn/gnu/gcc/gcc-7.3.0/gcc-7.3.0.tar.gz](https://mirrors.tuna.tsinghua.edu.cn/gnu/gcc/gcc-7.3.0/gcc-7.3.0.tar.gz). +2. GCC installation requires adequate temporary space. Run the following command to clear the **/tmp** directory in advance: + + ``` + sudo rm -rf /tmp/* + ``` + +3. Install dependencies. + + For CentOS/BCLinux, run the following command: + + ``` + yum install bzip2 + ``` + + For Ubuntu/Debian, run the following command: + + ``` + apt-get install bzip2 + ``` + +4. Build and install GCC. + 1. Go to the directory where the source package **gcc-7.3.0.tar.gz** is located and run the following command to decompress it: + + ``` + tar -zxvf gcc-7.3.0.tar.gz + ``` + + 2. Go to the extracted directory and run the following command to download the GCC dependency packages: + + ``` + cd gcc-7.3.0 + ./contrib/download_prerequisites + ``` + + If an error is reported during the command execution, run the following commands in the **gcc-7.3.0/** directory to download the dependency packages: + + ``` + wget http://gcc.gnu.org/pub/gcc/infrastructure/gmp-6.1.0.tar.bz2 + wget http://gcc.gnu.org/pub/gcc/infrastructure/mpfr-3.1.4.tar.bz2 + wget http://gcc.gnu.org/pub/gcc/infrastructure/mpc-1.0.3.tar.gz + wget http://gcc.gnu.org/pub/gcc/infrastructure/isl-0.16.1.tar.bz2 + ``` + + After the preceding dependencies are downloaded, run the following command again: + + ``` + ./contrib/download_prerequisites + ``` + + If the validation fails, check whether the dependency packages are repeatedly downloaded. The packages should be downloaded at a time. + + 3. Run the following commands for configuration, build, and installation. + + ``` + ./configure --enable-languages=c,c++ --disable-multilib --with-system-zlib --prefix=/usr/local/linux_gcc7.3.0 + make -j15 # Check the number of CPUs by running grep -w processor /proc/cpuinfo|wc -l. In this example, the number is 15. + make install + ``` + + >![](public_sys-resources/icon-caution.gif) **CAUTION:** + >The **--prefix** option is used to specify the linux\_gcc7.3.0 installation path, which is configurable. Do not set it to **/usr/local** or **/usr**, which is the default installation path for the GCC installed by using the software source. Otherwise, a conflict occurs and the original GCC compilation environment of the system is damaged. In this example, the installation path is set to **/usr/local/linux\_gcc7.3.0**. + + +5. Set the environment variable. + + Training must be performed in the compilation environment with GCC upgraded. If you will run training, configure the following environment variable in your training script: + + ``` + export LD_LIBRARY_PATH=${install_path}/lib64:${LD_LIBRARY_PATH} + ``` + + **$\{install\_path\}** indicates the GCC 7.3.0 installation path configured in [3](#en-us_topic_0000001173199577_en-us_topic_0000001172534867_en-us_topic_0276688294_li1649343041310). In this example, the GCC 7.3.0 installation path is **/usr/local/gcc7.3.0/**. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >Skip this step if you do not need to use the compilation environment with GCC upgraded. + +### HDF5 Compilation and Installation + +Perform the following steps as the **root** user. + +1. Obtain the code. + + ``` + git clone https://github.com/HDFGroup/hdf5.git + ``` + +2. Switch to the hdf5-1\_10\_7 branch. + + ``` + cd hdf5 + git checkout remotes/origin/hdf5_1_10_7 + ``` + +3. Compile HDF5. + + ``` + ./configure --prefix=/usr/local/hdf5 --enable-cxx + make -j72 # The value following -j can be set based on the number of CPU cores. + make check # run test suite. + make install + make check-install # verify installation. + ``` + +4. Add environment variables. + + ``` + export PATH=/usr/local/hdf5/bin:$PATH + export LD_LIBRARY_PATH=/usr/local/hdf5/lib:$LD_LIBRARY_PATH + export LIBRARY_PATH=/usr/local/hdf5/lib:$LIBRARY_PATH + export CPATH=/usr/local/hdf5/include:$CPATH + ``` + + +## FAQs + + +- **[FAQs About Software Installation](#faqs-about-software-installationmd)** + +- **[FAQs About Model and Operator Running](#faqs-about-model-and-operator-runningmd)** + +- **[FAQs About Model Commissioning](#faqs-about-model-commissioningmd)** + +- **[FAQs About Other Operations](#faqs-about-other-operationsmd)** + +- **[FAQs About Distributed Model Training](#faqs-about-distributed-model-trainingmd)** + + +### FAQs About Software Installation + + +- **[pip3.7 install Pillow==5.3.0 Installation Failed](#pip3-7-install-pillow-5-3-0-installation-failedmd)** + + +####pip3.7 install Pillow==5.3.0 Installation Failed + +##### Symptom + +**pip3.7 install pillow==5.3.0** installation failed. + +##### Possible Causes + +Necessary dependencies are missing, such as libjpeg, python-devel, zlib-devel, and libjpeg-turbo-devel. + +##### Solutions + +Run the following commands to install the dependencies: + +- CentOS/EulerOS/Tlinux/BClinux/Suse + + **yum install libjpeg python-devel zlib-devel libjpeg-turbo-devel** + +- Ubuntu/Debian/UOS + + **apt-get install libjpeg python-devel zlib-devel libjpeg-turbo-devel** + + +### FAQs About Model and Operator Running + + +- **[What Do I Do If the Error Message "RuntimeError: ExchangeDevice:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-runtimeerror-exchangedevice-is-displayed-during-model-or-operatormd)** + +- **[What Do I Do If the Error Message "Error in atexit.\_run\_exitfuncs:" Is Displayed During Model or Operator Running?](#what-do-i-do-if-the-error-message-error-in-atexit-_run_exitfuncs-is-displayed-during-model-or-operatmd)** + +- **[What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what\(\): HelpACLExecute:" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-terminate-called-after-throwing-an-instance-of-c10-error-what-hemd)** + +- **[What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what\(\): 0 INTERNAL ASSERT" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-terminate-called-after-throwing-an-instance-of-c10-error-what-0md)** + +- **[What Do I Do If the Error Message "ImportError: libhccl.so." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-importerror-libhccl-so-is-displayed-during-model-runningmd)** + +- **[What Do I Do If the Error Message "RuntimeError: Initialize." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-runtimeerror-initialize-is-displayed-during-model-runningmd)** + +- **[What Do I Do If the Error Message "TVM/te/cce error." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-tvm-te-cce-error-is-displayed-during-model-runningmd)** + +- **[What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-runningmd)** + +- **[What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-memcopysync-drvmemcpy-failed-is-displayed-during-model-running-6md)** + +- **[What Do I Do If the Error Message "HelpACLExecute." Is Displayed After Multi-Task Delivery Is Disabled \(export TASK\_QUEUE\_ENABLE=0\) During Model Running?](#what-do-i-do-if-the-error-message-helpaclexecute-is-displayed-after-multi-task-delivery-is-disabledmd)** + +- **[What Do I Do If the Error Message "55056 GetInputConstDataOut: ErrorNo: -1\(failed\)" Is Displayed During Model Running?](#what-do-i-do-if-the-error-message-55056-getinputconstdataout-errorno--1failed-is-displayed-duringmd)** + + +

What Do I Do If the Error Message "RuntimeError: ExchangeDevice:" Is Displayed During Model or Operator Running?

+ +##### Symptom + +![](figures/faq1.png) + +##### Possible Causes + +Currently, only one NPU device can be called in a thread. When different NPU devices are switched, the preceding error occurs. + +##### Solution + +In the code, when **torch.npu.set_device(device)**, **tensor.to(device)**, or **model.to(device)** is called in the same thread, the device names are inconsistent. For multiple threads (such as multi-device training), each thread can call only a fixed NPU device. + +

What Do I Do If the Error Message "Error in atexit.\_run\_exitfuncs:" Is Displayed During Model or Operator Running?

+ +##### Symptom + +![](figures/faq2.png) + +##### Possible Causes + +If no NPU device is specified by **torch.npu.device\(id\)** during torch initialization, device 0 is used by default. If another NPU device is directly used, for example, a tensor is created on device 1, the preceding error occurs during running. + +##### Solution + +Before calling an NPU device, specify the NPU device by using **torch.npu.set_device(device)**. + +

What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what(): HelpACLExecute:" Is Displayed During Model Running?

+ +##### Symptom + +![](figures/faq3.png) + +##### Possible Causes + +Currently, the HelpACLExecute error cannot be directly located. In this case, an error is reported when the task is delivered. This is because the multi-thread delivery of the task is enabled (**export TASK_QUEUE_ENABLE=1**), and the error information is encapsulated at the upper layer. As a result, more detailed error logs cannot be obtained. + +##### Solution + +You can resolve this exception by using either of the following methods: + +- Check the host error log information. The default log path is **/var/log/npu/slog/host-0/**. Search for the log file whose name is prefixed with **host-0** based on the time identifier, open the log file, and search for error information using keyword **ERROR**. +- Disable multi-thread delivery (**export TASK_QUEUE_ENABLE=0**) and run the code again. Generally, you can locate the fault based on the error information reported by the terminal. + +

What Do I Do If the Error Message "terminate called after throwing an instance of 'c10::Error' what(): 0 INTERNAL ASSERT" Is Displayed During Model Running?

+ +##### Symptom + +``` +import torch + +npu = "npu" + +def test_cpu(): + input = torch.randn(2000, 1000).detach().requires_grad_() + output = torch.sum(input) + output.backward(torch.ones_like(output)) + +def test_npu(): + input = torch.randn(2000, 1000).detach().requires_grad_().npu() + output = torch.sum(input) + output.backward(torch.ones_like(output)) + +if __name__ == "__main__": + test_cpu() + torch.npu.set_device(f"{npu}:1") + test_npu() +``` + +The following error message is displayed after code execution. + +![](figures/en-us_image_0000001208897433.png) + +##### Possible Causes + +After the backward operation is performed, the **set\_decice\(\)** method is used to manually set the device. As a result, an error is reported. During the backward operation, if the device is not set, the program automatically initializes the device to **0** by default. That is, **set\_device\("npu:0"\)** is executed. Currently, the device cannot be switched for calculation. If the device is manually set by using the **set\_decice\(\)** method, this error may occur. + +##### Solution + +Before performing the backward operation, use the **set\_decice\(\)** method to manually set the device. The modification is as follows: + +``` +if __name__ == "__main__": + torch.npu.set_device(f"{npu}:1") + test_cpu() + test_npu() +``` + +

What Do I Do If the Error Message "ImportError: libhccl.so." Is Displayed During Model Running?

+ +##### Symptom + +![](figures/faq7.png) + +##### Possible Causes + +Currently, the released PyTorch installation package uses the NPU and HCCL functions by default. Therefore, you need to add the path of the HCCL module to the environment variables when calling the PyTorch installation package. The error message "can not find libhccl.so" indicates that the cause is that the HCCL library file is missing. + +##### Solution + +Add the path of the HCCL module to the environment variables. Generally, the path of the HCCL library file is **.../fwkacllib/python/site-packages/hccl** in the installation package. + +

What Do I Do If the Error Message "RuntimeError: Initialize." Is Displayed During Model Running?

+ +##### Symptom + +![](figures/faq9.png) + +##### Possible Causes + +According to the error information, it is preliminarily determined that an error occurs during the initialization of the NPU device. The error information in the host log is as follows: + +![](figures/faq9-1.png) + +The log information indicates that an error is reported when the system starts the NPU device. + +##### Solution + +To solve the problem, perform the following steps: + +1. Restart the server and all NPU devices. + + If the problem is resolved, no further action is required. + + If the problem persists, go to [2](#li77121667913). + +2. Check whether the driver version matches the firmware version. + + If no, go to [3](#li967615545918). + + If yes, go to [4](#li475615212912). + +3. Ensure that the driver version matches the firmware version. + + If the problem is resolved, no further action is required. + + If the problem persists, go to Step 4. + +4. Contact Huawei technical support personnel. + +

What Do I Do If the Error Message "TVM/te/cce error." Is Displayed During Model Running?

+ +##### Symptom + +![](figures/faq10.png) + +##### Possible Causes + +Calling an NPU operator in PyTorch strongly depends on the TE, CCE, and TVM components. The PyTorch, CANN/NNAE, and TE versions must be the same. After CANN/NNAE is updated, components such as TE are not automatically updated. When their versions do not match, this error is reported. + +##### Solution + +Update the versions of components such as TE. The **te-*.whl** and **topi-*.whl** installation packages need to be updated. In the **lib64** subdirectory of the CANN or NNAE installation directory (the installation user is the **root** user and the default installation directory is **/usr/local/Ascend/ascend-toolkit/latest/lib64**), update the installation packages: The **topi-0.4.0-py3-none-any.whl** and **te-0.4.0-py3-none-any.whl** installation packages exist in the directory. Run the **pip3 install --upgrade topi-0.4.0-py3-none-any.whl** and **pip install --upgrade te-0.4.0-py3-none-any.whl** commands, respectively. + +![](figures/faq10-1.png) + +

What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?

+ +##### Symptom + +Scripts: + +``` + import torch + + def test_sum(): + xs_shape = [22400, 8] + ys_shape = [22400, 8] + gt_bboxes_shape = [22400, 8,4] + xs = torch.rand(xs_shape).npu() + ys = torch.rand(ys_shape).npu() + gt_bboxes = torch.rand(gt_bboxes_shape).npu().half() + left = xs - gt_bboxes[..., 0] + right = gt_bboxes[..., 2] - xs + top = ys - gt_bboxes[..., 1] + bottom = gt_bboxes[..., 3] - ys + # stream = torch.npu.current_stream() + # stream.synchronize() + # left, top: fp32, right, bottom: fp16, + # print(left.dtype, top.dtype, right.dtype, bottom.dtype) + bbox_targets = torch.stack((left, top, right, bottom), -1) # Error reported here + # stream.synchronize() + + bbox_targets = torch.sum(bbox_targets) +``` + +Shell error message: + +``` + RuntimeError: Run:/usr1/workspace/PyTorch_Apex_Daily_c20tr5/CODE/aten/src/ATen/native/npu/utils/OpParamMaker.h:280 NPU error,NPU error code is:500002 + [ERROR] RUNTIME(160809)kernel task happen error, retCode=0x28, [aicpu timeout]. + [ERROR] RUNTIME(160809)aicpu kernel execute failed, device_id=0, stream_id=512, task_id=24, fault so_name=, fault kernel_name=, extend_info=. + Error in atexit._run_exitfuncs: + Traceback (most recent call last): + File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/__init__.py", line 429, in _npu_shutdown + torch._C._npu_shutdown() + RuntimeError: npuSynchronizeDevice:/usr1/workspace/PyTorch_Apex_Daily_c20tr5/CODE/c10/npu/NPUStream.cpp:806 NPU error, error code is 0 +``` + +Log message: + +``` + [ERROR] RUNTIME(12731,python3.7):2021-02-02-22:23:56.475.679 [../../../../../../runtime/feature/src/npu_driver.cc:1408]12828 MemCopySync:drvMemcpy failed: dst=0x108040288000, destMax=1240, src=0x7fe7649556d0, size=1240, kind=1, drvRetCode=17! + [ERROR] RUNTIME(12731,python3.7):2021-02-02-22:23:56.475.698 [../../../../../../runtime/feature/src/logger.cc:113]12828 KernelLaunch:launch kernel failed, kernel=140631803535760/ArgMinWithValue_tvmbin, dim=32, stream=0x55b22b3def50 + [ERROR] RUNTIME(12731,python3.7):2021-02-02-22:23:56.475.717 [../../../../../../runtime/feature/src/api_c.cc:224]12828 rtKernelLaunch:ErrCode=207001, desc=[module new memory error], InnerCode=0x70a0002 +``` + +##### Possible Causes + +The shell error message does not match the log message. + +The shell error message indicates that the error occurs on the AI CPU during synchronization. However, the log message indicates that the error occurs on the min operator \(internal call of ArgMinWithValue\_tvmbin\). The two error messages do not match. Generally, this problem occurs because the error information generation in the log is delayed. + +The possible cause is that the AI CPU operator is executed asynchronously. As a result, the error information is delayed. + +##### Solution + +Perform the following steps to locate the fault based on the actual error information: + +1. Disable multi-task operator delivery. It is found that the result remains unchanged. It is inferred that the error occurs before the error in the shell error message and the error in the log message occur. +2. Perform stream synchronization based on the error information to narrow down the error range and locate the error operator. Stream synchronization requires that all calculations before the position where the code runs must be complete to locate the error. +3. It is determined that the error operator is stack. +4. Print the shape, dtype, and npu\_format of all stack parameters. Construct a single-operator case to reproduce the problem. The cause is that the data types of the input parameters for subtraction are different. As a result, the data types of the a-b and b-a results are different, and an error is reported in the stack operator. +5. Convert the data types of the stack input parameters to the same one to temporarily avoid the problem. + +

What Do I Do If the Error Message "MemCopySync:drvMemcpy failed." Is Displayed During Model Running?

+ +##### Symptom + +Script: + +``` + import torch + + def test_sum(): + xs_shape = [22400, 8] + ys_shape = [22400, 8] + gt_bboxes_shape = [22400, 8,4] + xs = torch.rand(xs_shape).npu() + ys = torch.rand(ys_shape).npu() + gt_bboxes = torch.rand(gt_bboxes_shape).npu().half() + left = xs - gt_bboxes[..., 0] + right = gt_bboxes[..., 2] - xs + top = ys - gt_bboxes[..., 1] + bottom = gt_bboxes[..., 3] - ys + # stream = torch.npu.current_stream() + # stream.synchronize() + # left, top: fp32, right, bottom: fp16, + # print(left.dtype, top.dtype, right.dtype, bottom.dtype) + bbox_targets = torch.stack((left, top, right, bottom), -1) # Error reported here + # stream.synchronize() + + bbox_targets = torch.sum(bbox_targets) +``` + +Shell error message: + +``` + RuntimeError: Run:/usr1/workspace/PyTorch_Apex_Daily_c20tr5/CODE/aten/src/ATen/native/npu/utils/OpParamMaker.h:280 NPU error,NPU error code is:500002 + [ERROR] RUNTIME(160809)kernel task happen error, retCode=0x28, [aicpu timeout]. + [ERROR] RUNTIME(160809)aicpu kernel execute failed, device_id=0, stream_id=512, task_id=24, fault so_name=, fault kernel_name=, extend_info=. + Error in atexit._run_exitfuncs: + Traceback (most recent call last): + File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/__init__.py", line 429, in _npu_shutdown + torch._C._npu_shutdown() + RuntimeError: npuSynchronizeDevice:/usr1/workspace/PyTorch_Apex_Daily_c20tr5/CODE/c10/npu/NPUStream.cpp:806 NPU error, error code is 0 +``` + +Log message: + +``` + [ERROR] RUNTIME(12731,python3.7):2021-02-02-22:23:56.475.679 [../../../../../../runtime/feature/src/npu_driver.cc:1408]12828 MemCopySync:drvMemcpy failed: dst=0x108040288000, destMax=1240, src=0x7fe7649556d0, size=1240, kind=1, drvRetCode=17! + [ERROR] RUNTIME(12731,python3.7):2021-02-02-22:23:56.475.698 [../../../../../../runtime/feature/src/logger.cc:113]12828 KernelLaunch:launch kernel failed, kernel=140631803535760/ArgMinWithValue_tvmbin, dim=32, stream=0x55b22b3def50 + [ERROR] RUNTIME(12731,python3.7):2021-02-02-22:23:56.475.717 [../../../../../../runtime/feature/src/api_c.cc:224]12828 rtKernelLaunch:ErrCode=207001, desc=[module new memory error], InnerCode=0x70a0002 +``` + +##### Possible Causes + +The shell error message does not match the log message. + +The shell error message indicates that the error occurs on the AI CPU during synchronization. However, the log message indicates that the error occurs on the min operator \(internal call of ArgMinWithValue\_tvmbin\). The two error messages do not match. Generally, this problem occurs because the error information generation in the log is delayed. + +The possible cause is that the AI CPU operator is executed asynchronously. As a result, the error information is delayed. + +##### Solution + +Perform the following steps to locate the fault based on the actual error information: + +1. Disable multi-task operator delivery. It is found that the result remains unchanged. It is inferred that the error occurs before the error in the shell error message and the error in the log message occur. +2. Perform stream synchronization based on the error information to narrow down the error range and locate the error operator. Stream synchronization requires that all calculations before the position where the code runs must be complete to locate the error. +3. It is determined that the error operator is stack. +4. Print the shape, dtype, and npu\_format of all stack parameters. Construct a single-operator case to reproduce the problem. The cause is that the data types of the input parameters for subtraction are different. As a result, the data types of the a-b and b-a results are different, and an error is reported in the stack operator. +5. Convert the data types of the stack input parameters to the same one to temporarily avoid the problem. + +

What Do I Do If the Error Message "HelpACLExecute." Is Displayed After Multi-Task Delivery Is Disabled (export TASK\_QUEUE\_ENABLE=0) During Model Running?

+ +##### Symptom + +![](figures/faq8.png) + +##### Possible Causes + +The PyTorch operator runs on the NPU and calls the optimized operators at the bottom layer through the AcendCL API. When the error message "HelpACLExecute." is reported at the upper layer, the error information and logs are being optimized. As a result, when errors occur in some operators, the error information fails to be obtained. + +##### Solution + +View the host log to determine the operator and location where the error is reported. The default log path is **/var/log/npu/slog/host-0**. Search for the **ERROR** field in the log file of the corresponding time to find the error information. For the preceding error, the **ERROR** field in the log is as follows: + +![](figures/faq8-1.png) + +The error information in the log indicates that the error operator is topKD and the error cause is "The number of attrs in op desc and op store does not match." Therefore, it is determined that the error cause is that the parameters of the topKD operator do not match. + +Locate the topKD operator in the model code and check whether the operator can be replaced by another operator. If the operator can be replaced by another operator, use the replacement solution and report the operator error information to Huawei engineers. If the operator cannot be replaced by another operator, contact Huawei technical support. + +

What Do I Do If the Error Message "55056 GetInputConstDataOut: ErrorNo: -1(failed)" Is Displayed During Model Running?

+ +##### Symptom + +During model training, the following error information may be displayed in the host training log \(directory: **/root/ascend/log/plog/**\): + +![](figures/20210720-102720(welinkpc).png) + +##### Possible Causes + +A public API is called. + +##### Solution + +The error information does not affect the training function and performance and can be ignored. + +### FAQs About Model Commissioning + + +- **[What Do I Do If the Error Message "RuntimeError: malloc:/..../pytorch/c10/npu/NPUCachingAllocator.cpp:293 NPU error, error code is 500000." Is Displayed During Model Commissioning?](#what-do-i-do-if-the-error-message-runtimeerror-malloc-pytorch-c10-npu-npucachingallocator-cpp-293-npmd)** + +- **[What Do I Do If the Error Message "RuntimeError: Could not run 'aten::trunc.out' with arguments from the 'NPUTensorId' backend." Is Displayed During Model Commissioning](#what-do-i-do-if-the-error-message-runtimeerror-could-not-run-aten-trunc-out-with-arguments-from-themd)** + +- **[What Do I Do If the MaxPoolGradWithArgmaxV1 and max Operators Report Errors During Model Commissioning?](#what-do-i-do-if-the-maxpoolgradwithargmaxv1-and-max-operators-report-errors-during-model-commissionimd)** + +- **[What Do I Do If the Error Message "ModuleNotFoundError: No module named 'torch.\_C'" Is Displayed When torch Is Called?](#what-do-i-do-if-the-error-message-modulenotfounderror-no-module-named-torch-_c-is-displayed-when-tormd)** + + +

What Do I Do If the Error Message "RuntimeError: malloc:/..../pytorch/c10/npu/NPUCachingAllocator.cpp:293 NPU error, error code is 500000." Is Displayed During Model Commissioning?

+ +##### Symptom + +![](figures/faq4.png) + +##### Possible Causes + +For the malloc error in **NPUCachingAllocator**, the possible cause is that the required video memory is larger than the available video memory on the NPU. + +##### Solution + +During model commissioning, you can decrease the value of the **batch size** parameter to reduce the size of the occupied video memory on the NPU. + +

What Do I Do If the Error Message "RuntimeError: Could not run 'aten::trunc.out' with arguments from the 'NPUTensorId' backend." Is Displayed During Model Commissioning

+ +##### Symptom + +![](figures/faq5.png) + +##### Possible Causes + +Currently, the NPU supports only some PyTorch operators. The preceding error is reported when operators that are not supported are used. The operators are being developed. For details about the supported operators, see [PyTorch Native Operators](https://support.huaweicloud.com/intl/en-us/opl-pytorch/atlasptol_09_0001.html). + +##### Solution + +During model commissioning, you can decrease the value of the **batch size** parameter to reduce the size of the occupied video memory on the NPU. + +

What Do I Do If the MaxPoolGradWithArgmaxV1 and max Operators Report Errors During Model Commissioning?

+ +##### Symptom + +![](figures/faq6.png) + +![](figures/faq6-1.png) + +##### Possible Causes + +During model building, the operator input parameters are diversified. For some operators (such as MaxPoolGradWithArgmaxV1 and max) with specific parameters, an error is reported during calculation or the operators are not supported. You can locate the operators based on the error information. + +##### Solution + +Locate the operators based on the error information and perform the following steps: + +1. Check whether the call mode and parameters of the operators in the model are correct. +2. Build a single-operator case based on the error operators to construct the error scenario. +3. Generally, operator errors cannot be resolved on Python, and error scenarios need to be constructed. Post the error scenario in the forum and ask for help from Huawei engineers. + + >![](public_sys-resources/icon-note.gif) **NOTE:** + >Pay special attention to the input parameters **shape** and **dtype**, which are the main causes of operator errors. + + +In the preceding figure, the error information indicates that the MaxPoolGradWithArgmaxV1 and max operators report the error. MaxPoolGradWithArgmaxV1 reports the error during backward propagation. Therefore, construct a reverse scenario. The max operator reports the error during forward propagation. Therefore, construct a forward scenario. + +If an operator error is reported in the model, you are advised to build a single-operator test case and determine the error scenario and cause. If a single-operator case cannot be built in a single operator, you need to construct a context-based single-operator scenario. For details about how to build a test case, see [Single-Operator Sample Building](#single-operator-sample-building). + +

What Do I Do If the Error Message "ModuleNotFoundError: No module named 'torch.\_C'" Is Displayed When torch Is Called?

+ +##### Symptom + +![](figures/faq11.png) + +##### Possible Causes + +In the preceding figure, the error path is **.../code/pytorch/torch/\_\_init\_\_.py**. However, the current operating path is **.../code/pytorch**. When the **import torch** command is executed, the **torch** folder is searched in the current directory by default. As a result, an error is reported. The torch package installed in the system directory instead of the torch package in the current directory is called. + +##### Solution + +Switch to another directory to run the script. + +### FAQs About Other Operations + + +- **[What Do I Do If an Error Is Reported During CUDA Stream Synchronization?](#what-do-i-do-if-an-error-is-reported-during-cuda-stream-synchronizationmd)** + +- **[What Do I Do If aicpu\_kernels/libpt\_kernels.so Does Not Exist?](#what-do-i-do-if-aicpu_kernels-libpt_kernels-so-does-not-existmd)** + +- **[What Do I Do If the Python Process Is Residual When the npu-smi info Command Is Used to View Video Memory?](#what-do-i-do-if-the-python-process-is-residual-when-the-npu-smi-info-command-is-used-to-view-video-mmd)** + +- **[What Do I Do If the Error Message "match op inputs failed"Is Displayed When the Dynamic Shape Is Used?](#what-do-i-do-if-the-error-message-match-op-inputs-failed-is-displayed-when-the-dynamic-shape-is-usedmd)** + +- **[What Do I Do If the Error Message "Op type SigmoidCrossEntropyWithLogitsV2 of ops kernel AIcoreEngine is unsupported" Is Displayed?](#what-do-i-do-if-the-error-message-op-type-sigmoidcrossentropywithlogitsv2-of-ops-kernel-aicoreenginemd)** + +- **[What Do I Do If a Hook Failure Occurs?](#what-do-i-do-if-a-hook-failure-occursmd)** + +- **[What Do I Do If the Error Message "load state\_dict error." Is Displayed When the Weight Is Loaded?](#what-do-i-do-if-the-error-message-load-state_dict-error-is-displayed-when-the-weight-is-loadedmd)** + + +

What Do I Do If an Error Is Reported During CUDA Stream Synchronization?

+ +##### Symptom + +![](figures/model_faq11_20210728.jpg) + +##### Possible Causes + +The NPU does not use NPU stream synchronization. + +##### Solution + +Use NPU stream synchronization. + +``` +stream = torch.npu.current_stream() +stream.synchronize() +``` + +

What Do I Do If aicpu\_kernels/libpt\_kernels.so Does Not Exist?

+ +##### Symptom + +![](figures/faq13.png) + +##### Possible Causes + +The AI CPU is not imported. + +##### Solution + +Import the AI CPU. \(The following describes how to install the CANN software package as the **root** user in the default installation path.\) + +``` +export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest +``` + +

What Do I Do If the Python Process Is Residual When the npu-smi info Command Is Used to View Video Memory?

+ +##### Symptom + +![](figures/faq14.png) + +##### Possible Causes + +The Python process needs to be killed. + +##### Solution + +Kill the Python process. + +``` +pkill -9 python +``` + +

What Do I Do If the Error Message "match op inputs failed"Is Displayed When the Dynamic Shape Is Used?

+ +##### Symptom + +![](figures/faq15.png) + +##### Possible Causes + +The operator compiled by **PTIndexPut** does not match the input shape, and the log starting with **acl\_dynamic\_shape\_op** is displayed. It is determined that an error is reported for the dynamic shape. + +##### Solution + +**PTIndexPut** corresponds to **tensor\[indices\] = value**. Locate the field in the code and change the dynamic shape to a fixed shape. + +

What Do I Do If the Error Message "Op type SigmoidCrossEntropyWithLogitsV2 of ops kernel AIcoreEngine is unsupported" Is Displayed?

+ +##### Symptom + +``` +[ERROR] GE(24836,python3.7):2021-01-27-18:27:51.562.111 [../../../../../../graphengine/ge/engine_manager/dnnengine_manager.cc:266]25155 GetDNNEngineName: ErrorNo: 1343242282(assign engine failed) GetDNNEngineName:Op type SigmoidCrossEntropyWithLogitsV2 of ops kernel AIcoreEngine is unsupported, reason:Op SigmoidCrossEntropyWithLogitsV2 not supported reason: The type of this op is not found in op store, check whether the op store has this type of op. Op store name is tbe-custom. +The dtype, format or shape of input in op desc is not supported in op store, check the dtype, format or shape of input between the op store and the graph. Op store name is tbe-builtin. +``` + +##### Possible Causes + +The input data type is not supported by the SigmoidCrossEntropyWithLogitsV2 operator. The possible cause is that the input data type is int64. + +##### Solution + +Check the input data type in the Python code and modify the data type. + +

What Do I Do If a Hook Failure Occurs?

+ +##### Symptom + +``` +Traceback (most recent call last): + File "tools/train.py", line 227, in + main() + File "tools/train.py", line 221, in main + meta=meta) + File "/root/YoloV3/mmdetection/mmdet/apis/train.py", line 192, in train_detector + runner.run(data_loaders, cfg.workflow, cfg.total_epochs) + File "/usr/local/python3.7.5/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 166, in run + epoch_runner(data_loaders[i], **kwargs) + File "/usr/local/python3.7.5/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train + self.run_iter(data_batch, train_mode=True) + File "/usr/local/python3.7.5/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter + outputs = self.model.train_step(data_batch, self.optimizer, **kwargs) + File "/usr/local/python3.7.5/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 100, in train_step + return self.module.train_step(*inputs[0], **kwargs[0]) + File "/root/YoloV3/mmdetection/mmdet/models/detectors/base.py", line 251, in train_step + losses = self(**data) + File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 660, in __call__ + var = next((v for v in var.values() if isinstance(v, torch.Tensor))) +StopIteration +``` + +##### Possible Causes + +The loss structure of the mmdet triggers the bug of the native hook of PyTorch, leading to an infinite loop. + +##### Solution + +Add **try** to line 658 to skip in the **/usr/local/python3.7.5/lib/python3.7/site-packages/torch/nn/modules/module.py** file: + +``` +if len(self._backward_hooks) > 0: + var = result + try: + while not isinstance(var, torch.Tensor): + if isinstance(var, dict): + var = next((v for v in var.values() if isinstance(v, torch.Tensor))) + else: + var = var[0] + grad_fn = var.grad_fn + if grad_fn is not None: + for hook in self._backward_hooks.values(): + wrapper = functools.partial(hook, self) + functools.update_wrapper(wrapper, hook) + grad_fn.register_hook(wrapper) + except Exception as e: + print('hook failed..') + print(str(e)) +return result +``` + +

What Do I Do If the Error Message "load state\_dict error." Is Displayed When the Weight Is Loaded?

+ +##### Symptom + +![](figures/faq18.png) + +![](figures/faq18-1.png) + +##### Possible Causes + +The key value of **state\_dict** saved after model training is different from the key value of **state\_dict** when the model is loaded. When the model is saved, a **module** prefix is added to the beginning of each key. + +##### Solution + +When loading the weight, traverse the **state\_dict** dictionary, modify the key value, and use the new dictionary. For details about the test case, see **demo.py**. + +The script is as follows: + +``` + ckpt = torch.load("checkpoint.pth", map_location=loc) + # model.load_state_dict(ckpt['state_dict']) + state_dict_old = ckpt['state_dict'] + state_dict = {} + for key, value in state_dict_old.items(): + key = key[7:] + state_dict[key] = value + model.load_state_dict(state_dict) +``` + +### FAQs About Distributed Model Training + +- **[What Do I Do If the Error Message "host not found." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-host-not-found-is-displayed-during-distributed-model-trainingmd)** + +- **[What Do I Do If the Error Message "RuntimeError: connect\(\) timed out." Is Displayed During Distributed Model Training?](#what-do-i-do-if-the-error-message-runtimeerror-connect-timed-out-is-displayed-during-distributed-mmd)** + + +

What Do I Do If the Error Message "host not found." Is Displayed During Distributed Model Training?

+ +##### Symptom + +![](figures/faq19.png) + +##### Possible Causes + +During distributed model training, the Huawei Collective Communication Library \(HCCL\) is invoked. You need to set the IP address and port number based on the site requirements. The error information indicates that the IP address is incorrect. + +##### Solution + +Set the correct IP address in the running script. If a single server is deployed, set the IP address to the IP address of the server. If multiple servers are deployed, set the IP address in the script on each server to the IP address of the active node. + +

What Do I Do If the Error Message "RuntimeError: connect\(\) timed out." Is Displayed During Distributed Model Training?

+ +##### Symptom + +![](figures/1234.png) + +##### Possible Causes + +During distributed model training, the system firewall may block the communication of the HCCL port. Check whether the communication port is enabled based on the error information and perform related settings. + +##### Solution + +Query the HCCL port that is blocked by the system firewall and enable the port. + diff --git a/docs/en/PyTorch Operator Support/PyTorch Operator Support.md b/docs/en/PyTorch Operator Support/PyTorch Operator Support.md index 6488217b0bca200148d2826b7d57c6a062e17067..a4759affb51fc3249daf977b269bd9914970a145 100644 --- a/docs/en/PyTorch Operator Support/PyTorch Operator Support.md +++ b/docs/en/PyTorch Operator Support/PyTorch Operator Support.md @@ -1,6296 +1,6297 @@ -# PyTorch Operator Support -- [Mapping Between PyTorch Native Operators and Ascend Adapted Operators](#mapping-between-pytorch-native-operators-and-ascend-adapted-operatorsmd) -- [PyTorch Operators Customized by Ascend](#pytorch-operators-customized-by-ascendmd) -

Mapping Between PyTorch Native Operators and Ascend Adapted Operators

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

No.

-

PyTorch Native Operator

-

Ascend Adapted Operator

-

1

-

dropout

-

dropout_npu

-

2

-

dropout_

-

dropout_npu_

-

3

-

abs

-

abs_npu

-

4

-

abs_

-

abs_npu_

-

5

-

abs.out

-

abs_out_npu

-

6

-

acos

-

acos_npu

-

7

-

acos_

-

acos_npu_

-

8

-

acos.out

-

acos_out_npu

-

9

-

adaptive_avg_pool1d

-

adaptive_avg_pool1d_npu

-

10

-

add.Tensor

-

add_npu

-

11

-

add_.Tensor

-

add_npu_

-

12

-

add.out

-

add_out_npu

-

13

-

add.Scalar

-

add_npu

-

14

-

add_.Scalar

-

add_npu_

-

15

-

addmv

-

addmv_npu

-

16

-

addmv_

-

addmv_npu_

-

17

-

addmv.out

-

addmv_out_npu

-

18

-

addr

-

addr_npu

-

19

-

addr_

-

addr_npu_

-

20

-

addr.out

-

addr_out_npu

-

21

-

affine_grid_generator

-

affine_grid_generator_npu

-

22

-

affine_grid_generator_backward

-

affine_grid_generator_backward_npu

-

23

-

all.dim

-

all_npu

-

24

-

all.out

-

all_out_npu

-

25

-

any.dim

-

any_npu

-

26

-

any.out

-

any_out_npu

-

27

-

arange

-

arange_npu

-

28

-

arange.start

-

arange_npu

-

29

-

arange.start_step

-

arange_npu

-

30

-

arange.out

-

arange_out_npu

-

31

-

arange.start_out

-

arange_out_npu

-

32

-

_dim_arange

-

_dim_arange_npu

-

33

-

argmax

-

argmax_npu

-

34

-

argmin

-

argmin_npu

-

35

-

as_strided

-

as_strided_npu

-

36

-

as_strided_

-

as_strided_npu_

-

37

-

asin

-

asin_npu

-

38

-

asin_

-

asin_npu_

-

39

-

asin.out

-

asin_out_npu

-

40

-

atan

-

atan_npu

-

41

-

atan_

-

atan_npu_

-

42

-

atan.out

-

atan_out_npu

-

43

-

baddbmm

-

baddbmm_npu

-

44

-

baddbmm_

-

baddbmm_npu_

-

45

-

baddbmm.out

-

baddbmm_out_npu

-

46

-

bartlett_window

-

bartlett_window_npu

-

47

-

bartlett_window.periodic

-

bartlett_window_npu

-

48

-

batch_norm

-

batch_norm_npu_

-

49

-

_batch_norm_impl_index

-

_batch_norm_impl_index_npu

-

50

-

_batch_norm_impl_index_backward

-

_batch_norm_impl_index_backward_npu

-

51

-

bernoulli

-

bernoulli_npu

-

52

-

bernoulli_.Tensor

-

bernoulli_npu_

-

53

-

bernoulli_.float

-

bernoulli_npu_

-

54

-

binary_cross_entropy

-

binary_cross_entropy_npu

-

55

-

binary_cross_entropy.out

-

binary_cross_entropy_out_npu

-

56

-

binary_cross_entropy_backward

-

binary_cross_entropy_backward_npu

-

57

-

binary_cross_entropy_backward.grad_input

-

binary_cross_entropy_backward_out_npu

-

58

-

binary_cross_entropy_with_logits

-

binary_cross_entropy_with_logits_npu

-

59

-

binary_cross_entropy_with_logits_backward

-

binary_cross_entropy_with_logits_backward_npu

-

60

-

bitwise_not

-

bitwise_not_npu

-

61

-

bitwise_not_

-

bitwise_not_npu_

-

62

-

bitwise_not.out

-

bitwise_not_out_npu

-

63

-

logical_not

-

logical_not_npu

-

64

-

logical_not_

-

logical_not_npu_

-

65

-

logical_not.out

-

logical_not_out_npu

-

66

-

logical_and

-

logical_and_npu

-

67

-

logical_and_

-

logical_and_npu_

-

68

-

logical_and.out

-

logical_and_out_npu

-

69

-

logical_or

-

logical_or_npu

-

70

-

logical_or_

-

logical_or_npu_

-

71

-

logical_or.out

-

logical_or_out_npu

-

72

-

blackman_window

-

blackman_window_npu

-

73

-

blackman_window.periodic

-

blackman_window_npu

-

74

-

bmm

-

bmm_npu

-

75

-

bmm.out

-

bmm_out_npu

-

76

-

cat

-

cat_npu

-

77

-

cat.out

-

cat_out_npu

-

78

-

cat.names

-

cat_npu

-

79

-

cat.names_out

-

cat_out_npu

-

80

-

ceil

-

ceil_npu

-

81

-

ceil_

-

ceil_npu_

-

82

-

ceil.out

-

ceil_out_npu

-

83

-

clamp

-

clamp_npu

-

84

-

clamp_

-

clamp_npu_

-

85

-

clamp.out

-

clamp_out_npu

-

86

-

clamp_max

-

clamp_max_npu

-

87

-

clamp_max_

-

clamp_max_npu_

-

88

-

clamp_max.out

-

clamp_max_out_npu

-

89

-

clamp_min

-

clamp_min_npu

-

90

-

clamp_min_

-

clamp_min_npu_

-

91

-

clamp_min.out

-

clamp_min_out_npu

-

92

-

constant_pad_nd

-

constant_pad_nd_npu

-

93

-

contiguous

-

contiguous_npu

-

94

-

convolution

-

convolution_npu

-

95

-

_convolution

-

_convolution_npu

-

96

-

_convolution_nogroup

-

_convolution_nogroup_npu

-

97

-

conv2d

-

conv2d_npu_

-

98

-

conv3d

-

_conv3d_npu

-

99

-

conv_tbc

-

conv_tbc_npu

-

100

-

conv_tbc_backward

-

conv_tbc_backward_npu

-

101

-

conv_transpose2d.input

-

conv_transpose2d_npu_

-

102

-

conv_transpose3d.input

-

conv_transpose3d_npu_

-

103

-

copy_

-

copy_npu_

-

104

-

cos

-

cos_npu

-

105

-

cos_

-

cos_npu_

-

106

-

cos.out

-

cos_out_npu

-

107

-

cosh

-

cosh_npu

-

108

-

cosh_

-

cosh_npu_

-

109

-

cosh.out

-

cosh_out_npu

-

110

-

_cummax_helper

-

cummax_helper_npu

-

111

-

_cummin_helper

-

cummin_helper_npu

-

112

-

cumprod

-

cumprod_npu

-

113

-

cumprod.out

-

cumprod_out_npu

-

114

-

cumprod.dimname

-

cumprod_npu

-

115

-

cumprod.dimname_out

-

cumprod_out_npu

-

116

-

ctc_loss.IntList

-

ctc_loss_npu

-

117

-

ctc_loss.Tensor

-

ctc_loss_npu

-

118

-

_ctc_loss

-

ctc_loss_npu

-

119

-

_ctc_loss_backward

-

ctc_loss_backward_npu

-

120

-

fill_diagonal_

-

fill_diagonal_npu_

-

121

-

div.Tensor

-

div_npu

-

122

-

div_.Tensor

-

div_npu_

-

123

-

div.out

-

div_out_npu

-

124

-

div.Scalar

-

div_npu

-

125

-

div_.Scalar

-

div_npu_

-

126

-

dot

-

dot_npu

-

127

-

dot.out

-

dot_out_npu

-

128

-

embedding

-

embedding_npu

-

129

-

embedding_backward

-

embedding_backward_npu

-

130

-

embedding_dense_backward

-

embedding_dense_backward_npu

-

131

-

embedding_renorm_

-

embedding_renorm_npu_

-

132

-

_embedding_bag

-

_embedding_bag_npu

-

133

-

empty.memory_format

-

empty_npu

-

134

-

resize_

-

resize_npu_

-

135

-

empty_like

-

empty_like_npu

-

136

-

empty_strided

-

empty_strided_npu

-

137

-

erf

-

erf_npu

-

138

-

erf_

-

erf_npu_

-

139

-

erf.out

-

erf_out_npu

-

140

-

erfc

-

erfc_npu

-

141

-

erfc_

-

erfc_npu_

-

142

-

erfc.out

-

erfc_out_npu

-

143

-

exp

-

exp_npu

-

144

-

exp_

-

exp_npu_

-

145

-

exp.out

-

exp_out_npu

-

146

-

expm1

-

expm1_npu

-

147

-

expm1_

-

expm1_npu_

-

148

-

expm1.out

-

expm1_out_npu

-

149

-

eye

-

eye_npu

-

150

-

eye.m

-

eye_npu

-

151

-

eye.out

-

eye_out_npu

-

152

-

eye.m_out

-

eye_out_npu

-

153

-

fill_.Scalar

-

fill_npu_

-

154

-

fill_.Tensor

-

fill_npu_

-

155

-

floor

-

floor_npu

-

156

-

floor_

-

floor_npu_

-

157

-

floor.out

-

floor_out_npu

-

158

-

floor_divide

-

floor_divide_npu

-

159

-

floor_divide_.Tensor

-

floor_divide_npu_

-

160

-

floor_divide.out

-

floor_divide_out_npu

-

161

-

floor_divide.Scalar

-

floor_divide_npu

-

162

-

floor_divide_.Scalar

-

floor_divide_npu_

-

163

-

frac

-

frac_npu

-

164

-

frac_

-

frac_npu_

-

165

-

frac.out

-

frac_out_npu

-

166

-

full.names

-

full_npu

-

167

-

full

-

full_npu

-

168

-

full.out

-

full_out_npu

-

169

-

grid_sampler

-

grid_sampler_npu

-

170

-

grid_sampler_3d

-

grid_sampler_3d_npu

-

171

-

grid_sampler_3d_backward

-

grid_sampler_3d_backward_npu

-

172

-

hann_window

-

hann_window_npu

-

173

-

hann_window.periodic

-

hann_window_npu

-

174

-

hamming_window

-

hamming_window_npu

-

175

-

hamming_window.periodic

-

hamming_window_npu

-

176

-

hamming_window.periodic_alpha

-

hamming_window_npu

-

177

-

hamming_window.periodic_alpha_beta

-

hamming_window_npu

-

178

-

ger

-

ger_npu

-

179

-

ger.out

-

ger_out_npu

-

180

-

index.Tensor

-

index_npu

-

181

-

index_put_

-

index_put_npu_

-

182

-

index_put

-

index_put_npu

-

183

-

_index_put_impl_

-

_index_put_impl_npu_

-

184

-

inverse

-

inverse_npu

-

185

-

inverse.out

-

inverse_out_npu

-

186

-

isclose

-

isclose_npu

-

187

-

isnan

-

isnan_npu

-

188

-

is_nonzero

-

is_nonzero_npu

-

189

-

kl_div

-

kl_div_npu

-

190

-

kl_div_backward

-

kl_div_backward_npu

-

191

-

kthvalue

-

kthvalue_npu

-

192

-

kthvalue.values

-

kthvalue_out_npu

-

193

-

kthvalue.dimname

-

kthvalue_npu

-

194

-

kthvalue.dimname_out

-

kthvalue_out_npu

-

195

-

native_layer_norm

-

layer_norm_npu

-

196

-

native_layer_norm_backward

-

layer_norm_backward_npu

-

197

-

linspace

-

linspace_npu

-

198

-

linspace.out

-

linspace_out_npu

-

199

-

log

-

log_npu

-

200

-

log_

-

log_npu_

-

201

-

log.out

-

log_out_npu

-

202

-

log10

-

log10_npu

-

203

-

log10_

-

log10_npu_

-

204

-

log10.out

-

log10_out_npu

-

205

-

log1p

-

log1p_npu

-

206

-

log1p_

-

log1p_npu_

-

207

-

log1p.out

-

log1p_out_npu

-

208

-

log2

-

log2_npu

-

209

-

log2_

-

log2_npu_

-

210

-

log2.out

-

log2_out_npu

-

211

-

logspace

-

logspace_npu

-

212

-

logspace.out

-

logspace_out_npu

-

213

-

log_softmax.int

-

log_softmax_npu

-

214

-

log_softmax.Dimname

-

log_softmax_npu

-

215

-

_log_softmax

-

_log_softmax_npu

-

216

-

_log_softmax_backward_data

-

_log_softmax_backward_npu

-

217

-

logsumexp

-

logsumexp_npu

-

218

-

logsumexp.out

-

logsumexp_out_npu

-

219

-

logsumexp.names

-

logsumexp_npu

-

220

-

logsumexp.names_out

-

logsumexp_out_npu

-

221

-

matmul

-

matmul_npu

-

222

-

matmul.out

-

matmul_out_npu

-

223

-

max.dim

-

max_npu

-

224

-

max.dim_max

-

max_out_npu

-

225

-

max_values

-

max_npu

-

226

-

max.names_dim

-

max_npu

-

227

-

max.names_dim_max

-

max_out_npu

-

228

-

max_values.names

-

max_npu

-

229

-

max_pool2d

-

max_pool2d_npu

-

230

-

mean

-

mean_npu

-

231

-

mean.dim

-

mean_npu

-

232

-

mean.out

-

mean_out_npu

-

233

-

mean.names_dim

-

mean_npu

-

234

-

mean.names_out

-

mean_out_npu

-

235

-

median.dim

-

median_npu

-

236

-

median.dim_values

-

median_out_npu

-

237

-

median.names_dim

-

median_npu

-

238

-

median.names_dim_values

-

median_out_npu

-

239

-

min.dim

-

min_npu

-

240

-

min.dim_min

-

min_out_npu

-

241

-

min_values

-

min_npu

-

242

-

min.names_dim

-

min_npu

-

243

-

min.names_dim_min

-

min_out_npu

-

244

-

min_values.names

-

min_npu

-

245

-

mm

-

mm_npu

-

246

-

mm.out

-

mm_out_npu

-

247

-

mul.Tensor

-

mul_npu

-

248

-

mul_.Tensor

-

mul_npu_

-

249

-

mul.out

-

mul_out_npu

-

250

-

mul.Scalar

-

mul_npu

-

251

-

mul_.Scalar

-

mul_npu_

-

252

-

mv

-

mv_npu

-

253

-

mv.out

-

mv_out_npu

-

254

-

narrow_copy

-

narrow_copy_npu

-

255

-

native_batch_norm

-

batch_norm_npu

-

256

-

batch_norm_stats

-

batch_norm_stats_npu

-

257

-

batch_norm_elemt

-

batch_norm_elemt_npu

-

258

-

batch_norm_elemt.out

-

batch_norm_elemt_out_npu

-

259

-

native_batch_norm_backward

-

batch_norm_backward_npu

-

260

-

batch_norm_backward_reduce

-

batch_norm_backward_reduce_npu

-

261

-

_nnpack_spatial_convolution

-

_nnpack_spatial_convolution_npu

-

262

-

ones.names

-

ones_npu

-

263

-

ones

-

ones_npu

-

264

-

ones.out

-

ones_out_npu

-

265

-

ones_like

-

ones_like_npu

-

266

-

cdist

-

cdist_npu

-

267

-

_cdist_forward

-

_cdist_forward_npu

-

268

-

_cdist_backward

-

_cdist_backward_npu

-

269

-

pdist

-

pdist_npu

-

270

-

_pdist_forward

-

_pdist_forward_npu

-

271

-

randperm

-

randperm_npu

-

272

-

randperm.generator

-

randperm_npu

-

273

-

randperm.out

-

randperm_out_npu

-

274

-

randperm.generator_out

-

randperm_out_npu

-

275

-

range.step

-

range_npu

-

276

-

range

-

range_npu

-

277

-

range.out

-

range_out_npu

-

278

-

reciprocal

-

reciprocal_npu

-

279

-

reciprocal_

-

reciprocal_npu_

-

280

-

reciprocal.out

-

reciprocal_out_npu

-

281

-

neg

-

neg_npu

-

282

-

neg_

-

neg_npu_

-

283

-

neg.out

-

neg_out_npu

-

284

-

repeat

-

repeat_npu

-

285

-

repeat_interleave.self_int

-

repeat_interleave_npu

-

286

-

round

-

round_npu

-

287

-

round_

-

round_npu_

-

288

-

round.out

-

round_out_npu

-

289

-

relu

-

relu_npu

-

290

-

relu_

-

relu_npu_

-

291

-

prelu

-

prelu_npu

-

292

-

prelu_backward

-

prelu_backward_npu

-

293

-

gelu

-

gelu_npu

-

294

-

gelu_backward

-

gelu_backward_npu

-

295

-

hardshrink

-

hardshrink_npu

-

296

-

hardshrink_backward

-

hardshrink_backward_npu

-

297

-

rsqrt

-

rsqrt_npu

-

298

-

rsqrt_

-

rsqrt_npu_

-

299

-

rsqrt.out

-

rsqrt_out_npu

-

300

-

selu

-

selu_npu

-

301

-

selu_

-

selu_npu_

-

302

-

celu

-

celu_npu

-

303

-

celu_

-

celu_npu_

-

304

-

sigmoid

-

sigmoid_npu

-

305

-

sigmoid_

-

sigmoid_npu_

-

306

-

sigmoid.out

-

sigmoid_out_npu

-

307

-

sin

-

sin_npu

-

308

-

sin_

-

sin_npu_

-

309

-

sin.out

-

sin_out_npu

-

310

-

sinh

-

sinh_npu

-

311

-

sinh_

-

sinh_npu_

-

312

-

sinh.out

-

sinh_out_npu

-

313

-

slogdet

-

slogdet_npu

-

314

-

softmax.int

-

softmax_npu

-

315

-

softmax.Dimname

-

softmax_npu

-

316

-

_softmax

-

_softmax_npu

-

317

-

_softmax_backward_data

-

_softmax_backward_npu

-

318

-

stack

-

stack_npu

-

319

-

stack.out

-

stack_out_npu

-

320

-

sum

-

sum_npu

-

321

-

sum.dim_IntList

-

sum_npu

-

322

-

sum.dim_DimnameList

-

sum_npu

-

323

-

sum.IntList_out

-

sum_out_npu

-

324

-

sum.DimnameList_out

-

sum_out_npu

-

325

-

sqrt

-

sqrt_npu

-

326

-

sqrt_

-

sqrt_npu_

-

327

-

sqrt.out

-

sqrt_out_npu

-

328

-

std

-

std_npu

-

329

-

std.dim

-

std_dim_npu

-

330

-

std_mean

-

std_mean_npu

-

331

-

std_mean.dim

-

std_mean_dim_npu

-

332

-

std_mean.names_dim

-

std_mean_names_npu

-

333

-

std.out

-

std_out_npu

-

334

-

std.names_dim

-

std_names_npu

-

335

-

std.names_out

-

std_out_npu

-

336

-

prod

-

prod_npu

-

337

-

prod.dim_int

-

prod_npu

-

338

-

prod.int_out

-

prod_out_npu

-

339

-

prod.dim_Dimname

-

prod_npu

-

340

-

prod.Dimname_out

-

prod_out_npu

-

341

-

tan

-

tan_npu

-

342

-

tan_

-

tan_npu_

-

343

-

tan.out

-

tan_out_npu

-

344

-

tanh

-

tanh_npu

-

345

-

tanh_

-

tanh_npu_

-

346

-

tanh.out

-

tanh_out_npu

-

347

-

threshold

-

threshold_npu

-

348

-

threshold_

-

threshold_npu_

-

349

-

threshold.out

-

threshold_out_npu

-

350

-

threshold_backward

-

threshold_backward_npu

-

351

-

one_hot

-

one_hot_npu1

-

352

-

flip

-

flip_npu

-

353

-

roll

-

roll_npu

-

354

-

true_divide.Tensor

-

true_divide_npu

-

355

-

true_divide_.Tensor

-

true_divide_npu_

-

356

-

true_divide.out

-

true_divide_out_npu

-

357

-

true_divide.Scalar

-

true_divide_npu

-

358

-

true_divide_.Scalar

-

true_divide_npu_

-

359

-

trunc

-

trunc_npu

-

360

-

trunc_

-

trunc_npu_

-

361

-

trunc.out

-

trunc_out_npu

-

362

-

_unique2

-

_unique2_npu

-

363

-

var

-

var_npu

-

364

-

var.dim

-

var_npu

-

365

-

var.out

-

var_out_npu

-

366

-

var.names_dim

-

var_npu

-

367

-

var.names_out

-

var_out_npu

-

368

-

var_mean

-

var_mean_npu

-

369

-

var_mean.dim

-

var_mean_npu

-

370

-

var_mean.names_dim

-

var_mean_npu

-

371

-

where.self

-

where_npu

-

372

-

where

-

where_npu

-

373

-

_s_where

-

_s_where_npu

-

374

-

zeros.names

-

zeros_npu

-

375

-

zeros

-

zeros_npu

-

376

-

zeros.out

-

zeros_out_npu

-

377

-

zeros_like

-

zeros_like_npu

-

378

-

norm.ScalarOpt_dtype

-

norm_npu

-

379

-

norm.Scalar

-

norm_npu

-

380

-

norm.ScalarOpt_dim_dtype

-

norm_npu

-

381

-

norm.ScalarOpt_dim

-

norm_npu

-

382

-

norm.dtype_out

-

norm_out_npu

-

383

-

norm.out

-

norm_out_npu

-

384

-

clone

-

clone_npu

-

385

-

resize_as_

-

resize_as_npu_

-

386

-

pow.Tensor_Scalar_out

-

pow_out_npu

-

387

-

pow.Tensor_Scalar

-

pow_npu

-

388

-

zero_

-

zero_npu_

-

389

-

sub.out

-

sub_out_npu

-

390

-

sub.Tensor

-

sub_npu

-

391

-

sub_.Tensor

-

sub_npu_

-

392

-

sub.Scalar

-

sub_npu

-

393

-

sub_.Scalar

-

sub_npu_

-

394

-

rsub.Tensor

-

rsub_npu

-

395

-

rsub.Scalar

-

rsub_npu

-

396

-

addmm.out

-

addmm_out_npu

-

397

-

addmm

-

addmm_npu

-

398

-

addmm_

-

addmm_npu_

-

399

-

quantize_per_tensor

-

quantize_per_tensor_npu

-

400

-

quantize_per_channel

-

quantize_per_channel_npu

-

401

-

to.dtype_layout

-

to_npu

-

402

-

to.device

-

to_device_npu

-

403

-

to.dtype

-

to_dtype_npu

-

404

-

to.other

-

to_other_npu

-

405

-

_local_scalar_dense

-

_local_scalar_dense_npu

-

406

-

lstm.input

-

lstm_npu

-

407

-

lstm.data

-

lstm_npu

-

408

-

gru.input

-

gru_npu_

-

409

-

_pack_padded_sequence

-

_pack_padded_sequence_npu

-

410

-

_pad_packed_sequence

-

_pad_packed_sequence_npu

-

411

-

set_.source_Storage

-

set_npu_

-

412

-

set_.source_Storage_storage_offset

-

set_npu_

-

413

-

set_.source_Tensor

-

set_npu_

-

414

-

set_

-

set_npu_

-

415

-

masked_fill_.Scalar

-

masked_fill_npu_

-

416

-

masked_fill_.Tensor

-

masked_fill_npu_

-

417

-

masked_scatter_

-

masked_scatter_npu_

-

418

-

view

-

view_npu

-

419

-

put_

-

put_npu_

-

420

-

index_add_

-

index_add_npu_

-

421

-

index_add

-

index_add_npu

-

422

-

index_add.dimname

-

index_add_npu

-

423

-

index_fill_.int_Scalar

-

index_fill_npu_

-

424

-

index_fill.int_Scalar

-

index_fill_npu

-

425

-

index_fill_.int_Tensor

-

index_fill_npu_

-

426

-

index_fill.int_Tensor

-

index_fill_npu

-

427

-

scatter_.src

-

scatter_npu_

-

428

-

scatter_.value

-

scatter_npu_

-

429

-

scatter_add_

-

scatter_add_npu_

-

430

-

scatter_add

-

scatter_add_npu

-

431

-

scatter_add.dimname

-

scatter_add_npu

-

432

-

lt_.Scalar

-

lt_npu_

-

433

-

lt_.Tensor

-

lt_npu_

-

434

-

gt_.Scalar

-

gt_npu_

-

435

-

gt_.Tensor

-

gt_npu_

-

436

-

le_.Scalar

-

le_npu_

-

437

-

le_.Tensor

-

le_npu_

-

438

-

ge_.Scalar

-

ge_npu_

-

439

-

ge_.Tensor

-

ge_npu_

-

440

-

eq_.Scalar

-

eq_npu_

-

441

-

eq_.Tensor

-

eq_npu_

-

442

-

ne_.Scalar

-

ne_npu_

-

443

-

ne_.Tensor

-

ne_npu_

-

444

-

bitwise_and.Tensor_out

-

bitwise_and_out_npu

-

445

-

bitwise_and.Scalar_out

-

bitwise_and_out_npu

-

446

-

bitwise_and.Scalar

-

bitwise_and_npu

-

447

-

bitwise_and.Tensor

-

bitwise_and_npu

-

448

-

bitwise_and_.Scalar

-

bitwise_and_npu_

-

449

-

bitwise_and_.Tensor

-

bitwise_and_npu_

-

450

-

__and__.Scalar

-

__and___npu

-

451

-

__and__.Tensor

-

__and___npu

-

452

-

bitwise_or.Tensor_out

-

bitwise_or_out_npu

-

453

-

bitwise_or.Scalar_out

-

bitwise_or_out_npu

-

454

-

bitwise_or.Scalar

-

bitwise_or_npu

-

455

-

bitwise_or.Tensor

-

bitwise_or_npu

-

456

-

bitwise_or_.Scalar

-

bitwise_or_npu_

-

457

-

bitwise_or_.Tensor

-

bitwise_or_npu_

-

458

-

__or__.Scalar

-

__or___npu

-

459

-

__or__.Tensor

-

__or___npu

-

460

-

__ior__.Scalar

-

__ior___npu

-

461

-

__ior__.Tensor

-

__ior___npu

-

462

-

bitwise_xor.Tensor_out

-

bitwise_xor_out_npu

-

463

-

bitwise_xor.Scalar_out

-

bitwise_xor_out_npu

-

464

-

bitwise_xor.Scalar

-

bitwise_xor_npu

-

465

-

bitwise_xor.Tensor

-

bitwise_xor_npu

-

466

-

bitwise_xor_.Scalar

-

bitwise_xor_npu_

-

467

-

bitwise_xor_.Tensor

-

bitwise_xor_npu_

-

468

-

__xor__.Scalar

-

__xor___npu

-

469

-

__xor__.Tensor

-

__xor___npu

-

470

-

__lshift__.Scalar

-

__lshift___npu

-

471

-

__lshift__.Tensor

-

__lshift___npu

-

472

-

__ilshift__.Scalar

-

__iLshift___npu

-

473

-

__ilshift__.Tensor

-

__iLshift___npu

-

474

-

__rshift__.Scalar

-

__rshift___npu

-

475

-

__rshift__.Tensor

-

__rshift___npu

-

476

-

__irshift__.Scalar

-

__iRshift___npu

-

477

-

__irshift__.Tensor

-

__iRshift___npu

-

478

-

atan2_

-

atan2_npu_

-

479

-

tril_

-

tril_npu_

-

480

-

triu_

-

triu_npu_

-

481

-

renorm_

-

renorm_npu_

-

482

-

pow_.Scalar

-

pow_npu_

-

483

-

pow_.Tensor

-

pow_npu_

-

484

-

lerp_.Scalar

-

lerp_npu_

-

485

-

lerp_.Tensor

-

lerp_npu_

-

486

-

fmod_.Scalar

-

fmod_npu_

-

487

-

fmod_.Tensor

-

fmod_npu_

-

488

-

remainder_.Scalar

-

remainder_npu_

-

489

-

remainder_.Tensor

-

remainder_npu_

-

490

-

addbmm_

-

addbmm_npu_

-

491

-

addbmm.out

-

addbmm_out_npu

-

492

-

addbmm

-

addbmm_npu

-

493

-

addcdiv_

-

addcdiv_npu_

-

494

-

random_.from

-

random_npu_

-

495

-

random_.to

-

random_npu_

-

496

-

random_

-

random_npu_

-

497

-

uniform_

-

uniform_npu_

-

498

-

diag.out

-

diag_out_npu

-

499

-

diag

-

diag_npu

-

500

-

cross.out

-

cross_out_npu

-

501

-

cross

-

cross_npu

-

502

-

triu.out

-

triu_out_npu

-

503

-

triu

-

triu_npu

-

504

-

tril.out

-

tril_out_npu

-

505

-

tril

-

tril_npu

-

506

-

tril_indices

-

tril_indices_npu

-

507

-

triu_indices

-

triu_indices_npu

-

508

-

ne.Scalar_out

-

ne_out_npu

-

509

-

ne.Scalar

-

ne_npu

-

510

-

ne.Tensor_out

-

ne_out_npu

-

511

-

ne.Tensor

-

ne_npu

-

512

-

eq.Scalar_out

-

eq_out_npu

-

513

-

eq.Scalar

-

eq_npu

-

514

-

eq.Tensor_out

-

eq_out_npu

-

515

-

eq.Tensor

-

eq_npu

-

516

-

ge.Scalar_out

-

ge_out_npu

-

517

-

ge.Scalar

-

ge_npu

-

518

-

ge.Tensor_out

-

ge_out_npu

-

519

-

ge.Tensor

-

ge_npu

-

520

-

le.Scalar_out

-

le_out_npu

-

521

-

le.Scalar

-

le_npu

-

522

-

le.Tensor_out

-

le_out_npu

-

523

-

le.Tensor

-

le_npu

-

524

-

gt.Scalar_out

-

gt_out_npu

-

525

-

gt.Scalar

-

gt_npu

-

526

-

gt.Tensor_out

-

gt_out_npu

-

527

-

gt.Tensor

-

gt_npu

-

528

-

lt.Scalar_out

-

lt_out_npu

-

529

-

lt.Scalar

-

lt_npu

-

530

-

lt.Tensor_out

-

lt_out_npu

-

531

-

lt.Tensor

-

lt_npu

-

532

-

take.out

-

take_out_npu

-

533

-

take

-

take_npu

-

534

-

index_select.out

-

index_select_out_npu

-

535

-

index_select

-

index_select_npu

-

536

-

index_select.dimname_out

-

index_select_out_npu

-

537

-

index_select.dimname

-

index_select_npu

-

538

-

masked_select.out

-

masked_select_out_npu

-

539

-

masked_select

-

masked_select_npu

-

540

-

nonzero.out

-

nonzero_out_npu

-

541

-

nonzero

-

nonzero_npu

-

542

-

gather.out

-

gather_out_npu

-

543

-

gather

-

gather_npu

-

544

-

gather.dimname_out

-

gather_out_npu

-

545

-

gather.dimname

-

gather_npu

-

546

-

addcmul.out

-

addcmul_out_npu

-

547

-

addcmul

-

addcmul_npu

-

548

-

addcmul_

-

addcmul_npu_

-

549

-

addcdiv.out

-

addcdiv_out_npu

-

550

-

addcdiv

-

addcdiv_npu

-

551

-

_triangular_solve_helper

-

_triangular_solve_helper_npu

-

552

-

_symeig_helper

-

_symeig_helper_npu

-

553

-

_svd_helper

-

_svd_helper_npu

-

554

-

qr.Q

-

qr_out_npu

-

555

-

qr

-

qr_npu

-

556

-

multinomial.out

-

multinomial_out_npu

-

557

-

multinomial

-

multinomial_npu

-

558

-

erfinv

-

erfinv_npu

-

559

-

erfinv_

-

erfinv_npu_

-

560

-

erfinv.out

-

erfinv_out_npu

-

561

-

sign

-

sign_npu

-

562

-

sign_

-

sign_npu_

-

563

-

sign.out

-

sign_out_npu

-

564

-

atan2.out

-

atan2_out_npu

-

565

-

atan2

-

atan2_npu

-

566

-

lerp.Scalar_out

-

lerp_out_npu

-

567

-

lerp.Tensor_out

-

lerp_out_npu

-

568

-

lerp.Scalar

-

lerp_npu

-

569

-

lerp.Tensor

-

lerp_npu

-

570

-

fmod.Scalar_out

-

fmod_out_npu

-

571

-

fmod.Scalar

-

fmod_npu

-

572

-

fmod.Tensor_out

-

fmod_out_npu

-

573

-

fmod.Tensor

-

fmod_npu

-

574

-

remainder.Scalar_out

-

remainder_out_npu

-

575

-

remainder.Scalar

-

remainder_npu

-

576

-

remainder.Tensor_out

-

remainder_out_npu

-

577

-

remainder.Tensor

-

remainder_npu

-

578

-

min.out

-

min_out_npu

-

579

-

min.other

-

min_npu

-

580

-

min

-

min_npu

-

581

-

max.out

-

max_out_npu

-

582

-

max.other

-

max_npu

-

583

-

max

-

max_npu

-

584

-

median

-

median_npu

-

585

-

sort.values

-

sort_out_npu

-

586

-

sort

-

sort_npu

-

587

-

sort.dimname_values

-

sort_out_npu

-

588

-

sort.dimname

-

sort_npu

-

589

-

argsort

-

argsort_npu

-

590

-

argsort.dimname

-

argsort_npu

-

591

-

topk.values

-

topk_out_npu

-

592

-

topk

-

topk_npu

-

593

-

all

-

all_npu

-

594

-

any

-

any_npu

-

595

-

renorm.out

-

renorm_out_npu

-

596

-

renorm

-

renorm_npu

-

597

-

unfold

-

unfold

-

598

-

equal

-

equal_npu

-

599

-

pow.Tensor_Tensor_out

-

pow_out_npu

-

600

-

pow.Tensor_Tensor

-

pow_npu

-

601

-

pow.Scalar_out

-

pow_out_npu

-

602

-

pow.Scalar

-

pow_npu

-

603

-

normal_

-

normal_npu_

-

604

-

normal.Tensor_float_out

-

normal_out_npu

-

605

-

normal.Tensor_float

-

normal_npu

-

606

-

normal.float_Tensor_out

-

normal_out_npu

-

607

-

normal.float_Tensor

-

normal_npu

-

608

-

normal.Tensor_Tensor_out

-

normal_out_npu

-

609

-

normal.Tensor_Tensor

-

normal_npu

-

610

-

normal.float_float

-

normal_npu

-

611

-

normal.float_float_out

-

normal_out_npu

-

612

-

_addr

-

_addr_npu

-

613

-

_addr_

-

_addr_npu_

-

614

-

_addr.out

-

_addr_out_npu

-

615

-

_index_copy_

-

index_copy_npu_

-

616

-

_cumsum

-

_cumsum_npu

-

617

-

_cumsum.out

-

_cumsum_out_npu

-

618

-

_cumprod

-

_cumprod_npu

-

619

-

_cumprod.out

-

_cumprod_out_npu

-

620

-

_var

-

_var_npu

-

621

-

_amp_non_finite_check_and_unscale_

-

_amp_non_finite_check_and_unscale_npu_

-

622

-

_cat

-

_cat_npu

-

623

-

_cat.out

-

_cat_out_npu

-

624

-

_max

-

_max_npu

-

625

-

_max.max

-

_max_out_npu

-

626

-

_min

-

_min_npu

-

627

-

_min.min

-

_min_out_npu

-

628

-

mse_loss.out

-

mse_loss_out_npu

-

629

-

mse_loss

-

mse_loss_npu

-

630

-

mse_loss_backward.grad_input

-

mse_loss_backward_out_npu

-

631

-

mse_loss_backward

-

mse_loss_backward_npu

-

632

-

l1_loss.out

-

l1_loss_out_npu

-

633

-

l1_loss

-

l1_loss_npu

-

634

-

l1_loss_backward.grad_input

-

l1_loss_backward_out_npu

-

635

-

l1_loss_backward

-

l1_loss_backward_npu

-

636

-

multilabel_margin_loss.out

-

multilabel_margin_loss_out_npu

-

637

-

multilabel_margin_loss

-

multilabel_margin_loss_npu

-

638

-

multilabel_margin_loss_forward.output

-

multilabel_margin_loss_forward_out_npu

-

639

-

multilabel_margin_loss_forward

-

multilabel_margin_loss_forward_npu

-

640

-

nll_loss.out

-

nll_loss_out_npu

-

641

-

nll_loss

-

nll_loss_npu

-

642

-

nll_loss_forward.output

-

nll_loss_forward_out_npu

-

643

-

nll_loss_forward

-

nll_loss_forward_npu

-

644

-

nll_loss_backward.grad_input

-

nll_loss_backward_out_npu

-

645

-

nll_loss_backward

-

nll_loss_backward_npu

-

646

-

nll_loss2d.out

-

nll_loss2d_out_npu

-

647

-

nll_loss2d

-

nll_loss2d_npu

-

648

-

nll_loss2d_forward.output

-

nll_loss2d_forward_out_npu

-

649

-

nll_loss2d_forward

-

nll_loss2d_forward_npu

-

650

-

nll_loss2d_backward.grad_input

-

nll_loss2d_backward_out_npu

-

651

-

nll_loss2d_backward

-

nll_loss2d_backward_npu

-

652

-

smooth_l1_loss.out

-

smooth_l1_loss_out_npu

-

653

-

smooth_l1_loss

-

smooth_l1_loss_npu

-

654

-

smooth_l1_loss_backward.grad_input

-

smooth_l1_loss_backward_out_npu

-

655

-

smooth_l1_loss_backward

-

smooth_l1_loss_backward_npu

-

656

-

soft_margin_loss.out

-

soft_margin_loss_out_npu

-

657

-

soft_margin_loss

-

soft_margin_loss_npu

-

658

-

soft_margin_loss_backward.grad_input

-

soft_margin_loss_backward_out_npu

-

659

-

soft_margin_loss_backward

-

soft_margin_loss_backward_npu

-

660

-

elu.out

-

elu_out_npu

-

661

-

elu

-

elu_npu

-

662

-

elu_backward.grad_input

-

elu_backward_out_npu

-

663

-

elu_backward

-

elu_backward_npu

-

664

-

elu_

-

elu_npu_

-

665

-

glu.out

-

glu_out_npu

-

666

-

glu

-

glu_npu

-

667

-

glu_backward.grad_input

-

glu_backward_out_npu

-

668

-

glu_backward

-

glu_backward_npu

-

669

-

hardsigmoid.out

-

hardsigmoid_out_npu

-

670

-

hardsigmoid

-

hardsigmoid_npu

-

671

-

hardsigmoid_

-

hardsigmoid_npu_

-

672

-

hardsigmoid_backward

-

hardsigmoid_backward_npu

-

673

-

hardtanh.out

-

hardtanh_out_npu

-

674

-

hardtanh

-

hardtanh_npu

-

675

-

hardtanh_backward.grad_input

-

hardtanh_backward_out_npu

-

676

-

hardtanh_backward

-

hardtanh_backward_npu

-

677

-

hardtanh_

-

hardtanh_npu_

-

678

-

leaky_relu.out

-

leaky_relu_out_npu

-

679

-

leaky_relu

-

leaky_relu_npu

-

680

-

leaky_relu_backward

-

leaky_relu_backward_npu

-

681

-

leaky_relu_

-

leaky_relu_npu_

-

682

-

log_sigmoid.out

-

log_sigmoid_out_npu

-

683

-

log_sigmoid

-

log_sigmoid_npu

-

684

-

log_sigmoid_forward.output

-

log_sigmoid_forward_out_npu

-

685

-

log_sigmoid_forward

-

log_sigmoid_forward_npu

-

686

-

log_sigmoid_backward.grad_input

-

log_sigmoid_backward_out_npu

-

687

-

log_sigmoid_backward

-

log_sigmoid_backward_npu

-

688

-

rrelu_with_noise.out

-

rrelu_with_noise_out_npu

-

689

-

rrelu_with_noise

-

rrelu_with_noise_npu

-

690

-

rrelu_with_noise_backward

-

rrelu_with_noise_backward_npu

-

691

-

rrelu_with_noise_

-

rrelu_with_noise_npu_

-

692

-

softplus.out

-

softplus_out_npu

-

693

-

softplus

-

softplus_npu

-

694

-

softplus_backward.grad_input

-

softplus_backward_out_npu

-

695

-

softplus_backward

-

softplus_backward_npu

-

696

-

softshrink.out

-

softshrink_out_npu

-

697

-

softshrink

-

softshrink_npu

-

698

-

softshrink_backward.grad_input

-

softshrink_backward_out_npu

-

699

-

softshrink_backward

-

softshrink_backward_npu

-

700

-

adaptive_avg_pool2d.out

-

adaptive_avg_pool2d_out_npu

-

701

-

adaptive_avg_pool2d

-

adaptive_avg_pool2d_npu

-

702

-

_adaptive_avg_pool2d

-

_adaptive_avg_pool2d_npu

-

703

-

_adaptive_avg_pool2d_backward

-

adaptive_avg_pool2d_backward_npu

-

704

-

adaptive_avg_pool3d.out

-

adaptive_avg_pool3d_out_npu

-

705

-

adaptive_avg_pool3d

-

adaptive_avg_pool3d_npu

-

706

-

adaptive_avg_pool3d_backward.grad_input

-

adaptive_avg_pool3d_backward_out_npu

-

707

-

adaptive_avg_pool3d_backward

-

adaptive_avg_pool3d_backward_npu

-

708

-

adaptive_max_pool2d.out

-

adaptive_max_pool2d_out_npu

-

709

-

adaptive_max_pool2d

-

adaptive_max_pool2d_npu

-

710

-

adaptive_max_pool2d_backward.grad_input

-

adaptive_max_pool2d_backward_out_npu

-

711

-

adaptive_max_pool2d_backward

-

adaptive_max_pool2d_backward_npu

-

712

-

avg_pool2d.out

-

avg_pool2d_out_npu

-

713

-

avg_pool2d

-

avg_pool2d_npu

-

714

-

avg_pool2d_backward.grad_input

-

avg_pool2d_backward_out_npu

-

715

-

avg_pool2d_backward

-

avg_pool2d_backward_npu

-

716

-

avg_pool3d.out

-

avg_pool3d_out_npu

-

717

-

avg_pool3d

-

avg_pool3d_npu

-

718

-

avg_pool3d_backward.grad_input

-

avg_pool3d_backward_out_npu

-

719

-

avg_pool3d_backward

-

avg_pool3d_backward_npu

-

720

-

max_pool2d_with_indices.out

-

max_pool2d_with_indices_out_npu

-

721

-

max_pool2d_with_indices

-

max_pool2d_with_indices_npu

-

722

-

max_pool2d_with_indices_backward.grad_input

-

max_pool2d_with_indices_backward_out_npu

-

723

-

max_pool2d_with_indices_backward

-

max_pool2d_with_indices_backward_npu

-

724

-

max_pool3d_with_indices.out

-

max_pool3d_with_indices_out_npu

-

725

-

max_pool3d_with_indices

-

max_pool3d_with_indices_npu

-

726

-

max_pool3d_with_indices_backward.grad_input

-

max_pool3d_with_indices_backward_out_npu

-

727

-

max_pool3d_with_indices_backward

-

max_pool3d_with_indices_backward_npu

-

728

-

max_unpool2d.out

-

max_unpool2d_out_npu

-

729

-

max_unpool2d

-

max_unpool2d_npu

-

730

-

max_unpool2d_backward.grad_input

-

max_unpool2d_backward_out_npu

-

731

-

max_unpool2d_backward

-

max_unpool2d_backward_npu

-

732

-

max_unpool3d.out

-

max_unpool3d_out_npu

-

733

-

max_unpool3d

-

max_unpool3d_npu

-

734

-

max_unpool3d_backward.grad_input

-

max_unpool3d_backward_out_npu

-

735

-

max_unpool3d_backward

-

max_unpool3d_backward_npu

-

736

-

reflection_pad2d.out

-

reflection_pad2d_out_npu

-

737

-

reflection_pad2d

-

reflection_pad2d_npu

-

738

-

reflection_pad2d_backward.grad_input

-

reflection_pad2d_backward_out_npu

-

739

-

reflection_pad2d_backward

-

reflection_pad2d_backward_npu

-

740

-

replication_pad2d.out

-

replication_pad2d_out_npu

-

741

-

replication_pad2d

-

replication_pad2d_npu

-

742

-

replication_pad2d_backward.grad_input

-

replication_pad2d_backward_out_npu

-

743

-

replication_pad2d_backward

-

replication_pad2d_backward_npu

-

744

-

upsample_linear1d.out

-

upsample_linear1d_out_npu

-

745

-

upsample_linear1d

-

upsample_linear1d_npu

-

746

-

upsample_linear1d_backward

-

upsample_linear1d_backward_npu

-

747

-

upsample_bilinear2d.out

-

upsample_bilinear2d_out_npu

-

748

-

upsample_bilinear2d

-

upsample_bilinear2d_npu

-

749

-

upsample_bilinear2d_backward.grad_input

-

upsample_bilinear2d_backward_out_npu

-

750

-

upsample_bilinear2d_backward

-

upsample_bilinear2d_backward_npu

-

751

-

upsample_bicubic2d.out

-

upsample_bicubic2d_out_npu

-

752

-

upsample_bicubic2d

-

upsample_bicubic2d_npu

-

753

-

upsample_bicubic2d_backward.grad_input

-

upsample_bicubic2d_backward_out_npu

-

754

-

upsample_bicubic2d_backward

-

upsample_bicubic2d_backward_npu

-

755

-

upsample_trilinear3d.out

-

upsample_trilinear3d_out_npu

-

756

-

upsample_trilinear3d

-

upsample_trilinear3d_npu

-

757

-

upsample_trilinear3d_backward.grad_input

-

upsample_trilinear3d_backward_out_npu

-

758

-

upsample_trilinear3d_backward

-

upsample_trilinear3d_backward_npu

-

759

-

upsample_nearest1d.out

-

upsample_nearest1d_out_npu

-

760

-

upsample_nearest1d

-

upsample_nearest1d_npu

-

761

-

upsample_nearest1d_backward.grad_input

-

upsample_nearest1d_backward_out_npu

-

762

-

upsample_nearest1d_backward

-

upsample_nearest1d_backward_npu

-

763

-

upsample_nearest2d.out

-

upsample_nearest2d_out_npu

-

764

-

upsample_nearest2d

-

upsample_nearest2d_npu

-

765

-

upsample_nearest2d_backward.grad_input

-

upsample_nearest2d_backward_out_npu

-

766

-

upsample_nearest2d_backward

-

upsample_nearest2d_backward_npu

-

767

-

upsample_nearest3d.out

-

upsample_nearest3d_out_npu

-

768

-

upsample_nearest3d

-

upsample_nearest3d_npu

-

769

-

upsample_nearest3d_backward.grad_input

-

upsample_nearest3d_backward_out_npu

-

770

-

upsample_nearest3d_backward

-

upsample_nearest3d_backward_npu

-

771

-

sigmoid_backward.grad_input

-

sigmoid_backward_out_npu

-

772

-

sigmoid_backward

-

sigmoid_backward_npu

-

773

-

tanh_backward.grad_input

-

tanh_backward_out_npu

-

774

-

tanh_backward

-

tanh_backward_npu

-

775

-

slow_conv_transpose2d.out

-

slow_conv_transpose2d_out_npu

-

776

-

slow_conv_transpose2d

-

slow_conv_transpose2d_npu

-

777

-

slow_conv_transpose2d_backward.grad_output

-

slow_conv_transpose2d_backward_out_npu

-

778

-

slow_conv_transpose2d_backward.output_mask

-

slow_conv_transpose2d_backward_npu

-

779

-

thnn_conv2d.out

-

thnn_conv2d_out_npu

-

780

-

thnn_conv2d

-

thnn_conv2d_npu

-

781

-

thnn_conv2d_forward.output

-

thnn_conv2d_forward_out_npu

-

782

-

thnn_conv2d_forward

-

thnn_conv2d_forward_npu

-

783

-

thnn_conv2d_backward.output_mask

-

thnn_conv2d_backward_npu

-

784

-

thnn_conv_depthwise2d.out

-

thnn_conv_depthwise2d_out_npu

-

785

-

thnn_conv_depthwise2d

-

thnn_conv_depthwise2d_npu

-

786

-

thnn_conv_depthwise2d_forward.out

-

thnn_conv_depthwise2d_forward_out_npu

-

787

-

thnn_conv_depthwise2d_forward

-

thnn_conv_depthwise2d_forward_npu

-

788

-

thnn_conv_depthwise2d_backward.grad_input

-

thnn_conv_depthwise2d_backward_out_npu

-

789

-

thnn_conv_depthwise2d_backward.output_mask

-

thnn_conv_depthwise2d_backward_npu

-

790

-

slow_conv3d.out

-

slow_conv3d_out_npu

-

791

-

slow_conv3d

-

slow_conv3d_npu

-

792

-

slow_conv3d_forward.output

-

slow_conv3d_forward_out_npu

-

793

-

slow_conv3d_forward

-

slow_conv3d_forward_npu

-

794

-

slow_conv_dilated2d

-

slow_conv_dilated2d_npu

-

795

-

slow_conv_dilated2d_backward

-

slow_conv_dilated2d_backward_npu

-

796

-

col2im.out

-

im2col_backward_out_npu

-

797

-

col2im

-

im2col_backward_npu

-

798

-

col2im_backward.grad_input

-

im2col_out_npu

-

799

-

col2im_backward

-

im2col_npu

-

800

-

im2col.out

-

im2col_out_npu

-

801

-

im2col

-

im2col_npu

-

802

-

im2col_backward.grad_input

-

im2col_backward_out_npu

-

803

-

im2col_backward

-

im2col_backward_npu

-

804

-

isfinite

-

isfinite_npu

-
- -

PyTorch Operators Customized by Ascend

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

No.

-

PyTorch Operator (Developed by Ascend)

-

Ascend Adapted Operator

-

1

-

npu_convolution_transpose

-

npu_convolution_transpose

-

2

-

npu_conv_transpose2d

-

conv_transpose2d_npu

-

3

-

npu_convolution_transpose_backward

-

npu_convolution_transpose_backward

-

4

-

npu_conv_transpose2d_backward

-

conv_transpose2d_backward_npu

-

5

-

npu_conv_transpose3d_backward

-

conv_transpose3d_backward_npu

-

6

-

npu_convolution

-

npu_convolution

-

7

-

npu_convolution_backward

-

npu_convolution_backward

-

8

-

npu_convolution_double_backward

-

npu_convolution_double_backward

-

9

-

npu_conv2d

-

conv2d_npu

-

10

-

npu_conv2d.out

-

conv2d_out_npu

-

11

-

npu_conv2d_backward

-

conv2d_backward_npu

-

12

-

npu_conv3d

-

conv3d_npu

-

13

-

npu_conv3d.out

-

conv3d_out_npu

-

14

-

npu_conv3d_backward

-

conv3d_backward_npu

-

15

-

one_

-

one_npu_

-

16

-

npu_sort_v2.out

-

sort_without_indices_out_npu

-

17

-

npu_sort_v2

-

sort_without_indices_npu

-

18

-

npu_format_cast

-

format_cast_npu

-

19

-

npu_format_cast_.acl_format

-

format_cast_npu_

-

20

-

npu_format_cast_.src

-

format_cast_npu_

-

21

-

npu_transpose_to_contiguous

-

transpose_to_contiguous_npu

-

22

-

npu_transpose

-

transpose_npu

-

23

-

npu_transpose.out

-

transpose_out_npu

-

24

-

npu_broadcast

-

broadcast_npu

-

25

-

npu_broadcast.out

-

broadcast_out_npu

-

26

-

npu_dtype_cast

-

dtype_cast_npu

-

27

-

npu_dtype_cast_.Tensor

-

dtype_cast_npu_

-

28

-

npu_roi_alignbk

-

roi_align_backward_npu

-

29

-

empty_with_format

-

empty_with_format_npu

-

30

-

empty_with_format.names

-

empty_with_format_npu

-

31

-

copy_memory_

-

copy_memory_npu_

-

32

-

npu_one_hot

-

one_hot_npu

-

33

-

npu_stride_add

-

stride_add_npu

-

34

-

npu_softmax_cross_entropy_with_logits

-

softmax_cross_entropy_with_logits_npu

-

35

-

npu_softmax_cross_entropy_with_logits_backward

-

softmax_cross_entropy_with_logits_backward_npu

-

36

-

npu_ps_roi_pooling

-

ps_roi_pooling_npu

-

37

-

npu_ps_roi_pooling_backward

-

ps_roi_pooling_backward_npu

-

38

-

npu_roi_align

-

roi_align_npu

-

39

-

npu_nms_v4

-

nms_v4_npu

-

40

-

npu_lstm

-

lstm_npu

-

41

-

npu_lstm_backward

-

lstm_backward_npu

-

42

-

npu_iou

-

iou_npu

-

43

-

npu_ptiou

-

ptiou_npu

-

44

-

npu_nms_with_mask

-

nms_with_mask_npu

-

45

-

npu_pad

-

pad_npu

-

46

-

npu_bounding_box_encode

-

bounding_box_encode_npu

-

47

-

npu_bounding_box_decode

-

bounding_box_decode_npu

-

48

-

npu_gru

-

gru_npu

-

49

-

npu_gru_backward

-

gru_backward_npu

-

50

-

npu_set_.source_Storage_storage_offset_format

-

set_npu_

-

51

-

npu_random_choice_with_mask

-

random_choice_with_mask_npu

-

52

-

npu_batch_nms

-

batch_nms_npu

-

53

-

npu_slice

-

slice_npu

-

54

-

npu_slice.out

-

slice_out_npu

-

55

-

npu_dropoutV2

-

dropout_v2_npu

-

56

-

npu_dropoutV2_backward

-

dropout_v2_backward_npu

-

57

-

_npu_dropout

-

_dropout_npu

-

58

-

_npu_dropout_inplace

-

_dropout_npu_inplace

-

59

-

npu_dropout_backward

-

dropout_backward_npu

-

60

-

npu_indexing

-

indexing_npu

-

61

-

npu_indexing.out

-

indexing_out_npu

-

62

-

npu_ifmr

-

ifmr_npu

-

63

-

npu_max.dim

-

max_v1_npu

-

64

-

npu_max.names_dim

-

max_v1_npu

-

65

-

npu_scatter

-

scatter_npu

-

66

-

npu_max_backward

-

max_backward_npu

-

67

-

npu_apply_adam

-

apply_adam_npu

-

68

-

npu_layer_norm_eval

-

layer_norm_eval_npu

-

69

-

npu_alloc_float_status

-

alloc_float_status_npu

-

70

-

npu_get_float_status

-

get_float_status_npu

-

71

-

npu_clear_float_status

-

clear_float_status_npu

-

72

-

npu_confusion_transpose

-

confusion_transpose_npu

-

73

-

npu_confusion_transpose_backward

-

confusion_transpose_backward_npu

-

74

-

npu_bmmV2

-

bmm_v2_npu

-

75

-

fast_gelu

-

fast_gelu_npu

-

76

-

fast_gelu_backward

-

fast_gelu_backward_npu

-

77

-

npu_sub_sample

-

sub_sample_npu

-

78

-

npu_deformable_conv2d

-

deformable_conv2d_npu

-

79

-

npu_deformable_conv2dbk

-

deformable_conv2d_backward_npu

-

80

-

npu_mish

-

mish_npu

-

81

-

npu_anchor_response_flags

-

anchor_response_flags_npu

-

82

-

npu_yolo_boxes_encode

-

yolo_boxes_encode_npu

-

83

-

npu_grid_assign_positive

-

grid_assign_positive_npu

-

84

-

npu_mish_backward

-

mish_backward_npu

-

85

-

npu_normalize_batch

-

normalize_batch_npu

-

86

-

npu_masked_fill_range

-

masked_fill_range_npu

-

87

-

npu_linear

-

linear_npu

-

88

-

npu_linear_backward

-

linear_backward_npu

-

89

-

npu_bert_apply_adam

-

bert_apply_adam_npu

-

90

-

npu_giou

-

giou_npu

-

91

-

npu_giou_backward

-

giou_backward_npu

-
- +# PyTorch Operator Support + +- [Mapping Between PyTorch Native Operators and Ascend Adapted Operators](#Mapping Between PyTorch Native Operators and Ascend Adapted Operatorsmd) +- [PyTorch Operators Customized by Ascend](#PyTorch Operators Customized by Ascendmd) +

Mapping Between PyTorch Native Operators and Ascend Adapted Operators

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

No.

+

PyTorch Native Operator

+

Ascend Adapted Operator

+

1

+

dropout

+

dropout_npu

+

2

+

dropout_

+

dropout_npu_

+

3

+

abs

+

abs_npu

+

4

+

abs_

+

abs_npu_

+

5

+

abs.out

+

abs_out_npu

+

6

+

acos

+

acos_npu

+

7

+

acos_

+

acos_npu_

+

8

+

acos.out

+

acos_out_npu

+

9

+

adaptive_avg_pool1d

+

adaptive_avg_pool1d_npu

+

10

+

add.Tensor

+

add_npu

+

11

+

add_.Tensor

+

add_npu_

+

12

+

add.out

+

add_out_npu

+

13

+

add.Scalar

+

add_npu

+

14

+

add_.Scalar

+

add_npu_

+

15

+

addmv

+

addmv_npu

+

16

+

addmv_

+

addmv_npu_

+

17

+

addmv.out

+

addmv_out_npu

+

18

+

addr

+

addr_npu

+

19

+

addr_

+

addr_npu_

+

20

+

addr.out

+

addr_out_npu

+

21

+

affine_grid_generator

+

affine_grid_generator_npu

+

22

+

affine_grid_generator_backward

+

affine_grid_generator_backward_npu

+

23

+

all.dim

+

all_npu

+

24

+

all.out

+

all_out_npu

+

25

+

any.dim

+

any_npu

+

26

+

any.out

+

any_out_npu

+

27

+

arange

+

arange_npu

+

28

+

arange.start

+

arange_npu

+

29

+

arange.start_step

+

arange_npu

+

30

+

arange.out

+

arange_out_npu

+

31

+

arange.start_out

+

arange_out_npu

+

32

+

_dim_arange

+

_dim_arange_npu

+

33

+

argmax

+

argmax_npu

+

34

+

argmin

+

argmin_npu

+

35

+

as_strided

+

as_strided_npu

+

36

+

as_strided_

+

as_strided_npu_

+

37

+

asin

+

asin_npu

+

38

+

asin_

+

asin_npu_

+

39

+

asin.out

+

asin_out_npu

+

40

+

atan

+

atan_npu

+

41

+

atan_

+

atan_npu_

+

42

+

atan.out

+

atan_out_npu

+

43

+

baddbmm

+

baddbmm_npu

+

44

+

baddbmm_

+

baddbmm_npu_

+

45

+

baddbmm.out

+

baddbmm_out_npu

+

46

+

bartlett_window

+

bartlett_window_npu

+

47

+

bartlett_window.periodic

+

bartlett_window_npu

+

48

+

batch_norm

+

batch_norm_npu_

+

49

+

_batch_norm_impl_index

+

_batch_norm_impl_index_npu

+

50

+

_batch_norm_impl_index_backward

+

_batch_norm_impl_index_backward_npu

+

51

+

bernoulli

+

bernoulli_npu

+

52

+

bernoulli_.Tensor

+

bernoulli_npu_

+

53

+

bernoulli_.float

+

bernoulli_npu_

+

54

+

binary_cross_entropy

+

binary_cross_entropy_npu

+

55

+

binary_cross_entropy.out

+

binary_cross_entropy_out_npu

+

56

+

binary_cross_entropy_backward

+

binary_cross_entropy_backward_npu

+

57

+

binary_cross_entropy_backward.grad_input

+

binary_cross_entropy_backward_out_npu

+

58

+

binary_cross_entropy_with_logits

+

binary_cross_entropy_with_logits_npu

+

59

+

binary_cross_entropy_with_logits_backward

+

binary_cross_entropy_with_logits_backward_npu

+

60

+

bitwise_not

+

bitwise_not_npu

+

61

+

bitwise_not_

+

bitwise_not_npu_

+

62

+

bitwise_not.out

+

bitwise_not_out_npu

+

63

+

logical_not

+

logical_not_npu

+

64

+

logical_not_

+

logical_not_npu_

+

65

+

logical_not.out

+

logical_not_out_npu

+

66

+

logical_and

+

logical_and_npu

+

67

+

logical_and_

+

logical_and_npu_

+

68

+

logical_and.out

+

logical_and_out_npu

+

69

+

logical_or

+

logical_or_npu

+

70

+

logical_or_

+

logical_or_npu_

+

71

+

logical_or.out

+

logical_or_out_npu

+

72

+

blackman_window

+

blackman_window_npu

+

73

+

blackman_window.periodic

+

blackman_window_npu

+

74

+

bmm

+

bmm_npu

+

75

+

bmm.out

+

bmm_out_npu

+

76

+

cat

+

cat_npu

+

77

+

cat.out

+

cat_out_npu

+

78

+

cat.names

+

cat_npu

+

79

+

cat.names_out

+

cat_out_npu

+

80

+

ceil

+

ceil_npu

+

81

+

ceil_

+

ceil_npu_

+

82

+

ceil.out

+

ceil_out_npu

+

83

+

clamp

+

clamp_npu

+

84

+

clamp_

+

clamp_npu_

+

85

+

clamp.out

+

clamp_out_npu

+

86

+

clamp_max

+

clamp_max_npu

+

87

+

clamp_max_

+

clamp_max_npu_

+

88

+

clamp_max.out

+

clamp_max_out_npu

+

89

+

clamp_min

+

clamp_min_npu

+

90

+

clamp_min_

+

clamp_min_npu_

+

91

+

clamp_min.out

+

clamp_min_out_npu

+

92

+

constant_pad_nd

+

constant_pad_nd_npu

+

93

+

contiguous

+

contiguous_npu

+

94

+

convolution

+

convolution_npu

+

95

+

_convolution

+

_convolution_npu

+

96

+

_convolution_nogroup

+

_convolution_nogroup_npu

+

97

+

conv2d

+

conv2d_npu_

+

98

+

conv3d

+

_conv3d_npu

+

99

+

conv_tbc

+

conv_tbc_npu

+

100

+

conv_tbc_backward

+

conv_tbc_backward_npu

+

101

+

conv_transpose2d.input

+

conv_transpose2d_npu_

+

102

+

conv_transpose3d.input

+

conv_transpose3d_npu_

+

103

+

copy_

+

copy_npu_

+

104

+

cos

+

cos_npu

+

105

+

cos_

+

cos_npu_

+

106

+

cos.out

+

cos_out_npu

+

107

+

cosh

+

cosh_npu

+

108

+

cosh_

+

cosh_npu_

+

109

+

cosh.out

+

cosh_out_npu

+

110

+

_cummax_helper

+

cummax_helper_npu

+

111

+

_cummin_helper

+

cummin_helper_npu

+

112

+

cumprod

+

cumprod_npu

+

113

+

cumprod.out

+

cumprod_out_npu

+

114

+

cumprod.dimname

+

cumprod_npu

+

115

+

cumprod.dimname_out

+

cumprod_out_npu

+

116

+

ctc_loss.IntList

+

ctc_loss_npu

+

117

+

ctc_loss.Tensor

+

ctc_loss_npu

+

118

+

_ctc_loss

+

ctc_loss_npu

+

119

+

_ctc_loss_backward

+

ctc_loss_backward_npu

+

120

+

fill_diagonal_

+

fill_diagonal_npu_

+

121

+

div.Tensor

+

div_npu

+

122

+

div_.Tensor

+

div_npu_

+

123

+

div.out

+

div_out_npu

+

124

+

div.Scalar

+

div_npu

+

125

+

div_.Scalar

+

div_npu_

+

126

+

dot

+

dot_npu

+

127

+

dot.out

+

dot_out_npu

+

128

+

embedding

+

embedding_npu

+

129

+

embedding_backward

+

embedding_backward_npu

+

130

+

embedding_dense_backward

+

embedding_dense_backward_npu

+

131

+

embedding_renorm_

+

embedding_renorm_npu_

+

132

+

_embedding_bag

+

_embedding_bag_npu

+

133

+

empty.memory_format

+

empty_npu

+

134

+

resize_

+

resize_npu_

+

135

+

empty_like

+

empty_like_npu

+

136

+

empty_strided

+

empty_strided_npu

+

137

+

erf

+

erf_npu

+

138

+

erf_

+

erf_npu_

+

139

+

erf.out

+

erf_out_npu

+

140

+

erfc

+

erfc_npu

+

141

+

erfc_

+

erfc_npu_

+

142

+

erfc.out

+

erfc_out_npu

+

143

+

exp

+

exp_npu

+

144

+

exp_

+

exp_npu_

+

145

+

exp.out

+

exp_out_npu

+

146

+

expm1

+

expm1_npu

+

147

+

expm1_

+

expm1_npu_

+

148

+

expm1.out

+

expm1_out_npu

+

149

+

eye

+

eye_npu

+

150

+

eye.m

+

eye_npu

+

151

+

eye.out

+

eye_out_npu

+

152

+

eye.m_out

+

eye_out_npu

+

153

+

fill_.Scalar

+

fill_npu_

+

154

+

fill_.Tensor

+

fill_npu_

+

155

+

floor

+

floor_npu

+

156

+

floor_

+

floor_npu_

+

157

+

floor.out

+

floor_out_npu

+

158

+

floor_divide

+

floor_divide_npu

+

159

+

floor_divide_.Tensor

+

floor_divide_npu_

+

160

+

floor_divide.out

+

floor_divide_out_npu

+

161

+

floor_divide.Scalar

+

floor_divide_npu

+

162

+

floor_divide_.Scalar

+

floor_divide_npu_

+

163

+

frac

+

frac_npu

+

164

+

frac_

+

frac_npu_

+

165

+

frac.out

+

frac_out_npu

+

166

+

full.names

+

full_npu

+

167

+

full

+

full_npu

+

168

+

full.out

+

full_out_npu

+

169

+

grid_sampler

+

grid_sampler_npu

+

170

+

grid_sampler_3d

+

grid_sampler_3d_npu

+

171

+

grid_sampler_3d_backward

+

grid_sampler_3d_backward_npu

+

172

+

hann_window

+

hann_window_npu

+

173

+

hann_window.periodic

+

hann_window_npu

+

174

+

hamming_window

+

hamming_window_npu

+

175

+

hamming_window.periodic

+

hamming_window_npu

+

176

+

hamming_window.periodic_alpha

+

hamming_window_npu

+

177

+

hamming_window.periodic_alpha_beta

+

hamming_window_npu

+

178

+

ger

+

ger_npu

+

179

+

ger.out

+

ger_out_npu

+

180

+

index.Tensor

+

index_npu

+

181

+

index_put_

+

index_put_npu_

+

182

+

index_put

+

index_put_npu

+

183

+

_index_put_impl_

+

_index_put_impl_npu_

+

184

+

inverse

+

inverse_npu

+

185

+

inverse.out

+

inverse_out_npu

+

186

+

isclose

+

isclose_npu

+

187

+

isnan

+

isnan_npu

+

188

+

is_nonzero

+

is_nonzero_npu

+

189

+

kl_div

+

kl_div_npu

+

190

+

kl_div_backward

+

kl_div_backward_npu

+

191

+

kthvalue

+

kthvalue_npu

+

192

+

kthvalue.values

+

kthvalue_out_npu

+

193

+

kthvalue.dimname

+

kthvalue_npu

+

194

+

kthvalue.dimname_out

+

kthvalue_out_npu

+

195

+

native_layer_norm

+

layer_norm_npu

+

196

+

native_layer_norm_backward

+

layer_norm_backward_npu

+

197

+

linspace

+

linspace_npu

+

198

+

linspace.out

+

linspace_out_npu

+

199

+

log

+

log_npu

+

200

+

log_

+

log_npu_

+

201

+

log.out

+

log_out_npu

+

202

+

log10

+

log10_npu

+

203

+

log10_

+

log10_npu_

+

204

+

log10.out

+

log10_out_npu

+

205

+

log1p

+

log1p_npu

+

206

+

log1p_

+

log1p_npu_

+

207

+

log1p.out

+

log1p_out_npu

+

208

+

log2

+

log2_npu

+

209

+

log2_

+

log2_npu_

+

210

+

log2.out

+

log2_out_npu

+

211

+

logspace

+

logspace_npu

+

212

+

logspace.out

+

logspace_out_npu

+

213

+

log_softmax.int

+

log_softmax_npu

+

214

+

log_softmax.Dimname

+

log_softmax_npu

+

215

+

_log_softmax

+

_log_softmax_npu

+

216

+

_log_softmax_backward_data

+

_log_softmax_backward_npu

+

217

+

logsumexp

+

logsumexp_npu

+

218

+

logsumexp.out

+

logsumexp_out_npu

+

219

+

logsumexp.names

+

logsumexp_npu

+

220

+

logsumexp.names_out

+

logsumexp_out_npu

+

221

+

matmul

+

matmul_npu

+

222

+

matmul.out

+

matmul_out_npu

+

223

+

max.dim

+

max_npu

+

224

+

max.dim_max

+

max_out_npu

+

225

+

max_values

+

max_npu

+

226

+

max.names_dim

+

max_npu

+

227

+

max.names_dim_max

+

max_out_npu

+

228

+

max_values.names

+

max_npu

+

229

+

max_pool2d

+

max_pool2d_npu

+

230

+

mean

+

mean_npu

+

231

+

mean.dim

+

mean_npu

+

232

+

mean.out

+

mean_out_npu

+

233

+

mean.names_dim

+

mean_npu

+

234

+

mean.names_out

+

mean_out_npu

+

235

+

median.dim

+

median_npu

+

236

+

median.dim_values

+

median_out_npu

+

237

+

median.names_dim

+

median_npu

+

238

+

median.names_dim_values

+

median_out_npu

+

239

+

min.dim

+

min_npu

+

240

+

min.dim_min

+

min_out_npu

+

241

+

min_values

+

min_npu

+

242

+

min.names_dim

+

min_npu

+

243

+

min.names_dim_min

+

min_out_npu

+

244

+

min_values.names

+

min_npu

+

245

+

mm

+

mm_npu

+

246

+

mm.out

+

mm_out_npu

+

247

+

mul.Tensor

+

mul_npu

+

248

+

mul_.Tensor

+

mul_npu_

+

249

+

mul.out

+

mul_out_npu

+

250

+

mul.Scalar

+

mul_npu

+

251

+

mul_.Scalar

+

mul_npu_

+

252

+

mv

+

mv_npu

+

253

+

mv.out

+

mv_out_npu

+

254

+

narrow_copy

+

narrow_copy_npu

+

255

+

native_batch_norm

+

batch_norm_npu

+

256

+

batch_norm_stats

+

batch_norm_stats_npu

+

257

+

batch_norm_elemt

+

batch_norm_elemt_npu

+

258

+

batch_norm_elemt.out

+

batch_norm_elemt_out_npu

+

259

+

native_batch_norm_backward

+

batch_norm_backward_npu

+

260

+

batch_norm_backward_reduce

+

batch_norm_backward_reduce_npu

+

261

+

_nnpack_spatial_convolution

+

_nnpack_spatial_convolution_npu

+

262

+

ones.names

+

ones_npu

+

263

+

ones

+

ones_npu

+

264

+

ones.out

+

ones_out_npu

+

265

+

ones_like

+

ones_like_npu

+

266

+

cdist

+

cdist_npu

+

267

+

_cdist_forward

+

_cdist_forward_npu

+

268

+

_cdist_backward

+

_cdist_backward_npu

+

269

+

pdist

+

pdist_npu

+

270

+

_pdist_forward

+

_pdist_forward_npu

+

271

+

randperm

+

randperm_npu

+

272

+

randperm.generator

+

randperm_npu

+

273

+

randperm.out

+

randperm_out_npu

+

274

+

randperm.generator_out

+

randperm_out_npu

+

275

+

range.step

+

range_npu

+

276

+

range

+

range_npu

+

277

+

range.out

+

range_out_npu

+

278

+

reciprocal

+

reciprocal_npu

+

279

+

reciprocal_

+

reciprocal_npu_

+

280

+

reciprocal.out

+

reciprocal_out_npu

+

281

+

neg

+

neg_npu

+

282

+

neg_

+

neg_npu_

+

283

+

neg.out

+

neg_out_npu

+

284

+

repeat

+

repeat_npu

+

285

+

repeat_interleave.self_int

+

repeat_interleave_npu

+

286

+

round

+

round_npu

+

287

+

round_

+

round_npu_

+

288

+

round.out

+

round_out_npu

+

289

+

relu

+

relu_npu

+

290

+

relu_

+

relu_npu_

+

291

+

prelu

+

prelu_npu

+

292

+

prelu_backward

+

prelu_backward_npu

+

293

+

gelu

+

gelu_npu

+

294

+

gelu_backward

+

gelu_backward_npu

+

295

+

hardshrink

+

hardshrink_npu

+

296

+

hardshrink_backward

+

hardshrink_backward_npu

+

297

+

rsqrt

+

rsqrt_npu

+

298

+

rsqrt_

+

rsqrt_npu_

+

299

+

rsqrt.out

+

rsqrt_out_npu

+

300

+

selu

+

selu_npu

+

301

+

selu_

+

selu_npu_

+

302

+

celu

+

celu_npu

+

303

+

celu_

+

celu_npu_

+

304

+

sigmoid

+

sigmoid_npu

+

305

+

sigmoid_

+

sigmoid_npu_

+

306

+

sigmoid.out

+

sigmoid_out_npu

+

307

+

sin

+

sin_npu

+

308

+

sin_

+

sin_npu_

+

309

+

sin.out

+

sin_out_npu

+

310

+

sinh

+

sinh_npu

+

311

+

sinh_

+

sinh_npu_

+

312

+

sinh.out

+

sinh_out_npu

+

313

+

slogdet

+

slogdet_npu

+

314

+

softmax.int

+

softmax_npu

+

315

+

softmax.Dimname

+

softmax_npu

+

316

+

_softmax

+

_softmax_npu

+

317

+

_softmax_backward_data

+

_softmax_backward_npu

+

318

+

stack

+

stack_npu

+

319

+

stack.out

+

stack_out_npu

+

320

+

sum

+

sum_npu

+

321

+

sum.dim_IntList

+

sum_npu

+

322

+

sum.dim_DimnameList

+

sum_npu

+

323

+

sum.IntList_out

+

sum_out_npu

+

324

+

sum.DimnameList_out

+

sum_out_npu

+

325

+

sqrt

+

sqrt_npu

+

326

+

sqrt_

+

sqrt_npu_

+

327

+

sqrt.out

+

sqrt_out_npu

+

328

+

std

+

std_npu

+

329

+

std.dim

+

std_dim_npu

+

330

+

std_mean

+

std_mean_npu

+

331

+

std_mean.dim

+

std_mean_dim_npu

+

332

+

std_mean.names_dim

+

std_mean_names_npu

+

333

+

std.out

+

std_out_npu

+

334

+

std.names_dim

+

std_names_npu

+

335

+

std.names_out

+

std_out_npu

+

336

+

prod

+

prod_npu

+

337

+

prod.dim_int

+

prod_npu

+

338

+

prod.int_out

+

prod_out_npu

+

339

+

prod.dim_Dimname

+

prod_npu

+

340

+

prod.Dimname_out

+

prod_out_npu

+

341

+

tan

+

tan_npu

+

342

+

tan_

+

tan_npu_

+

343

+

tan.out

+

tan_out_npu

+

344

+

tanh

+

tanh_npu

+

345

+

tanh_

+

tanh_npu_

+

346

+

tanh.out

+

tanh_out_npu

+

347

+

threshold

+

threshold_npu

+

348

+

threshold_

+

threshold_npu_

+

349

+

threshold.out

+

threshold_out_npu

+

350

+

threshold_backward

+

threshold_backward_npu

+

351

+

one_hot

+

one_hot_npu1

+

352

+

flip

+

flip_npu

+

353

+

roll

+

roll_npu

+

354

+

true_divide.Tensor

+

true_divide_npu

+

355

+

true_divide_.Tensor

+

true_divide_npu_

+

356

+

true_divide.out

+

true_divide_out_npu

+

357

+

true_divide.Scalar

+

true_divide_npu

+

358

+

true_divide_.Scalar

+

true_divide_npu_

+

359

+

trunc

+

trunc_npu

+

360

+

trunc_

+

trunc_npu_

+

361

+

trunc.out

+

trunc_out_npu

+

362

+

_unique2

+

_unique2_npu

+

363

+

var

+

var_npu

+

364

+

var.dim

+

var_npu

+

365

+

var.out

+

var_out_npu

+

366

+

var.names_dim

+

var_npu

+

367

+

var.names_out

+

var_out_npu

+

368

+

var_mean

+

var_mean_npu

+

369

+

var_mean.dim

+

var_mean_npu

+

370

+

var_mean.names_dim

+

var_mean_npu

+

371

+

where.self

+

where_npu

+

372

+

where

+

where_npu

+

373

+

_s_where

+

_s_where_npu

+

374

+

zeros.names

+

zeros_npu

+

375

+

zeros

+

zeros_npu

+

376

+

zeros.out

+

zeros_out_npu

+

377

+

zeros_like

+

zeros_like_npu

+

378

+

norm.ScalarOpt_dtype

+

norm_npu

+

379

+

norm.Scalar

+

norm_npu

+

380

+

norm.ScalarOpt_dim_dtype

+

norm_npu

+

381

+

norm.ScalarOpt_dim

+

norm_npu

+

382

+

norm.dtype_out

+

norm_out_npu

+

383

+

norm.out

+

norm_out_npu

+

384

+

clone

+

clone_npu

+

385

+

resize_as_

+

resize_as_npu_

+

386

+

pow.Tensor_Scalar_out

+

pow_out_npu

+

387

+

pow.Tensor_Scalar

+

pow_npu

+

388

+

zero_

+

zero_npu_

+

389

+

sub.out

+

sub_out_npu

+

390

+

sub.Tensor

+

sub_npu

+

391

+

sub_.Tensor

+

sub_npu_

+

392

+

sub.Scalar

+

sub_npu

+

393

+

sub_.Scalar

+

sub_npu_

+

394

+

rsub.Tensor

+

rsub_npu

+

395

+

rsub.Scalar

+

rsub_npu

+

396

+

addmm.out

+

addmm_out_npu

+

397

+

addmm

+

addmm_npu

+

398

+

addmm_

+

addmm_npu_

+

399

+

quantize_per_tensor

+

quantize_per_tensor_npu

+

400

+

quantize_per_channel

+

quantize_per_channel_npu

+

401

+

to.dtype_layout

+

to_npu

+

402

+

to.device

+

to_device_npu

+

403

+

to.dtype

+

to_dtype_npu

+

404

+

to.other

+

to_other_npu

+

405

+

_local_scalar_dense

+

_local_scalar_dense_npu

+

406

+

lstm.input

+

lstm_npu

+

407

+

lstm.data

+

lstm_npu

+

408

+

gru.input

+

gru_npu_

+

409

+

_pack_padded_sequence

+

_pack_padded_sequence_npu

+

410

+

_pad_packed_sequence

+

_pad_packed_sequence_npu

+

411

+

set_.source_Storage

+

set_npu_

+

412

+

set_.source_Storage_storage_offset

+

set_npu_

+

413

+

set_.source_Tensor

+

set_npu_

+

414

+

set_

+

set_npu_

+

415

+

masked_fill_.Scalar

+

masked_fill_npu_

+

416

+

masked_fill_.Tensor

+

masked_fill_npu_

+

417

+

masked_scatter_

+

masked_scatter_npu_

+

418

+

view

+

view_npu

+

419

+

put_

+

put_npu_

+

420

+

index_add_

+

index_add_npu_

+

421

+

index_add

+

index_add_npu

+

422

+

index_add.dimname

+

index_add_npu

+

423

+

index_fill_.int_Scalar

+

index_fill_npu_

+

424

+

index_fill.int_Scalar

+

index_fill_npu

+

425

+

index_fill_.int_Tensor

+

index_fill_npu_

+

426

+

index_fill.int_Tensor

+

index_fill_npu

+

427

+

scatter_.src

+

scatter_npu_

+

428

+

scatter_.value

+

scatter_npu_

+

429

+

scatter_add_

+

scatter_add_npu_

+

430

+

scatter_add

+

scatter_add_npu

+

431

+

scatter_add.dimname

+

scatter_add_npu

+

432

+

lt_.Scalar

+

lt_npu_

+

433

+

lt_.Tensor

+

lt_npu_

+

434

+

gt_.Scalar

+

gt_npu_

+

435

+

gt_.Tensor

+

gt_npu_

+

436

+

le_.Scalar

+

le_npu_

+

437

+

le_.Tensor

+

le_npu_

+

438

+

ge_.Scalar

+

ge_npu_

+

439

+

ge_.Tensor

+

ge_npu_

+

440

+

eq_.Scalar

+

eq_npu_

+

441

+

eq_.Tensor

+

eq_npu_

+

442

+

ne_.Scalar

+

ne_npu_

+

443

+

ne_.Tensor

+

ne_npu_

+

444

+

bitwise_and.Tensor_out

+

bitwise_and_out_npu

+

445

+

bitwise_and.Scalar_out

+

bitwise_and_out_npu

+

446

+

bitwise_and.Scalar

+

bitwise_and_npu

+

447

+

bitwise_and.Tensor

+

bitwise_and_npu

+

448

+

bitwise_and_.Scalar

+

bitwise_and_npu_

+

449

+

bitwise_and_.Tensor

+

bitwise_and_npu_

+

450

+

__and__.Scalar

+

__and___npu

+

451

+

__and__.Tensor

+

__and___npu

+

452

+

bitwise_or.Tensor_out

+

bitwise_or_out_npu

+

453

+

bitwise_or.Scalar_out

+

bitwise_or_out_npu

+

454

+

bitwise_or.Scalar

+

bitwise_or_npu

+

455

+

bitwise_or.Tensor

+

bitwise_or_npu

+

456

+

bitwise_or_.Scalar

+

bitwise_or_npu_

+

457

+

bitwise_or_.Tensor

+

bitwise_or_npu_

+

458

+

__or__.Scalar

+

__or___npu

+

459

+

__or__.Tensor

+

__or___npu

+

460

+

__ior__.Scalar

+

__ior___npu

+

461

+

__ior__.Tensor

+

__ior___npu

+

462

+

bitwise_xor.Tensor_out

+

bitwise_xor_out_npu

+

463

+

bitwise_xor.Scalar_out

+

bitwise_xor_out_npu

+

464

+

bitwise_xor.Scalar

+

bitwise_xor_npu

+

465

+

bitwise_xor.Tensor

+

bitwise_xor_npu

+

466

+

bitwise_xor_.Scalar

+

bitwise_xor_npu_

+

467

+

bitwise_xor_.Tensor

+

bitwise_xor_npu_

+

468

+

__xor__.Scalar

+

__xor___npu

+

469

+

__xor__.Tensor

+

__xor___npu

+

470

+

__lshift__.Scalar

+

__lshift___npu

+

471

+

__lshift__.Tensor

+

__lshift___npu

+

472

+

__ilshift__.Scalar

+

__iLshift___npu

+

473

+

__ilshift__.Tensor

+

__iLshift___npu

+

474

+

__rshift__.Scalar

+

__rshift___npu

+

475

+

__rshift__.Tensor

+

__rshift___npu

+

476

+

__irshift__.Scalar

+

__iRshift___npu

+

477

+

__irshift__.Tensor

+

__iRshift___npu

+

478

+

atan2_

+

atan2_npu_

+

479

+

tril_

+

tril_npu_

+

480

+

triu_

+

triu_npu_

+

481

+

renorm_

+

renorm_npu_

+

482

+

pow_.Scalar

+

pow_npu_

+

483

+

pow_.Tensor

+

pow_npu_

+

484

+

lerp_.Scalar

+

lerp_npu_

+

485

+

lerp_.Tensor

+

lerp_npu_

+

486

+

fmod_.Scalar

+

fmod_npu_

+

487

+

fmod_.Tensor

+

fmod_npu_

+

488

+

remainder_.Scalar

+

remainder_npu_

+

489

+

remainder_.Tensor

+

remainder_npu_

+

490

+

addbmm_

+

addbmm_npu_

+

491

+

addbmm.out

+

addbmm_out_npu

+

492

+

addbmm

+

addbmm_npu

+

493

+

addcdiv_

+

addcdiv_npu_

+

494

+

random_.from

+

random_npu_

+

495

+

random_.to

+

random_npu_

+

496

+

random_

+

random_npu_

+

497

+

uniform_

+

uniform_npu_

+

498

+

diag.out

+

diag_out_npu

+

499

+

diag

+

diag_npu

+

500

+

cross.out

+

cross_out_npu

+

501

+

cross

+

cross_npu

+

502

+

triu.out

+

triu_out_npu

+

503

+

triu

+

triu_npu

+

504

+

tril.out

+

tril_out_npu

+

505

+

tril

+

tril_npu

+

506

+

tril_indices

+

tril_indices_npu

+

507

+

triu_indices

+

triu_indices_npu

+

508

+

ne.Scalar_out

+

ne_out_npu

+

509

+

ne.Scalar

+

ne_npu

+

510

+

ne.Tensor_out

+

ne_out_npu

+

511

+

ne.Tensor

+

ne_npu

+

512

+

eq.Scalar_out

+

eq_out_npu

+

513

+

eq.Scalar

+

eq_npu

+

514

+

eq.Tensor_out

+

eq_out_npu

+

515

+

eq.Tensor

+

eq_npu

+

516

+

ge.Scalar_out

+

ge_out_npu

+

517

+

ge.Scalar

+

ge_npu

+

518

+

ge.Tensor_out

+

ge_out_npu

+

519

+

ge.Tensor

+

ge_npu

+

520

+

le.Scalar_out

+

le_out_npu

+

521

+

le.Scalar

+

le_npu

+

522

+

le.Tensor_out

+

le_out_npu

+

523

+

le.Tensor

+

le_npu

+

524

+

gt.Scalar_out

+

gt_out_npu

+

525

+

gt.Scalar

+

gt_npu

+

526

+

gt.Tensor_out

+

gt_out_npu

+

527

+

gt.Tensor

+

gt_npu

+

528

+

lt.Scalar_out

+

lt_out_npu

+

529

+

lt.Scalar

+

lt_npu

+

530

+

lt.Tensor_out

+

lt_out_npu

+

531

+

lt.Tensor

+

lt_npu

+

532

+

take.out

+

take_out_npu

+

533

+

take

+

take_npu

+

534

+

index_select.out

+

index_select_out_npu

+

535

+

index_select

+

index_select_npu

+

536

+

index_select.dimname_out

+

index_select_out_npu

+

537

+

index_select.dimname

+

index_select_npu

+

538

+

masked_select.out

+

masked_select_out_npu

+

539

+

masked_select

+

masked_select_npu

+

540

+

nonzero.out

+

nonzero_out_npu

+

541

+

nonzero

+

nonzero_npu

+

542

+

gather.out

+

gather_out_npu

+

543

+

gather

+

gather_npu

+

544

+

gather.dimname_out

+

gather_out_npu

+

545

+

gather.dimname

+

gather_npu

+

546

+

addcmul.out

+

addcmul_out_npu

+

547

+

addcmul

+

addcmul_npu

+

548

+

addcmul_

+

addcmul_npu_

+

549

+

addcdiv.out

+

addcdiv_out_npu

+

550

+

addcdiv

+

addcdiv_npu

+

551

+

_triangular_solve_helper

+

_triangular_solve_helper_npu

+

552

+

_symeig_helper

+

_symeig_helper_npu

+

553

+

_svd_helper

+

_svd_helper_npu

+

554

+

qr.Q

+

qr_out_npu

+

555

+

qr

+

qr_npu

+

556

+

multinomial.out

+

multinomial_out_npu

+

557

+

multinomial

+

multinomial_npu

+

558

+

erfinv

+

erfinv_npu

+

559

+

erfinv_

+

erfinv_npu_

+

560

+

erfinv.out

+

erfinv_out_npu

+

561

+

sign

+

sign_npu

+

562

+

sign_

+

sign_npu_

+

563

+

sign.out

+

sign_out_npu

+

564

+

atan2.out

+

atan2_out_npu

+

565

+

atan2

+

atan2_npu

+

566

+

lerp.Scalar_out

+

lerp_out_npu

+

567

+

lerp.Tensor_out

+

lerp_out_npu

+

568

+

lerp.Scalar

+

lerp_npu

+

569

+

lerp.Tensor

+

lerp_npu

+

570

+

fmod.Scalar_out

+

fmod_out_npu

+

571

+

fmod.Scalar

+

fmod_npu

+

572

+

fmod.Tensor_out

+

fmod_out_npu

+

573

+

fmod.Tensor

+

fmod_npu

+

574

+

remainder.Scalar_out

+

remainder_out_npu

+

575

+

remainder.Scalar

+

remainder_npu

+

576

+

remainder.Tensor_out

+

remainder_out_npu

+

577

+

remainder.Tensor

+

remainder_npu

+

578

+

min.out

+

min_out_npu

+

579

+

min.other

+

min_npu

+

580

+

min

+

min_npu

+

581

+

max.out

+

max_out_npu

+

582

+

max.other

+

max_npu

+

583

+

max

+

max_npu

+

584

+

median

+

median_npu

+

585

+

sort.values

+

sort_out_npu

+

586

+

sort

+

sort_npu

+

587

+

sort.dimname_values

+

sort_out_npu

+

588

+

sort.dimname

+

sort_npu

+

589

+

argsort

+

argsort_npu

+

590

+

argsort.dimname

+

argsort_npu

+

591

+

topk.values

+

topk_out_npu

+

592

+

topk

+

topk_npu

+

593

+

all

+

all_npu

+

594

+

any

+

any_npu

+

595

+

renorm.out

+

renorm_out_npu

+

596

+

renorm

+

renorm_npu

+

597

+

unfold

+

unfold

+

598

+

equal

+

equal_npu

+

599

+

pow.Tensor_Tensor_out

+

pow_out_npu

+

600

+

pow.Tensor_Tensor

+

pow_npu

+

601

+

pow.Scalar_out

+

pow_out_npu

+

602

+

pow.Scalar

+

pow_npu

+

603

+

normal_

+

normal_npu_

+

604

+

normal.Tensor_float_out

+

normal_out_npu

+

605

+

normal.Tensor_float

+

normal_npu

+

606

+

normal.float_Tensor_out

+

normal_out_npu

+

607

+

normal.float_Tensor

+

normal_npu

+

608

+

normal.Tensor_Tensor_out

+

normal_out_npu

+

609

+

normal.Tensor_Tensor

+

normal_npu

+

610

+

normal.float_float

+

normal_npu

+

611

+

normal.float_float_out

+

normal_out_npu

+

612

+

_addr

+

_addr_npu

+

613

+

_addr_

+

_addr_npu_

+

614

+

_addr.out

+

_addr_out_npu

+

615

+

_index_copy_

+

index_copy_npu_

+

616

+

_cumsum

+

_cumsum_npu

+

617

+

_cumsum.out

+

_cumsum_out_npu

+

618

+

_cumprod

+

_cumprod_npu

+

619

+

_cumprod.out

+

_cumprod_out_npu

+

620

+

_var

+

_var_npu

+

621

+

_amp_non_finite_check_and_unscale_

+

_amp_non_finite_check_and_unscale_npu_

+

622

+

_cat

+

_cat_npu

+

623

+

_cat.out

+

_cat_out_npu

+

624

+

_max

+

_max_npu

+

625

+

_max.max

+

_max_out_npu

+

626

+

_min

+

_min_npu

+

627

+

_min.min

+

_min_out_npu

+

628

+

mse_loss.out

+

mse_loss_out_npu

+

629

+

mse_loss

+

mse_loss_npu

+

630

+

mse_loss_backward.grad_input

+

mse_loss_backward_out_npu

+

631

+

mse_loss_backward

+

mse_loss_backward_npu

+

632

+

l1_loss.out

+

l1_loss_out_npu

+

633

+

l1_loss

+

l1_loss_npu

+

634

+

l1_loss_backward.grad_input

+

l1_loss_backward_out_npu

+

635

+

l1_loss_backward

+

l1_loss_backward_npu

+

636

+

multilabel_margin_loss.out

+

multilabel_margin_loss_out_npu

+

637

+

multilabel_margin_loss

+

multilabel_margin_loss_npu

+

638

+

multilabel_margin_loss_forward.output

+

multilabel_margin_loss_forward_out_npu

+

639

+

multilabel_margin_loss_forward

+

multilabel_margin_loss_forward_npu

+

640

+

nll_loss.out

+

nll_loss_out_npu

+

641

+

nll_loss

+

nll_loss_npu

+

642

+

nll_loss_forward.output

+

nll_loss_forward_out_npu

+

643

+

nll_loss_forward

+

nll_loss_forward_npu

+

644

+

nll_loss_backward.grad_input

+

nll_loss_backward_out_npu

+

645

+

nll_loss_backward

+

nll_loss_backward_npu

+

646

+

nll_loss2d.out

+

nll_loss2d_out_npu

+

647

+

nll_loss2d

+

nll_loss2d_npu

+

648

+

nll_loss2d_forward.output

+

nll_loss2d_forward_out_npu

+

649

+

nll_loss2d_forward

+

nll_loss2d_forward_npu

+

650

+

nll_loss2d_backward.grad_input

+

nll_loss2d_backward_out_npu

+

651

+

nll_loss2d_backward

+

nll_loss2d_backward_npu

+

652

+

smooth_l1_loss.out

+

smooth_l1_loss_out_npu

+

653

+

smooth_l1_loss

+

smooth_l1_loss_npu

+

654

+

smooth_l1_loss_backward.grad_input

+

smooth_l1_loss_backward_out_npu

+

655

+

smooth_l1_loss_backward

+

smooth_l1_loss_backward_npu

+

656

+

soft_margin_loss.out

+

soft_margin_loss_out_npu

+

657

+

soft_margin_loss

+

soft_margin_loss_npu

+

658

+

soft_margin_loss_backward.grad_input

+

soft_margin_loss_backward_out_npu

+

659

+

soft_margin_loss_backward

+

soft_margin_loss_backward_npu

+

660

+

elu.out

+

elu_out_npu

+

661

+

elu

+

elu_npu

+

662

+

elu_backward.grad_input

+

elu_backward_out_npu

+

663

+

elu_backward

+

elu_backward_npu

+

664

+

elu_

+

elu_npu_

+

665

+

glu.out

+

glu_out_npu

+

666

+

glu

+

glu_npu

+

667

+

glu_backward.grad_input

+

glu_backward_out_npu

+

668

+

glu_backward

+

glu_backward_npu

+

669

+

hardsigmoid.out

+

hardsigmoid_out_npu

+

670

+

hardsigmoid

+

hardsigmoid_npu

+

671

+

hardsigmoid_

+

hardsigmoid_npu_

+

672

+

hardsigmoid_backward

+

hardsigmoid_backward_npu

+

673

+

hardtanh.out

+

hardtanh_out_npu

+

674

+

hardtanh

+

hardtanh_npu

+

675

+

hardtanh_backward.grad_input

+

hardtanh_backward_out_npu

+

676

+

hardtanh_backward

+

hardtanh_backward_npu

+

677

+

hardtanh_

+

hardtanh_npu_

+

678

+

leaky_relu.out

+

leaky_relu_out_npu

+

679

+

leaky_relu

+

leaky_relu_npu

+

680

+

leaky_relu_backward

+

leaky_relu_backward_npu

+

681

+

leaky_relu_

+

leaky_relu_npu_

+

682

+

log_sigmoid.out

+

log_sigmoid_out_npu

+

683

+

log_sigmoid

+

log_sigmoid_npu

+

684

+

log_sigmoid_forward.output

+

log_sigmoid_forward_out_npu

+

685

+

log_sigmoid_forward

+

log_sigmoid_forward_npu

+

686

+

log_sigmoid_backward.grad_input

+

log_sigmoid_backward_out_npu

+

687

+

log_sigmoid_backward

+

log_sigmoid_backward_npu

+

688

+

rrelu_with_noise.out

+

rrelu_with_noise_out_npu

+

689

+

rrelu_with_noise

+

rrelu_with_noise_npu

+

690

+

rrelu_with_noise_backward

+

rrelu_with_noise_backward_npu

+

691

+

rrelu_with_noise_

+

rrelu_with_noise_npu_

+

692

+

softplus.out

+

softplus_out_npu

+

693

+

softplus

+

softplus_npu

+

694

+

softplus_backward.grad_input

+

softplus_backward_out_npu

+

695

+

softplus_backward

+

softplus_backward_npu

+

696

+

softshrink.out

+

softshrink_out_npu

+

697

+

softshrink

+

softshrink_npu

+

698

+

softshrink_backward.grad_input

+

softshrink_backward_out_npu

+

699

+

softshrink_backward

+

softshrink_backward_npu

+

700

+

adaptive_avg_pool2d.out

+

adaptive_avg_pool2d_out_npu

+

701

+

adaptive_avg_pool2d

+

adaptive_avg_pool2d_npu

+

702

+

_adaptive_avg_pool2d

+

_adaptive_avg_pool2d_npu

+

703

+

_adaptive_avg_pool2d_backward

+

adaptive_avg_pool2d_backward_npu

+

704

+

adaptive_avg_pool3d.out

+

adaptive_avg_pool3d_out_npu

+

705

+

adaptive_avg_pool3d

+

adaptive_avg_pool3d_npu

+

706

+

adaptive_avg_pool3d_backward.grad_input

+

adaptive_avg_pool3d_backward_out_npu

+

707

+

adaptive_avg_pool3d_backward

+

adaptive_avg_pool3d_backward_npu

+

708

+

adaptive_max_pool2d.out

+

adaptive_max_pool2d_out_npu

+

709

+

adaptive_max_pool2d

+

adaptive_max_pool2d_npu

+

710

+

adaptive_max_pool2d_backward.grad_input

+

adaptive_max_pool2d_backward_out_npu

+

711

+

adaptive_max_pool2d_backward

+

adaptive_max_pool2d_backward_npu

+

712

+

avg_pool2d.out

+

avg_pool2d_out_npu

+

713

+

avg_pool2d

+

avg_pool2d_npu

+

714

+

avg_pool2d_backward.grad_input

+

avg_pool2d_backward_out_npu

+

715

+

avg_pool2d_backward

+

avg_pool2d_backward_npu

+

716

+

avg_pool3d.out

+

avg_pool3d_out_npu

+

717

+

avg_pool3d

+

avg_pool3d_npu

+

718

+

avg_pool3d_backward.grad_input

+

avg_pool3d_backward_out_npu

+

719

+

avg_pool3d_backward

+

avg_pool3d_backward_npu

+

720

+

max_pool2d_with_indices.out

+

max_pool2d_with_indices_out_npu

+

721

+

max_pool2d_with_indices

+

max_pool2d_with_indices_npu

+

722

+

max_pool2d_with_indices_backward.grad_input

+

max_pool2d_with_indices_backward_out_npu

+

723

+

max_pool2d_with_indices_backward

+

max_pool2d_with_indices_backward_npu

+

724

+

max_pool3d_with_indices.out

+

max_pool3d_with_indices_out_npu

+

725

+

max_pool3d_with_indices

+

max_pool3d_with_indices_npu

+

726

+

max_pool3d_with_indices_backward.grad_input

+

max_pool3d_with_indices_backward_out_npu

+

727

+

max_pool3d_with_indices_backward

+

max_pool3d_with_indices_backward_npu

+

728

+

max_unpool2d.out

+

max_unpool2d_out_npu

+

729

+

max_unpool2d

+

max_unpool2d_npu

+

730

+

max_unpool2d_backward.grad_input

+

max_unpool2d_backward_out_npu

+

731

+

max_unpool2d_backward

+

max_unpool2d_backward_npu

+

732

+

max_unpool3d.out

+

max_unpool3d_out_npu

+

733

+

max_unpool3d

+

max_unpool3d_npu

+

734

+

max_unpool3d_backward.grad_input

+

max_unpool3d_backward_out_npu

+

735

+

max_unpool3d_backward

+

max_unpool3d_backward_npu

+

736

+

reflection_pad2d.out

+

reflection_pad2d_out_npu

+

737

+

reflection_pad2d

+

reflection_pad2d_npu

+

738

+

reflection_pad2d_backward.grad_input

+

reflection_pad2d_backward_out_npu

+

739

+

reflection_pad2d_backward

+

reflection_pad2d_backward_npu

+

740

+

replication_pad2d.out

+

replication_pad2d_out_npu

+

741

+

replication_pad2d

+

replication_pad2d_npu

+

742

+

replication_pad2d_backward.grad_input

+

replication_pad2d_backward_out_npu

+

743

+

replication_pad2d_backward

+

replication_pad2d_backward_npu

+

744

+

upsample_linear1d.out

+

upsample_linear1d_out_npu

+

745

+

upsample_linear1d

+

upsample_linear1d_npu

+

746

+

upsample_linear1d_backward

+

upsample_linear1d_backward_npu

+

747

+

upsample_bilinear2d.out

+

upsample_bilinear2d_out_npu

+

748

+

upsample_bilinear2d

+

upsample_bilinear2d_npu

+

749

+

upsample_bilinear2d_backward.grad_input

+

upsample_bilinear2d_backward_out_npu

+

750

+

upsample_bilinear2d_backward

+

upsample_bilinear2d_backward_npu

+

751

+

upsample_bicubic2d.out

+

upsample_bicubic2d_out_npu

+

752

+

upsample_bicubic2d

+

upsample_bicubic2d_npu

+

753

+

upsample_bicubic2d_backward.grad_input

+

upsample_bicubic2d_backward_out_npu

+

754

+

upsample_bicubic2d_backward

+

upsample_bicubic2d_backward_npu

+

755

+

upsample_trilinear3d.out

+

upsample_trilinear3d_out_npu

+

756

+

upsample_trilinear3d

+

upsample_trilinear3d_npu

+

757

+

upsample_trilinear3d_backward.grad_input

+

upsample_trilinear3d_backward_out_npu

+

758

+

upsample_trilinear3d_backward

+

upsample_trilinear3d_backward_npu

+

759

+

upsample_nearest1d.out

+

upsample_nearest1d_out_npu

+

760

+

upsample_nearest1d

+

upsample_nearest1d_npu

+

761

+

upsample_nearest1d_backward.grad_input

+

upsample_nearest1d_backward_out_npu

+

762

+

upsample_nearest1d_backward

+

upsample_nearest1d_backward_npu

+

763

+

upsample_nearest2d.out

+

upsample_nearest2d_out_npu

+

764

+

upsample_nearest2d

+

upsample_nearest2d_npu

+

765

+

upsample_nearest2d_backward.grad_input

+

upsample_nearest2d_backward_out_npu

+

766

+

upsample_nearest2d_backward

+

upsample_nearest2d_backward_npu

+

767

+

upsample_nearest3d.out

+

upsample_nearest3d_out_npu

+

768

+

upsample_nearest3d

+

upsample_nearest3d_npu

+

769

+

upsample_nearest3d_backward.grad_input

+

upsample_nearest3d_backward_out_npu

+

770

+

upsample_nearest3d_backward

+

upsample_nearest3d_backward_npu

+

771

+

sigmoid_backward.grad_input

+

sigmoid_backward_out_npu

+

772

+

sigmoid_backward

+

sigmoid_backward_npu

+

773

+

tanh_backward.grad_input

+

tanh_backward_out_npu

+

774

+

tanh_backward

+

tanh_backward_npu

+

775

+

slow_conv_transpose2d.out

+

slow_conv_transpose2d_out_npu

+

776

+

slow_conv_transpose2d

+

slow_conv_transpose2d_npu

+

777

+

slow_conv_transpose2d_backward.grad_output

+

slow_conv_transpose2d_backward_out_npu

+

778

+

slow_conv_transpose2d_backward.output_mask

+

slow_conv_transpose2d_backward_npu

+

779

+

thnn_conv2d.out

+

thnn_conv2d_out_npu

+

780

+

thnn_conv2d

+

thnn_conv2d_npu

+

781

+

thnn_conv2d_forward.output

+

thnn_conv2d_forward_out_npu

+

782

+

thnn_conv2d_forward

+

thnn_conv2d_forward_npu

+

783

+

thnn_conv2d_backward.output_mask

+

thnn_conv2d_backward_npu

+

784

+

thnn_conv_depthwise2d.out

+

thnn_conv_depthwise2d_out_npu

+

785

+

thnn_conv_depthwise2d

+

thnn_conv_depthwise2d_npu

+

786

+

thnn_conv_depthwise2d_forward.out

+

thnn_conv_depthwise2d_forward_out_npu

+

787

+

thnn_conv_depthwise2d_forward

+

thnn_conv_depthwise2d_forward_npu

+

788

+

thnn_conv_depthwise2d_backward.grad_input

+

thnn_conv_depthwise2d_backward_out_npu

+

789

+

thnn_conv_depthwise2d_backward.output_mask

+

thnn_conv_depthwise2d_backward_npu

+

790

+

slow_conv3d.out

+

slow_conv3d_out_npu

+

791

+

slow_conv3d

+

slow_conv3d_npu

+

792

+

slow_conv3d_forward.output

+

slow_conv3d_forward_out_npu

+

793

+

slow_conv3d_forward

+

slow_conv3d_forward_npu

+

794

+

slow_conv_dilated2d

+

slow_conv_dilated2d_npu

+

795

+

slow_conv_dilated2d_backward

+

slow_conv_dilated2d_backward_npu

+

796

+

col2im.out

+

im2col_backward_out_npu

+

797

+

col2im

+

im2col_backward_npu

+

798

+

col2im_backward.grad_input

+

im2col_out_npu

+

799

+

col2im_backward

+

im2col_npu

+

800

+

im2col.out

+

im2col_out_npu

+

801

+

im2col

+

im2col_npu

+

802

+

im2col_backward.grad_input

+

im2col_backward_out_npu

+

803

+

im2col_backward

+

im2col_backward_npu

+

804

+

isfinite

+

isfinite_npu

+
+ +

PyTorch Operators Customized by Ascend

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

No.

+

PyTorch Operator (Developed by Ascend)

+

Ascend Adapted Operator

+

1

+

npu_convolution_transpose

+

npu_convolution_transpose

+

2

+

npu_conv_transpose2d

+

conv_transpose2d_npu

+

3

+

npu_convolution_transpose_backward

+

npu_convolution_transpose_backward

+

4

+

npu_conv_transpose2d_backward

+

conv_transpose2d_backward_npu

+

5

+

npu_conv_transpose3d_backward

+

conv_transpose3d_backward_npu

+

6

+

npu_convolution

+

npu_convolution

+

7

+

npu_convolution_backward

+

npu_convolution_backward

+

8

+

npu_convolution_double_backward

+

npu_convolution_double_backward

+

9

+

npu_conv2d

+

conv2d_npu

+

10

+

npu_conv2d.out

+

conv2d_out_npu

+

11

+

npu_conv2d_backward

+

conv2d_backward_npu

+

12

+

npu_conv3d

+

conv3d_npu

+

13

+

npu_conv3d.out

+

conv3d_out_npu

+

14

+

npu_conv3d_backward

+

conv3d_backward_npu

+

15

+

one_

+

one_npu_

+

16

+

npu_sort_v2.out

+

sort_without_indices_out_npu

+

17

+

npu_sort_v2

+

sort_without_indices_npu

+

18

+

npu_format_cast

+

format_cast_npu

+

19

+

npu_format_cast_.acl_format

+

format_cast_npu_

+

20

+

npu_format_cast_.src

+

format_cast_npu_

+

21

+

npu_transpose_to_contiguous

+

transpose_to_contiguous_npu

+

22

+

npu_transpose

+

transpose_npu

+

23

+

npu_transpose.out

+

transpose_out_npu

+

24

+

npu_broadcast

+

broadcast_npu

+

25

+

npu_broadcast.out

+

broadcast_out_npu

+

26

+

npu_dtype_cast

+

dtype_cast_npu

+

27

+

npu_dtype_cast_.Tensor

+

dtype_cast_npu_

+

28

+

npu_roi_alignbk

+

roi_align_backward_npu

+

29

+

empty_with_format

+

empty_with_format_npu

+

30

+

empty_with_format.names

+

empty_with_format_npu

+

31

+

copy_memory_

+

copy_memory_npu_

+

32

+

npu_one_hot

+

one_hot_npu

+

33

+

npu_stride_add

+

stride_add_npu

+

34

+

npu_softmax_cross_entropy_with_logits

+

softmax_cross_entropy_with_logits_npu

+

35

+

npu_softmax_cross_entropy_with_logits_backward

+

softmax_cross_entropy_with_logits_backward_npu

+

36

+

npu_ps_roi_pooling

+

ps_roi_pooling_npu

+

37

+

npu_ps_roi_pooling_backward

+

ps_roi_pooling_backward_npu

+

38

+

npu_roi_align

+

roi_align_npu

+

39

+

npu_nms_v4

+

nms_v4_npu

+

40

+

npu_lstm

+

lstm_npu

+

41

+

npu_lstm_backward

+

lstm_backward_npu

+

42

+

npu_iou

+

iou_npu

+

43

+

npu_ptiou

+

ptiou_npu

+

44

+

npu_nms_with_mask

+

nms_with_mask_npu

+

45

+

npu_pad

+

pad_npu

+

46

+

npu_bounding_box_encode

+

bounding_box_encode_npu

+

47

+

npu_bounding_box_decode

+

bounding_box_decode_npu

+

48

+

npu_gru

+

gru_npu

+

49

+

npu_gru_backward

+

gru_backward_npu

+

50

+

npu_set_.source_Storage_storage_offset_format

+

set_npu_

+

51

+

npu_random_choice_with_mask

+

random_choice_with_mask_npu

+

52

+

npu_batch_nms

+

batch_nms_npu

+

53

+

npu_slice

+

slice_npu

+

54

+

npu_slice.out

+

slice_out_npu

+

55

+

npu_dropoutV2

+

dropout_v2_npu

+

56

+

npu_dropoutV2_backward

+

dropout_v2_backward_npu

+

57

+

_npu_dropout

+

_dropout_npu

+

58

+

_npu_dropout_inplace

+

_dropout_npu_inplace

+

59

+

npu_dropout_backward

+

dropout_backward_npu

+

60

+

npu_indexing

+

indexing_npu

+

61

+

npu_indexing.out

+

indexing_out_npu

+

62

+

npu_ifmr

+

ifmr_npu

+

63

+

npu_max.dim

+

max_v1_npu

+

64

+

npu_max.names_dim

+

max_v1_npu

+

65

+

npu_scatter

+

scatter_npu

+

66

+

npu_max_backward

+

max_backward_npu

+

67

+

npu_apply_adam

+

apply_adam_npu

+

68

+

npu_layer_norm_eval

+

layer_norm_eval_npu

+

69

+

npu_alloc_float_status

+

alloc_float_status_npu

+

70

+

npu_get_float_status

+

get_float_status_npu

+

71

+

npu_clear_float_status

+

clear_float_status_npu

+

72

+

npu_confusion_transpose

+

confusion_transpose_npu

+

73

+

npu_confusion_transpose_backward

+

confusion_transpose_backward_npu

+

74

+

npu_bmmV2

+

bmm_v2_npu

+

75

+

fast_gelu

+

fast_gelu_npu

+

76

+

fast_gelu_backward

+

fast_gelu_backward_npu

+

77

+

npu_sub_sample

+

sub_sample_npu

+

78

+

npu_deformable_conv2d

+

deformable_conv2d_npu

+

79

+

npu_deformable_conv2dbk

+

deformable_conv2d_backward_npu

+

80

+

npu_mish

+

mish_npu

+

81

+

npu_anchor_response_flags

+

anchor_response_flags_npu

+

82

+

npu_yolo_boxes_encode

+

yolo_boxes_encode_npu

+

83

+

npu_grid_assign_positive

+

grid_assign_positive_npu

+

84

+

npu_mish_backward

+

mish_backward_npu

+

85

+

npu_normalize_batch

+

normalize_batch_npu

+

86

+

npu_masked_fill_range

+

masked_fill_range_npu

+

87

+

npu_linear

+

linear_npu

+

88

+

npu_linear_backward

+

linear_backward_npu

+

89

+

npu_bert_apply_adam

+

bert_apply_adam_npu

+

90

+

npu_giou

+

giou_npu

+

91

+

npu_giou_backward

+

giou_backward_npu

+
\ No newline at end of file