# warp-rnnt

**Repository Path**: xbnpyk/warp-rnnt

## Basic Information

- **Project Name**: warp-rnnt
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-11-07
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

![PyPI](https://img.shields.io/pypi/v/warp-rnnt.svg)
[![Downloads](https://pepy.tech/badge/warp-rnnt)](https://pepy.tech/project/warp-rnnt)

# CUDA-Warp RNN-Transducer
A GPU implementation of RNN Transducer (Graves [2012](https://arxiv.org/abs/1211.3711), [2013](https://arxiv.org/abs/1303.5778)).
This code is ported from the [reference implementation](https://github.com/awni/transducer/blob/master/ref_transduce.py) (by Awni Hannun)
and fully utilizes the CUDA warp mechanism.

The main bottleneck in the loss is a forward/backward pass, which based on the dynamic programming algorithm.
In particular, there is a nested loop to populate a lattice with shape (T, U),
and each value in this lattice depend on the two previous cells from each dimension (e.g. [forward pass](https://github.com/awni/transducer/blob/6b37e98c21551c7ed2181e2f526053bae8ae94d2/ref_transduce.py#L56)).

CUDA executes threads in groups of 32 parallel threads called [warps](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture).
Full efficiency is realized when all 32 threads of a warp agree on their execution path.
This is exactly what is used to optimize the RNN Transducer. The lattice is split into warps in the T dimension.
In each warp, variables between threads exchanged using a fast operations.
As soon as the current warp fills the last value, the next two warps (t+32, u) and (t, u+1) are start running. 
A schematic procedure for the forward pass is shown in the figure below, where T - number of frames, U - number of labels, W - warp size.
The similar procedure for the backward pass runs in parallel.

![](lattice.gif)


## Performance
[Benchmarked](pytorch_binding/benchmark.py) on a GeForce GTX 1080 Ti GPU, Intel i7-8700 CPU @ 3.20GHz.

|                         |    warp_rnnt    | [warprnnt_pytorch](https://github.com/HawkAaron/warp-transducer/tree/master/pytorch_binding) | [transducer](https://github.com/awni/transducer) |
| :---------------------- | ------------------: | ------------------: | ------------------: |
|  **T=150, U=40, V=28**  | 
|         N=1             |       0.07 ms       |       0.68 ms       |       1.28 ms       |
|         N=16            |       0.33 ms       |       1.80 ms       |       6.15 ms       |
|         N=32            |       0.35 ms       |       3.39 ms       |      12.72 ms       |
|         N=64            |       0.56 ms       |       6.11 ms       |      23.73 ms       |
|         N=128           |       0.60 ms       |       9.22 ms       |      47.93 ms       |
| **T=150, U=20, V=5000** |
|         N=1             |       0.46 ms       |       2.14 ms       |      21.18 ms       |
|         N=16            |       1.42 ms       |      21.24 ms       |     240.11 ms       |
|         N=32            |       2.51 ms       |      38.26 ms       |     490.66 ms       |
|         N=64            |    out-of-memory    |      75.54 ms       |     944.73 ms       |
|         N=128           |    out-of-memory    |    out-of-memory    |    1894.93 ms       |
| **T=1500, U=300, V=50** |
|         N=1             |       0.60 ms       |      10.77 ms       |     121.82 ms       |
|         N=16            |       2.25 ms       |      97.69 ms       |     732.50 ms       |
|         N=32            |       3.97 ms       |     184.73 ms       |    1448.54 ms       |
|         N=64            |    out-of-memory    |     out-of-memory   |    2767.59 ms       |


## Note
- This implementation assumes that the input is log_softmax.

- In addition to alphas/betas arrays, counts array is allocated with shape (N, U * 2), which is used as a scheduling mechanism.

- [core_gather.cu](core_gather.cu) is a slightly memory-efficient version that expects log_probs with the shape (N, T, U, 2) only for blank and labels values.

- Do not expect that this implementation will greatly reduce the training time of RNN Transducer model. Probably, the main bottleneck will be a trainable joint network with an output (N, T, U, V).

- Also, there is a restricted version, called [Recurrent Neural Aligner](https://github.com/1ytic/warp-rna), with assumption that the length of input sequence is equal to or greater than the length of target sequence.


## Install
Currently, there is only a binding for PyTorch 1.0 and higher.

```bash
pip install warp_rnnt
```

## Test
There is a unittest in `pytorch_binding/warp_rnnt` which includes tests for arguments and outputs as well.

```bash
python -m warp_rnnt.test
```

## Reference
- Awni Hannun [transducer](https://github.com/awni/transducer)

- Mingkun Huang [warp-transducer](https://github.com/HawkAaron/warp-transducer)