# espnet
**Repository Path**: xbnpyk/espnet
## Basic Information
- **Project Name**: espnet
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-10-22
- **Last Updated**: 2020-12-19
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README

# ESPnet: end-to-end speech processing toolkit
|system/pytorch ver.|1.0.1|1.1.0|1.2.0|1.3.1|1.4.0|1.5.1|1.6.0|
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|ubuntu18/python3.8/pip|||||||[](https://github.com/espnet/espnet/actions)|
|ubuntu18/python3.7/pip|[](https://github.com/espnet/espnet/actions)|[](https://github.com/espnet/espnet/actions)|[](https://github.com/espnet/espnet/actions)|[](https://github.com/espnet/espnet/actions)|[](https://github.com/espnet/espnet/actions)|[](https://github.com/espnet/espnet/actions)|[](https://github.com/espnet/espnet/actions)|
|ubuntu18/python3.6/conda|||||||[](https://circleci.com/gh/espnet/espnet)|
|ubuntu20/python3.6/conda|||||||[](https://circleci.com/gh/espnet/espnet)|
|debian9/python3.6/conda|||||||[](https://circleci.com/gh/espnet/espnet)|
|centos7/python3.6/conda|||||||[](https://circleci.com/gh/espnet/espnet)|
|[docs/coverage] python3.8|||||||[](https://travis-ci.org/espnet/espnet)|
[](https://badge.fury.io/py/espnet)
[](https://pypi.org/project/espnet/)
[](https://pepy.tech/project/espnet)
[](https://github.com/espnet/espnet)
[](https://codecov.io/gh/espnet/espnet)
[](https://github.com/psf/black)
[](https://mergify.io)
[](https://gitter.im/espnet-en/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
[**Docs**](https://espnet.github.io/espnet/)
| [**Example**](https://github.com/espnet/espnet/tree/master/egs)
| [**Example (ESPnet2)**](https://github.com/espnet/espnet/tree/master/egs2)
| [**Docker**](https://github.com/espnet/espnet/tree/master/docker)
| [**Notebook**](https://github.com/espnet/notebook)
| [**Tutorial (2019)**](https://github.com/espnet/interspeech2019-tutorial)
ESPnet is an end-to-end speech processing toolkit, mainly focuses on end-to-end speech recognition and end-to-end text-to-speech.
ESPnet uses [chainer](https://chainer.org/) and [pytorch](http://pytorch.org/) as a main deep learning engine,
and also follows [Kaldi](http://kaldi-asr.org/) style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.
## Key Features
### Kaldi style complete recipe
- Support numbers of `ASR` recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, etc.)
- Support numbers of `TTS` recipes with a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
- Support numbers of `ST` recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
- Support numbers of `MT` recipes (IWSLT'16, the above ST recipes etc.)
- Support speech separation and recognition recipe (WSJ-2mix)
- Support voice conversion recipe (VCC2020 baseline) (new!)
### ASR: Automatic Speech Recognition
- **State-of-the-art performance** in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
- **Hybrid CTC/attention** based end-to-end ASR
- Fast/accurate training with CTC/attention multitask training
- CTC/attention joint decoding to boost monotonic alignment decoding
- Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU) or Transformer
- Attention: Dot product, location-aware attention, variants of multihead
- Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
- Batch GPU decoding
- **Transducer** based end-to-end ASR
- Available: RNN-based encoder/decoder and Transformer-based encoder/decoder w/ customizable architecture.
- Also support: mixed RNN/Transformer architecture, attention mechanism (RNN decoder), VGG2L (RNN/Transformer encoder), Conformer (Transformer encoder), TDNN (Transformer encoder), Causal Conv1d (Transformer decoder) and various decoding algorithms.
> Please refer to the [tutorial page](https://espnet.github.io/espnet/tutorial.html#transducer) for complete documentation.
- CTC segmentation
### TTS: Text-to-speech
- Tacotron2
- Transformer-TTS
- FastSpeech
- FastSpeech2 (in ESPnet2)
- Conformer-based FastSpeech & FastSpeech2 (in ESPnet2)
- Multi-speaker model with pretrained speaker embedding
- Multi-speaker model with GST (in ESPnet2)
- Phoneme-based training (En, Jp, and Zn)
- Integration with neural vocoders (WaveNet, ParallelWaveGAN, and MelGAN)
You can try demo online now!
- Real-time TTS demo with ESPnet2 [](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb)
- Real-time TTS demo with ESPnet1 [](https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb)
To train the neural vocoder, please check the following repositories:
- [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
- [r9y9/wavenet_vocoder](https://github.com/r9y9/wavenet_vocoder)
> **NOTE**:
> - We are moving on ESPnet2-based development for TTS.
> - If you are beginner, we recommend using [ESPnet2-TTS](https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1).
### ST: Speech Translation & MT: Machine Translation
- **State-of-the-art performance** in several ST benchmarks (comparable/superior to cascaded ASR and MT)
- Transformer based end-to-end ST (new!)
- Transformer based end-to-end MT (new!)
### VC: Voice conversion
- Transformer and Tacotron2 based parallel VC using melspectrogram (new!)
- End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)
### DNN Framework
- Flexible network architecture thanks to chainer and pytorch
- Flexible front-end processing thanks to [kaldiio](https://github.com/nttcslab-sp/kaldiio) and HDF5 support
- Tensorboard based monitoring
### ESPnet2
See [ESPnet2](https://espnet.github.io/espnet/espnet2_tutorial.html).
- Indepedent from Kaldi/Chainer
- On the fly feature extraction and text processing when training
- Multi GPUs training on single/multi nodes (Distributed training)
- A template recipe which can be applied for all corpora
- Possible to train any size of corpus without cpu memory error
- (Under development) [ESPnet Model Zoo](https://github.com/espnet/espnet_model_zoo)
## Installation
- If you intend to do full experiments including DNN training, then see [Installation](https://espnet.github.io/espnet/installation.html).
- If you just need the Python module only:
```sh
pip install espnet
# To install latest
# pip install git+https://github.com/espnet/espnet
```
You need to install some packages.
```sh
pip install torch
pip install chainer==6.0.0 cupy==6.0.0 # [Option] If you'll use ESPnet1
pip install torchaudio # [Option] If you'll use enhancement task
pip install torch_optimizer # [Option] If you'll use additional optimizers in ESPnet2
```
There are some required packages depending on each task other than above. If you meet ImportError, please intall them at that time.
## Usage
See [Usage](https://espnet.github.io/espnet/tutorial.html).
## Docker Container
go to [docker/](docker/) and follow [instructions](https://espnet.github.io/espnet/docker.html).
## Contribution
Thank you for taking times for ESPnet! Any contributions to ESPNet are welcome and feel free to ask any questions or requests to [issues](https://github.com/espnet/espnet/issues).
If it's the first contribution to ESPnet for you, please follow the [contribution guide](CONTRIBUTING.md).
## Results and demo
You can find useful tutorials and demos in [Interspeech 2019 Tutorial](https://github.com/espnet/interspeech2019-tutorial)
### ASR results
expand
We list the character error rate (CER) and word error rate (WER) of major ASR tasks.
| Task | CER (%) | WER (%) | Pretrained model |
| ----------- | :----: | :----: | :----: |
| Aishell dev | 6.0 | N/A | [link](https://github.com/espnet/espnet/blob/master/egs/aishell/asr1/RESULTS.md#transformer-result-default-transformer-with-initial-learning-rate--10-and-epochs--50) |
| Aishell test | 6.6 | N/A | same as above |
| Common Voice dev | 1.7 | 2.2 | [link](https://github.com/espnet/espnet/blob/master/egs/commonvoice/asr1/RESULTS.md#first-results-default-pytorch-transformer-setting-with-bpe-100-epochs-single-gpu) |
| Common Voice test | 1.8 | 2.3 | same as above |
| CSJ eval1 | 5.7 | N/A | [link](https://github.com/espnet/espnet/blob/master/egs/csj/asr1/RESULTS.md#pytorch-backend-transformer-without-any-hyperparameter-tuning) |
| CSJ eval2 | 3.8 | N/A | same as above |
| CSJ eval3 | 4.2 | N/A | same as above |
| HKUST dev | 23.5 | N/A | [link](https://github.com/espnet/espnet/blob/master/egs/hkust/asr1/RESULTS.md#transformer-only-20-epochs) |
| Librispeech dev_clean | N/A | 2.0 | [link](https://github.com/espnet/espnet/blob/master/egs/librispeech/asr1/RESULTS.md#pytorch-large-transformer-with-specaug-4-gpus--transformer-lm-4-gpus) |
| Librispeech dev_other | N/A | 4.9 | same as above |
| Librispeech test_clean | N/A | 2.2 | same as above |
| Librispeech test_other | N/A | 5.3 | same as above |
| TEDLIUM2 dev | N/A | 9.3 | [link](https://github.com/espnet/espnet/blob/master/egs/tedlium2/asr1/RESULTS.md#transformer-large-model--specaug--large-lm) |
| TEDLIUM2 test | N/A | 8.1 | same as above |
| TEDLIUM3 dev | N/A | 9.7 | [link](https://github.com/espnet/espnet/blob/master/egs/tedlium3/asr1/RESULTS.md#transformer-elayers12-dlayers6-units2048-8-gpus-specaug--large-lm) |
| TEDLIUM3 test | N/A | 8.0 | same as above |
| WSJ dev93 | 3.2 | 7.0 | N/A |
| WSJ eval92 | 2.1 | 4.7 | N/A |
Note that the performance of the CSJ, HKUST, and Librispeech tasks was significantly improved by using the wide network (#units = 1024) and large subword units if necessary reported by [RWTH](https://arxiv.org/pdf/1805.03294.pdf).
If you want to check the results of the other recipes, please check `egs//asr1/RESULTS.md`.
### ASR demo
expand
You can recognize speech in a WAV file using pretrained models.
Go to a recipe directory and run `utils/recog_wav.sh` as follows:
```sh
# go to recipe directory and source path of espnet tools
cd egs/tedlium2/asr1 && . ./path.sh
# let's recognize speech!
recog_wav.sh --models tedlium2.transformer.v1 example.wav
```
where `example.wav` is a WAV file to be recognized.
The sampling rate must be consistent with that of data used in training.
Available pretrained models in the demo script are listed as below.
| Model | Notes |
| :------ | :------ |
| [tedlium2.rnn.v1](https://drive.google.com/open?id=1UqIY6WJMZ4sxNxSugUqp3mrGb3j6h7xe) | Streaming decoding based on CTC-based VAD |
| [tedlium2.rnn.v2](https://drive.google.com/open?id=1cac5Uc09lJrCYfWkLQsF8eapQcxZnYdf) | Streaming decoding based on CTC-based VAD (batch decoding) |
| [tedlium2.transformer.v1](https://drive.google.com/open?id=1cVeSOYY1twOfL9Gns7Z3ZDnkrJqNwPow) | Joint-CTC attention Transformer trained on Tedlium 2 |
| [tedlium3.transformer.v1](https://drive.google.com/open?id=1zcPglHAKILwVgfACoMWWERiyIquzSYuU) | Joint-CTC attention Transformer trained on Tedlium 3 |
| [librispeech.transformer.v1](https://drive.google.com/open?id=1BtQvAnsFvVi-dp_qsaFP7n4A_5cwnlR6) | Joint-CTC attention Transformer trained on Librispeech |
| [commonvoice.transformer.v1](https://drive.google.com/open?id=1tWccl6aYU67kbtkm8jv5H6xayqg1rzjh) | Joint-CTC attention Transformer trained on CommonVoice |
| [csj.transformer.v1](https://drive.google.com/open?id=120nUQcSsKeY5dpyMWw_kI33ooMRGT2uF) | Joint-CTC attention Transformer trained on CSJ |
| [csj.rnn.v1](https://drive.google.com/open?id=1ALvD4nHan9VDJlYJwNurVr7H7OV0j2X9) | Joint-CTC attention VGGBLSTM trained on CSJ |
### ST results
expand
We list 4-gram BLEU of major ST tasks.
#### end-to-end system
| Task | BLEU | Pretrained model |
| ---- | :----: | :----: |
| Fisher-CallHome Spanish fisher_test (Es->En) | 51.03 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/st1/RESULTS.md#train_spen_lcrm_pytorch_train_pytorch_transformer_bpe_short_long_bpe1000_specaug_asrtrans_mttrans) |
| Fisher-CallHome Spanish callhome_evltest (Es->En) | 20.44 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/st1/RESULTS.md#train_spen_lcrm_pytorch_train_pytorch_transformer_bpe_short_long_bpe1000_specaug_asrtrans_mttrans) |
| Libri-trans test (En->Fr) | 16.70 | [link](https://github.com/espnet/espnet/blob/master/egs/libri_trans/st1/RESULTS.md#train_spfr_lc_pytorch_train_pytorch_transformer_bpe_short_long_bpe1000_specaug_asrtrans_mttrans-1) |
| How2 dev5 (En->Pt) | 45.68 | [link](https://github.com/espnet/espnet/blob/master/egs/how2/st1/RESULTS.md#trainpt_tc_pytorch_train_pytorch_transformer_short_long_bpe8000_specaug_asrtrans_mttrans-1) |
| Must-C tst-COMMON (En->De) | 22.91 | [link](https://github.com/espnet/espnet/blob/master/egs/must_c/st1/RESULTS.md#train_spen-dede_tc_pytorch_train_pytorch_transformer_short_long_bpe8000_specaug_asrtrans_mttrans) |
| Mboshi-French dev (Fr->Mboshi) | 6.18 | N/A |
#### cascaded system
| Task | BLEU | Pretrained model |
| ---- | :----: | :----: |
| Fisher-CallHome Spanish fisher_test (Es->En) | 42.16 | N/A |
| Fisher-CallHome Spanish callhome_evltest (Es->En) | 19.82 | N/A |
| Libri-trans test (En->Fr) | 16.96 | N/A |
| How2 dev5 (En->Pt) | 44.90 | N/A |
| Must-C tst-COMMON (En->De) | 23.65 | N/A |
If you want to check the results of the other recipes, please check `egs//st1/RESULTS.md`.
### ST demo
expand
(**New!**) We made a new real-time E2E-ST + TTS demonstration in Google Colab.
Please access the notebook from the following button and enjoy the real-time speech-to-speech translation!
[](https://colab.research.google.com/github/espnet/notebook/blob/master/st_demo.ipynb)
---
You can translate speech in a WAV file using pretrained models.
Go to a recipe directory and run `utils/translate_wav.sh` as follows:
```sh
# go to recipe directory and source path of espnet tools
cd egs/fisher_callhome_spanish/st1 && . ./path.sh
# download example wav file
wget -O - https://github.com/espnet/espnet/files/4100928/test.wav.tar.gz | tar zxvf -
# let's translate speech!
translate_wav.sh --models fisher_callhome_spanish.transformer.v1.es-en test.wav
```
where `test.wav` is a WAV file to be translated.
The sampling rate must be consistent with that of data used in training.
Available pretrained models in the demo script are listed as below.
| Model | Notes |
| :------ | :------ |
| [fisher_callhome_spanish.transformer.v1](https://drive.google.com/open?id=1hawp5ZLw4_SIHIT3edglxbKIIkPVe8n3) | Transformer-ST trained on Fisher-CallHome Spanish Es->En |
### MT results
expand
| Task | BLEU | Pretrained model |
| ---- | :----: | :----: |
| Fisher-CallHome Spanish fisher_test (Es->En) | 61.45 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/mt1/RESULTS.md#trainen_lcrm_lcrm_pytorch_train_pytorch_transformer_bpe_bpe1000) |
| Fisher-CallHome Spanish callhome_evltest (Es->En) | 29.86 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/mt1/RESULTS.md#trainen_lcrm_lcrm_pytorch_train_pytorch_transformer_bpe_bpe1000) |
| Libri-trans test (En->Fr) | 18.09 | [link](https://github.com/espnet/espnet/blob/master/egs/libri_trans/mt1/RESULTS.md#trainfr_lcrm_tc_pytorch_train_pytorch_transformer_bpe1000) |
| How2 dev5 (En->Pt) | 58.61 | [link](https://github.com/espnet/espnet/blob/master/egs/how2/mt1/RESULTS.md#trainpt_tc_tc_pytorch_train_pytorch_transformer_bpe8000) |
| Must-C tst-COMMON (En->De) | 27.63 | [link](https://github.com/espnet/espnet/blob/master/egs/must_c/mt1/RESULTS.md#summary-4-gram-bleu) |
| IWSLT'14 test2014 (En->De) | 24.70 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) |
| IWSLT'14 test2014 (De->En) | 29.22 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) |
| IWSLT'16 test2014 (En->De) | 24.05 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) |
| IWSLT'16 test2014 (De->En) | 29.13 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) |
### TTS results
ESPnet2
You can listen to the generated samples in the following url.
- [ESPnet2 TTS generated samples](https://drive.google.com/drive/folders/1H3fnlBbWMEkQUfrHqosKN_ZX_WjO29ma?usp=sharing)
> Note that in the generation we use Griffin-Lim (`wav/`) and [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) (`wav_pwg/`).
You can download pretrained models via `espnet_model_zoo`.
- [ESPnet model zoo](https://github.com/espnet/espnet_model_zoo)
- [Pretrained model list](https://github.com/espnet/espnet_model_zoo/blob/master/espnet_model_zoo/table.csv)
You can download pretrained vocoders via `kan-bayashi/ParallelWaveGAN`.
- [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
- [Pretrained vocoder list](https://github.com/kan-bayashi/ParallelWaveGAN#results)
ESPnet1
> NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest results in the above ESPnet2 results.
You can listen to our samples in demo HP [espnet-tts-sample](https://espnet.github.io/espnet-tts-sample/).
Here we list some notable ones:
- [Single English speaker Tacotron2](https://drive.google.com/open?id=18JgsOCWiP_JkhONasTplnHS7yaF_konr)
- [Single Japanese speaker Tacotron2](https://drive.google.com/open?id=1fEgS4-K4dtgVxwI4Pr7uOA1h4PE-zN7f)
- [Single other language speaker Tacotron2](https://drive.google.com/open?id=1q_66kyxVZGU99g8Xb5a0Q8yZ1YVm2tN0)
- [Multi English speaker Tacotron2](https://drive.google.com/open?id=18S_B8Ogogij34rIfJOeNF8D--uG7amz2)
- [Single English speaker Transformer](https://drive.google.com/open?id=14EboYVsMVcAq__dFP1p6lyoZtdobIL1X)
- [Single English speaker FastSpeech](https://drive.google.com/open?id=1PSxs1VauIndwi8d5hJmZlppGRVu2zuy5)
- [Multi English speaker Transformer](https://drive.google.com/open?id=1_vrdqjM43DdN1Qz7HJkvMQ6lCMmWLeGp)
- [Single Italian speaker FastSpeech](https://drive.google.com/open?id=13I5V2w7deYFX4DlVk1-0JfaXmUR2rNOv)
- [Single Mandarin speaker Transformer](https://drive.google.com/open?id=1mEnZfBKqA4eT6Bn0eRZuP6lNzL-IL3VD)
- [Single Mandarin speaker FastSpeech](https://drive.google.com/open?id=1Ol_048Tuy6BgvYm1RpjhOX4HfhUeBqdK)
- [Multi Japanese speaker Transformer](https://drive.google.com/open?id=1fFMQDF6NV5Ysz48QLFYE8fEvbAxCsMBw)
- [Single English speaker models with Parallel WaveGAN](https://drive.google.com/open?id=1HvB0_LDf1PVinJdehiuCt5gWmXGguqtx)
- [Single English speaker knowledge distillation-based FastSpeech](https://drive.google.com/open?id=1wG-Y0itVYalxuLAHdkAHO7w1CWFfRPF4)
You can download all of the pretrained models and generated samples:
- [All of the pretrained E2E-TTS models](https://drive.google.com/open?id=1k9RRyc06Zl0mM2A7mi-hxNiNMFb_YzTF)
- [All of the generated samples](https://drive.google.com/open?id=1bQGuqH92xuxOX__reWLP4-cif0cbpMLX)
Note that in the generated samples we use the following vocoders: Griffin-Lim (**GL**), WaveNet vocoder (**WaveNet**), Parallel WaveGAN (**ParallelWaveGAN**), and MelGAN (**MelGAN**).
The neural vocoders are based on following repositories.
- [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN): Parallel WaveGAN / MelGAN / Multi-band MelGAN
- [r9y9/wavenet_vocoder](https://github.com/r9y9/wavenet_vocoder): 16 bit mixture of Logistics WaveNet vocoder
- [kan-bayashi/PytorchWaveNetVocoder](https://github.com/kan-bayashi/PytorchWaveNetVocoder): 8 bit Softmax WaveNet Vocoder with the noise shaping
If you want to build your own neural vocoder, please check the above repositories.
[kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) provides [the manual](https://github.com/kan-bayashi/ParallelWaveGAN#decoding-with-espnet-tts-models-features) about how to decode ESPnet-TTS model's features with neural vocoders. Please check it.
Here we list all of the pretrained neural vocoders. Please download and enjoy the generation of high quality speech!
| Model link | Lang | Fs [Hz] | Mel range [Hz] | FFT / Shift / Win [pt] | Model type |
| :------ | :---: | :----: | :--------: | :---------------: | :------ |
| [ljspeech.wavenet.softmax.ns.v1](https://drive.google.com/open?id=1eA1VcRS9jzFa-DovyTgJLQ_jmwOLIi8L) | EN | 22.05k | None | 1024 / 256 / None | [Softmax WaveNet](https://github.com/kan-bayashi/PytorchWaveNetVocoder) |
| [ljspeech.wavenet.mol.v1](https://drive.google.com/open?id=1sY7gEUg39QaO1szuN62-Llst9TrFno2t) | EN | 22.05k | None | 1024 / 256 / None | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder) |
| [ljspeech.parallel_wavegan.v1](https://drive.google.com/open?id=1tv9GKyRT4CDsvUWKwH3s_OfXkiTi0gw7) | EN | 22.05k | None | 1024 / 256 / None | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) |
| [ljspeech.wavenet.mol.v2](https://drive.google.com/open?id=1es2HuKUeKVtEdq6YDtAsLNpqCy4fhIXr) | EN | 22.05k | 80-7600 | 1024 / 256 / None | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder) |
| [ljspeech.parallel_wavegan.v2](https://drive.google.com/open?id=1Grn7X9wD35UcDJ5F7chwdTqTa4U7DeVB) | EN | 22.05k | 80-7600 | 1024 / 256 / None | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) |
| [ljspeech.melgan.v1](https://drive.google.com/open?id=1ipPWYl8FBNRlBFaKj1-i23eQpW_W_YcR) | EN | 22.05k | 80-7600 | 1024 / 256 / None | [MelGAN](https://github.com/kan-bayashi/ParallelWaveGAN) |
| [ljspeech.melgan.v3](https://drive.google.com/open?id=1_a8faVA5OGCzIcJNw4blQYjfG4oA9VEt) | EN | 22.05k | 80-7600 | 1024 / 256 / None | [MelGAN](https://github.com/kan-bayashi/ParallelWaveGAN) |
| [libritts.wavenet.mol.v1](https://drive.google.com/open?id=1jHUUmQFjWiQGyDd7ZeiCThSjjpbF_B4h) | EN | 24k | None | 1024 / 256 / None | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder) |
| [jsut.wavenet.mol.v1](https://drive.google.com/open?id=187xvyNbmJVZ0EZ1XHCdyjZHTXK9EcfkK) | JP | 24k | 80-7600 | 2048 / 300 / 1200 | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder) |
| [jsut.parallel_wavegan.v1](https://drive.google.com/open?id=1OwrUQzAmvjj1x9cDhnZPp6dqtsEqGEJM) | JP | 24k | 80-7600 | 2048 / 300 / 1200 | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) |
| [csmsc.wavenet.mol.v1](https://drive.google.com/open?id=1PsjFRV5eUP0HHwBaRYya9smKy5ghXKzj) | ZH | 24k | 80-7600 | 2048 / 300 / 1200 | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder) |
| [csmsc.parallel_wavegan.v1](https://drive.google.com/open?id=10M6H88jEUGbRWBmU1Ff2VaTmOAeL8CEy) | ZH | 24k | 80-7600 | 2048 / 300 / 1200 | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) |
If you want to use the above pretrained vocoders, please exactly match the feature setting with them.
### TTS demo
ESPnet2
You can try the real-time demo in Google Colab.
Please access the notebook from the following button and enjoy the real-time synthesis!
- Real-time TTS demo with ESPnet2 [](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb)
English, Japanese, and Mandarin models are available in the demo.
ESPnet1
> NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest demo in the above ESPnet2 demo.
You can try the real-time demo in Google Colab.
Please access the notebook from the following button and enjoy the real-time synthesis.
- Real-time TTS demo with ESPnet1 [](https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb)
We also provide shell script to perform synthesize.
Go to a recipe directory and run `utils/synth_wav.sh` as follows:
```sh
# go to recipe directory and source path of espnet tools
cd egs/ljspeech/tts1 && . ./path.sh
# we use upper-case char sequence for the default model.
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example.txt
# let's synthesize speech!
synth_wav.sh example.txt
# also you can use multiple sentences
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example_multi.txt
echo "TEXT TO SPEECH IS A TECHQNIQUE TO CONVERT TEXT INTO SPEECH." >> example_multi.txt
synth_wav.sh example_multi.txt
```
You can change the pretrained model as follows:
```sh
synth_wav.sh --models ljspeech.fastspeech.v1 example.txt
```
Waveform synthesis is performed with Griffin-Lim algorithm and neural vocoders (WaveNet and ParallelWaveGAN).
You can change the pretrained vocoder model as follows:
```sh
synth_wav.sh --vocoder_models ljspeech.wavenet.mol.v1 example.txt
```
WaveNet vocoder provides very high quality speech but it takes time to generate.
See more details or available models via `--help`.
```sh
synth_wav.sh --help
```
### VC results
expand
- Transformer and Tacotron2 based VC
You can listen to some samples on the [demo webpage](https://unilight.github.io/Publication-Demos/publications/transformer-vc/).
- Cascade ASR+TTS as one of the baseline systems of VCC2020
The [Voice Conversion Challenge 2020](http://www.vc-challenge.org/) (VCC2020) adopts ESPnet to build an end-to-end based baseline system.
In VCC2020, the objective is intra/cross lingual nonparallel VC.
You can download converted samples of the cascade ASR+TTS baseline system [here](https://drive.google.com/drive/folders/1oeZo83GrOgtqxGwF7KagzIrfjr8X59Ue?usp=sharing).
### CTC Segmentation demo
expand
[CTC segmentation](https://arxiv.org/abs/2007.09127) determines utterance segments within audio files.
Aligned utterance segments constitute the labels of speech datasets.
As demo, we align start and end of utterances within the audio file `ctc_align_test.wav`, using the example script `utils/ctc_align_wav.sh`.
For preparation, set up a data directory:
```sh
cd egs/tedlium2/align1/
# data directory
align_dir=data/demo
mkdir -p ${align_dir}
# wav file
base=ctc_align_test
wav=../../../test_utils/${base}.wav
# recipe files
echo "batchsize: 0" > ${align_dir}/align.yaml
cat << EOF > ${align_dir}/utt_text
${base} THE SALE OF THE HOTELS
${base} IS PART OF HOLIDAY'S STRATEGY
${base} TO SELL OFF ASSETS
${base} AND CONCENTRATE
${base} ON PROPERTY MANAGEMENT
EOF
```
Here, `utt_text` is the file containing the list of utterances.
Choose a pre-trained ASR model that includes a CTC layer to find utterance segments:
```sh
# pre-trained ASR model
model=wsj.transformer_small.v1
mkdir ./conf && cp ../../wsj/asr1/conf/no_preprocess.yaml ./conf
../../../utils/asr_align_wav.sh \
--models ${model} \
--align_dir ${align_dir} \
--align_config ${align_dir}/align.yaml \
${wav} ${align_dir}/utt_text
```
Segments are written to `aligned_segments` as a list of file/utterance name, utterance start and end times in seconds and a confidence score.
The confidence score is a probability in log space that indicates how good the utterance was aligned. If needed, remove bad utterances:
```sh
min_confidence_score=-5
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${align_dir}/aligned_segments
```
The demo script `utils/ctc_align_wav.sh` uses an already pretrained ASR model (see list above for more models).
It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files;
rather than using Transformer models that have a high memory consumption on longer audio data.
The sample rate of the audio must be consistent with that of the data used in training; adjust with `sox` if needed.
A full example recipe is in `egs/tedlium2/align1/`.
## References
[1] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, "ESPnet: End-to-End Speech Processing Toolkit," *Proc. Interspeech'18*, pp. 2207-2211 (2018)
[2] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, "Joint CTC-attention based end-to-end speech recognition using multi-task learning," *Proc. ICASSP'17*, pp. 4835--4839 (2017)
[3] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey and Tomoki Hayashi, "Hybrid CTC/Attention Architecture for End-to-End Speech Recognition," *IEEE Journal of Selected Topics in Signal Processing*, vol. 11, no. 8, pp. 1240-1253, Dec. 2017
## Citations
```
@inproceedings{watanabe2018espnet,
author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
title={{ESPnet}: End-to-End Speech Processing Toolkit},
year={2018},
booktitle={Proceedings of Interspeech},
pages={2207--2211},
doi={10.21437/Interspeech.2018-1456},
url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}
@inproceedings{hayashi2020espnet,
title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={7654--7658},
year={2020},
organization={IEEE}
}
@inproceedings{inaguma-etal-2020-espnet,
title = "{ESP}net-{ST}: All-in-One Speech Translation Toolkit",
author = "Inaguma, Hirofumi and
Kiyono, Shun and
Duh, Kevin and
Karita, Shigeki and
Yalta, Nelson and
Hayashi, Tomoki and
Watanabe, Shinji",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-demos.34",
pages = "302--311",
}
```