# Muyan-TTS

**Repository Path**: AI-LargeModel/Muyan-TTS

## Basic Information

- **Project Name**: Muyan-TTS
- **Description**: Muyan-TTS 是一种可训练的 TTS 模型，可实现零样本 TTS 合成和高质量语音生成。
Muyan-TTS 支持数十分钟目标语音的说话人适应，使其针对个人声音进行了高度定制。
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: https://github.com/MYZY-AI/Muyan-TTS
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-05-14
- **Last Updated**: 2025-05-14

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

<div align="center">
    <p align="center">
    <img src="assets/logo.png" width="400"/>
<p>

<p align="center">
Muyan-TTS <a href="https://huggingface.co/MYZY-AI/Muyan-TTS">🤗</a>&nbsp;<a href="https://modelscope.cn/models/MYZY-AI/Muyan-TTS">🤖</a>&nbsp;<a href="https://wisemodel.cn/models/MYZY-AI/Muyan-TTS">🦉</a>&nbsp; | Muyan-TTS-SFT <a href="https://huggingface.co/MYZY-AI/Muyan-TTS-SFT">🤗</a>&nbsp;<a href="https://modelscope.cn/models/MYZY-AI/Muyan-TTS-SFT">🤖</a>&nbsp;<a href="https://wisemodel.cn/models/MYZY-AI/Muyan-TTS-SFT">🦉</a>&nbsp; | &nbsp;<a href="https://arxiv.org/abs/2504.19146">Technical Report</a> &nbsp;&nbsp;
</p>
<p>
    <a href="https://discord.gg/zT52KG6WbD">
       <img src="https://dcbadge.limes.pink/api/server/zT52KG6WbD?style=flat">
    </a>
    <a href="https://github.com/MYZY-AI/Muyan-TTS/issues/1">
       <img src="https://img.shields.io/badge/群聊-WeChat-green">
    </a>
</p>
</div>

Muyan-TTS is a trainable TTS model designed for podcast applications within a $50,000 budget, which is pre-trained on over 100,000 hours of podcast audio data, enabling zero-shot TTS synthesis with high-quality voice generation. Furthermore, Muyan-TTS supports speaker adaptation with dozens of minutes of target speech, making it highly customizable for individual voices.

## 🔥🔥🔥 News!!

* April 29, 2025: 👋 We release the zero-shot TTS model weights of [Muyan-TTS](https://huggingface.co/MYZY-AI/Muyan-TTS).
* April 29, 2025: 👋 We release the few-shot TTS model weights of [Muyan-TTS-SFT](https://huggingface.co/MYZY-AI/Muyan-TTS-SFT), which is trained based on [Muyan-TTS](https://huggingface.co/MYZY-AI/Muyan-TTS) with dozens of minutes of a single speaker's speech.
* April 29, 2025: 👋 We release the training code from the base model to the SFT model for speaker adaptation.
* April 29, 2025: 👋 We release the [technical report](https://arxiv.org/abs/2504.19146) of Muyan-TTS.

## Summary

### Framework
![Framework](assets/framework.png)
Framework of Muyan-TTS. Left is an LLM that models the parallel corpus of text (in blue) and audio (in green) tokens. Right is a SoVITS model that decodes the generated audio tokens, as well as phonemes and speaker embeddings, into the audio waveform.

### Data
![Pipeline](assets/pipeline.png)
Data processing pipeline. The final dataset comprises over 100,000 hours of high-quality speech and corresponding transcriptions, forming a robust parallel corpus suitable for TTS training in long-form audio scenarios such as podcasts.

### Training costs
| Training Cost   | Data Processing   | Pre-training of LLM| Training of Decoder | Total |
|-------|-------|-------|-------|-------|
| in GPU Hours   | 60K(A10)   | 19.2K(A100)| 1.34K(A100) | - |
| in USD   | $30K   | $19.2K| $1.34K | $50.54K |

Training costs of Muyan-TTS, assuming the rental price of A10 and A100 in GPU hour is $0.5 and $1, respectively.

### Synthesis speed
We denote ```r``` as the inference time needed to generate one second of audio and compare the synthesis speed with several open-source TTS models.

| Model   | CosyVoice2   | Step-Audio| Spark-TTS | FireRedTTS |  GPT-SoVITS v3|  Muyan-TTS |
|-------|-------|-------|-------|-------|-------|-------|
| r &#8595;   | 2.19   | 0.90| 1.31 | 0.61 | 0.48 | 0.33 |

All the inference process ran on a single NVIDIA A100 (40G, PCIe) GPU, and the baseline models were evaluated using their official inference implementations.

*Note*: Muyan-TTS only supports English input since the training data is heavily skewed toward English.

## Demo

https://github.com/user-attachments/assets/a20d407c-15f8-40da-92b7-65e92e4f0c06

The three audios in the "Base model" column and the first audio in the "SFT model" column are synthesized by the open-sourced Muyan-TTS and Muyan-TTS-SFT, respectively. The last two audios in the "SFT model" column are generated by the SFT models trained separately on the base model, which are not open for use.

## Install
### Clone & Install
```sh
git clone https://github.com/MYZY-AI/Muyan-TTS.git
cd Muyan-TTS

conda create -n muyan-tts python=3.10 -y
conda activate muyan-tts
make build
```

You need to install ```FFmpeg```. If you're using Ubuntu, you can install it with the following command:
```sh
sudo apt update
sudo apt install ffmpeg
```


### Model Download 
| Models   | Links   |
|-------|-------|
| Muyan-TTS   | [huggingface](https://huggingface.co/MYZY-AI/Muyan-TTS) \| [modelscope](https://modelscope.cn/models/MYZY-AI/Muyan-TTS) \| [wisemodel](https://wisemodel.cn/models/MYZY-AI/Muyan-TTS)   |
| Muyan-TTS-SFT   | [huggingface](https://huggingface.co/MYZY-AI/Muyan-TTS-SFT) \| [modelscope](https://modelscope.cn/models/MYZY-AI/Muyan-TTS-SFT) \| [wisemodel](https://wisemodel.cn/models/MYZY-AI/Muyan-TTS-SFT)   |

Additionally, you need to download the weights of [chinese-hubert-base](https://huggingface.co/TencentGameMate/chinese-hubert-base).

Place all the downloaded models in the ```pretrained_models``` directory. Your directory structure should look similar to the following:
```
pretrained_models
├── chinese-hubert-base
├── Muyan-TTS
└── Muyan-TTS-SFT
```

## Quickstart
```sh
python tts.py
```
This will synthesize speech through inference. The core code is as follows:
```py
async def main(model_type, model_path):
    tts = Inference(model_type, model_path, enable_vllm_acc=False)
    wavs = await tts.generate(
        ref_wav_path="assets/Claire.wav",
        prompt_text="Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
        text="Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
    )
    output_path = "logs/tts.wav"
    with open(output_path, "wb") as f:
        f.write(next(wavs))  
    print(f"Speech generated in {output_path}")
```
You need to specify the prompt speech, including the ```ref_wav_path``` and its ```prompt_text```, and the ```text``` to be synthesized. The synthesized speech is saved by default to ```logs/tts.wav```.

Additionally, you need to specify ```model_type``` as either ```base``` or ```sft```, with the default being ```base```.

When you specify the ```model_type``` to be ```base```, you can change the prompt speech to arbitrary speaker for zero-shot TTS synthesis.

When you specify the ```model_type``` to be ```sft```, you need to keep the prompt speech unchanged because the ```sft``` model is trained on Claire's voice.

## API Usage
```sh
python api.py
```
Using the API mode automatically enables vLLM acceleration, and the above command will start a service on the default port ```8020```. Additionally, LLM logs will be saved in ```logs/llm.log```.

Similarly, you need to specify ```model_type``` as either ```base``` or ```sft```, with the default being ```base```. Note that the ```model_path``` should be consistent with your specified ```model_type```.

You can send a request to the API using the example below:
```py
import time
import requests
TTS_PORT=8020
payload = {
    "ref_wav_path": "assets/Claire.wav",
    "prompt_text": "Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
    "text": "Welcome to the captivating world of podcasts, let's embark on this exciting journey together.",
    "temperature": 0.6,
    "speed": 1.0,
}
start = time.time()

url = f"http://localhost:{TTS_PORT}/get_tts"
response = requests.post(url, json=payload)
audio_file_path = "logs/tts.wav"
with open(audio_file_path, "wb") as f:
    f.write(response.content)
    
print(time.time() - start)
```

By default, the synthesized speech will be saved at ```logs/tts.wav```.

## Training

We use ```LibriSpeech``` as an example. You can use your own dataset instead, but you need to organize the data into the format shown in ```data_process/examples```.

If you haven't downloaded ```LibriSpeech``` yet, you can download the dev-clean set using:
```sh
wget --no-check-certificate https://www.openslr.org/resources/12/dev-clean.tar.gz
```
After uncompressing the data, specify the ```librispeech_dir``` in ```prepare_sft_dataset.py``` to be the parent folder of your ```LibriSpeech``` path. Then run:
```sh
./train.sh
```
This will automatically process the data and generate ```data/tts_sft_data.json```.

Note that we use a specific speaker ID of "3752" from dev-clean of LibriSpeech (which can be specified in ```data_process/text_format_conversion.py```) as an example because its data size is relatively large. If you organize your own dataset for training, please prepare at least a dozen of minutes of speech from the target speaker.

If an error occurs during the process, resolve the error, delete the existing contents of the data folder, and then rerun ```train.sh```.

After generating ```data/tts_sft_data.json```, train.sh will automatically copy it to ```llama-factory/data``` and add the following field to ```dataset_info.json```:
```json
"tts_sft_data": {
    "file_name": "tts_sft_data.json"
}
```
Finally, it will automatically execute the ```llamafactory-cli train``` command to start training. You can adjust training settings using ```training/sft.yaml```.

By default, the trained weights will be saved to ```pretrained_models/Muyan-TTS-new-SFT```.

After training, you need to copy the ```sovits.pth``` of base/sft model to your trained model path before inference:
```sh
cp pretrained_models/Muyan-TTS/sovits.pth pretrained_models/Muyan-TTS-new-SFT
```

You can directly deploy your trained model using the API tool above. During inference, you need to specify the ```model_type``` to be ```sft``` and replace the ```ref_wav_path``` and ```prompt_text``` with a sample of the speaker's voice you trained on.

## Acknowledgment

The model is trained base on [Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B).

We borrow a lot of code from [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS).

We borrow a lot of code from [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory).

## Citation
```
@article{li2025muyan,
  title={Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50 K Budget},
  author={Li, Xin and Jia, Kaikai and Sun, Hao and Dai, Jun and Jiang, Ziyang},
  journal={arXiv preprint arXiv:2504.19146},
  year={2025}
}
```