# VibeVoice **Repository Path**: mirrors_microsoft/VibeVoice ## Basic Information - **Project Name**: VibeVoice - **Description**: Frontier Open-Source Text-to-Speech - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-26 - **Last Updated**: 2025-09-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

## 🎙️ VibeVoice: A Frontier Long Conversational Text-to-Speech Model [![Project Page](https://img.shields.io/badge/Project-Page-blue?logo=microsoft)](https://microsoft.github.io/VibeVoice) [![Hugging Face](https://img.shields.io/badge/HuggingFace-Collection-orange?logo=huggingface)](https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f) [![Technical Report](https://img.shields.io/badge/Technical-Report-red?logo=adobeacrobatreader)](https://arxiv.org/pdf/2508.19205)

**2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have disabled this repo until we are confident that out-of-scope use is no longer possible.**
VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https://arxiv.org/abs/2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details. The model can synthesize speech up to **90 minutes** long with up to **4 distinct speakers**, surpassing the typical 1-2 speaker limits of many prior models.

MOS Preference Results VibeVoice Overview

### 🎵 Demo Examples **Video Demo** We produced this video with [Wan2.2](https://github.com/Wan-Video/Wan2.2). We sincerely appreciate the Wan-Video team for their great work. **English**

https://github.com/user-attachments/assets/0967027c-141e-4909-bec8-091558b1b784

**Chinese**

https://github.com/user-attachments/assets/322280b7-3093-4c67-86e3-10be4746c88f

**Cross-Lingual**

https://github.com/user-attachments/assets/838d8ad9-a201-4dde-bb45-8cd3f59ce722

**Spontaneous Singing**

https://github.com/user-attachments/assets/6f27a8a5-0c60-4f57-87f3-7dea2e11c730

**Long Conversation with 4 people**

https://github.com/user-attachments/assets/a357c4b6-9768-495c-a576-1618f6275727

For more examples, see the [Project Page](https://microsoft.github.io/VibeVoice). ## Risks and limitations While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release). Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content. English and Chinese only: Transcripts in languages other than English or Chinese may result in unexpected audio outputs. Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects. Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations. We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.