# Kimi-Audio **Repository Path**: tangzhangss/Kimi-Audio ## Basic Information - **Project Name**: Kimi-Audio - **Description**: https://github.com/MoonshotAI/Kimi-Audio - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-04-27 - **Last Updated**: 2025-04-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
Kimi-Audio-7B-Instruct 🤗 | 📑 Paper
We present Kimi-Audio, an open-source audio foundation model excelling in **audio understanding, generation, and conversation**. This repository contains the official implementation, models, and evaluation toolkit for Kimi-Audio. ## 🔥🔥🔥 News!! * April 25, 2025: 👋 We release the inference code and model weights of [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct). * April 25, 2025: 👋 We release the audio evaluation toolkit [Kimi-Audio-Evalkit](https://github.com/MoonshotAI/Kimi-Audio-Evalkit). We can easily reproduce the **our results and baselines** by this toolkit! * April 25, 2025: 👋 We release the technical report of [Kimi-Audio](assets/kimia_report.pdf). ## Table of Contents - [Introduction](#introduction) - [Architecture Overview](#architecture-overview) - [Quick Start](#quick-start) - [Evaluation](#evaluation) - [Speech Recognition](#automatic-speech-recognition-asr) - [Audio Understanding](#audio-understanding) - [Audio-to-Text Chat](#audio-to-text-chat) - [Speech Conversation](#speech-conversation) - [Evaluation Toolkit](#evaluation-toolkit) - [License](#license) - [Acknowledgements](#acknowledgements) - [Citation](#citation) - [Contact Us](#contact-us) ## Introduction Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include: * **Universal Capabilities:** Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), and end-to-end speech conversation. * **State-of-the-Art Performance:** Achieves SOTA results on numerous audio benchmarks (see [Evaluation](#evaluation) and the [Technical Report](assets/kimia_report.pdf)). * **Large-Scale Pre-training:** Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding. * **Novel Architecture:** Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation. * **Efficient Inference:** Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation. * **Open-Source:** We release the code, model checkpoints, and a comprehensive evaluation toolkit to foster community research and development. ## Architecture Overview
Kimi-Audio consists of three main components: 1. **Audio Tokenizer:** Converts input audio into: * Discrete semantic tokens (12.5Hz) using vector quantization. * Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz). 2. **Audio LLM:** A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens. 3. **Audio Detokenizer:** Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency. ## Quick Start This example demonstrates basic usage for generating text from audio (ASR) and generating both text and speech in a conversational turn. ```python import soundfile as sf from kimia_infer.api.kimia import KimiAudio # --- 1. Load Model --- model_path = "moonshotai/Kimi-Audio-7B-Instruct" model = KimiAudio(model_path=model_path, load_detokenizer=True) # --- 2. Define Sampling Parameters --- sampling_params = { "audio_temperature": 0.8, "audio_top_k": 10, "text_temperature": 0.0, "text_top_k": 5, "audio_repetition_penalty": 1.0, "audio_repetition_window_size": 64, "text_repetition_penalty": 1.0, "text_repetition_window_size": 16, } # --- 3. Example 1: Audio-to-Text (ASR) --- messages_asr = [ # You can provide context or instructions as text {"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"}, # Provide the audio file path {"role": "user", "message_type": "audio", "content": "test_audios/asr_example.wav"} ] # Generate only text output _, text_output = model.generate(messages_asr, **sampling_params, output_type="text") print(">>> ASR Output Text: ", text_output) # Expected output: "这并不是告别,这是一个篇章的结束,也是新篇章的开始。" # --- 4. Example 2: Audio-to-Audio/Text Conversation --- messages_conversation = [ # Start conversation with an audio query {"role": "user", "message_type": "audio", "content": "test_audios/qa_example.wav"} ] # Generate both audio and text output wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both") # Save the generated audio output_audio_path = "output_audio.wav" sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output print(f">>> Conversational Output Audio saved to: {output_audio_path}") print(">>> Conversational Output Text: ", text_output) # Expected output: "A." print("Kimi-Audio inference examples complete.") ``` ## Evaluation Kimi-Audio achieves state-of-the-art (SOTA) performance across a wide range of audio benchmarks. The below is the overall performance:
Here are performances on different benchmarks, you can easily reproduce the **our results and baselines** by our [Kimi-Audio-Evalkit](https://github.com/MoonshotAI/Kimi-Audio-Evalkit) (also see [**Evaluation Toolkit**](#evaluation-toolkit)): ### Automatic Speech Recognition (ASR)
Datasets | Model | Performance (WER↓) |
---|---|---|
LibriSpeech test-clean | test-other |
Qwen2-Audio-base | 1.74 | 4.04 |
Baichuan-base | 3.02 | 6.04 | |
Step-Audio-chat | 3.19 | 10.67 | |
Qwen2.5-Omni | 2.37 | 4.21 | |
Kimi-Audio | 1.28 | 2.42 | |
Fleurs zh | en |
Qwen2-Audio-base | 3.63 | 5.20 |
Baichuan-base | 4.15 | 8.07 | |
Step-Audio-chat | 4.26 | 8.56 | |
Qwen2.5-Omni | 2.92 | 4.17 | |
Kimi-Audio | 2.69 | 4.44 | |
AISHELL-1 | Qwen2-Audio-base | 1.52 |
Baichuan-base | 1.93 | |
Step-Audio-chat | 2.14 | |
Qwen2.5-Omni | 1.13 | |
Kimi-Audio | 0.60 | |
AISHELL-2 ios | Qwen2-Audio-base | 3.08 |
Baichuan-base | 3.87 | |
Step-Audio-chat | 3.89 | |
Qwen2.5-Omni | 2.56 | |
Kimi-Audio | 2.56 | |
WenetSpeech test-meeting | test-net |
Qwen2-Audio-base | 8.40 | 7.64 |
Baichuan-base | 13.28 | 10.13 | |
Step-Audio-chat | 10.83 | 9.47 | |
Qwen2.5-Omni | 7.71 | 6.04 | |
Kimi-Audio | 6.28 | 5.37 | |
Kimi-ASR Internal Testset subset1 | subset2 |
Qwen2-Audio-base | 2.31 | 3.24 |
Baichuan-base | 3.41 | 5.60 | |
Step-Audio-chat | 2.82 | 4.74 | |
Qwen2.5-Omni | 1.53 | 2.68 | |
Kimi-Audio | 1.42 | 2.44 |
Datasets | Model | Performance↑ |
---|---|---|
MMAU music | sound | speech |
Qwen2-Audio-base | 58.98 | 69.07 | 52.55 |
Baichuan-chat | 49.10 | 59.46 | 42.47 | |
GLM-4-Voice | 38.92 | 43.54 | 32.43 | |
Step-Audio-chat | 49.40 | 53.75 | 47.75 | |
Qwen2.5-Omni | 62.16 | 67.57 | 53.92 | |
Kimi-Audio | 61.68 | 73.27 | 60.66 | |
ClothoAQA test | dev |
Qwen2-Audio-base | 71.73 | 72.63 |
Baichuan-chat | 48.02 | 48.16 | |
Step-Audio-chat | 45.84 | 44.98 | |
Qwen2.5-Omni | 72.86 | 73.12 | |
Kimi-Audio | 71.24 | 73.18 | |
VocalSound | Qwen2-Audio-base | 93.82 |
Baichuan-base | 58.17 | |
Step-Audio-chat | 28.58 | |
Qwen2.5-Omni | 93.73 | |
Kimi-Audio | 94.85 | |
Nonspeech7k | Qwen2-Audio-base | 87.17 |
Baichuan-chat | 59.03 | |
Step-Audio-chat | 21.38 | |
Qwen2.5-Omni | 69.89 | |
Kimi-Audio | 93.93 | |
MELD | Qwen2-Audio-base | 51.23 |
Baichuan-chat | 23.59 | |
Step-Audio-chat | 33.54 | |
Qwen2.5-Omni | 49.83 | |
Kimi-Audio | 59.13 | |
TUT2017 | Qwen2-Audio-base | 33.83 |
Baichuan-base | 27.9 | |
Step-Audio-chat | 7.41 | |
Qwen2.5-Omni | 43.27 | |
Kimi-Audio | 65.25 | |
CochlScene test | dev |
Qwen2-Audio-base | 52.69 | 50.96 |
Baichuan-base | 34.93 | 34.56 | |
Step-Audio-chat | 10.06 | 10.42 | |
Qwen2.5-Omni | 63.82 | 63.82 | |
Kimi-Audio | 79.84 | 80.99 |
Datasets | Model | Performance↑ |
---|---|---|
OpenAudioBench AlpacaEval | Llama Questions | Reasoning QA | TriviaQA | Web Questions |
Qwen2-Audio-chat | 57.19 | 69.67 | 42.77 | 40.30 | 45.20 |
Baichuan-chat | 59.65 | 74.33 | 46.73 | 55.40 | 58.70 | |
GLM-4-Voice | 57.89 | 76.00 | 47.43 | 51.80 | 55.40 | |
StepAudio-chat | 56.53 | 72.33 | 60.00 | 56.80 | 73.00 | |
Qwen2.5-Omni | 72.76 | 75.33 | 63.76 | 57.06 | 62.80 | |
Kimi-Audio | 75.73 | 79.33 | 58.02 | 62.10 | 70.20 | |
VoiceBench AlpacaEval | CommonEval | SD-QA | MMSU |
Qwen2-Audio-chat | 3.69 | 3.40 | 35.35 | 35.43 |
Baichuan-chat | 4.00 | 3.39 | 49.64 | 48.80 | |
GLM-4-Voice | 4.06 | 3.48 | 43.31 | 40.11 | |
StepAudio-chat | 3.99 | 2.99 | 46.84 | 28.72 | |
Qwen2.5-Omni | 4.33 | 3.84 | 57.41 | 56.38 | |
Kimi-Audio | 4.46 | 3.97 | 63.12 | 62.17 | |
VoiceBench OpenBookQA | IFEval | AdvBench | Avg |
Qwen2-Audio-chat | 49.01 | 22.57 | 98.85 | 54.72 |
Baichuan-chat | 63.30 | 41.32 | 86.73 | 62.51 | |
GLM-4-Voice | 52.97 | 24.91 | 88.08 | 57.17 | |
StepAudio-chat | 31.87 | 29.19 | 65.77 | 48.86 | |
Qwen2.5-Omni | 79.12 | 53.88 | 99.62 | 72.83 | |
Kimi-Audio | 83.52 | 61.10 | 100.00 | 76.93 |
Model | Ability | |||||
---|---|---|---|---|---|---|
Speed Control | Accent Control | Emotion Control | Empathy | Style Control | Avg | |
GPT-4o | 4.21 | 3.65 | 4.05 | 3.87 | 4.54 | 4.06 |
Step-Audio-chat | 3.25 | 2.87 | 3.33 | 3.05 | 4.14 | 3.33 |
GLM-4-Voice | 3.83 | 3.51 | 3.77 | 3.07 | 4.04 | 3.65 |
GPT-4o-mini | 3.15 | 2.71 | 4.24 | 3.16 | 4.01 | 3.45 |
Kimi-Audio | 4.30 | 3.45 | 4.27 | 3.39 | 4.09 | 3.90 |