# Kimi-Audio **Repository Path**: tangzhangss/Kimi-Audio ## Basic Information - **Project Name**: Kimi-Audio - **Description**: https://github.com/MoonshotAI/Kimi-Audio - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-04-27 - **Last Updated**: 2025-04-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Kimi-Audio-7B-Instruct 🤗  | 📑 Paper   

We present Kimi-Audio, an open-source audio foundation model excelling in **audio understanding, generation, and conversation**. This repository contains the official implementation, models, and evaluation toolkit for Kimi-Audio. ## 🔥🔥🔥 News!! * April 25, 2025: 👋 We release the inference code and model weights of [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct). * April 25, 2025: 👋 We release the audio evaluation toolkit [Kimi-Audio-Evalkit](https://github.com/MoonshotAI/Kimi-Audio-Evalkit). We can easily reproduce the **our results and baselines** by this toolkit! * April 25, 2025: 👋 We release the technical report of [Kimi-Audio](assets/kimia_report.pdf). ## Table of Contents - [Introduction](#introduction) - [Architecture Overview](#architecture-overview) - [Quick Start](#quick-start) - [Evaluation](#evaluation) - [Speech Recognition](#automatic-speech-recognition-asr) - [Audio Understanding](#audio-understanding) - [Audio-to-Text Chat](#audio-to-text-chat) - [Speech Conversation](#speech-conversation) - [Evaluation Toolkit](#evaluation-toolkit) - [License](#license) - [Acknowledgements](#acknowledgements) - [Citation](#citation) - [Contact Us](#contact-us) ## Introduction Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include: * **Universal Capabilities:** Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), and end-to-end speech conversation. * **State-of-the-Art Performance:** Achieves SOTA results on numerous audio benchmarks (see [Evaluation](#evaluation) and the [Technical Report](assets/kimia_report.pdf)). * **Large-Scale Pre-training:** Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding. * **Novel Architecture:** Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation. * **Efficient Inference:** Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation. * **Open-Source:** We release the code, model checkpoints, and a comprehensive evaluation toolkit to foster community research and development. ## Architecture Overview

Kimi-Audio consists of three main components: 1. **Audio Tokenizer:** Converts input audio into: * Discrete semantic tokens (12.5Hz) using vector quantization. * Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz). 2. **Audio LLM:** A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens. 3. **Audio Detokenizer:** Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency. ## Quick Start This example demonstrates basic usage for generating text from audio (ASR) and generating both text and speech in a conversational turn. ```python import soundfile as sf from kimia_infer.api.kimia import KimiAudio # --- 1. Load Model --- model_path = "moonshotai/Kimi-Audio-7B-Instruct" model = KimiAudio(model_path=model_path, load_detokenizer=True) # --- 2. Define Sampling Parameters --- sampling_params = { "audio_temperature": 0.8, "audio_top_k": 10, "text_temperature": 0.0, "text_top_k": 5, "audio_repetition_penalty": 1.0, "audio_repetition_window_size": 64, "text_repetition_penalty": 1.0, "text_repetition_window_size": 16, } # --- 3. Example 1: Audio-to-Text (ASR) --- messages_asr = [ # You can provide context or instructions as text {"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"}, # Provide the audio file path {"role": "user", "message_type": "audio", "content": "test_audios/asr_example.wav"} ] # Generate only text output _, text_output = model.generate(messages_asr, **sampling_params, output_type="text") print(">>> ASR Output Text: ", text_output) # Expected output: "这并不是告别,这是一个篇章的结束,也是新篇章的开始。" # --- 4. Example 2: Audio-to-Audio/Text Conversation --- messages_conversation = [ # Start conversation with an audio query {"role": "user", "message_type": "audio", "content": "test_audios/qa_example.wav"} ] # Generate both audio and text output wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both") # Save the generated audio output_audio_path = "output_audio.wav" sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output print(f">>> Conversational Output Audio saved to: {output_audio_path}") print(">>> Conversational Output Text: ", text_output) # Expected output: "A." print("Kimi-Audio inference examples complete.") ``` ## Evaluation Kimi-Audio achieves state-of-the-art (SOTA) performance across a wide range of audio benchmarks. The below is the overall performance:

Here are performances on different benchmarks, you can easily reproduce the **our results and baselines** by our [Kimi-Audio-Evalkit](https://github.com/MoonshotAI/Kimi-Audio-Evalkit) (also see [**Evaluation Toolkit**](#evaluation-toolkit)): ### Automatic Speech Recognition (ASR)
Datasets Model Performance (WER↓)
LibriSpeech
test-clean | test-other
Qwen2-Audio-base 1.74 | 4.04
Baichuan-base 3.02 | 6.04
Step-Audio-chat 3.19 | 10.67
Qwen2.5-Omni 2.37 | 4.21
Kimi-Audio 1.28 | 2.42
Fleurs
zh | en
Qwen2-Audio-base 3.63 | 5.20
Baichuan-base 4.15 | 8.07
Step-Audio-chat 4.26 | 8.56
Qwen2.5-Omni 2.92 | 4.17
Kimi-Audio 2.69 | 4.44
AISHELL-1 Qwen2-Audio-base 1.52
Baichuan-base 1.93
Step-Audio-chat 2.14
Qwen2.5-Omni 1.13
Kimi-Audio 0.60
AISHELL-2 ios Qwen2-Audio-base 3.08
Baichuan-base 3.87
Step-Audio-chat 3.89
Qwen2.5-Omni 2.56
Kimi-Audio 2.56
WenetSpeech
test-meeting | test-net
Qwen2-Audio-base 8.40 | 7.64
Baichuan-base 13.28 | 10.13
Step-Audio-chat 10.83 | 9.47
Qwen2.5-Omni 7.71 | 6.04
Kimi-Audio 6.28 | 5.37
Kimi-ASR Internal Testset
subset1 | subset2
Qwen2-Audio-base 2.31 | 3.24
Baichuan-base 3.41 | 5.60
Step-Audio-chat 2.82 | 4.74
Qwen2.5-Omni 1.53 | 2.68
Kimi-Audio 1.42 | 2.44
### Audio Understanding
Datasets Model Performance↑
MMAU
music | sound | speech
Qwen2-Audio-base 58.98 | 69.07 | 52.55
Baichuan-chat 49.10 | 59.46 | 42.47
GLM-4-Voice 38.92 | 43.54 | 32.43
Step-Audio-chat 49.40 | 53.75 | 47.75
Qwen2.5-Omni 62.16 | 67.57 | 53.92
Kimi-Audio 61.68 | 73.27 | 60.66
ClothoAQA
test | dev
Qwen2-Audio-base 71.73 | 72.63
Baichuan-chat 48.02 | 48.16
Step-Audio-chat 45.84 | 44.98
Qwen2.5-Omni 72.86 | 73.12
Kimi-Audio 71.24 | 73.18
VocalSound Qwen2-Audio-base 93.82
Baichuan-base 58.17
Step-Audio-chat 28.58
Qwen2.5-Omni 93.73
Kimi-Audio 94.85
Nonspeech7k Qwen2-Audio-base 87.17
Baichuan-chat 59.03
Step-Audio-chat 21.38
Qwen2.5-Omni 69.89
Kimi-Audio 93.93
MELD Qwen2-Audio-base 51.23
Baichuan-chat 23.59
Step-Audio-chat 33.54
Qwen2.5-Omni 49.83
Kimi-Audio 59.13
TUT2017 Qwen2-Audio-base 33.83
Baichuan-base 27.9
Step-Audio-chat 7.41
Qwen2.5-Omni 43.27
Kimi-Audio 65.25
CochlScene
test | dev
Qwen2-Audio-base 52.69 | 50.96
Baichuan-base 34.93 | 34.56
Step-Audio-chat 10.06 | 10.42
Qwen2.5-Omni 63.82 | 63.82
Kimi-Audio 79.84 | 80.99
### Audio-to-Text Chat
Datasets Model Performance↑
OpenAudioBench
AlpacaEval | Llama Questions |
Reasoning QA | TriviaQA | Web Questions
Qwen2-Audio-chat 57.19 | 69.67 | 42.77 | 40.30 | 45.20
Baichuan-chat 59.65 | 74.33 | 46.73 | 55.40 | 58.70
GLM-4-Voice 57.89 | 76.00 | 47.43 | 51.80 | 55.40
StepAudio-chat 56.53 | 72.33 | 60.00 | 56.80 | 73.00
Qwen2.5-Omni 72.76 | 75.33 | 63.76 | 57.06 | 62.80
Kimi-Audio 75.73 | 79.33 | 58.02 | 62.10 | 70.20
VoiceBench
AlpacaEval | CommonEval |
SD-QA | MMSU
Qwen2-Audio-chat 3.69 | 3.40 | 35.35 | 35.43
Baichuan-chat 4.00 | 3.39 | 49.64 | 48.80
GLM-4-Voice 4.06 | 3.48 | 43.31 | 40.11
StepAudio-chat 3.99 | 2.99 | 46.84 | 28.72
Qwen2.5-Omni 4.33 | 3.84 | 57.41 | 56.38
Kimi-Audio 4.46 | 3.97 | 63.12 | 62.17
VoiceBench
OpenBookQA | IFEval |
AdvBench | Avg
Qwen2-Audio-chat 49.01 | 22.57 | 98.85 | 54.72
Baichuan-chat 63.30 | 41.32 | 86.73 | 62.51
GLM-4-Voice 52.97 | 24.91 | 88.08 | 57.17
StepAudio-chat 31.87 | 29.19 | 65.77 | 48.86
Qwen2.5-Omni 79.12 | 53.88 | 99.62 | 72.83
Kimi-Audio 83.52 | 61.10 | 100.00 | 76.93
### Speech Conversation
Performance of Kimi-Audio and baseline models on speech conversation.
Model Ability
Speed Control Accent Control Emotion Control Empathy Style Control Avg
GPT-4o 4.21 3.65 4.05 3.87 4.54 4.06
Step-Audio-chat 3.25 2.87 3.33 3.05 4.14 3.33
GLM-4-Voice 3.83 3.51 3.77 3.07 4.04 3.65
GPT-4o-mini 3.15 2.71 4.24 3.16 4.01 3.45
Kimi-Audio 4.30 3.45 4.27 3.39 4.09 3.90
## Evaluation Toolkit Evaluating and comparing audio foundation models is challenging due to inconsistent metrics, varying inference configurations, and a lack of standardized generation evaluation. To address this, we developed and open-sourced an **Evaluation Toolkit**. Key features: * Integrates Kimi-Audio and other recent audio LLMs. * Implements standardized metric calculation and integrates LLMs for intelligent judging (e.g., for AQA). * Provides a unified platform for side-by-side comparisons with shareable inference 'recipes' for reproducibility. * Includes a benchmark for evaluating speech conversation abilities (control, empathy, style). We encourage the community to use and contribute to this toolkit to foster more reliable and comparable benchmarking. Find it here: [Kimi-Audio-Evalkit](https://github.com/MoonshotAI/Kimi-Audio-Evalkit). ## License The model is based and modified from [Qwen 2.5-7B](https://github.com/QwenLM/Qwen2.5). Code derived from Qwen2.5-7B is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT). ## Acknowledgements We would like to thank the following projects and individuals for their contributions to the development of Kimi-Audio: * [Whisper](https://github.com/openai/whisper) * [Transformers](https://github.com/huggingface/transformers) * [BigVGAN](https://github.com/NVIDIA/BigVGAN) * [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) Thank you to all the open-source projects for their contributions to this project! ## Citation If you find Kimi-Audio useful in your research or applications, please cite our technical report: ```bibtex @misc{kimi_audio_2024, title={Kimi-Audio Technical Report}, author={Kimi Team}, year={2024}, eprint={arXiv:placeholder}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## Contact Us For questions, issues, or collaboration inquiries, please feel free to open an issue on GitHub.