# Kimi-Audio **Repository Path**: tangzhangss/Kimi-Audio ## Basic Information - **Project Name**: Kimi-Audio - **Description**: https://github.com/MoonshotAI/Kimi-Audio - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-04-27 - **Last Updated**: 2025-04-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Kimi-Audio-7B-Instruct 🤗 | 📑 Paper

We present Kimi-Audio, an open-source audio foundation model excelling in **audio understanding, generation, and conversation**. This repository contains the official implementation, models, and evaluation toolkit for Kimi-Audio. ## 🔥🔥🔥 News!! * April 25, 2025: 👋 We release the inference code and model weights of [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct). * April 25, 2025: 👋 We release the audio evaluation toolkit [Kimi-Audio-Evalkit](https://github.com/MoonshotAI/Kimi-Audio-Evalkit). We can easily reproduce the **our results and baselines** by this toolkit! * April 25, 2025: 👋 We release the technical report of [Kimi-Audio](assets/kimia_report.pdf). ## Table of Contents - [Introduction](#introduction) - [Architecture Overview](#architecture-overview) - [Quick Start](#quick-start) - [Evaluation](#evaluation) - [Speech Recognition](#automatic-speech-recognition-asr) - [Audio Understanding](#audio-understanding) - [Audio-to-Text Chat](#audio-to-text-chat) - [Speech Conversation](#speech-conversation) - [Evaluation Toolkit](#evaluation-toolkit) - [License](#license) - [Acknowledgements](#acknowledgements) - [Citation](#citation) - [Contact Us](#contact-us) ## Introduction Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include: * **Universal Capabilities:** Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), and end-to-end speech conversation. * **State-of-the-Art Performance:** Achieves SOTA results on numerous audio benchmarks (see [Evaluation](#evaluation) and the [Technical Report](assets/kimia_report.pdf)). * **Large-Scale Pre-training:** Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding. * **Novel Architecture:** Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation. * **Efficient Inference:** Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation. * **Open-Source:** We release the code, model checkpoints, and a comprehensive evaluation toolkit to foster community research and development. ## Architecture Overview

Kimi-Audio consists of three main components: 1. **Audio Tokenizer:** Converts input audio into: * Discrete semantic tokens (12.5Hz) using vector quantization. * Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz). 2. **Audio LLM:** A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens. 3. **Audio Detokenizer:** Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency. ## Quick Start This example demonstrates basic usage for generating text from audio (ASR) and generating both text and speech in a conversational turn. ```python import soundfile as sf from kimia_infer.api.kimia import KimiAudio # --- 1. Load Model --- model_path = "moonshotai/Kimi-Audio-7B-Instruct" model = KimiAudio(model_path=model_path, load_detokenizer=True) # --- 2. Define Sampling Parameters --- sampling_params = { "audio_temperature": 0.8, "audio_top_k": 10, "text_temperature": 0.0, "text_top_k": 5, "audio_repetition_penalty": 1.0, "audio_repetition_window_size": 64, "text_repetition_penalty": 1.0, "text_repetition_window_size": 16, } # --- 3. Example 1: Audio-to-Text (ASR) --- messages_asr = [ # You can provide context or instructions as text {"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"}, # Provide the audio file path {"role": "user", "message_type": "audio", "content": "test_audios/asr_example.wav"} ] # Generate only text output _, text_output = model.generate(messages_asr, **sampling_params, output_type="text") print(">>> ASR Output Text: ", text_output) # Expected output: "这并不是告别，这是一个篇章的结束，也是新篇章的开始。" # --- 4. Example 2: Audio-to-Audio/Text Conversation --- messages_conversation = [ # Start conversation with an audio query {"role": "user", "message_type": "audio", "content": "test_audios/qa_example.wav"} ] # Generate both audio and text output wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both") # Save the generated audio output_audio_path = "output_audio.wav" sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output print(f">>> Conversational Output Audio saved to: {output_audio_path}") print(">>> Conversational Output Text: ", text_output) # Expected output: "A." print("Kimi-Audio inference examples complete.") ``` ## Evaluation Kimi-Audio achieves state-of-the-art (SOTA) performance across a wide range of audio benchmarks. The below is the overall performance:

Here are performances on different benchmarks, you can easily reproduce the **our results and baselines** by our [Kimi-Audio-Evalkit](https://github.com/MoonshotAI/Kimi-Audio-Evalkit) (also see [**Evaluation Toolkit**](#evaluation-toolkit)): ### Automatic Speech Recognition (ASR)

Datasets Model Performance (WER↓)

LibriSpeech
test-clean | test-other Qwen2-Audio-base 1.74 | 4.04

Baichuan-base 3.02 | 6.04

Step-Audio-chat 3.19 | 10.67

Qwen2.5-Omni 2.37 | 4.21

Kimi-Audio 1.28 | 2.42

Fleurs
zh | en Qwen2-Audio-base 3.63 | 5.20

Baichuan-base 4.15 | 8.07

Step-Audio-chat 4.26 | 8.56

Qwen2.5-Omni 2.92 | 4.17

Kimi-Audio 2.69 | 4.44

AISHELL-1 Qwen2-Audio-base 1.52

Baichuan-base 1.93

Step-Audio-chat 2.14

Qwen2.5-Omni 1.13

Kimi-Audio 0.60

AISHELL-2 ios Qwen2-Audio-base 3.08

Baichuan-base 3.87

Step-Audio-chat 3.89

Qwen2.5-Omni 2.56

Kimi-Audio 2.56

WenetSpeech
test-meeting | test-net Qwen2-Audio-base 8.40 | 7.64

Baichuan-base 13.28 | 10.13

Step-Audio-chat 10.83 | 9.47

Qwen2.5-Omni 7.71 | 6.04

Kimi-Audio 6.28 | 5.37

Kimi-ASR Internal Testset
subset1 | subset2 Qwen2-Audio-base 2.31 | 3.24

Baichuan-base 3.41 | 5.60

Step-Audio-chat 2.82 | 4.74

Qwen2.5-Omni 1.53 | 2.68

Kimi-Audio 1.42 | 2.44

### Audio Understanding

Datasets Model Performance↑

MMAU
music | sound | speech Qwen2-Audio-base 58.98 | 69.07 | 52.55

Baichuan-chat 49.10 | 59.46 | 42.47

GLM-4-Voice 38.92 | 43.54 | 32.43

Step-Audio-chat 49.40 | 53.75 | 47.75

Qwen2.5-Omni 62.16 | 67.57 | 53.92

Kimi-Audio 61.68 | 73.27 | 60.66

ClothoAQA
test | dev Qwen2-Audio-base 71.73 | 72.63

Baichuan-chat 48.02 | 48.16

Step-Audio-chat 45.84 | 44.98

Qwen2.5-Omni 72.86 | 73.12

Kimi-Audio 71.24 | 73.18

VocalSound Qwen2-Audio-base 93.82

Baichuan-base 58.17

Step-Audio-chat 28.58

Qwen2.5-Omni 93.73

Kimi-Audio 94.85

Nonspeech7k Qwen2-Audio-base 87.17

Baichuan-chat 59.03

Step-Audio-chat 21.38

Qwen2.5-Omni 69.89

Kimi-Audio 93.93

MELD Qwen2-Audio-base 51.23

Baichuan-chat 23.59

Step-Audio-chat 33.54

Qwen2.5-Omni 49.83

Kimi-Audio 59.13

TUT2017 Qwen2-Audio-base 33.83

Baichuan-base 27.9

Step-Audio-chat 7.41

Qwen2.5-Omni 43.27

Kimi-Audio 65.25

CochlScene
test | dev Qwen2-Audio-base 52.69 | 50.96

Baichuan-base 34.93 | 34.56

Step-Audio-chat 10.06 | 10.42

Qwen2.5-Omni 63.82 | 63.82

Kimi-Audio 79.84 | 80.99

### Audio-to-Text Chat

Datasets Model Performance↑

OpenAudioBench
AlpacaEval | Llama Questions |
Reasoning QA | TriviaQA | Web Questions Qwen2-Audio-chat 57.19 | 69.67 | 42.77 | 40.30 | 45.20

Baichuan-chat 59.65 | 74.33 | 46.73 | 55.40 | 58.70

GLM-4-Voice 57.89 | 76.00 | 47.43 | 51.80 | 55.40

StepAudio-chat 56.53 | 72.33 | 60.00 | 56.80 | 73.00

Qwen2.5-Omni 72.76 | 75.33 | 63.76 | 57.06 | 62.80

Kimi-Audio 75.73 | 79.33 | 58.02 | 62.10 | 70.20

VoiceBench
AlpacaEval | CommonEval |
SD-QA | MMSU Qwen2-Audio-chat 3.69 | 3.40 | 35.35 | 35.43

Baichuan-chat 4.00 | 3.39 | 49.64 | 48.80

GLM-4-Voice 4.06 | 3.48 | 43.31 | 40.11

StepAudio-chat 3.99 | 2.99 | 46.84 | 28.72

Qwen2.5-Omni 4.33 | 3.84 | 57.41 | 56.38

Kimi-Audio 4.46 | 3.97 | 63.12 | 62.17

VoiceBench
OpenBookQA | IFEval |
AdvBench | Avg Qwen2-Audio-chat 49.01 | 22.57 | 98.85 | 54.72

Baichuan-chat 63.30 | 41.32 | 86.73 | 62.51

GLM-4-Voice 52.97 | 24.91 | 88.08 | 57.17

StepAudio-chat 31.87 | 29.19 | 65.77 | 48.86

Qwen2.5-Omni 79.12 | 53.88 | 99.62 | 72.83

Kimi-Audio 83.52 | 61.10 | 100.00 | 76.93

### Speech Conversation
Performance of Kimi-Audio and baseline models on speech conversation.

Model Ability

Speed Control Accent Control Emotion Control Empathy Style Control Avg

GPT-4o 4.21 3.65 4.05 3.87 4.54 4.06

Step-Audio-chat 3.25 2.87 3.33 3.05 4.14 3.33

GLM-4-Voice 3.83 3.51 3.77 3.07 4.04 3.65

GPT-4o-mini 3.15 2.71 4.24 3.16 4.01 3.45

Kimi-Audio 4.30 3.45 4.27 3.39 4.09 3.90

## Evaluation Toolkit Evaluating and comparing audio foundation models is challenging due to inconsistent metrics, varying inference configurations, and a lack of standardized generation evaluation. To address this, we developed and open-sourced an **Evaluation Toolkit**. Key features: * Integrates Kimi-Audio and other recent audio LLMs. * Implements standardized metric calculation and integrates LLMs for intelligent judging (e.g., for AQA). * Provides a unified platform for side-by-side comparisons with shareable inference 'recipes' for reproducibility. * Includes a benchmark for evaluating speech conversation abilities (control, empathy, style). We encourage the community to use and contribute to this toolkit to foster more reliable and comparable benchmarking. Find it here: [Kimi-Audio-Evalkit](https://github.com/MoonshotAI/Kimi-Audio-Evalkit). ## License The model is based and modified from [Qwen 2.5-7B](https://github.com/QwenLM/Qwen2.5). Code derived from Qwen2.5-7B is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT). ## Acknowledgements We would like to thank the following projects and individuals for their contributions to the development of Kimi-Audio: * [Whisper](https://github.com/openai/whisper) * [Transformers](https://github.com/huggingface/transformers) * [BigVGAN](https://github.com/NVIDIA/BigVGAN) * [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) Thank you to all the open-source projects for their contributions to this project! ## Citation If you find Kimi-Audio useful in your research or applications, please cite our technical report: ```bibtex @misc{kimi_audio_2024, title={Kimi-Audio Technical Report}, author={Kimi Team}, year={2024}, eprint={arXiv:placeholder}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## Contact Us For questions, issues, or collaboration inquiries, please feel free to open an issue on GitHub.

Datasets	Model	Performance (WER↓)
LibriSpeech test-clean \| test-other	Qwen2-Audio-base	1.74 \| 4.04
	Baichuan-base	3.02 \| 6.04
	Step-Audio-chat	3.19 \| 10.67
	Qwen2.5-Omni	2.37 \| 4.21
	Kimi-Audio	1.28 \| 2.42
Fleurs zh \| en	Qwen2-Audio-base	3.63 \| 5.20
	Baichuan-base	4.15 \| 8.07
	Step-Audio-chat	4.26 \| 8.56
	Qwen2.5-Omni	2.92 \| 4.17
	Kimi-Audio	2.69 \| 4.44
AISHELL-1	Qwen2-Audio-base	1.52
	Baichuan-base	1.93
	Step-Audio-chat	2.14
	Qwen2.5-Omni	1.13
	Kimi-Audio	0.60
AISHELL-2 ios	Qwen2-Audio-base	3.08
	Baichuan-base	3.87
	Step-Audio-chat	3.89
	Qwen2.5-Omni	2.56
	Kimi-Audio	2.56
WenetSpeech test-meeting \| test-net	Qwen2-Audio-base	8.40 \| 7.64
	Baichuan-base	13.28 \| 10.13
	Step-Audio-chat	10.83 \| 9.47
	Qwen2.5-Omni	7.71 \| 6.04
	Kimi-Audio	6.28 \| 5.37
Kimi-ASR Internal Testset subset1 \| subset2	Qwen2-Audio-base	2.31 \| 3.24
	Baichuan-base	3.41 \| 5.60
	Step-Audio-chat	2.82 \| 4.74
	Qwen2.5-Omni	1.53 \| 2.68
	Kimi-Audio	1.42 \| 2.44

Datasets	Model	Performance↑
MMAU music \| sound \| speech	Qwen2-Audio-base	58.98 \| 69.07 \| 52.55
	Baichuan-chat	49.10 \| 59.46 \| 42.47
	GLM-4-Voice	38.92 \| 43.54 \| 32.43
	Step-Audio-chat	49.40 \| 53.75 \| 47.75
	Qwen2.5-Omni	62.16 \| 67.57 \| 53.92
	Kimi-Audio	61.68 \| 73.27 \| 60.66
ClothoAQA test \| dev	Qwen2-Audio-base	71.73 \| 72.63
	Baichuan-chat	48.02 \| 48.16
	Step-Audio-chat	45.84 \| 44.98
	Qwen2.5-Omni	72.86 \| 73.12
	Kimi-Audio	71.24 \| 73.18
VocalSound	Qwen2-Audio-base	93.82
	Baichuan-base	58.17
	Step-Audio-chat	28.58
	Qwen2.5-Omni	93.73
	Kimi-Audio	94.85
Nonspeech7k	Qwen2-Audio-base	87.17
	Baichuan-chat	59.03
	Step-Audio-chat	21.38
	Qwen2.5-Omni	69.89
	Kimi-Audio	93.93
MELD	Qwen2-Audio-base	51.23
	Baichuan-chat	23.59
	Step-Audio-chat	33.54
	Qwen2.5-Omni	49.83
	Kimi-Audio	59.13
TUT2017	Qwen2-Audio-base	33.83
	Baichuan-base	27.9
	Step-Audio-chat	7.41
	Qwen2.5-Omni	43.27
	Kimi-Audio	65.25
CochlScene test \| dev	Qwen2-Audio-base	52.69 \| 50.96
	Baichuan-base	34.93 \| 34.56
	Step-Audio-chat	10.06 \| 10.42
	Qwen2.5-Omni	63.82 \| 63.82
	Kimi-Audio	79.84 \| 80.99

Datasets	Model	Performance↑
OpenAudioBench AlpacaEval \| Llama Questions \| Reasoning QA \| TriviaQA \| Web Questions	Qwen2-Audio-chat	57.19 \| 69.67 \| 42.77 \| 40.30 \| 45.20
	Baichuan-chat	59.65 \| 74.33 \| 46.73 \| 55.40 \| 58.70
	GLM-4-Voice	57.89 \| 76.00 \| 47.43 \| 51.80 \| 55.40
	StepAudio-chat	56.53 \| 72.33 \| 60.00 \| 56.80 \| 73.00
	Qwen2.5-Omni	72.76 \| 75.33 \| 63.76 \| 57.06 \| 62.80
	Kimi-Audio	75.73 \| 79.33 \| 58.02 \| 62.10 \| 70.20
VoiceBench AlpacaEval \| CommonEval \| SD-QA \| MMSU	Qwen2-Audio-chat	3.69 \| 3.40 \| 35.35 \| 35.43
	Baichuan-chat	4.00 \| 3.39 \| 49.64 \| 48.80
	GLM-4-Voice	4.06 \| 3.48 \| 43.31 \| 40.11
	StepAudio-chat	3.99 \| 2.99 \| 46.84 \| 28.72
	Qwen2.5-Omni	4.33 \| 3.84 \| 57.41 \| 56.38
	Kimi-Audio	4.46 \| 3.97 \| 63.12 \| 62.17
VoiceBench OpenBookQA \| IFEval \| AdvBench \| Avg	Qwen2-Audio-chat	49.01 \| 22.57 \| 98.85 \| 54.72
	Baichuan-chat	63.30 \| 41.32 \| 86.73 \| 62.51
	GLM-4-Voice	52.97 \| 24.91 \| 88.08 \| 57.17
	StepAudio-chat	31.87 \| 29.19 \| 65.77 \| 48.86
	Qwen2.5-Omni	79.12 \| 53.88 \| 99.62 \| 72.83
	Kimi-Audio	83.52 \| 61.10 \| 100.00 \| 76.93

Model	Ability
Model	Speed Control	Accent Control	Emotion Control	Empathy	Style Control	Avg
GPT-4o	4.21	3.65	4.05	3.87	4.54	4.06
Step-Audio-chat	3.25	2.87	3.33	3.05	4.14	3.33
GLM-4-Voice	3.83	3.51	3.77	3.07	4.04	3.65
GPT-4o-mini	3.15	2.71	4.24	3.16	4.01	3.45
Kimi-Audio	4.30	3.45	4.27	3.39	4.09	3.90