Kimi-Audio Technical Report (2504.18425v1)

Published 25 Apr 2025 in eess.AS, cs.AI, cs.CL, cs.LG, cs.MM, and cs.SD

Abstract: We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.

Summary

The paper introduces an open-source audio foundation model that unifies audio understanding, generation, and conversation using a hybrid tokenization approach.
It combines discrete semantic tokens and continuous acoustic features with a shared transformer backbone and a specialized audio detokenizer for real-time synthesis.
Evaluation demonstrates state-of-the-art performance on ASR, audio reasoning, and speech conversation tasks, backed by a robust, reproducible evaluation toolkit.

Kimi-Audio is presented as an open-source audio foundation model designed for comprehensive audio understanding, generation, and conversation. The technical report details its architecture, data curation, training methodology, inference deployment, and evaluation process. The goal is to unify various audio tasks within a single, performant model, overcoming limitations of prior models that focused on specific tasks or lacked open availability and extensive pre-training.

The model architecture consists of three main components:

Audio Tokenizer: This component converts raw audio into representations suitable for the LLM. It uses a hybrid approach:
- Discrete Semantic Tokens: Derived from a supervised speech tokenizer (based on a Whisper encoder with a vector quantization layer) at a low frame rate of 12.5Hz. These capture the semantic content.
- Continuous Acoustic Features: Extracted from a pre-trained Whisper model, downsampled from 50Hz to 12.5Hz using an adaptor. These provide richer acoustic details like speaker characteristics, emotion, and environmental sounds. The final input to the LLM is an embedding formed by combining the discrete token embeddings and the continuous features.
Audio LLM: This is the core processing unit, initialized from a pre-trained text LLM (Qwen2.5 7B). It uses a shared transformer backbone that processes multimodal inputs. After the shared layers, the architecture branches into two parallel heads:
- Text Head: Predicts discrete text tokens autoregressively.
- Audio Head: Predicts discrete semantic audio tokens autoregressively. The shared layers and text head are initialized from the text LLM weights, while the audio head is randomly initialized, allowing the model to leverage strong text capabilities while learning audio processing.
Audio Detokenizer: Converts the predicted discrete semantic audio tokens back into a high-quality audio waveform. It comprises two parts:
- Flow-Matching Module: Converts 12.5Hz semantic tokens into 50Hz mel-spectrograms.
- Vocoder: Uses BigVGAN [lee2022bigvgan] to generate waveforms from the mel-spectrograms. To enable low-latency, real-time generation, a chunk-wise streaming detokenizer is employed. It processes audio in chunks (e.g., 1 second), using previous chunks as context. A "look-ahead" mechanism is introduced during inference, where a few future tokens are included in the context for boundary processing, mitigating intermittent issues at chunk transitions without increasing training complexity.

Data curation is a crucial aspect of building Kimi-Audio, involving both pre-training and supervised fine-tuning (SFT) data.

Pre-training Data: Over 13 million hours of raw audio (speech, sound, music) are used, along with text data. A sophisticated automatic pipeline processes the raw audio to generate high-quality multimodal data (audio-text pairs), as illustrated in Figure~\ref{fig:pretrain_data_pipeline}. Key steps include:
- Speech Enhancement: Using a BSRNN model, with random mixing of original and enhanced audio to preserve environmental sounds.
- Segmentation by Diarization: Utilizing PyAnnote for initial diarization, followed by extensive post-processing (speaker cluster merging based on embeddings, chunk-based reassignment for purity, and segment merging with length/silence constraints) to produce accurate, consistently sized speaker turns.
- Speech Transcription: Employing Whisper-large-v3 for language detection (English/Mandarin retained) and English transcription/punctuation, and Paraformer-Zh [gao2022paraformer] for Mandarin transcription with custom punctuation rules based on silence gaps.
- The pipeline is deployed on a significant cloud cluster (30 instances, 240 GPUs) achieving a throughput of ~200,000 hours/day.
SFT Data: Designed to support diverse tasks like audio understanding (ASR, AQA, AAC, SER, SEC, ASC), speech conversation, and audio-to-text chat. It leverages a mix of open-source datasets (listed in Table~\ref{tab:sft_datasets_collected} and Table~\ref{tab:sft_datasets_synthesized}) and in-house data.
- Speech Conversation Data: Multi-turn conversations constructed by generating user queries (LLM text + Kimi-TTS synthesis with diverse timbres) and assistant responses (text + Kimi-TTS/Kimi-VC synthesis using a specific Kimi-Audio speaker timbre). Kimi-TTS is a zero-shot TTS system trained on 1M hours, and Kimi-VC is a Seed-VC based system fine-tuned on speaker data to preserve style/emotion during timbre conversion.
- Audio-to-Text Chat Data: Synthesized by converting public text SFT datasets into audio-text pairs, where user queries are synthesized speech. Text preprocessing filters complex content and rewrites instructions for clarity.

The training process is divided into pre-training and supervised fine-tuning.

Pre-training: Aims to learn unimodal knowledge and align audio/text modalities. Tasks include:
- Audio/Text Unimodal: Next-token prediction on text-only and audio-only data.
- Audio-Text Mapping: ASR ( $a \rightarrow t$ ) and TTS ( $t \rightarrow a^d$ ) tasks formulated as sequence-to-sequence prediction.
- Audio-Text Interleaving: Interleaving sequences of audio ( $a$ ) and target audio semantic tokens ( $a^d$ ) or text tokens ( $t$ ) in various patterns ( $a_i, a_{i+1}^d, \dots$ ; $a_i, t_{i+1}, \dots$ ; $a_i, a_{i+1}^d/t_{i+1}, \dots$ ) to bridge modalities and enable multi-output generation.
- The model is initialized from Qwen2.5 7B and trained for 1 epoch on 585B audio and text tokens with specific task weights. The Whisper feature extractor is initially frozen then unfrozen.
Supervised Fine-tuning: Equips the model with instruction-following capabilities using natural language instructions (audio or text versions). Datasets are trained for 2-4 epochs using AdamW optimizer with a cosine decay learning rate schedule.
Audio Detokenizer Training: Trained in three stages: pre-training on 1M hours for diversity, chunk-wise fine-tuning, and fine-tuning on high-quality single-speaker data for expressiveness.

For inference and deployment, Kimi-Audio is designed for tasks like real-time speech-to-speech conversation. The process involves client streaming audio, server-side VAD, LLM inference, and real-time streaming of generated audio chunks back to the client. The production deployment architecture leverages modular services (RTC Service, Inference Scheduler, Tokenizer/Detokenizer/LLM Services) with load balancing and parallel instances for scalability and low latency, critical for real-time interaction.

Evaluating Kimi-Audio and comparing it to other models requires a robust evaluation framework, as current audio benchmarks suffer from inconsistent metrics, diverse configurations, and limited generation evaluation. To address this, the authors developed an open-source Evaluation Toolkit (2504.18425) with features like standardized WER calculation, GPT-4o-mini as an intelligent judge for nuanced tasks, a unified platform supporting multiple models, and standardized inference parameters/prompting strategies ("recipes") to ensure reproducibility and fair comparison.

Evaluation results, presented using this toolkit, demonstrate Kimi-Audio's state-of-the-art performance across various benchmarks:

ASR: Achieves superior performance (lower WER) on datasets like LibriSpeech, Fleurs, AISHELL-1/2, and WenetSpeech compared to models like Qwen2-Audio, Baichuan-Audio, Step-Audio, and Qwen2.5-Omni (Table~\ref{tab:asr_performance}).
Audio Understanding: Shows leading performance on tasks involving music, sound events, and speech understanding across MMAU, ClothoAQA, VocalSound, Nonspeech7k, MELD, TUT2017, and CochlScene datasets (Table~\ref{tab:audio_understanding_performance}).
Audio-to-Text Chat: Attains state-of-the-art results on multiple sub-tasks of OpenAudioBench and VoiceBench, indicating strong conversational and reasoning abilities based on audio input (Table~\ref{tab:audio_text_performance}).
Speech Conversation: Subjective human evaluations show Kimi-Audio achieving high scores for speed control, emotion control, and empathy, with a competitive overall average compared to other models like GPT-4o-mini, GLM-4-Voice, and Step-Audio-chat (Table~\ref{tab:smo_performance}).

The report concludes by discussing challenges and future trends, including the need to move beyond transcription-only pre-training towards incorporating richer audio descriptions, developing better audio representations that blend semantic and acoustic information, and reducing reliance on ASR/TTS for data generation to unlock the full potential of audio foundation models. The open-sourcing of Kimi-Audio and its evaluation toolkit is presented as a contribution to fostering further research and development in the community.