Kimi-Audio: Open-Source Audio Foundation
- Kimi-Audio is an open-source audio foundation model that leverages both continuous acoustic features and discrete semantic tokens for comprehensive audio analysis.
- It utilizes a transformer-based LLM with a 12.5 Hz audio tokenizer and a streamable detokenizer to deliver efficient, high-fidelity performance on diverse audio tasks.
- Its extensive pre-training on over 13 million hours of multi-modal audio enables robust applications in speech recognition, synthesis, and conversational AI.
Kimi-Audio is an open-source audio foundation model designed for comprehensive audio understanding, generation, and conversational tasks. Built upon a large-scale LLM-based architecture, Kimi-Audio processes both continuous acoustic features and discrete semantic tokens, leveraging a massive, multi-modal dataset and advanced training strategies to achieve state-of-the-art performance across diverse audio and speech benchmarks. The model integrates a 12.5 Hz audio tokenizer, a hybrid architecture for multi-modal input and output, continual pre-training on audio/text data, task-specific fine-tuning, and a chunk-wise streaming detokenizer utilizing flow matching. Kimi-Audio supports real-time inference, robust evaluation, and scalable deployment for research and industrial applications.
1. Model Architecture and Audio Tokenization
Kimi-Audio utilizes a hybrid architecture in which audio signals are represented both as continuous features and discrete semantic tokens. The core system consists of:
- Audio Tokenizer: Based on a Whisper encoder, the tokenizer extracts discrete semantic tokens via vector quantization and operates at a reduced frame rate of 12.5 Hz, facilitating alignment with text and improving computational efficiency. In parallel, continuous acoustic features are downsampled from the native 50 Hz of Whisper to 12.5 Hz and concatenated with the discrete token embeddings.
- Audio LLM Processor: The main processor is a transformer-based LLM initialized from a pre-trained text LLM (e.g., Qwen2.5-7B). It receives time-step representations , with as the continuous vector and as the embedded discrete token. The shared transformer layers produce features used in two heads: one for autoregressive text generation and one for predicting discrete audio semantic tokens.
- Streamable Detokenizer: The flow-matching detokenizer transforms discrete tokens to 50 Hz mel-spectrograms, then synthesizes waveforms through a BigVGAN vocoder. Chunk-wise streaming detokenization, with look-ahead tokens, is used to ensure real-time, boundary-smooth synthesis.
This design enables Kimi-Audio to process, generate, and converse in both text and audio domains with high fidelity and efficiency.
2. Data Curation and Pre-Training Corpus
The pre-training dataset encompasses over 13 million hours of audio, spanning speech, music, and environmental sound modalities:
- Speech Data: Includes conversational speech, audiobooks, interviews, and meeting data.
- Music and Environmental Sound: Covers diverse non-speech audio contexts.
- Processing Pipeline:
- Speech Enhancement: BSRNN-based denoising and dereverberation, selectively applied to balance preservation of original sound context.
- Diarization and Segmentation: Utilizing PyAnnote and heuristic post-processing for speaker clustering and chunk assignment.
- Transcription: Whisper-large-v3 for English; Paraformer-Zh with Mandarin-specific punctuation rules for Chinese segments.
Post-training supervised fine-tuning data further includes curated open-source and proprietary datasets for audio understanding, speech conversation, and audio-to-text chat.
3. Training Strategy and Objective Formulation
Kimi-Audio’s training follows a two-stage approach:
- Continual Pre-Training:
- Initialized from a text LLM.
- Employs unimodal next-token prediction (audio-only and text-only), audio-to-text mapping (ASR, TTS), and audio-text interleaving.
- Composite training sequences (e.g., ; prediction losses applied at underlined token locations) with task weights balanced in a 1:7:1:1:1:1:2 ratio.
- Large-scale token-level optimization: 585B audio tokens and 585B text tokens per epoch.
- Supervised Fine-Tuning (SFT):
- Instruction-based tuning on ASR, audio understanding, speech conversation, and audio-to-text chat, with both text and synthesized audio instructions via Kimi-TTS.
- Learning rates and epochs are carefully controlled (typically 2–4 per SFT dataset).
- AdamW optimizer with warmup and cosine decay.
This regimen enhances the model's ability to understand multimodal audio/text input, generate responses, and follow instructions.
4. Evaluation Methodology and Benchmark Results
Extensive evaluation demonstrates Kimi-Audio’s state-of-the-art performance:
Task | Benchmark Datasets | Reported Metrics/Performance |
---|---|---|
ASR | LibriSpeech, AISHELL-1/2, WenetSpeech | 1.28 WER (LibriSpeech test-clean) |
Audio Understanding | MMAU, ClothoAQA, VocalSound, Nonspeech7k, MELD, TUT2017, CochlScene | SOTA scores on sound event, emotion, scene |
Audio-To-Text Chat | OpenAudioBench, VoiceBench | Outperforms Qwen2-Audio, Qwen2.5-Omni |
Speech Conversation | Internal human ratings | High scores, close to GPT-4o on fluency, empathy |
Subjective human ratings address nuanced conversational metrics (speed control, accent, emotion, empathy).
The evaluation toolkit, released open-source, standardizes WER, supports diverse audio tasks, and leverages both objective measures and intelligent judging via models like GPT-4o-mini.
5. Inference Deployment and System Workflow
Kimi-Audio is optimized for real-time, scalable deployment:
- Inference System:
- Client-side streaming (app or browser) emits audio to server over WebRTC.
- Server-side VAD module detects end-of-utterance; segment is fed to tokenizer for multimodal representation.
- Audio LLM generates output tokens (text and audio).
- Streaming detokenizer chunk-wise processes output tokens, looks ahead for boundary smoothing, and streams audio responses.
- Microservice Architecture:
- Tokenizer, LLM, and Detokenizer are modular microservices with load balancing and parallel execution.
- Designed for low-latency, robust real-time interaction and conversational tasks.
The workflow is suitable for robust deployment in research and production environments.
6. Comparative Context and Relation to Recent Audio-LLMs
Kimi-Audio builds upon—and is compared directly to—recent open and closed-source audio-LLMs:
- Outperforms Qwen2-Audio-base, Qwen2.5-Omni, and other competitive baselines on speech and audio benchmarks.
- While Audio Flamingo 2 (Ghosh et al., 6 Mar 2025) and Audio Flamingo 3 (Goel et al., 10 Jul 2025) introduce additional curriculum-based training strategies and purpose-built encoders for reasoning and long-context comprehension, Kimi-Audio’s unified hybrid tokenization and large-scale data curation distinguish it in the SOTA landscape.
- The architectural innovations (12.5 Hz tokenizer, explicit dual representation, real-time chunk-wise detokenization) and scalable evaluation contribute significantly to performance gains against contemporaries.
A plausible implication is that future foundation models for audio will further integrate continuous/discrete token synergies and prioritized massively multi-modal corpora, as exemplified by Kimi-Audio and its competitive cohort.
7. Applications and Implications
Kimi-Audio enables a wide spectrum of tasks:
- Audio Understanding: Robust handling of speech, music, and environmental sounds.
- Speech Generation and Conversation: Instruction-following, context-rich dialogue, and expressive synthesis.
- Audio Captioning and QA: Unified model supports both content description and interactive QA across modalities.
- Real-Time Interfaces: Foundation for deployable speech-to-speech systems for accessibility, multimedia retrieval, virtual agents, and more.
In summary, Kimi-Audio represents a scalable, unified approach to end-to-end audio modeling. Its architecture and extensive training pipeline offer concrete advances in multimodal audio intelligence, positioning it as a reference model for state-of-the-art audio-language research and application.