MiMo-Audio-7B-Base: Unified Audio-Language Model
- MiMo-Audio-7B-Base is a unified audio-language model that treats speech and text as a single interleaved sequence for few-shot learning.
- It employs a high-fidelity MiMo-Audio-Tokenizer with patch aggregation to convert 24 kHz waveforms into discrete tokens matched to text granularity.
- Pretraining on over 100M hours of audio allows the model to achieve competitive performance in speech understanding, speech-to-speech, and cross-modal generation.
Searching arXiv for the specified paper to ground the article and citation. MiMo-Audio-7B-Base is a unified audio–LLM introduced in the paper "MiMo-Audio: Audio LLMs are Few-Shot Learners" (Team et al., 29 Dec 2025). It treats speech and text as a single interleaved sequence and learns entirely via next-token prediction at scale. Its design centers on lossless flow of speech information: high-fidelity discrete audio tokens are produced by a dedicated tokenizer, aggregated into patches to match the granularity of text, and modeled autoregressively by a 7B LLM backbone. The reported pretraining regime scales beyond one hundred million hours of speech and trillions of mixed text/audio tokens, yielding few-shot capabilities across speech intelligence, audio understanding, and speech-to-speech generation tasks without task-specific finetuning (Team et al., 29 Dec 2025).
1. Model identity and system-level design
MiMo-Audio-7B-Base is described as a unified audio–LLM in which speech and text occupy a common sequence space. The model operates on an interleaved sequence of text tokens and audio patches, modeled autoregressively as
with the negative log-likelihood objective
This formulation places speech understanding, text generation, and speech generation under a single next-token prediction paradigm (Team et al., 29 Dec 2025).
The architecture is organized around four principal components: MiMo-Audio-Tokenizer, a patch encoder, the MiMo-7B-Base LLM backbone, and a patch decoder. The tokenizer converts 24 kHz mono waveforms into discrete tokens at 25 Hz. The patch encoder aggregates groups of consecutive 25 Hz frames into one patch, so the LLM receives audio at 6.25 Hz. The LLM backbone has 36 layers, 32 heads, model dimension 4096, FFN dimension 11008, a text vocabulary size of 151,680, and context length 8192. The patch decoder then autoregressively emits RVQ tokens per patch through independent output heads, one per codebook (Team et al., 29 Dec 2025).
A concise summary of the model’s input/output interface is as follows.
| Component | Configuration | Function |
|---|---|---|
| Tokenizer | 25 Hz discrete audio tokens | Converts waveform to RVQ token sequence |
| Patch encoder | frames per patch; 6.25 Hz to LLM | Matches audio granularity to text |
| LLM backbone | 36 layers, 32 heads, dim 4096, context 8192 | Unified autoregressive modeling |
| Patch decoder | 16 layers, 64 heads, dim 1024 | Generates delayed RVQ tokens |
A common misconception is that MiMo-Audio-7B-Base is a conventional cascaded speech system. The reported design instead places speech and text into a single autoregressive sequence, with the LLM hidden states serving either for text next-token projection or as conditioning for audio decoding (Team et al., 29 Dec 2025).
2. Audio tokenization and lossless speech representation
The tokenizer is a central part of the system because the broader architecture depends on preserving speech detail at sufficiently high fidelity. MiMo-Audio-Tokenizer has 1.2B parameters and processes a mel-spectrogram at 100 Hz. Its encoder is a Transformer with bidirectional attention, 32 layers, 20 heads, model dimension 1280, FFN dimension 5120, RoPE positional embeddings, GELU activations, and 2× downsampling at input/output. To mitigate semantic–acoustic conflict, layer-3 hidden states are added element-wise to the final encoder output. Discretization is performed with Residual Vector Quantization using 20 layers; codebook sizes are 1024 for the first two layers and 128 for layers 3–20, while downstream MiMo-Audio uses only the first 8 RVQ codebooks, (Team et al., 29 Dec 2025).
The decoder is a Transformer with causal self-attention and is described as streaming-friendly, mirroring the encoder. Waveform reconstruction is handled by a Vocos-style Transformer vocoder with 16 layers, 16 heads, dimension 256, FFN 1024, RoPE, and sliding-window attention, providing receptive fields of and supporting sequence packing and streaming synthesis. The effective token rate is $25$ frames/s 0 codebooks 1 discrete tokens/s, and the codebooks 1024/128 imply approximately 2 kbps, as reported (Team et al., 29 Dec 2025).
The tokenizer’s reconstruction quality is reported using only the first 8 codebooks, exactly matching what MiMo-Audio “sees.” On SEED-ZH, the values are PESQ-NB 3.30, PESQ-WB 2.71, SIM 0.89, and STOI 0.93. On SEED-EN, the values are PESQ-NB 3.02, PESQ-WB 2.43, SIM 0.85, and STOI 0.92. The paper states that these surpass GLM-4-Voice-Tokenizer, Baichuan-Audio-Tokenizer, Mimi, XY-Tokenizer, XCodec2.0, and BigCodec at similar bitrates (Team et al., 29 Dec 2025).
This representation strategy is closely tied to the model’s later few-shot behavior. The paper attributes the system’s emergent speech-to-speech capabilities to lossless compression-style pretraining over high-fidelity RVQ tokens that preserve paralinguistic detail, allowing the model to carry timbre, prosody, and environmental context in its internal state and manipulate these in-context via patch-level conditioning and delayed generation (Team et al., 29 Dec 2025).
3. Patching, delayed generation, and autoregressive audio decoding
A key engineering problem in unified audio–language modeling is the mismatch between raw audio token rate and text token granularity. MiMo-Audio-7B-Base addresses this through patching. For each 25 Hz frame 3, embeddings from the 4 codebooks are summed as
5
and the sequence within a patch is encoded by a Transformer with bidirectional attention, 6 layers, 64 heads, model dimension 1024, and FFN dimension 4096. The encoded patch is then linearly projected to the LLM input dimension (Team et al., 29 Dec 2025).
On the output side, the patch decoder is a Transformer with causal masking, 16 layers, 64 heads, model dimension 1024, and FFN dimension 4096. It autoregressively emits RVQ tokens per patch through independent output heads for the eight codebooks. To handle inter-layer dependencies among RVQ tokens, the decoder uses a layer-delayed generation pattern
6
With 7 and 8, the delayed sequence length per patch becomes 9, which matches the patch decoder context length reported in the model configuration (Team et al., 29 Dec 2025).
This design has direct consequences for sequence length. The LLM context length is 8192 positions. With audio-only input at 6.25 Hz patches, and each patch corresponding to 160 ms, the theoretical maximum audio context is approximately 0, or about 21.9 minutes, reduced if text is interleaved. The tokenizer’s decoder and the vocoder are designed for streaming through causal attention and sliding-window mechanisms, whereas the patch encoder uses bidirectional attention and is therefore non-streaming during understanding. For real-time applications, the generation pipeline nonetheless supports incremental audio production (Team et al., 29 Dec 2025).
The paper does not specify temperature or top-1 settings for decoding. It states only that users can apply standard LLM sampling parameters for text and similar sampling for per-codebook logits in audio generation (Team et al., 29 Dec 2025).
4. Training objectives, stages, and data curation
The tokenizer is trained in two stages. Stage 1 combines an audio-to-text objective with multi-scale mel reconstruction and RVQ commitment loss. The audio-to-text term is
2
where 3 is the quantized audio representation and 4. The reconstruction loss is
5
with 6 and 7 computed via normalized STFT with window size 8 and hop length 9. The total Stage-1 objective is
0
with 1, 2, and 3 (Team et al., 29 Dec 2025).
Stage 2 for the tokenizer adds adversarial fine-tuning with Multi-Period Discriminator and Multi-Scale STFT discriminators under a Hinge-GAN objective. The discriminator loss is
4
the adversarial loss is
5
and the feature matching loss is
6
The generator loss is
7
with 8, 9, and 0 (Team et al., 29 Dec 2025).
MiMo-Audio pretraining itself proceeds in two stages starting from MiMo-7B-Base. Stage 1 is understanding-only: patch encoder plus LLM are trained for speech understanding, with loss computed only on text tokens. The data volume is 2.6T tokens total, comprising 1.2T text and 1.4T speech-related tokens at 6.25 Hz. Tasks include speech–text interleaved training, ASR, audio captioning, and text-only pretraining. The learning rates are 1 for the patch encoder and 2 for the LLM, with constant LR, batch size 16.8M tokens, warmup ratio 0.01, and LLM context 8192 (Team et al., 29 Dec 2025).
Stage 2 is understanding–generation joint training: patch encoder, LLM, and patch decoder are trained with loss on both text and audio tokens. The data scale is 5T tokens total, with 2.6T text and 2.4T audio. Tasks include speech continuation, speech–text interleaved training, ASR, TTS, audio captioning, instruction-following TTS, and text pretraining. Text-guided interleaving with a fixed 5:5 ratio is used during generation; after text completion, speech tokens are generated to finish the turn. Loss weights are text 100 and RVQ heads 12, 8, 6, 4, 2, 2, 1, 1. Learning rates are 3 for patch encoder/decoder and 4 for the LLM, with cosine LR, batch size 16.8M tokens, warmup 0.01, and delay pattern 5-6-7-8-9-0-1-2 (Team et al., 29 Dec 2025).
The reported pretraining corpus contains over 100 million hours of in-the-wild speech audio, described as an order of magnitude larger than prior open-source speech models. The sources include podcasts, audiobooks, news, interviews, and conference recordings, spanning daily conversation, entertainment, business, arts, culture, and science. The processing pipeline includes audio normalization, speaker diarization, voice activity detection, ASR, and audio quality assessment. Automated annotation adds semantic scores for conversational quality, knowledge density, and logical reasoning using text quality assessment applied to ASR transcripts, together with non-semantic descriptions of timbre, emotion, and background generated by an audio captioner. Low-quality, noisy, or unsafe segments are filtered, after which sampling is performed by integrated semantic and non-semantic scoring to favor high-quality coverage. The paper does not provide detailed licensing terms, speaker counts, or per-domain hour breakdowns (Team et al., 29 Dec 2025).
5. Few-shot evaluation protocol and benchmark performance
The evaluation protocol is explicitly GPT-3–style few-shot evaluation without parameter updates. For SpeechMMLU, the benchmark contains 8,549 entries across 34 subjects and supports controlled splits enabling T2T, S2T, T2S, and S2S, each evaluated with 5-shot prompts. For MMAU, the setup is also 5-shot, using audio+text input and text output across speech, environmental sounds, and music. For speech-to-speech generation, the reported protocol uses 16-shot prompts conditioned only on speech exemplars placed in context; tasks include voice conversion, emotion conversion, speaking rate conversion, speech denoising, and En→Zh speech translation (Team et al., 29 Dec 2025).
On SpeechMMLU, MiMo-Audio-7B-Base reports S2S 69.1, S2T 69.5, T2S 71.5, and T2T 72.5. The paper states that S2S 69.1 is highest among open-source baselines, with gains of +37.2 over Kimi-Audio-Base (11.8), +17.3 over Step-Audio2-mini-Base (51.8), and +37.2 over Baichuan-Audio-Base (31.9). For S2T, the score is +1.6 over Kimi (67.9), +1.7 over Step-Audio2 (67.8), and +39.6 over Baichuan (29.9). For T2S, the score is +8.1 over Step-Audio2 (63.4), +54.8 over Kimi (0.0), and +54.8 over Baichuan (16.7). T2T 72.5 is described as retaining text performance close to Step-Audio2 (74.1) and Kimi (70.7). The modality gap, defined as T2T minus S2S, is 3.4 for MiMo-Audio, compared with 22.3 for Step-Audio2, 58.9 for Kimi, and 39.2 for Baichuan, which the paper interprets as indicating strong cross-modal consistency (Team et al., 29 Dec 2025).
On MMAU, MiMo-Audio-7B-Base reports 66.0 overall, described as best among open-source models, with gains of +5.7 over Step-Audio2-mini-Base (60.3), +37.4 over Kimi-Audio-Base (28.6), and +40.1 over Baichuan-Audio-Base (25.9). The subdomain breakdown is Speech 67.6, Sound 65.2, and Music 65.3. Step-Audio2 leads in Sound at 67.9 but lags in Speech at 55.0 and Music at 58.1 (Team et al., 29 Dec 2025).
These results are presented as evidence that the model’s unified formulation does not merely preserve text capability while adding speech handling. Rather, it yields comparatively small degradation across modalities, especially in the reported modality gap analysis. A plausible implication is that the shared interleaved sequence representation narrows the difference between text-native and speech-native reasoning settings, although the paper confines its concrete claim to benchmark results and gap values (Team et al., 29 Dec 2025).
6. Emergent few-shot behavior and speech-to-speech generalization
The paper places particular emphasis on abilities absent from the training data. It reports a phase transition with scale: once pretraining exceeds approximately 0.7T tokens, performance in 5-shot SpeechMMLU T2S and S2S, 16-shot Voice Conversion, and 16-shot Speech-to-Speech Translation rises sharply from near-zero baselines and then stabilizes, described as consistent with the emergent abilities literature (Team et al., 29 Dec 2025).
The reported speech-to-speech tasks are all evaluated in-context rather than via task-specific finetuning. In the 16-shot setup, the model is conditioned only on speech exemplars placed in context. The tasks include voice conversion, defined as A→B timbre with input semantics preserved; emotion conversion, defined as same speaker, emotion A→B; speaking rate conversion, defined as same speaker, rate A→B; speech denoising, defined as noisy→clean; and speech translation, defined as En→Zh (Team et al., 29 Dec 2025). A common misconception is that such transformations necessarily require dedicated supervised training for each task. The reported protocol instead treats them as few-shot in-context behaviors emerging from general pretraining (Team et al., 29 Dec 2025).
The paper’s qualitative evidence also extends beyond conventional benchmark tasks. It states that MiMo-Audio-7B-Base demonstrates powerful speech continuation capabilities, including highly realistic talk shows, recitations, livestreaming, and debates. More specific examples listed in the paper include realistic talk shows with crowd reactions, debates with multi-speaker consistency and viewpoint coherence, recitations with emotion and professional cadence, livestreams and teaching with expressiveness, colloquial fillers, and volume modulation, as well as singing with melodic consistency and dialect continuation (Team et al., 29 Dec 2025).
The proposed mechanism is that high-fidelity RVQ tokens preserve paralinguistic detail and allow the model to manipulate timbre, prosody, and environmental context through patch-level conditioning and delayed generation. This suggests that the system’s few-shot speech transformations are linked not only to dataset scale but also to the decision to model relatively lossless discrete speech representations rather than heavily compressed semantic tokens (Team et al., 29 Dec 2025).
7. Inference, reproducibility, safety, and limitations
The reported high-level inference workflow begins by preparing audio input: resampling to 24 kHz mono, computing a mel-spectrogram at 100 Hz, passing it through the MiMo-Audio-Tokenizer encoder and RVQ to obtain 3-layer tokens at 25 Hz, and grouping them into patches of 4 for 6.25 Hz input to the LLM. Text tokens and audio patch embeddings are then concatenated into the interleaved sequence 5 up to 8192 positions, with task-specific prompting and few-shot exemplars as needed. Text outputs are produced by projecting LLM hidden states to text logits and sampling with typical LLM settings; audio outputs are produced by the patch decoder, which generates delayed RVQ tokens following 6, after which the tokenizer decoder and vocoder reconstruct waveform audio. Streaming is supported by the causal decoder and vocoder (Team et al., 29 Dec 2025).
The code, checkpoints, and evaluation suite are reported as available through the project repository at https://github.com/XiaomiMiMo/MiMo-Audio. For replication, the paper specifies the principal training hyperparameters: Stage 1 uses learning rates 7 for the patch encoder and 8 for the LLM, constant LR, and batch size 16.8M tokens; Stage 2 uses learning rates 9 for patch encoder/decoder and 0 for the LLM, cosine LR, batch size 16.8M tokens, loss weights 1, and delay pattern 2 (Team et al., 29 Dec 2025).
The paper does not report VRAM requirements or official quantization options for the 7B model. It states that users should expect typical 7B LLM inference footprints, with additional memory for the patch modules and vocoder, and that mixed precision and tensor parallelism are recommended for long audio contexts (Team et al., 29 Dec 2025). It also does not specify license terms for released assets, advising consultation of the GitHub repository for model and dataset licensing (Team et al., 29 Dec 2025).
Safety and ethics are treated primarily in relation to data filtering and misuse risk. The data pipeline includes automated quality filters and unsafe-content removal, and both semantic and non-semantic labels are used to target coverage and diversity. At the same time, the paper states that voice conversion and cloning risks are inherent to high-fidelity generative models. The reported mitigation consists of data filtering and a proposal for future reinforcement learning alignment; the text further notes that ethical safeguards such as consent, watermarking, and policy filters should be applied (Team et al., 29 Dec 2025).
The model’s limitations are explicitly stated. For the base model, in-context generation can struggle with background music mixing and complex sound event handling, and speech continuation is strong but not universally robust. For the post-trained instruct variant, the paper reports timbre discontinuities, unstable audio quality, mispronunciations, especially for symbols and formulas, inconsistent style control, and a trade-off in which “thinking” helps speech understanding but may induce hallucinations in sound and music tasks (Team et al., 29 Dec 2025). These caveats place the reported few-shot and generative results in a more precise context: the system demonstrates broad emergent capability, but not uniform robustness across all acoustic conditions and output regimes.