MinMo: Multimodal Voice Interaction LLM
- MinMo is a multimodal large language model that integrates voice and text processing with 8B parameters for real-time, full-duplex interactions.
- It employs a multi-stage alignment framework combining speech-to-text, text-to-speech, and full-duplex modules to achieve state-of-the-art performance across various voice benchmarks.
- Its autoregressive streaming voice decoder and full-duplex predictor enable low-latency, controllable speech synthesis and comprehension with precise style-following capabilities.
MinMo is a multimodal LLM with approximately 8 billion parameters, specifically designed for seamless and natural voice interactions across speech and text. Utilizing a multi-stage alignment framework, MinMo integrates a speech front-end, a text LLM backbone, an autoregressive streaming voice decoder, and a full-duplex interaction module, enabling real-time, full-duplex, and fine-grained controllable speech comprehension and generation. Trained on 1.4 million hours of diverse speech data spanning ASR, TTS, speaker, and style tasks, MinMo achieves state-of-the-art performance on a broad set of voice benchmarks, surpassing both native and aligned predecessor models while retaining high-level text LLM capabilities (Chen et al., 10 Jan 2025).
1. Model Architecture and Multimodal Alignment
MinMo is built as an aligned multimodal model, maintaining the core capabilities of a pre-trained text LLM while incorporating speech processing through dedicated speech-to-text (S2T) and text-to-speech (T2S) pathways. The principal components and parameter allocation are as follows:
| Module | Component Model | Parameters (M) |
|---|---|---|
| Voice Encoder | SenseVoice-Large | 636 |
| Input Projector | CNN + 2 Transformers | 170 |
| Text LLM Backbone | Qwen2.5-7B-Instruct | 7,000 |
| Output Projector | Linear | 6 |
| Voice Token LM | CosyVoice 2 LM | 370 |
| Full-Duplex Predictor | 1 Transformer + softmax | 18 |
The speech understanding pathway routes incoming voice waveforms through the Voice Encoder and Input Projector to create hidden queries, which are then processed by Qwen2.5-7B-Instruct (updated via LoRA). Speech generation takes LLM text tokens and hidden states, passes them through the Output Projector and the Voice Token LM, producing discrete speech tokens that are synthesized to waveform output.
A distinctive feature is the full-duplex predictor, operating on the LLM's last-layer hidden states to provide a frame-level decision ("CONTINUE_SPEAKING" vs. "STOP to listen"), crucial for simultaneous two-way conversation.
2. Autoregressive Voice Decoder and Latency
MinMo introduces a novel streaming autoregressive (AR) Transformer-based voice decoder, which interleaves semantic vectors from the LLM (5 per block) with speech tokens (15 per block) at a fixed $5 : 15$ ratio. These blocks are presented to the Voice Token LM in a streaming format using teacher forcing. The output is chunked by the Token2Wav synthesizer (employing flow-matching plus a vocoder) to produce waveform audio.
Theoretical end-to-end latency for a block is given by:
where is the time to generate one text token, is for each speech token, and for waveform synthesis per speech token.
Full-duplex theoretical latency is approximately 600 ms, with an empirical end-to-end latency of 800 ms when including OS and I/O overhead, both measured on L20 GPUs with BF16 precision (Chen et al., 10 Jan 2025).
3. Multi-Stage Training Procedure
The training pipeline for MinMo spans four alignment stages, leveraging approximately 1.4 million hours of speech data:
- Stage 1: Speech-to-Text Alignment Pre-alignment (120k h) updates only the Input Projector. Full alignment (1.2M h) trains the Voice Encoder and Input Projector with the LLM frozen. LoRA-based instruction fine-tuning adapts the LLM for cross-modal prompts. The objective is standard next-token cross-entropy:
- Stage 2: Text-to-Speech Alignment Uses 170k h of basic TTS and 1k h of instruction-controlled TTS. First, the Output Projector is trained alone, then jointly with the Voice Token LM (LLM and Voice Encoder frozen), using cross-entropy loss over speech tokens.
- Stage 3: Speech-to-Speech Alignment Trains on 10k h simulated multi-turn conversations plus 100 h style-controllable speech. Only Output Projector and Voice Token LM are trained, with style control embeddings concatenated.
- Stage 4: Duplex Interaction Alignment Utilizes 4k h of real and simulated long-form dialogs. Only the Full-Duplex Predictor is trained using frame-wise classification:
Summary of Training Data
| Task | Hours (k) |
|---|---|
| Speech-to-Text | 1,200 |
| Text-to-Speech | 171 |
| Speech-to-Speech | 10 + 0.1 |
| Duplex | 4 |
4. Performance Evaluation and Benchmarks
MinMo demonstrates state-of-the-art results across multilingual ASR, speech-to-text translation, spoken question answering, TTS, and voice style control.
- Voice Comprehension
- Voice Generation and Style Control
- Seed-TTS, English WER: 2.90%
- Style-following on 122-turn, 12-style test: MinMo 98.4%, GLM-4-Voice 63.1%
| Style Group | GLM-4-Voice | MinMo |
|---|---|---|
| Emotion | 75.6% | 97.6% |
| Dialect | 42.9% | 100% |
| Speaking rate | 80.0% | 100% |
| Role-play | 70.4% | 96.3% |
| Default | 88.2% | 96.3% |
| Total | 63.1% | 98.4% |
- Latency
- Speech-to-text: ~100 ms (for 5 tokens)
- Theoretical full-duplex: ~600 ms
- Practical full-duplex: ~800 ms
On the radar chart comparison (Figure 1), MinMo outperforms native (Moshi, GLM-4-Voice) and aligned (LLaMA-Omni, Freeze-Omni, Mini-Omni2) models on ASR, S2TT, SER, LID, and VSC, while retaining text LLM capabilities (Chen et al., 10 Jan 2025).
5. Instruction-Following and Controllable Voice Synthesis
MinMo supports nuanced, instruction-driven speech synthesis using learned style embeddings concatenated to LLM hidden states. These embeddings are conditioned on explicit user instructions, such as emotions, dialects, speaking rate, and speaker mimicry, acquired from 1k h instruct-TTS and 100 h style-controllable S2S data during training.
Examples of prompt control include:
- “Please speak very fast: …” resulting in an accelerated utterance.
- “Speaking with a tone of sadness: I miss my dear friend…” producing corresponding emotional prosody.
This mechanism enables MinMo to follow verbal instructions with high fidelity, as shown in style-following benchmarks.
6. Limitations and Directions for Enhancement
- Text-Instruction Following: MinMo leverages only LoRA updates on the LLM during cross-modal fine-tuning; text-only instruction-following remains weaker. Extension of text-only fine-tuning is required for recovery of full LLM instruction-following capacity.
- Pronunciation Errors: Residual long-tail errors in end-to-end speech generation suggest room for improvement via a more balanced token vocabulary or additional diverse training data.
- Style-Control Efficiency: Performance may be further enhanced by expanding the scale and diversity of style instruction datasets and increasing embedding complexity.
- Duplex Integration: Presently, full-duplex mode relies on external AEC/VAD modules; full end-to-end duplex processing remains a target for future development (Chen et al., 10 Jan 2025).
MinMo establishes that an aligned architecture, multi-stage speech-text alignment on scale, and an AR streaming voice decoder can collectively endow a text LLM with state-of-the-art empirical performance in voice interaction, generation, duplex conversation, and fine-grained controllability within a robust 8B-parameter framework.