MinMo: Multimodal Voice Interaction LLM

Updated 1 July 2026

MinMo is a multimodal large language model that integrates voice and text processing with 8B parameters for real-time, full-duplex interactions.
It employs a multi-stage alignment framework combining speech-to-text, text-to-speech, and full-duplex modules to achieve state-of-the-art performance across various voice benchmarks.
Its autoregressive streaming voice decoder and full-duplex predictor enable low-latency, controllable speech synthesis and comprehension with precise style-following capabilities.

MinMo is a multimodal LLM with approximately 8 billion parameters, specifically designed for seamless and natural voice interactions across speech and text. Utilizing a multi-stage alignment framework, MinMo integrates a speech front-end, a text LLM backbone, an autoregressive streaming voice decoder, and a full-duplex interaction module, enabling real-time, full-duplex, and fine-grained controllable speech comprehension and generation. Trained on 1.4 million hours of diverse speech data spanning ASR, TTS, speaker, and style tasks, MinMo achieves state-of-the-art performance on a broad set of voice benchmarks, surpassing both native and aligned predecessor models while retaining high-level text LLM capabilities (Chen et al., 10 Jan 2025).

1. Model Architecture and Multimodal Alignment

MinMo is built as an aligned multimodal model, maintaining the core capabilities of a pre-trained text LLM while incorporating speech processing through dedicated speech-to-text (S2T) and text-to-speech (T2S) pathways. The principal components and parameter allocation are as follows:

Module	Component Model	Parameters (M)
Voice Encoder	SenseVoice-Large	636
Input Projector	CNN + 2 Transformers	170
Text LLM Backbone	Qwen2.5-7B-Instruct	7,000
Output Projector	Linear	6
Voice Token LM	CosyVoice 2 LM	370
Full-Duplex Predictor	1 Transformer + softmax	18

The speech understanding pathway routes incoming voice waveforms through the Voice Encoder and Input Projector to create hidden queries, which are then processed by Qwen2.5-7B-Instruct (updated via LoRA). Speech generation takes LLM text tokens and hidden states, passes them through the Output Projector and the Voice Token LM, producing discrete speech tokens that are synthesized to waveform output.

A distinctive feature is the full-duplex predictor, operating on the LLM's last-layer hidden states to provide a frame-level decision ("CONTINUE_SPEAKING" vs. "STOP to listen"), crucial for simultaneous two-way conversation.

2. Autoregressive Voice Decoder and Latency

MinMo introduces a novel streaming autoregressive (AR) Transformer-based voice decoder, which interleaves semantic vectors from the LLM (5 per block) with speech tokens (15 per block) at a fixed $5 : 15$ ratio. These blocks are presented to the Voice Token LM in a streaming format using teacher forcing. The output is chunked by the Token2Wav synthesizer (employing flow-matching plus a vocoder) to produce waveform audio.

Theoretical end-to-end latency for a block is given by:

$\text{Latency} = 5\,d_{\text{LLM}} + 15\,d_{\text{lm}} + 15\,d_{\text{syn}}$

where $d_{\text{LLM}}$ is the time to generate one text token, $d_{\text{lm}}$ is for each speech token, and $d_{\text{syn}}$ for waveform synthesis per speech token.

Full-duplex theoretical latency is approximately 600 ms, with an empirical end-to-end latency of 800 ms when including OS and I/O overhead, both measured on L20 GPUs with BF16 precision (Chen et al., 10 Jan 2025).

3. Multi-Stage Training Procedure

The training pipeline for MinMo spans four alignment stages, leveraging approximately 1.4 million hours of speech data:

Stage 1: Speech-to-Text Alignment Pre-alignment (120k h) updates only the Input Projector. Full alignment (1.2M h) trains the Voice Encoder and Input Projector with the LLM frozen. LoRA-based instruction fine-tuning adapts the LLM for cross-modal prompts. The objective is standard next-token cross-entropy:

$\mathcal{L}_{\text{S2T}} = -\sum_{t=1}^T \log P_{\theta}(y_t \mid y_{<t}, X_{\text{speech}})$

Stage 2: Text-to-Speech Alignment Uses 170k h of basic TTS and 1k h of instruction-controlled TTS. First, the Output Projector is trained alone, then jointly with the Voice Token LM (LLM and Voice Encoder frozen), using cross-entropy loss over speech tokens.
Stage 3: Speech-to-Speech Alignment Trains on 10k h simulated multi-turn conversations plus 100 h style-controllable speech. Only Output Projector and Voice Token LM are trained, with style control embeddings concatenated.
Stage 4: Duplex Interaction Alignment Utilizes 4k h of real and simulated long-form dialogs. Only the Full-Duplex Predictor is trained using frame-wise classification:

$\mathcal{L}_{\text{Duplex}} = -\sum_{t}\left[y_t\log p_t + (1-y_t)\log(1-p_t)\right], \quad y_t\in \{0,1\}$

Summary of Training Data

Task	Hours (k)
Speech-to-Text	1,200
Text-to-Speech	171
Speech-to-Speech	10 + 0.1
Duplex	4

4. Performance Evaluation and Benchmarks

MinMo demonstrates state-of-the-art results across multilingual ASR, speech-to-text translation, spoken question answering, TTS, and voice style control.

Voice Comprehension
- LibriSpeech-clean: 1.64% WER
- Fleurs (avg): 4.13% CER
- Speech-to-text translation Fleurs: 29.13 BLEU (xx→EN)
- Spoken QA (WebQ): 55.0% S2T, 39.9% S2S accuracy, both outperforming Moshi, GLM-4-Voice, and Freez-Omni.
Voice Generation and Style Control
- Seed-TTS, English WER: 2.90%
- Style-following on 122-turn, 12-style test: MinMo 98.4%, GLM-4-Voice 63.1%

Style Group	GLM-4-Voice	MinMo
Emotion	75.6%	97.6%
Dialect	42.9%	100%
Speaking rate	80.0%	100%
Role-play	70.4%	96.3%
Default	88.2%	96.3%
Total	63.1%	98.4%

Latency
- Speech-to-text: ~100 ms (for 5 tokens)
- Theoretical full-duplex: ~600 ms
- Practical full-duplex: ~800 ms

On the radar chart comparison (Figure 1), MinMo outperforms native (Moshi, GLM-4-Voice) and aligned (LLaMA-Omni, Freeze-Omni, Mini-Omni2) models on ASR, S2TT, SER, LID, and VSC, while retaining text LLM capabilities (Chen et al., 10 Jan 2025).

5. Instruction-Following and Controllable Voice Synthesis

MinMo supports nuanced, instruction-driven speech synthesis using learned style embeddings concatenated to LLM hidden states. These embeddings are conditioned on explicit user instructions, such as emotions, dialects, speaking rate, and speaker mimicry, acquired from 1k h instruct-TTS and 100 h style-controllable S2S data during training.

Examples of prompt control include:

“Please speak very fast: …” resulting in an accelerated utterance.
“Speaking with a tone of sadness: I miss my dear friend…” producing corresponding emotional prosody.

This mechanism enables MinMo to follow verbal instructions with high fidelity, as shown in style-following benchmarks.

6. Limitations and Directions for Enhancement

Text-Instruction Following: MinMo leverages only LoRA updates on the LLM during cross-modal fine-tuning; text-only instruction-following remains weaker. Extension of text-only fine-tuning is required for recovery of full LLM instruction-following capacity.
Pronunciation Errors: Residual long-tail errors in end-to-end speech generation suggest room for improvement via a more balanced token vocabulary or additional diverse training data.
Style-Control Efficiency: Performance may be further enhanced by expanding the scale and diversity of style instruction datasets and increasing embedding complexity.
Duplex Integration: Presently, full-duplex mode relies on external AEC/VAD modules; full end-to-end duplex processing remains a target for future development (Chen et al., 10 Jan 2025).

MinMo establishes that an aligned architecture, multi-stage speech-text alignment on scale, and an AR streaming voice decoder can collectively endow a text LLM with state-of-the-art empirical performance in voice interaction, generation, duplex conversation, and fine-grained controllability within a robust 8B-parameter framework.

Markdown Report Issue Upgrade to Chat

References (1)

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MinMo.

MinMo: Multimodal Voice Interaction LLM

1. Model Architecture and Multimodal Alignment

2. Autoregressive Voice Decoder and Latency

3. Multi-Stage Training Procedure

Summary of Training Data

4. Performance Evaluation and Benchmarks

5. Instruction-Following and Controllable Voice Synthesis

6. Limitations and Directions for Enhancement

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MinMo: Multimodal Voice Interaction LLM

1. Model Architecture and Multimodal Alignment

2. Autoregressive Voice Decoder and Latency

3. Multi-Stage Training Procedure

Summary of Training Data

4. Performance Evaluation and Benchmarks

5. Instruction-Following and Controllable Voice Synthesis

6. Limitations and Directions for Enhancement

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research