Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 165 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

MGM-Omni: Unified Omni-Modal Speech Model

Updated 6 October 2025
  • MGM-Omni is a unified omni-modal architecture that integrates multimodal reasoning with real-time, long-form speech generation.
  • Its dual-track 'brain-mouth' design separates language reasoning from speech synthesis, ensuring low latency and enhanced audio quality.
  • The framework leverages dual audio encoder fusion and chunk-based parallel decoding to achieve data efficiency and robust ASR/TTS performance.

MGM-Omni refers to a unified, omni-modal LLM architecture and training framework specialized for scalable multimodal understanding and expressive, long-horizon speech generation. Departing from cascaded pipelines that treat speech and language interaction as modular but loosely coupled tasks, MGM-Omni establishes an efficient “brain-mouth” dual-track design that decouples multimodal reasoning from token-level real-time speech synthesis. This allows for robust omnimodal perception, low-latency streaming speech synthesis, and controllable, personalized long-form audio generation within a single, data- and compute-efficient end-to-end paradigm (Wang et al., 29 Sep 2025).

1. Dual-Track "Brain-Mouth" Architecture

MGM-Omni centers on a dual-track, token-based architecture. The “brain” comprises a multimodal LLM (MLLM) responsible for unified reasoning across text, images, video, and audio, whereas the “mouth” consists of a speech-focused module (SpeechLM) dedicated to speech token generation. The separation of reasoning and speech synthesis is fundamental: the MLLM generates textual tokens that capture context and intent, which are then ingested by the SpeechLM that produces speech tokens in a streaming, real-time manner. The interface between the two modules is token-aligned, facilitating immediate output and supporting extensions in vocabulary for multimodal token fusion.

The design supports low-latency, streaming speech generation that is essential for interactive applications and enables direct cross-modal reasoning-to-speech pipelines without pipelined bottlenecks. The use of a TTS-Adapter appended to the LLM backbone (e.g., Qwen3) allows for specialized adaptation to speech synthesis within the same computational graph.

2. Unified Multimodal Training and Dual Audio Encoder Fusion

MGM-Omni employs a unified training regime that accommodates multimodal inputs—especially long-form and heterogeneous audio—using a dual encoder design. The architecture integrates:

  • Qwen2-Audio encoder (Whisper-large-v3-based) for robust acoustic feature extraction, optimized via continuous training.
  • Belle-Whisper-large-v3 encoder targeting semantic enrichment and language-specific comprehension, essential for complex ASR tasks (notably Chinese).

Fusion of these encoders is handled by an information mining module. Suppose Q is the representation from the primary encoder and (K, V) are the keys and values from the secondary encoder (transformed via learned projections φ). The integrated audio representation T_A is given by:

TA=MLP(Q+Softmax(ϕ(Q)ϕ(K))ϕ(V))T_A = \text{MLP}\left(Q + \text{Softmax}\bigl(\phi(Q)\phi(K)^{\top}\bigr) \phi(V)\right)

where the MLP is a two-layer perceptron. This design ensures that both low-level acoustic features and high-level semantic cues are preserved, yielding enhanced long-form audio perception across variable acoustic regimes.

Training proceeds in two phases:

  1. Audio-to-text pre-training aligns the dual audio encoder output to the language backbone with transcription data.
  2. Unified multimodal fine-tuning mixes audio QA, transcription, VQA, and text instruction to enable comprehensive omnimodal understanding.

Dynamic batching strategies further improve sample efficiency—batches are grouped by clip length, and batch sizes are dynamically adapted, ensuring both short and long-form audio are learned with equal stability.

3. Chunk-Based Parallel Speech Decoding for Long-Form Generation

SpeechLM in MGM-Omni is designed for efficient and natural long-form speech synthesis, with support for streaming zero-shot voice cloning and stable timbre preservation. The key technical advancement is the “chunk-based parallel decoding” scheme.

  • Token rate mismatch correction: Speech tokenizers (e.g., FSQ) produce ~25 tokens/s, far higher than text rates. To address misalignment between generated text and speech tokens (which can cause slow or desynchronized speech), the input text is split into chunks, and for each, SpeechLM outputs corresponding speech tokens, maintaining synchrony.
  • Parallel token generation: During each forward step, the model decodes both a text token xtx_t and kk speech tokens {st1,...,stk}\{s_t^1, ..., s_t^k\}. The extended vocabulary size is

V=Vtext+kVspeech|V| = |V_{text}| + k \cdot |V_{speech}|

and the input embedding aggregate is:

htin=1k+1[f(xt)+i=1kf(sti)]h_t^{in} = \frac{1}{k+1}\left[f(x_t) + \sum_{i=1}^k f(s_t^i)\right]

with f()f(\cdot) the embedding function. The TTS-Adapter and LM head then produce the next speech token block:

{s^t+11,...,s^t+1k}=lm_head(TTS-Adapter(htout))\{\hat{s}_{t+1}^1, ..., \hat{s}_{t+1}^k\} = \text{lm\_head}(\text{TTS-Adapter}(h_t^{out}))

The result is a ~3× boost in inference throughput and near-constant-latency generation, even for long-form utterances.

  • Streaming zero-shot voice cloning: The approach also enables, for the first time in an open-source omni-modal LLM, controlled and stable voice style transfer over multi-minute audio streams, with consistent timbre and phonetic fidelity.

4. Data Efficiency and Training Regime

Whereas competing approaches frequently require millions of hours of audio and large-scale text-to-speech corpora, MGM-Omni achieves state-of-the-art long-form speech and omnimodal understanding with approximately 400k hours of audio for pre- and post-training. This efficiency arises from:

  • The dual audio encoder fusion, which extracts significant signal from smaller, more diverse datasets.
  • The chunk-based, parallel decoding, which minimizes error accumulation and avoids drift in long-horizon synthesis.
  • Two-stage training schedules, where the core LLM is frozen during early TTS-Adapter training, followed by joint fine-tuning with differential learning rates.

Empirical results indicate that MGM-Omni consistently achieves lower word and character error rates across benchmark ASR and TTS tasks, and maintains faster real-time factors for speech synthesis—e.g., RTF of 0.19 with parallel size 4 on GPU—than prior open-source omni-modal systems.

5. Evaluation Results and Ablation Studies

MGM-Omni’s performance has been assessed on multiple public and custom metrics, including:

  • ASR: On LibriSpeech, CommonVoice, and AISHELL, it matches or surpasses prior open-source omni-modal LLMs in WER/CER.
  • Audio QA: For audio question-answering benchmarks, MGM-Omni achieves highest GPT-4-scored average scores versus its open source peers.
  • Speech Synthesis: Evaluated on both short-form (Seed-TTS-Eval) and long-form (Long-TTS-Eval) metrics, MGM-Omni demonstrates considerably lower WER and CER, with enhanced prosody and timbre preservation.

Ablation studies confirm the positive contribution of each architectural component (dual encoder, chunkwise decoding, parallel speech token generation) to result stability, low latency, and voice quality.

6. Applications and Broader Implications

The MGM-Omni paradigm enables several advanced AI applications:

  • Long-form personalized speech generation for audiobooks, podcasts, and accessible content creation.
  • Streaming zero-shot voice cloning for real-time virtual assistants, digital actors, and personalized media.
  • Cross-modal interactive AI systems: By combining high-fidelity speech synthesis with unified perception for text, image, video, and audio, MGM-Omni is directly suitable for context-aware chat, automated dubbing, and broadcast.
  • Efficient end-to-end omnimodal learning: The “brain-mouth” architecture, dual audio encoder integration, and chunkwise decoding provide templates for future multimodal LLM research, emphasizing modular decoupling, architectural efficiency, and scalable inference.

The MGM-Omni approach establishes a new technical reference point in unified, controllable omnimodal AI, demonstrating that expressive, context-aware, long-horizon speech synthesis can be achieved with compact architectures and data-efficient learning (Wang et al., 29 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MGM-Omni.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube