Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLaMA-Omni 2: Real-Time SpeechLM

Updated 13 May 2026
  • LLaMA-Omni 2 is a series of speech language models that integrate a frozen Whisper encoder, an autoregressive streaming TTS, and a Qwen2.5 LLM for real-time spoken dialogue applications.
  • It achieves state-of-the-art performance in both speech-to-text and speech-to-speech tasks, significantly reducing latency and improving accuracy compared to benchmarks like GLM-4-Voice.
  • The model’s efficient training on 200,000 multi-turn dialogues, combined with modular pre-trained components, demonstrates a shift towards data-efficient, low-latency speech synthesis.

LLaMA-Omni 2 is a series of Speech LLMs (SpeechLMs) designed for real-time, high-fidelity spoken chatbot applications. Integrating a speech encoder, an autoregressive streaming speech decoder, and a LLM backbone, LLaMA-Omni 2 achieves state-of-the-art performance on spoken question answering and speech instruction following tasks. Despite being trained on only 200,000 multi-turn speech dialogue samples, it surpasses existing benchmarks established by models trained on much larger datasets, such as GLM-4-Voice (Fang et al., 5 May 2025).

1. Model Architecture and Dataflow

LLaMA-Omni 2 utilizes a modular architecture structured around the following components:

  • Speech Encoder: Inputs speech XX as 80-dimensional log-mel features and encodes via a frozen Whisper-large-v3 (≈1.5B parameters).
  • Speech Adapter: Applies 5× frame downsampling followed by a feedforward network (FFN), producing representations for the LLM.
  • LLM Backbone: Qwen2.5-Instruct (0.5B–14B parameters) acts as a decoder-only Transformer for speech-instruction understanding.
  • Gated Fusion: Combines LLM hidden states and sampled text tokens, providing context to the speech generation module.
  • Text-to-Speech LLM (MTTS\mathcal{M}_\mathrm{TTS}): An autoregressive Transformer LM, initialized from Qwen2.5-0.5B, with vocabulary expanded by 6,561 discrete speech tokens.
  • Chunk-aware Causal Flow Matching + Vocoder: Converts discrete speech tokens into mel-spectrograms using a frozen, pretrained CosyVoice 2 model and then synthesizes audio waveforms via a streaming HiFi-GAN vocoder (≈50M parameters).

High-level dataflow:

Input speech XX → Whisper encoder → speech adapter → LLM → gate fusion → TTS LM → flow matching → mel -> HiFi-GAN vocoder → output waveform YSY^S.

2. Component Specifications

Speech Encoder

  • Whisper-large-v3: Processes 80-dimension log-mel features (25 ms window, 10 ms shift); Transformer architecture; 1.5B parameters; weights frozen during LLaMA-Omni 2 training.

Speech Adapter

  • Downsampling: Concatenates every k=5k=5 frames, reducing input sequence length by 5×.
  • FFN: Single-layer, intermediate dimension 2048, output matching LLM input dimension.

LLM Backbone

  • Qwen2.5-Instruct: Model sizes are 0.5B, 1.5B, 3B, 7B, 14B; decoder-only, standard self-attention; trained by cross-entropy loss mapping speech instructions to text responses.

Text-to-Speech LM (MTTS\mathcal{M}_\mathrm{TTS})

  • Same architecture as Qwen2.5-0.5B; vocabulary expanded by 6,561 new discrete speech tokens.
  • All weights initialized from Qwen2.5-0.5B, new token embeddings randomly initialized (≈\approx0.5B parameters).

Flow Matching Model and Vocoder

  • Chunk-aware Flow Matching: Pretrained CosyVoice 2 (frozen); performs streaming synthesis every WW tokens.
  • HiFi-GAN Vocoder: Streams 2WW mel frames per chunk; ≈50M parameters.

3. Training Methodology

Dataset Construction

  • 200,000 Multi-turn Dialogues: Derived by rewriting Alpaca and UltraChat corpora into dialogues (N∼Pois(λ=2)N \sim \mathrm{Pois}(\lambda=2), truncated to 1–5 turns).
  • Instruction Synthesis: "fish-speech-1.5" produces random-voice prompts, voice cloned per-dialogue via CosyVoice2-0.5B.
  • Response Synthesis: Uniform voice output generated by CosyVoice2-0.5B.

Training Stages

  1. Speech-to-Text (Stage I(a)):

    • Freeze speech encoder; train adapter and LLM with pairs MTTS\mathcal{M}_\mathrm{TTS}0.
    • Loss:

    MTTS\mathcal{M}_\mathrm{TTS}1

  2. TTS LM Pretraining (Stage I(b)):

    • Train MTTS\mathcal{M}_\mathrm{TTS}2 on MTTS\mathcal{M}_\mathrm{TTS}3, gate fusion disabled.
    • Loss:

    MTTS\mathcal{M}_\mathrm{TTS}4

  3. End-to-end Speech-to-Speech (Stage II):

    • Freeze encoder, adapter, LLM; train gate fusion and MTTS\mathcal{M}_\mathrm{TTS}5 on MTTS\mathcal{M}_\mathrm{TTS}6.
    • Streaming TTS loss (read–write):

    MTTS\mathcal{M}_\mathrm{TTS}7

Hyperparameters

  • Batch size: 32; Stage I(a) – 3 epochs, learning rate MTTS\mathcal{M}_\mathrm{TTS}8; Stage I(b) – 5 epochs, MTTS\mathcal{M}_\mathrm{TTS}9; Stage II – 1 epoch, XX0.
  • 3% warmup, cosine-annealing schedule.
  • 4×H800 GPUs (14B), 4×L40 for smaller models.

4. Real-Time Streaming and Decoding

Read–Write Streaming

For every XX1 LLM tokens (default XX2), XX3 speech tokens (XX4) are synthesized. After the LLM completes, remaining speech tokens are generated in an autoregressive manner.

Latency Calculation:

XX5

  • On NVIDIA L40, Omni2-7B with XX6 achieves XX7583 ms end-to-first-chunk latency.

Decoding Algorithms

  • LLM: Greedy decoding for stable generation.
  • TTS LM: Sampling with temperature 1.0 to minimize repetition.
  • Flow Matching and Vocoder: Streaming synthesis per chunk.

Gate Fusion

The gate fusion mechanism computes the fused representation XX8 as:

XX9

Read–Write Streaming Algorithm (pseudocode)

YSY^S4

5. Performance and Benchmarking

LLaMA-Omni 2 demonstrates strong performance in both speech-to-text (S2T) and speech-to-speech (S2S) tasks across various metrics. Key results are summarized below.

Model Llama Qs S2T S2S Web Qs S2T S2S GPT Score S2T S2S ASR-WER↓ UTMOS↑ Latency (ms)↓
GLM-4-Voice (9B) 64.7 50.7 32.2 15.9 4.16 4.09 9.02 3.48 1562.8
LLaMA-Omni (8B) 67.7 49.0 33.4 23.7 3.99 3.52 5.95 3.67 346.7
Omni2-7B 70.3 60.7 34.5 31.3 4.28 4.15 3.26 4.19 582.9
Omni2-14B 73.0 62.7 40.4 37.1 4.56 4.35 3.89 4.20 663.3

Key observations:

  • Omni 2 markedly improves both S2T and S2S accuracy versus GLM-4-Voice, sharply reducing the S2T→S2S drop (Web Qs: YSY^S0 for Omni 2 versus YSY^S1 for GLM-4-Voice).
  • Instruction following (GPT-4o scores), ASR-WER, and naturalness (UTMOS YSY^S2) are superior under real-time streaming synthesis.
  • End-to-first-chunk latency (YSY^S3600 ms) is about 2× faster than GLM-4-Voice and meets real-time deployment requirements.

6. Significance, Context, and Implications

LLaMA-Omni 2 establishes that integration of a lightweight Qwen2.5 LLM with frozen open speech components (Whisper, CosyVoice 2) and end-to-end streaming AR TTS modeling enables efficient, high-quality real-time spoken dialogue systems. The fact that Omni 2 achieves superior performance relative to models trained on orders of magnitude more audio (e.g., millions of hours for GLM-4-Voice) with only 200,000 synthetic multi-turn dialogues suggests a shift in the relative importance of data quantity versus architectural modularity and pretraining alignment in SpeechLMs.

A plausible implication is that, for real-time high-fidelity spoken interaction, exhaustive supervised speech data may be less critical than previously assumed, provided that pre-trained components and data-efficient fusion strategies are employed. This suggests new avenues for multimodal chatbot training that emphasize modular integration, parameter efficiency, and low-latency streaming generation (Fang et al., 5 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLaMA-Omni 2.