LLaMA-Omni 2: Real-Time SpeechLM

Updated 13 May 2026

LLaMA-Omni 2 is a series of speech language models that integrate a frozen Whisper encoder, an autoregressive streaming TTS, and a Qwen2.5 LLM for real-time spoken dialogue applications.
It achieves state-of-the-art performance in both speech-to-text and speech-to-speech tasks, significantly reducing latency and improving accuracy compared to benchmarks like GLM-4-Voice.
The model’s efficient training on 200,000 multi-turn dialogues, combined with modular pre-trained components, demonstrates a shift towards data-efficient, low-latency speech synthesis.

LLaMA-Omni 2 is a series of Speech LLMs (SpeechLMs) designed for real-time, high-fidelity spoken chatbot applications. Integrating a speech encoder, an autoregressive streaming speech decoder, and a LLM backbone, LLaMA-Omni 2 achieves state-of-the-art performance on spoken question answering and speech instruction following tasks. Despite being trained on only 200,000 multi-turn speech dialogue samples, it surpasses existing benchmarks established by models trained on much larger datasets, such as GLM-4-Voice (Fang et al., 5 May 2025).

1. Model Architecture and Dataflow

LLaMA-Omni 2 utilizes a modular architecture structured around the following components:

Speech Encoder: Inputs speech $X$ as 80-dimensional log-mel features and encodes via a frozen Whisper-large-v3 (≈1.5B parameters).
Speech Adapter: Applies 5× frame downsampling followed by a feedforward network (FFN), producing representations for the LLM.
LLM Backbone: Qwen2.5-Instruct (0.5B–14B parameters) acts as a decoder-only Transformer for speech-instruction understanding.
Gated Fusion: Combines LLM hidden states and sampled text tokens, providing context to the speech generation module.
Text-to-Speech LLM ( $\mathcal{M}_\mathrm{TTS}$ ): An autoregressive Transformer LM, initialized from Qwen2.5-0.5B, with vocabulary expanded by 6,561 discrete speech tokens.
Chunk-aware Causal Flow Matching + Vocoder: Converts discrete speech tokens into mel-spectrograms using a frozen, pretrained CosyVoice 2 model and then synthesizes audio waveforms via a streaming HiFi-GAN vocoder (≈50M parameters).

High-level dataflow:

Input speech $X$ → Whisper encoder → speech adapter → LLM → gate fusion → TTS LM → flow matching → mel -> HiFi-GAN vocoder → output waveform $Y^S$ .

2. Component Specifications

Speech Encoder

Whisper-large-v3: Processes 80-dimension log-mel features (25 ms window, 10 ms shift); Transformer architecture; 1.5B parameters; weights frozen during LLaMA-Omni 2 training.

Speech Adapter

Downsampling: Concatenates every $k=5$ frames, reducing input sequence length by 5×.
FFN: Single-layer, intermediate dimension 2048, output matching LLM input dimension.

LLM Backbone

Qwen2.5-Instruct: Model sizes are 0.5B, 1.5B, 3B, 7B, 14B; decoder-only, standard self-attention; trained by cross-entropy loss mapping speech instructions to text responses.

Text-to-Speech LM ( $\mathcal{M}_\mathrm{TTS}$ )

Same architecture as Qwen2.5-0.5B; vocabulary expanded by 6,561 new discrete speech tokens.
All weights initialized from Qwen2.5-0.5B, new token embeddings randomly initialized ( $\approx$ 0.5B parameters).

Flow Matching Model and Vocoder

Chunk-aware Flow Matching: Pretrained CosyVoice 2 (frozen); performs streaming synthesis every $W$ tokens.
HiFi-GAN Vocoder: Streams 2 $W$ mel frames per chunk; ≈50M parameters.

3. Training Methodology

Dataset Construction

200,000 Multi-turn Dialogues: Derived by rewriting Alpaca and UltraChat corpora into dialogues ( $N \sim \mathrm{Pois}(\lambda=2)$ , truncated to 1–5 turns).
Instruction Synthesis: "fish-speech-1.5" produces random-voice prompts, voice cloned per-dialogue via CosyVoice2-0.5B.
Response Synthesis: Uniform voice output generated by CosyVoice2-0.5B.

Training Stages

Speech-to-Text (Stage I(a)):
- Freeze speech encoder; train adapter and LLM with pairs $\mathcal{M}_\mathrm{TTS}$ 0.
- Loss:
$\mathcal{M}_\mathrm{TTS}$ 1
TTS LM Pretraining (Stage I(b)):
- Train $\mathcal{M}_\mathrm{TTS}$ 2 on $\mathcal{M}_\mathrm{TTS}$ 3, gate fusion disabled.
- Loss:
$\mathcal{M}_\mathrm{TTS}$ 4
End-to-end Speech-to-Speech (Stage II):
- Freeze encoder, adapter, LLM; train gate fusion and $\mathcal{M}_\mathrm{TTS}$ 5 on $\mathcal{M}_\mathrm{TTS}$ 6.
- Streaming TTS loss (read–write):
$\mathcal{M}_\mathrm{TTS}$ 7

Hyperparameters

Batch size: 32; Stage I(a) – 3 epochs, learning rate $\mathcal{M}_\mathrm{TTS}$ 8; Stage I(b) – 5 epochs, $\mathcal{M}_\mathrm{TTS}$ 9; Stage II – 1 epoch, $X$ 0.
3% warmup, cosine-annealing schedule.
4×H800 GPUs (14B), 4×L40 for smaller models.

4. Real-Time Streaming and Decoding

Read–Write Streaming

For every $X$ 1 LLM tokens (default $X$ 2), $X$ 3 speech tokens ( $X$ 4) are synthesized. After the LLM completes, remaining speech tokens are generated in an autoregressive manner.

Latency Calculation:

$X$ 5

On NVIDIA L40, Omni2-7B with $X$ 6 achieves $X$ 7583 ms end-to-first-chunk latency.

Decoding Algorithms

LLM: Greedy decoding for stable generation.
TTS LM: Sampling with temperature 1.0 to minimize repetition.
Flow Matching and Vocoder: Streaming synthesis per chunk.

Gate Fusion

The gate fusion mechanism computes the fused representation $X$ 8 as:

$X$ 9

Read–Write Streaming Algorithm (pseudocode)

$Y^S$ 4

5. Performance and Benchmarking

LLaMA-Omni 2 demonstrates strong performance in both speech-to-text (S2T) and speech-to-speech (S2S) tasks across various metrics. Key results are summarized below.

Model	Llama Qs S2T	S2S	Web Qs S2T	S2S	GPT Score S2T	S2S	ASR-WER↓	UTMOS↑	Latency (ms)↓
GLM-4-Voice (9B)	64.7	50.7	32.2	15.9	4.16	4.09	9.02	3.48	1562.8
LLaMA-Omni (8B)	67.7	49.0	33.4	23.7	3.99	3.52	5.95	3.67	346.7
Omni2-7B	70.3	60.7	34.5	31.3	4.28	4.15	3.26	4.19	582.9
Omni2-14B	73.0	62.7	40.4	37.1	4.56	4.35	3.89	4.20	663.3

Key observations:

Omni 2 markedly improves both S2T and S2S accuracy versus GLM-4-Voice, sharply reducing the S2T→S2S drop (Web Qs: $Y^S$ 0 for Omni 2 versus $Y^S$ 1 for GLM-4-Voice).
Instruction following (GPT-4o scores), ASR-WER, and naturalness (UTMOS $Y^S$ 2) are superior under real-time streaming synthesis.
End-to-first-chunk latency ( $Y^S$ 3600 ms) is about 2× faster than GLM-4-Voice and meets real-time deployment requirements.

6. Significance, Context, and Implications

LLaMA-Omni 2 establishes that integration of a lightweight Qwen2.5 LLM with frozen open speech components (Whisper, CosyVoice 2) and end-to-end streaming AR TTS modeling enables efficient, high-quality real-time spoken dialogue systems. The fact that Omni 2 achieves superior performance relative to models trained on orders of magnitude more audio (e.g., millions of hours for GLM-4-Voice) with only 200,000 synthetic multi-turn dialogues suggests a shift in the relative importance of data quantity versus architectural modularity and pretraining alignment in SpeechLMs.

A plausible implication is that, for real-time high-fidelity spoken interaction, exhaustive supervised speech data may be less critical than previously assumed, provided that pre-trained components and data-efficient fusion strategies are employed. This suggests new avenues for multimodal chatbot training that emphasize modular integration, parameter efficiency, and low-latency streaming generation (Fang et al., 5 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLaMA-Omni 2.

LLaMA-Omni 2: Real-Time SpeechLM

1. Model Architecture and Dataflow

2. Component Specifications

Speech Encoder

Speech Adapter

LLM Backbone

Text-to-Speech LM ( $\mathcal{M}_\mathrm{TTS}$ )

Flow Matching Model and Vocoder

3. Training Methodology

Dataset Construction

Training Stages

Hyperparameters

4. Real-Time Streaming and Decoding

Read–Write Streaming

Decoding Algorithms

Gate Fusion

5. Performance and Benchmarking

6. Significance, Context, and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LLaMA-Omni 2: Real-Time SpeechLM

1. Model Architecture and Dataflow

2. Component Specifications

Speech Encoder

Speech Adapter

LLM Backbone

Text-to-Speech LM (MTTS\mathcal{M}_\mathrm{TTS}MTTS​)

Flow Matching Model and Vocoder

3. Training Methodology

Dataset Construction

Training Stages

Hyperparameters

4. Real-Time Streaming and Decoding

Read–Write Streaming

Decoding Algorithms

Gate Fusion

5. Performance and Benchmarking

6. Significance, Context, and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Text-to-Speech LM ( $\mathcal{M}_\mathrm{TTS}$ )