LLaMA-Omni 2: Real-Time SpeechLM
- LLaMA-Omni 2 is a series of speech language models that integrate a frozen Whisper encoder, an autoregressive streaming TTS, and a Qwen2.5 LLM for real-time spoken dialogue applications.
- It achieves state-of-the-art performance in both speech-to-text and speech-to-speech tasks, significantly reducing latency and improving accuracy compared to benchmarks like GLM-4-Voice.
- The model’s efficient training on 200,000 multi-turn dialogues, combined with modular pre-trained components, demonstrates a shift towards data-efficient, low-latency speech synthesis.
LLaMA-Omni 2 is a series of Speech LLMs (SpeechLMs) designed for real-time, high-fidelity spoken chatbot applications. Integrating a speech encoder, an autoregressive streaming speech decoder, and a LLM backbone, LLaMA-Omni 2 achieves state-of-the-art performance on spoken question answering and speech instruction following tasks. Despite being trained on only 200,000 multi-turn speech dialogue samples, it surpasses existing benchmarks established by models trained on much larger datasets, such as GLM-4-Voice (Fang et al., 5 May 2025).
1. Model Architecture and Dataflow
LLaMA-Omni 2 utilizes a modular architecture structured around the following components:
- Speech Encoder: Inputs speech as 80-dimensional log-mel features and encodes via a frozen Whisper-large-v3 (≈1.5B parameters).
- Speech Adapter: Applies 5× frame downsampling followed by a feedforward network (FFN), producing representations for the LLM.
- LLM Backbone: Qwen2.5-Instruct (0.5B–14B parameters) acts as a decoder-only Transformer for speech-instruction understanding.
- Gated Fusion: Combines LLM hidden states and sampled text tokens, providing context to the speech generation module.
- Text-to-Speech LLM (): An autoregressive Transformer LM, initialized from Qwen2.5-0.5B, with vocabulary expanded by 6,561 discrete speech tokens.
- Chunk-aware Causal Flow Matching + Vocoder: Converts discrete speech tokens into mel-spectrograms using a frozen, pretrained CosyVoice 2 model and then synthesizes audio waveforms via a streaming HiFi-GAN vocoder (≈50M parameters).
High-level dataflow:
Input speech → Whisper encoder → speech adapter → LLM → gate fusion → TTS LM → flow matching → mel -> HiFi-GAN vocoder → output waveform .
2. Component Specifications
Speech Encoder
- Whisper-large-v3: Processes 80-dimension log-mel features (25 ms window, 10 ms shift); Transformer architecture; 1.5B parameters; weights frozen during LLaMA-Omni 2 training.
Speech Adapter
- Downsampling: Concatenates every frames, reducing input sequence length by 5×.
- FFN: Single-layer, intermediate dimension 2048, output matching LLM input dimension.
LLM Backbone
- Qwen2.5-Instruct: Model sizes are 0.5B, 1.5B, 3B, 7B, 14B; decoder-only, standard self-attention; trained by cross-entropy loss mapping speech instructions to text responses.
Text-to-Speech LM ()
- Same architecture as Qwen2.5-0.5B; vocabulary expanded by 6,561 new discrete speech tokens.
- All weights initialized from Qwen2.5-0.5B, new token embeddings randomly initialized (0.5B parameters).
Flow Matching Model and Vocoder
- Chunk-aware Flow Matching: Pretrained CosyVoice 2 (frozen); performs streaming synthesis every tokens.
- HiFi-GAN Vocoder: Streams 2 mel frames per chunk; ≈50M parameters.
3. Training Methodology
Dataset Construction
- 200,000 Multi-turn Dialogues: Derived by rewriting Alpaca and UltraChat corpora into dialogues (, truncated to 1–5 turns).
- Instruction Synthesis: "fish-speech-1.5" produces random-voice prompts, voice cloned per-dialogue via CosyVoice2-0.5B.
- Response Synthesis: Uniform voice output generated by CosyVoice2-0.5B.
Training Stages
- Speech-to-Text (Stage I(a)):
- Freeze speech encoder; train adapter and LLM with pairs 0.
- Loss:
1
- TTS LM Pretraining (Stage I(b)):
- Train 2 on 3, gate fusion disabled.
- Loss:
4
- End-to-end Speech-to-Speech (Stage II):
- Freeze encoder, adapter, LLM; train gate fusion and 5 on 6.
- Streaming TTS loss (read–write):
7
Hyperparameters
- Batch size: 32; Stage I(a) – 3 epochs, learning rate 8; Stage I(b) – 5 epochs, 9; Stage II – 1 epoch, 0.
- 3% warmup, cosine-annealing schedule.
- 4×H800 GPUs (14B), 4×L40 for smaller models.
4. Real-Time Streaming and Decoding
Read–Write Streaming
For every 1 LLM tokens (default 2), 3 speech tokens (4) are synthesized. After the LLM completes, remaining speech tokens are generated in an autoregressive manner.
Latency Calculation:
5
- On NVIDIA L40, Omni2-7B with 6 achieves 7583 ms end-to-first-chunk latency.
Decoding Algorithms
- LLM: Greedy decoding for stable generation.
- TTS LM: Sampling with temperature 1.0 to minimize repetition.
- Flow Matching and Vocoder: Streaming synthesis per chunk.
Gate Fusion
The gate fusion mechanism computes the fused representation 8 as:
9
Read–Write Streaming Algorithm (pseudocode)
4
5. Performance and Benchmarking
LLaMA-Omni 2 demonstrates strong performance in both speech-to-text (S2T) and speech-to-speech (S2S) tasks across various metrics. Key results are summarized below.
| Model | Llama Qs S2T | S2S | Web Qs S2T | S2S | GPT Score S2T | S2S | ASR-WER↓ | UTMOS↑ | Latency (ms)↓ |
|---|---|---|---|---|---|---|---|---|---|
| GLM-4-Voice (9B) | 64.7 | 50.7 | 32.2 | 15.9 | 4.16 | 4.09 | 9.02 | 3.48 | 1562.8 |
| LLaMA-Omni (8B) | 67.7 | 49.0 | 33.4 | 23.7 | 3.99 | 3.52 | 5.95 | 3.67 | 346.7 |
| Omni2-7B | 70.3 | 60.7 | 34.5 | 31.3 | 4.28 | 4.15 | 3.26 | 4.19 | 582.9 |
| Omni2-14B | 73.0 | 62.7 | 40.4 | 37.1 | 4.56 | 4.35 | 3.89 | 4.20 | 663.3 |
Key observations:
- Omni 2 markedly improves both S2T and S2S accuracy versus GLM-4-Voice, sharply reducing the S2T→S2S drop (Web Qs: 0 for Omni 2 versus 1 for GLM-4-Voice).
- Instruction following (GPT-4o scores), ASR-WER, and naturalness (UTMOS 2) are superior under real-time streaming synthesis.
- End-to-first-chunk latency (3600 ms) is about 2× faster than GLM-4-Voice and meets real-time deployment requirements.
6. Significance, Context, and Implications
LLaMA-Omni 2 establishes that integration of a lightweight Qwen2.5 LLM with frozen open speech components (Whisper, CosyVoice 2) and end-to-end streaming AR TTS modeling enables efficient, high-quality real-time spoken dialogue systems. The fact that Omni 2 achieves superior performance relative to models trained on orders of magnitude more audio (e.g., millions of hours for GLM-4-Voice) with only 200,000 synthetic multi-turn dialogues suggests a shift in the relative importance of data quantity versus architectural modularity and pretraining alignment in SpeechLMs.
A plausible implication is that, for real-time high-fidelity spoken interaction, exhaustive supervised speech data may be less critical than previously assumed, provided that pre-trained components and data-efficient fusion strategies are employed. This suggests new avenues for multimodal chatbot training that emphasize modular integration, parameter efficiency, and low-latency streaming generation (Fang et al., 5 May 2025).