VocalNet-M2: Efficient Low-Latency SLM
- VocalNet-M2 is a low-latency spoken language model that integrates multi-codebook tokenization and multi-token prediction to eliminate the flow-matching bottleneck.
- The architecture achieves nearly 50% latency reduction (∼349 ms) and improves performance with a WER of 6.07% compared to previous models.
- Its modular Thinker–Talker pipeline with stacked MTP layers enables efficient real-time processing, ideal for voice assistants, human-robot interfaces, and telepresence.
VocalNet-M2 is a low-latency spoken LLM (SLM) that advances the architecture of end-to-end speech generation systems by introducing two core techniques: integrated multi-codebook tokenization and multi-token prediction (MTP). Developed with the aim of minimizing response latency in real-time interactive applications while maintaining high speech and text quality, VocalNet-M2 eliminates the flow-matching model bottleneck present in previous SLMs by directly generating multi-track speech tokens. The architecture is designed to support fast, efficient, and high-fidelity speech synthesis, making it applicable to use cases such as voice assistants, human-robot interaction, and telepresence systems (Wang et al., 13 Nov 2025).
1. Multi-Codebook Tokenization
VocalNet-M2 employs the XY-Tokenizer (Gong et al., 2025) for discretizing speech into multiple codebook representations. The tokenizer comprises independent codebooks, each containing embedding vectors of dimension . For a given acoustic frame , a small encoder network produces a latent vector . Each codebook then quantizes this latent via nearest-neighbor search:
This results in an 8-tuple of codebook indices at each timestep.
While the XY-Tokenizer's explicit loss is omitted in the primary text, the standard vector-quantization variational autoencoder (VQ-VAE) objective is implied:
where denotes stop-gradient. The approach allows rich, low-latency, multi-channel acoustic representation, but the model’s robustness depends heavily on the diversity and quality of training data. Empirically, multi-codebook tokenization requires larger, higher-quality datasets to match the robustness and word error rate (WER) of single-codebook approaches.
2. Multi-Token Prediction (MTP) Acceleration
To address the latency introduced by serial, step-by-step autoregression in conventional speech decoders, VocalNet-M2 incorporates MTP layers stacked atop its Talker module. Each MTP layer predicts the next future token in parallel, permitting inference to stride forward by tokens per decoding pass.
Let be the upsampled semantic representation, and let sum previous codebook embeddings. The base Talker transformer produces the next token set:
Successive MTP layers predict tokens at for :
Training minimizes the total cross-entropy over next-step and future predictions:
Empirical ablation demonstrates that yields optimal performance (WER improves from 8.56% at to 6.07% at , with stable UTMOS rating).
3. Architectural Design
VocalNet-M2 follows a modular Thinker–Talker pipeline:
- Audio Encoding: Raw audio is processed by a frozen Whisper-large-v3 encoder and a learned downsample adapter, producing representations .
- Semantic Decoding (Thinker): The Thinker, a Qwen3-8B autoregressive transformer, generates text tokens and hidden states .
- Fusion & Upsampling: Text token embeddings and Thinker hidden states are fused via a learned linear layer, then upsampled (by a factor of 3) to produce , matching the granular frame count.
- Audio Token Generation (Talker with MTP): The Talker, an autoregressive transformer with eight parallel output heads, consumes and previous codebook embeddings, outputting eight codebook indices per timestep. Stacked MTP layers predict future tokens in the same pass.
- Vocoder Synthesis: The contiguous 8-codebook indices are fed directly to a lightweight neural vocoder, synthesizing the waveform without flow-matching.
Block Diagram:
Raw Audio → Whisper Encoder → Downsample Adapter ↓ Thinker (Qwen3-8B) ↓ Fusion Layer → Upsample ↓ Talker (8-track AR Transformer + MTP Layers) ↓ 8 Codebook Token Streams ↓ Lightweight Neural Vocoder ↓ Output Audio Waveform
4. Empirical Evaluation and Comparative Performance
VocalNet-M2 was evaluated via a rigorous three-phase training regimen: (i) Talker pretraining on 10,000 h of TTS data (Emilia corpus), (ii) Downsample Adapter and Thinker training with LoRA on speech-to-text, and (iii) end-to-end fine-tuning on 7,000 h of speech-to-speech dialogues from VoiceAssistant, UltraChat, and Tulu-3-derived samples.
Core metrics:
- Text Quality: scored by AlpacaEval, Llama Questions, TriviaQA, Web Questions (0–10 scale)
- Speech Quality: UTMOS (predicted MOS), word error rate (WER, via Whisper-large-v3)
- Latency: first-chunk generation time (for 0.8 s of audio, single L20 GPU)
Summary Results:
| Model | AlpacaEval | LlamaQ | TriviaQA | WebQ | WER | UTMOS | Latency (ms) |
|---|---|---|---|---|---|---|---|
| SLAM-Omni | 3.50 | 2.94 | 0.39 | 0.84 | 5.78 | 4.46 | 702 ± 30 |
| VocalNet-8B | 7.12 | 7.95 | 6.24 | 6.48 | 3.64 | 4.49 | 556 ± 8 |
| GLM-4-Voice | 5.86 | 7.74 | 4.95 | 5.56 | 11.90 | 4.23 | 1060 ± 2 |
| MiniCPM-o | 6.13 | 7.72 | 6.43 | 7.16 | 9.52 | 4.14 | 894 ± 82 |
| kimi-audio | 6.49 | 8.10 | 6.15 | 7.10 | 14.71 | 2.87 | 1745 ± 140 |
| Qwen2.5-Omni | 6.01 | 7.90 | 5.89 | 6.88 | 2.31 | 4.34 | — |
| VocalNet-M2 | 7.29 | 8.33 | 6.13 | 6.65 | 6.07 | 4.31 | 348.9 ± 2.9 |
VocalNet-M2 achieves a near 50% reduction in first-chunk latency (∼725 ms to ∼349 ms) relative to prior SLMs, retaining strong text and speech metrics.
Ablation Studies:
- Single vs. Multi-Codebook Tokenization: Multi-codebook models, when matched with high-quality, filtered data, close the WER and UTMOS gap with single-codebook models and provide a ∼2× latency speedup by forgoing the flow-matching model.
- Effect of MTP Layers: Increasing improves WER up to , after which gains plateau.
5. Insights, Advantages, and Limitations
By directly generating multi-codebook tokens, VocalNet-M2 obviates the need for separate flow-matching decoders, constituting a principal source of latency reduction. The MTP stack enables significant stride in autoregressive decoding, further reducing inference cost, particularly in resource-constrained or real-time interactive scenarios.
A primary limitation emerges from the data requirements for multi-codebook tokenization. Achieving robust tokenization that rivals single-codebook systems in WER and audio quality demands both extensive and high-quality training data. Consequently, extensions such as adaptive codebook sizing, semi-supervised learning, and improved semantic-acoustic embedding mechanisms become pertinent directions for future research.
6. Application Domains and Prospects
VocalNet-M2's design, with first-chunk latency reduced to ∼350 ms, is particularly suited to real-time dialogue, voice-assistant interactivity, remote conferencing, and responsive human-robot interfaces. This latency regime supports near-natural conversational turn-taking and improved user experience in interactive systems.
A plausible implication is that the removal of the flow-matching constraint and the multi-token generation mechanism together set a new trajectory for scalable, efficient SLMs, provided that future methods can efficiently address the data brittleness currently observed in multi-codebook training paradigms (Wang et al., 13 Nov 2025).