Papers
Topics
Authors
Recent
2000 character limit reached

VocalNet-M2: Efficient Low-Latency SLM

Updated 20 November 2025
  • VocalNet-M2 is a low-latency spoken language model that integrates multi-codebook tokenization and multi-token prediction to eliminate the flow-matching bottleneck.
  • The architecture achieves nearly 50% latency reduction (∼349 ms) and improves performance with a WER of 6.07% compared to previous models.
  • Its modular Thinker–Talker pipeline with stacked MTP layers enables efficient real-time processing, ideal for voice assistants, human-robot interfaces, and telepresence.

VocalNet-M2 is a low-latency spoken LLM (SLM) that advances the architecture of end-to-end speech generation systems by introducing two core techniques: integrated multi-codebook tokenization and multi-token prediction (MTP). Developed with the aim of minimizing response latency in real-time interactive applications while maintaining high speech and text quality, VocalNet-M2 eliminates the flow-matching model bottleneck present in previous SLMs by directly generating multi-track speech tokens. The architecture is designed to support fast, efficient, and high-fidelity speech synthesis, making it applicable to use cases such as voice assistants, human-robot interaction, and telepresence systems (Wang et al., 13 Nov 2025).

1. Multi-Codebook Tokenization

VocalNet-M2 employs the XY-Tokenizer (Gong et al., 2025) for discretizing speech into multiple codebook representations. The tokenizer comprises J=8J=8 independent codebooks, each containing KK embedding vectors of dimension DD. For a given acoustic frame xtx_t, a small encoder network EE produces a latent vector ht=E(xt)RDh_t=E(x_t)\in\mathbb{R}^D. Each codebook jj then quantizes this latent via nearest-neighbor search:

atcb,j=argmin1kKhtek(j)22,j=1,,8a^{\mathrm{cb}, j}_t = \arg\min_{1\leq k\leq K}\|h_t - e^{(j)}_k\|^2_2, \qquad j=1,\ldots,8

This results in an 8-tuple of codebook indices at each timestep.

While the XY-Tokenizer's explicit loss is omitted in the primary text, the standard vector-quantization variational autoencoder (VQ-VAE) objective is implied:

LVQ=tj=18(sg[ht]eatcb,j(j)22+βhtsg[eatcb,j(j)]22),0<β1\mathcal{L}_{\mathrm{VQ}} = \sum_t \sum_{j=1}^8 \left( \|\mathrm{sg}[h_t] - e^{(j)}_{a^{\mathrm{cb},j}_t}\|^2_2 + \beta\|h_t - \mathrm{sg}[e^{(j)}_{a^{\mathrm{cb},j}_t}]\|^2_2 \right), \quad 0<\beta\leq1

where sg\mathrm{sg} denotes stop-gradient. The approach allows rich, low-latency, multi-channel acoustic representation, but the model’s robustness depends heavily on the diversity and quality of training data. Empirically, multi-codebook tokenization requires larger, higher-quality datasets to match the robustness and word error rate (WER) of single-codebook approaches.

2. Multi-Token Prediction (MTP) Acceleration

To address the latency introduced by serial, step-by-step autoregression in conventional speech decoders, VocalNet-M2 incorporates NmtpN_{\mathrm{mtp}} MTP layers stacked atop its Talker module. Each MTP layer predicts the next future token in parallel, permitting inference to stride forward by Nmtp+1N_{\mathrm{mtp}}+1 tokens per decoding pass.

Let h1:tupRt×dh^{\mathrm{up}}_{1:t}\in\mathbb{R}^{t\times d} be the upsampled semantic representation, and let j=18Emb(a1:tcb,j)\sum_{j=1}^8 \mathrm{Emb}(a^{\mathrm{cb},j}_{1:t}) sum previous codebook embeddings. The base Talker transformer Ttalker\mathcal{T}_{\mathrm{talker}} produces the next token set:

{at+1cb,j}j=18=Ttalker(h1:tup+j=18Emb(a1:tcb,j))\{a^{\mathrm{cb},j}_{t+1}\}_{j=1}^8 = \mathcal{T}_{\mathrm{talker}}( h^{\mathrm{up}}_{1:t} + \textstyle\sum_{j=1}^8 \mathrm{Emb}(a^{\mathrm{cb},j}_{1:t}) )

Successive MTP layers predict tokens at t+n+1t+n+1 for n=1,,Nmtpn=1,\ldots,N_{\mathrm{mtp}}:

{at+n+1cb,j}j=18=TMTPnTMTP1(h1:tup+j=18Emb(a1:tcb,j))\{a^{\mathrm{cb},j}_{t+n+1}\}_{j=1}^8 = \mathcal{T}_{\mathrm{MTP}_n}\circ\cdots\circ\mathcal{T}_{\mathrm{MTP}_1}( h^{\mathrm{up}}_{1:t} + \textstyle\sum_{j=1}^8 \mathrm{Emb}(a^{\mathrm{cb},j}_{1:t}) )

Training minimizes the total cross-entropy over next-step and future predictions:

L=t=0M1j=18logP(at+1cb,j)n=1Nmtpt=0M1j=18logP(at+n+1cb,j)\mathcal{L} = -\sum_{t=0}^{M-1} \sum_{j=1}^8 \log P(a^{\mathrm{cb},j}_{t+1}|\cdots) - \sum_{n=1}^{N_{\mathrm{mtp}}} \sum_{t=0}^{M-1} \sum_{j=1}^8 \log P(a^{\mathrm{cb},j}_{t+n+1}|\cdots)

Empirical ablation demonstrates that Nmtp=4N_{\mathrm{mtp}}=4 yields optimal performance (WER improves from 8.56% at Nmtp=0N_{\mathrm{mtp}}=0 to 6.07% at Nmtp=4N_{\mathrm{mtp}}=4, with stable UTMOS rating).

3. Architectural Design

VocalNet-M2 follows a modular Thinker–Talker pipeline:

  1. Audio Encoding: Raw audio xax^a is processed by a frozen Whisper-large-v3 encoder and a learned downsample adapter, producing representations r1:Tar^a_{1:T}.
  2. Semantic Decoding (Thinker): The Thinker, a Qwen3-8B autoregressive transformer, generates NN text tokens t1:Ntextt^{\text{text}}_{1:N} and hidden states h1:Ntexth^{\text{text}}_{1:N}.
  3. Fusion & Upsampling: Text token embeddings and Thinker hidden states are fused via a learned linear layer, then upsampled (by a factor of 3) to produce h1:3Nuph^{\mathrm{up}}_{1:3N}, matching the granular frame count.
  4. Audio Token Generation (Talker with MTP): The Talker, an autoregressive transformer with eight parallel output heads, consumes h1:tuph^{\mathrm{up}}_{1:t} and previous codebook embeddings, outputting eight codebook indices per timestep. Stacked MTP layers predict future tokens in the same pass.
  5. Vocoder Synthesis: The contiguous 8-codebook indices are fed directly to a lightweight neural vocoder, synthesizing the waveform without flow-matching.

Block Diagram:

Raw Audio → Whisper Encoder → Downsample Adapter ↓ Thinker (Qwen3-8B) ↓ Fusion Layer → Upsample ↓ Talker (8-track AR Transformer + NmtpN_{\mathrm{mtp}} MTP Layers) ↓ 8 Codebook Token Streams ↓ Lightweight Neural Vocoder ↓ Output Audio Waveform

4. Empirical Evaluation and Comparative Performance

VocalNet-M2 was evaluated via a rigorous three-phase training regimen: (i) Talker pretraining on \sim10,000 h of TTS data (Emilia corpus), (ii) Downsample Adapter and Thinker training with LoRA on speech-to-text, and (iii) end-to-end fine-tuning on \sim7,000 h of speech-to-speech dialogues from VoiceAssistant, UltraChat, and Tulu-3-derived samples.

Core metrics:

  • Text Quality: scored by AlpacaEval, Llama Questions, TriviaQA, Web Questions (0–10 scale)
  • Speech Quality: UTMOS (predicted MOS), word error rate (WER, via Whisper-large-v3)
  • Latency: first-chunk generation time (for 0.8 s of audio, single L20 GPU)

Summary Results:

Model AlpacaEval LlamaQ TriviaQA WebQ WER UTMOS Latency (ms)
SLAM-Omni 3.50 2.94 0.39 0.84 5.78 4.46 702 ± 30
VocalNet-8B 7.12 7.95 6.24 6.48 3.64 4.49 556 ± 8
GLM-4-Voice 5.86 7.74 4.95 5.56 11.90 4.23 1060 ± 2
MiniCPM-o 6.13 7.72 6.43 7.16 9.52 4.14 894 ± 82
kimi-audio 6.49 8.10 6.15 7.10 14.71 2.87 1745 ± 140
Qwen2.5-Omni 6.01 7.90 5.89 6.88 2.31 4.34
VocalNet-M2 7.29 8.33 6.13 6.65 6.07 4.31 348.9 ± 2.9

VocalNet-M2 achieves a near 50% reduction in first-chunk latency (∼725 ms to ∼349 ms) relative to prior SLMs, retaining strong text and speech metrics.

Ablation Studies:

  • Single vs. Multi-Codebook Tokenization: Multi-codebook models, when matched with high-quality, filtered data, close the WER and UTMOS gap with single-codebook models and provide a ∼2× latency speedup by forgoing the flow-matching model.
  • Effect of MTP Layers: Increasing NmtpN_{\mathrm{mtp}} improves WER up to Nmtp=4N_{\mathrm{mtp}}=4, after which gains plateau.

5. Insights, Advantages, and Limitations

By directly generating multi-codebook tokens, VocalNet-M2 obviates the need for separate flow-matching decoders, constituting a principal source of latency reduction. The MTP stack enables significant stride in autoregressive decoding, further reducing inference cost, particularly in resource-constrained or real-time interactive scenarios.

A primary limitation emerges from the data requirements for multi-codebook tokenization. Achieving robust tokenization that rivals single-codebook systems in WER and audio quality demands both extensive and high-quality training data. Consequently, extensions such as adaptive codebook sizing, semi-supervised learning, and improved semantic-acoustic embedding mechanisms become pertinent directions for future research.

6. Application Domains and Prospects

VocalNet-M2's design, with first-chunk latency reduced to ∼350 ms, is particularly suited to real-time dialogue, voice-assistant interactivity, remote conferencing, and responsive human-robot interfaces. This latency regime supports near-natural conversational turn-taking and improved user experience in interactive systems.

A plausible implication is that the removal of the flow-matching constraint and the multi-token generation mechanism together set a new trajectory for scalable, efficient SLMs, provided that future methods can efficiently address the data brittleness currently observed in multi-codebook training paradigms (Wang et al., 13 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to VocalNet-M2.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube