SpeechLMs: Unified Speech Language Models

Updated 2 December 2025

SpeechLMs are neural architectures designed to convert raw audio into semantically aligned tokens for integrated speech-to-speech, speech-to-text, and TTS applications.
They employ a three-module pipeline—speech tokenizer, autoregressive transformer, and waveform reconstructor—to efficiently reduce token sequences while preserving semantic fidelity.
Leveraging large-scale synthetic speech-text pretraining and multimodal objectives, SpeechLMs achieve significant gains in speech recognition, QA accuracy, and dialogue systems.

Speech LLMs (SpeechLMs) are neural architectures designed to process, understand, and generate spoken language directly at the signal or token level, bypassing the conventional separation between automatic speech recognition (ASR), text-based LLMs, and text-to-speech (TTS) synthesis. SpeechLMs unify speech and text processing for end-to-end multimodal tasks, supporting direct speech-to-speech, speech-to-text, and text-to-speech functionality with integrated semantic, syntactic, and paralinguistic competence.

1. Core Architectural Principles and Tokenization

SpeechLMs are generally structured as a composition of three principal modules: a speech tokenizer, a sequence model (typically an autoregressive Transformer), and a waveform reconstructor (neural vocoder or decoder). The tokenizer maps raw audio waveforms to sequences of discrete tokens, often leveraging vector-quantized bottlenecks or clustering of self-supervised features extracted at frame rates ranging from 12.5 Hz to 50 Hz. State-of-the-art approaches may further separate token streams into semantic and acoustic channels, or develop controllable-rate, syllable-like tokens to reduce sequence length, as in SyllableLM, where semantic units with controllable granularity enable a 30-fold reduction in training compute and 4-fold speedup over phone-level modeling (Baade et al., 5 Oct 2024).

A defining innovation is the move from frame-level, high-resolution quantization toward coarse, semantically-aligned units (e.g., syllable, word-level, or segmental tokens), motivated by the observation that excessive paralinguistic variability and long token sequences inhibit efficient language modeling and semantic generalization (Wang et al., 22 Dec 2024). Approaches such as vector-quantized, supervised bottlenecks derived from ASR encoders (Whisper-large-v3 plus block-causal attention) are used to ensure semantic preservation even at lower frame rates (e.g., 12.5 Hz), achieving WER ≈ 8.4% and high MOSNet scores while reducing sequence lengths by a factor of four (Zeng et al., 26 Nov 2024). Decoupled tokenization architectures separate semantic and acoustic channels to improve alignment and synthesis quality, enabling multi-token prediction and rapid decoding (Fan et al., 14 Jun 2025).

2. Data Curation, Synthetic Pretraining, and Interleaved Data

A central barrier to scaling SpeechLMs is the scarcity of large, high-quality, unsupervised, and parallel speech-text corpora. To overcome this, leading work proposes the large-scale generation of synthetic, interleaved speech-text data. One approach stochastically samples text spans from existing corpora, then maps spans into speech token sequences via a pretrained text-to-token transformer, entirely avoiding waveform generation in the pretraining phase (Zeng et al., 26 Nov 2024). Each resulting document is an interleaving of text and instrumented speech tokens, which supports the training of models on up to 1 trillion tokens including 600B synthetic speech-text instances, and over 300B text tokens, matched with smaller proportions of unsupervised and supervised natural speech.

Fine and coarse interleaving of text and audio during pretraining exposes the model to joint modality context and improves spoken question answering (SQA) accuracy by up to 3.1 percentage points over coarse chunking (Udandarao et al., 22 Oct 2025). Further, the inclusion of knowledge-rich synthetic speech-text pairs (e.g., QA-formatted and domain-focused data) fills coverage gaps and closes the modality distribution gap to natural benchmarks, achieving 10.2% absolute improvement over larger baselines on SQA (Udandarao et al., 22 Oct 2025).

The synthetic approach also dramatically reduces the compute and I/O burden, as discrete token streams are far more compact than raw waveforms or spectrograms. These strategies drive the scalability of recent SpeechLMs to the trillion-token regime.

3. Pretraining Objectives, Model Scaling, and Optimization

SpeechLMs are pretrained using autoregressive next-token prediction across text-only, speech-only, and mixed or interleaved streams. The canonical objective for a sequence $x$ comprising both textual and speech tokens is:

$L_\mathrm{NTP} = -\sum_{i=1}^L \log P(x_i \mid x_{<i}),$

where $L$ is the length of the token sequence. Models typically extend the vocabulary of a large, dense text Transformer (e.g., GLM-4-9B-Base) with a fresh set of speech tokens, yielding joint vocabularies upwards of 150,000 entries (Zeng et al., 26 Nov 2024). Training leverages batch compositions that blend unsupervised text (e.g., 30% of each batch), massive synthetic interleaved speech-text data (e.g., 600B tokens), unsupervised speech (one epoch, ~63B tokens at 25 Hz), and supervised ASR/TTS (one epoch, ~7B tokens).

Scaling studies demonstrate strong, predictable power-law correlations between pretraining loss and downstream syntactic/semantic performance, but show that pure speech-based models require 100–1000× more compute to reach LLM-level proficiency compared to text models, due to weaker scaling exponents in speech ( $\gamma \approx 0.021$ for BLIMP, vs. $0.066$ in text) (Cuervo et al., 31 Mar 2024). Incorporating synthetic data (e.g., sTinyStories) and avoiding overly compressed tokenization can yield 6–10% accuracy improvements on semantic tasks for a fixed compute budget.

Optimized model implementations employ delay- or parallel-interleaving to fuse multi-stream speech and text inputs, and weighted cross-entropy to normalize the contributions of text, semantic, and acoustic token streams (Tian et al., 21 Jun 2025). Post-pretraining, fine-tuning on curated speech dialogue data (e.g., SpeechDialog-90K) enables end-to-end fully speech-domain chatbots.

4. Evaluation, Benchmarking, and State-of-the-Art Results

SpeechLMs are evaluated across a range of metrics, including next-continuation accuracy on sTopic-StoryCloze and sStoryCloze, closed-book SQA (WebQuestions, LlamaQuestions, TriviaQA), and spoken chatbot performance assessed by GPT-4 content quality scores, UTMOS, and ASR-WER. The best SpeechLMs have achieved:

sTopic-StoryCloze speech→text: 93.6% (vs. prior SOTA 88.6%)
Spoken QA: average speech-to-speech accuracy 31% (vs. previous SOTA Moshi ~13%)
Dialogues: UTMOS ≈ 4.33, ASR-WER ≈ 7.8% (parity with top TTS-fine-tuned baselines)

These advances reflect a near-closure of the performance gap between speech-only and text-augmented tasks, as well as the production of end-to-end spoken conversation systems that rival cascaded ASR+LLM+TTS solutions (Zeng et al., 26 Nov 2024).

5. Model Variants and Robustness Considerations

Different architectural and optimization choices yield notable trade-offs:

Models employing single-stage joint supervised fine-tuning of both text and speech tasks via parameter-efficient adapters (e.g., LoRA) preserve core text reasoning performance while acquiring strong speech proficiency, effectively avoiding catastrophic forgetting observed with independent or sequential SFT (Peng et al., 23 Oct 2024).
Large multi-modal systems that admit multi-stream token interleaving (e.g., OpusLM) demonstrate that scaling to ≥1.7B parameters is critical for balanced text and speech capability; under-scaling yields near-random performance (Tian et al., 21 Jun 2025).
Alignment-aware connectors that segment and compress speech features to token-aligned granularity, followed by distillation and multitask fine-tuning, yield marked improvements in speech understanding and preserve pretrained LLM abilities (Tan et al., 30 Sep 2024).
Speaker- and role-aware generation, multi-token prediction, and fully decoupled semantic/acoustic tokenization schemes further improve synthesis quality and cross-modal alignment (Fan et al., 14 Jun 2025).

Notably, benchmarking has revealed subtle but impactful forms of bias—most prominently, acoustic-based gender differentiation—traceable to the properties of speech encoders (e.g., Whisper), which can introduce systematic male-oriented response patterns or erase legitimate gender-dependent distinctions unless carefully audited and controlled (Choi et al., 25 Sep 2025).

6. Practical Guidelines, Toolkits, and Open Challenges

Recent toolkits such as ESPnet-SpeechLM standardize SpeechLM development as universal sequential modeling, supporting arbitrary input/output typologies, arbitrary streaming or parallel interleaving, and task templates decoupled across ASR, TTS, TextLM, and AudioLM (Tian et al., 21 Feb 2025). Best practices identified in the literature include initializing from large-scale text LMs, balancing cross-modal loss weights, asynchronous offline tokenization at scale, and leveraging multi-task objectives without sacrificing efficiency or model fidelity.

Open challenges and research directions include:

Mitigating modality-specific gaps through more efficient speech tokenization (syllable/word level, minimal paralinguistics), improved speech-text alignment, and compositional cross-modal adapters.
Scaling jointly in model size and data while circumventing the impractical compute demands of pure speech LMs via transfer learning, synthetic data, or enhanced cross-modal objectives.
Standardizing evaluation suites that span linguistic, paralinguistic, and acoustic scene understanding, and addressing trust, safety, and social bias in both speech representations and generation.
Extending to under-resourced languages, multi-speaker, multi-dialect, and emotion-adaptive dialog, as well as integration into real-time and low-latency systems.

SpeechLM research is transitioning toward universal models able to perform all classical and modern speech tasks—recognition, synthesis, understanding, translation, and dialogue—in a unified, scalable, and robust manner. Recent advances showcase that, given tailored data generation, advanced tokenization, expressive model architectures, and rigorous multi-modal pretraining, SpeechLMs can rival or exceed pipeline-based approaches on both standard and emerging benchmarks (Zeng et al., 26 Nov 2024, Udandarao et al., 22 Oct 2025, Tian et al., 21 Jun 2025, Cuervo et al., 31 Mar 2024, Peng et al., 23 Oct 2024).