Speech-Language Models: Unified Speech Processing

Updated 21 April 2026

Speech-Language Models are neural sequence models that process raw speech tokens either purely or in hybrid setups, enabling holistic, multimodal speech processing.
They combine specialized speech encoders, modality adapters, and autoregressive Transformers with tokenization techniques like VQ-VAE and RVQ to ensure effective cross-modal alignment.
SLMs use multi-objective training—autoregressive, masked acoustic, and contrastive losses—to overcome modality gaps and enhance cross-lingual, paralinguistic, and reasoning performance.

Speech-LLMs (SLMs) are neural sequence models designed to process and generate speech directly, generalizing the principles of LLMs to the acoustic domain. They are distinguished by their capacity to handle raw speech input as a primary modality, with architectures that span fully speech-centric models and hybrid models integrating text-based LMs. SLMs model distributions over tokenized speech representations, enable speech-text multimodal interactions, and pursue universal speech processing with joint modeling of linguistic, paralinguistic, and speaker characteristics (Arora et al., 11 Apr 2025).

1. Definition, Scope, and Historical Context

SLMs are defined as neural sequence models that either model distributions over tokenized speech sequences (“pure” SLMs) or augment LLMs with speech-encoding frontends (“hybrid” SLMs) (Arora et al., 11 Apr 2025). Pure SLMs learn $P(x_1,\dots,x_T) = \prod_{t=1}^T P(x_t \mid x_{<t})$ over discrete speech tokens $x_t$ , typically obtained from quantized self-supervised representations (e.g., HuBERT, RVQ-codecs). Hybrid SLMs prepend or interleave encoded speech representations $H^{sp}$ with text embeddings $H^{txt}$ , then condition language modeling on both (Arora et al., 11 Apr 2025).

Early work on speech technology partitioned pipelines into ASR $\to$ text-LM $\to$ TTS. Recent advancements, including encoder-decoder transformers and self-supervised learning for speech, have driven a paradigm shift toward end-to-end multimodal models capable of holistic speech understanding, generation, and reasoning (Zhou et al., 26 Oct 2025, Chou et al., 12 Aug 2025). The field is now increasingly focused on universal SLMs unifying diverse speech processing tasks, with objectives paralleling those used in LLM research.

2. Model Architectures and Tokenization Strategies

SLMs typically feature three architectural components (Arora et al., 11 Apr 2025):

Speech encoder ( $\mathrm{Enc}^{sp}$ ) mapping raw waveform $X^{sp}$ to either continuous features $H^{sp}$ or discrete token indices $h_l$ .
Modality adapter ( $x_t$ 0 or $x_t$ 1), which projects speech/text features into a common embedding space. This includes strided CNNs, Q-former modules using cross-attention from trainable queries, and CTC-based compressors for downsampling and temporal compression (Lu et al., 2024).
Sequence model — most often a decoder-only Transformer — realizing autoregressive prediction of the next token. Formally, $x_t$ 2 for pure speech LMs; hybrid models concatenate speech and text representations before LM inference.

Tokenization is a central design axis:

Phonetic tokens: k-means clusters or VQ-VAE quantizations of SSL speech features, with vocabulary sizes $x_t$ 3– $x_t$ 4.
Codec tokens: fine-grained, multi-codebook RVQ tokens (e.g., SoundStream, EnCodec) encoding acoustic detail at $x_t$ 5 Hz (Fan et al., 14 Jun 2025).
Fully decoupled tokenization: semantic and acoustic/prosodic tokens are factored into separate streams (e.g., FACodec), enabling independent modeling/routing and superior cross-modal alignment (Fan et al., 14 Jun 2025).
Continuous features: mel-spectrograms or hidden states as direct sequence inputs, increasingly rare but still present in joint acoustic modeling approaches (Chou et al., 12 Aug 2025).

A summary of model and tokenization principles:

Component	Pure SLMs	Hybrid/LLM-centric SLMs
Speech Encoder	SSL (e.g., HuBERT, Whisper, WavLM)	Same
Modality Adapter	Projection, CTC, Q-Former, Downsampling	Strided CNN, Q-Former, MLP fusion
Tokenization	k-means, VQ-VAE, RVQ, FACodec	Codec/semantic streams, decoupled
Sequence Model	Transformer, autoregressive or chunked	Decoder-only Transformer

3. Training Objectives and Alignment Paradigms

Canonical SLM pretraining uses the autoregressive language modeling loss: $x_t$ 6 Optionally supplemented by:

Masked Acoustic Modeling: random masking of feature/tokens, BERT-style, with loss $x_t$ 7.
Contrastive and Alignment Losses: e.g., contrastive predictive coding, cross-modal contrastive loss (analysis of speech/text pairs), and explicit $x_t$ 8 or cosine alignment to encourage shared latent spaces (Lu et al., 2024, Xu et al., 11 Aug 2025).
Optimal Transport Regularization (OTReg): speech-text alignment as an OT problem, with a regularization term based on the optimal transport plan between speech and transcript embeddings, leading to improved cross-domain generalization in ASR (Xu et al., 11 Aug 2025).

Several works highlight the importance of multi-objective training to bridge modality gaps and to mitigate overfitting to idiosyncratic speech variations (e.g., prosody, background noise, speaker id).

For efficient alignment and decoding, multi-token prediction (MTP) aggregates $x_t$ 9 tokens per prediction step, balancing the information rate between speech and text and enabling up to 12× faster decoding without loss in cross-modal alignment or synthesis quality (Fan et al., 14 Jun 2025).

4. Evaluation Metrics, Benchmarks, and Analysis of Linguistic Competence

SLMs are evaluated using a diverse set of metrics:

Likelihood and Generative Metrics:
- Perplexity (PPL) on speech or transcribed sequences.
- Discrimination tasks (sWUGGY, sBLIMP): real vs. fake utterance identification.
- Diversity metrics (auto-BLEU), VERT (composite of PPL and diversity).
Downstream Task Accuracy:
- ASR Word Error Rate (WER), speech translation BLEU, instruction-following accuracy (Dynamic-SUPERB, AIR-Bench, VoiceBench).
Speech Quality:
- MOSNet, UTMOS, or human listening tests for naturalness and speaker similarity (SIM) (Fan et al., 14 Jun 2025).
Instruction-Following and Reasoning:
- Dynamic-SUPERB, AIR-Bench, EchoMind, and StyleBench for multi-turn dialog, paralinguistic control, and empathetic reasoning (Zhou et al., 26 Oct 2025, Zhao et al., 8 Mar 2026).
Trustworthiness:
- Hallucination rates, toxicity/bias in transcribed content, fake speech detectability (CodecFake) (Arora et al., 11 Apr 2025).

Minimal-pair probing reveals that SLMs encode grammatical and phonetic features robustly but have a substantial form–meaning gap, with conceptual (semantic) competence consistently lagging grammatical competence. S3Ms and ASR encoders achieve >90% accuracy on syntactic features but only ~60% on conceptual features, with such structure emerging around intermediate Transformer layers (He et al., 19 Sep 2025).

5. Challenges: Modality Gap, Scaling Laws, and Semantic Bottlenecks

Modality Gap and Representation Challenges

The modality gap—the divergence between high-variance acoustic speech embeddings and compact, content-focused text representations—poses a major challenge to SLM generalization. This gap is amplified by SLMs' tendency to exploit spurious within-domain paralinguistic cues rather than semantic content (Xu et al., 11 Aug 2025). OTReg demonstrates that enforcing structured speech-text alignment alleviates this issue and enhances robustness in cross-lingual transfer and domain generalization.

Scaling Properties and Data Efficiency

SLM test loss and downstream metric scaling follow Chinchilla-style laws, but SLMs scale 10²–10³× more slowly than text LLMs in linguistic tasks such as BLIMP and StoryCloze (Cuervo et al., 2024). The slow scaling is attributed to:

Low information density per token (longer speech token sequences per conceptual unit).
Restrictive context windows and the need to filter acoustic variability.
Data augmentation with synthetic corpora (e.g., sTinyStories) yields limited but consistent improvements in semantic tasks.
Coarser tokenization (unigram BPE over speech tokens) improves loss scaling but collapses downstream syntactic/semantic scaling (Cuervo et al., 2024).

The modality-evolving perspective directly quantifies the effect of phonetic dominance, sequence length, and paralinguistic variability, finding that paralinguistic variability is the primary barrier to semantic competence, with sequence length as a secondary factor, and phonetic density as relatively minor (Wang et al., 2024).

6. Unified Multilingual, Paralinguistic, and Reasoning-Centric SLMs

Contemporary SLMs embrace heterogeneity in input/output modalities, cross-linguality, explicit paralinguistic modeling, and causal reasoning:

Cross-Lingual SLMs: Textless cross-lingual interleaving at the sentence level fosters shared semantic spaces and robust transfer under data-scarce conditions (Moumen et al., 1 Dec 2025).
Paralinguistic Modeling: Decoupled or dual-head architectures (GOAT-SLM) maintain strong core semantic performance while substantially boosting dialect, emotion, and age-aware dialogue in high-dimensional evaluation (TELEVAL) (Chen et al., 24 Jul 2025). End-to-end models outperform OLMs in style intensity adaptation (StyleBench), highlighting the importance of tokenizer/decoder co-design for paralinguistic control (Zhao et al., 8 Mar 2026).
Causal and Modular Reasoning: Explicit world models, as in Speech World Model (SWM), factorize speech perception into context, affect, speech-act, and pragmatic modules passing messages along a learned causal DAG; this enables interpretability, counterfactual reasoning, and robustness to missing supervision (Zhou et al., 5 Dec 2025).
Instruction Following and Speech-Text Alignment: Automatic speech-captioning alignment with minimal tuning imbues SLMs with strong zero-shot instruction-following and multimodal reasoning, even in the absence of curated instruction-tuned speech data (Lu et al., 2024, Lu et al., 2024).

7. Open Problems and Future Directions

Key open issues and priorities for advancing SLMs include (Arora et al., 11 Apr 2025):

Universal Representations: Optimal trade-offs between discrete and continuous tokenization; merging speech and text embeddings.
Modular versus End-to-End Integration: Balancing modular pipelines (ASR $H^{sp}$ 0 LLM $H^{sp}$ 1 TTS) with monolithic SLMs that jointly capture content, prosody, and identity.
Data Availability and Benchmark Standardization: Unified open benchmarks, robust tokenization standards, and large-scale, diverse speech corpora remain rare.
Scalability and Efficiency: Model compression, efficient tokenizations/codecs, and on-device deployment require further research.
Trustworthiness and Inclusivity: Broadening coverage to low-resource languages, diverse accents, and challenging acoustic domains while mitigating hallucinations, biases, and deepfake risks.
Semantic Bottlenecks: Bridging the form–meaning gap in speech with new architectural and training objectives that emphasize lexical and conceptual structure.
Empathetic Reasoning: Deepening integration of non-lexical cues for empathetic conversational intelligence and layered, context-grounded dialogue (Zhou et al., 26 Oct 2025).