Speech and Language Model (SLM)

Updated 14 April 2026

Speech and Language Models (SLMs) are unified neural architectures that jointly process, understand, and generate both speech and language signals.
They leverage pre-trained speech and text models with specialized tokenization and adapter mechanisms to enable tasks like ASR, TTS, and cross-modal reasoning.
SLMs face challenges in compute efficiency and semantic alignment, driving continuous research in modality fusion and robust cross-modal design.

A Speech and LLM (SLM) is a neural architecture designed to jointly process, understand, and generate speech and language signals within a unified or tightly integrated framework. SLMs generalize LLMs by explicitly incorporating speech as a first-class modality, thereby supporting not only text-based comprehension and generation, but also direct speech-to-text, text-to-speech, speech-to-speech, and cross-modal reasoning tasks. Modern SLMs leverage pre-trained foundation models for both speech (e.g., wav2vec, HuBERT, Whisper, USM) and language (T5, Llama, Qwen), integrating them via carefully designed adapters, tokenizers, and alignment mechanisms. The SLM paradigm encompasses pure speech LMs, hybrid systems (speech encoder + text LLM), and models that unify both modalities for instruction-following, dialog, and expressive generation.

1. Formal Architecture and Modality Alignment

SLMs typically operate by embedding raw audio into a sequence of feature vectors using a pre-trained speech encoder. These feature vectors are quantized or compressed (using CTC blank-filtering, VQ, clustering, or neural codecs), producing a sequence of speech tokens or embeddings. To enable operations in the same space as a downstream LLM, a modality adapter (e.g., a small transformer or feed-forward block) projects these embeddings into the LLM’s token embedding space. The adapted embeddings are then concatenated with textual tokens (e.g., dialog history, textual context, or instructions) and passed to the autoregressive LLM encoder or decoder stack.

A representative instantiation is as follows:

Input: raw speech signal $x$ (waveform), optional previous dialog history $h$ (tokenized text)
CTC encoder: $f_\text{CTC}(x)$ produces frame-level activations $H \in \mathbb{R}^{T \times d}$
Blank-filtering: retain non-blank frames to produce $H̄ \in \mathbb{R}^{T' \times d}$ , $T' \approx T/4$
Adapter mapping: $E_\text{speech} = f_1(H̄) \in \mathbb{R}^{T' \times d_\text{text}}$
Text embedding: $E_\text{text} \in \mathbb{R}^{M \times d_\text{text}}$
Concatenation: $E_\text{concat} = [E_\text{speech} ; E_\text{text}] \in \mathbb{R}^{(T'+M) \times d_\text{text}}$
LLM encoder/decoder: processes $E_\text{concat}$ to autogenerate outputs, such as ASR transcript and dialog state

Some models further augment the input via external entity retrieval, prepending representations of retrieved entities to the input sequence and enabling retrieval-augmented architectures for improved robustness and rare-entity generalization (Wang et al., 2023).

2. Speech Tokenization and Sequence Modeling

SLMs are critically dependent on the tokenization approach used to bridge the high information variability of speech and the efficient modeling requirements of LLMs:

Phonetic/semantic tokens: Tokenizers based on self-supervised units (HuBERT clusters, phone BPE, k-means) capture linguistic information, but often at high sequence length and with limited density; content is primarily phonetic (Wang et al., 2024).
Compression strategies: Blank-filtering and deduplication via CTC or BPE reduce sequence length and compress information density, making the representations more tractable for transformer LMs yet at risk of discarding essential semantics (Wang et al., 2023).
Paralinguistic and speaker-invariant tokens: Advanced tokenizers, such as DC-Spin, employ clustering objectives and dual codebooks to generate speech units robust to speaker variation and environmental noise, optimized for both understanding (ABX, sWUGGY) and synthesis (resynthesis WER, UTMOS), and supporting streamable chunk-wise inference (Chang et al., 2024).
Decoupled tokenization: State-of-the-art architectures interleave separate streams of purely semantic and acoustic tokens, yielding high synthesis quality, strong cross-modal alignment, and efficient multi-token prediction (MTP) that accelerates decoding (Fan et al., 14 Jun 2025).

SLMs are trained in one or more of the following regimes:

Self-supervised pretraining: Unlabeled speech is used to train encoders via contrastive objectives (InfoNCE), masked prediction (HuBERT, w2v-BERT), or clustering. The resulting representations are quantized into discrete tokens for LM pretraining (Cui et al., 2024).
Continued pre-training: LLMs pretrained on text are expanded to joint token spaces and further trained on speech token data, which boosts convergence and semantic transfer (Cui et al., 2024).
Adapter-based transfer: Rather than re-training large models, SLMs often freeze the backbone speech and LLMs, training only a lightweight adapter to align speech and text representations; in many cases, adapters comprise just 1% of the total parameter count (Wang et al., 2023).
Instruction tuning and descriptive alignment: Datasets synthesized via LLM-driven speech captioning or templated instruction generation permit alignment to linguistic, paralinguistic, and contextual attributes without catastrophic forgetting. Speech-text alignment via natural language captions enhances generalization and enables zero-shot instruction following (Lu et al., 2024, Lu et al., 2024).
Retrieval augmentation: Retrieval-augmented SLMs incorporate top-K entities derived from speech, prepended to the LLM context, to improve rare entity recall and dialog state tracking in open-domain scenarios (Wang et al., 2023).
Weakly-supervised disentanglement: Heterogeneous adapters and regularized randomness help decouple paralinguistic signals from linguistic content, enabling robust emotional and prosodic modeling in the frozen LLM backbone (Wang et al., 11 Aug 2025).

4. Capabilities, Evaluation, and Performance Limits

SLMs address a range of tasks beyond traditional ASR, including:

End-to-end dialog understanding: SLMs model the entire speech-to-dialog-state pipeline in an autoregressive, turn-by-turn fashion, directly emitting transcripts and structured dialog states (Wang et al., 2023).
Multi-modal reasoning: By accepting mixed speech and text queries, SLMs support cross-modal QA, slot filling, keyword spotting, TTS, and real-time multi-turn interactive setups (Cui et al., 2024, Wang et al., 2023).
Expressive and paralinguistic modeling: Architectures incorporating dual modality heads, attribute embeddings, and intent-aware bridging (e.g., via variational information bottlenecks or explicit paralinguistic adapters) produce outputs that cover prosody, emotion, speaker traits, dialectal variation, and non-speech vocalizations with substantial improvements in benchmark performance (Wang et al., 13 Apr 2026, Chen et al., 24 Jul 2025, Wang et al., 11 Aug 2025).
Multilingual and domain generalization: Large-scale pre-trained speech and LLMs, combined with modular adapters, can be extended to tens of languages and specialized domains (e.g., singing synthesis) with minimal adaptation data (Zhao et al., 16 Dec 2025).
Evaluation metrics: SLMs are assessed via ASR WER/CER, dialog state tracking accuracy (JGA), lexical/semantic/syntactic probes (sWUGGY, sBLIMP, StoryCloze), speaker similarity, UTMOS, and human- or LLM-judged subjective scores (e.g., emotional appropriateness, dialect consistency). Retrieval-augmented and alignment-based systems show marked gains on rare entity and open-domain metrics (Wang et al., 2023, Lu et al., 2024, Chen et al., 24 Jul 2025).

Despite these advances, SLMs lag behind pure text LLMs in compute efficiency and semantic scaling; for equivalent proficiency, SLMs often require up to three orders of magnitude more compute. Key bottlenecks are sequence length, information density mismatches, and the challenge of robust abstraction from low-level acoustic variation (Cuervo et al., 2024, Wang et al., 2024). Scaling laws establish a strong but slow correlation between pretraining loss and downstream task improvements.

5. Theoretical Insights, Challenges, and Future Directions

Fundamental Insights

Representation alignment can be achieved with limited adaptation: Empirical evidence suggests that two-layer transformer adapters suffice to map compressed speech features into the LLM manifold, preserving pre-trained capabilities while enabling effective joint modeling (Wang et al., 2023, Wang et al., 2023).
Decoupling modality-specific information yields robust generation: Decoupled tokenizers and multi-head architectures with attribute embeddings allow SLMs to separately model semantics and paralinguistics, producing outputs that are contextually, emotionally, and socially adaptive without sacrificing text-level fluency (Fan et al., 14 Jun 2025, Chen et al., 24 Jul 2025).
Self-aware and explicit reasoning frameworks close the understanding–realization gap: Variational information bottlenecks, rubric-driven preference optimization, and causal graph-based world models bridge the gap between what the SLM "thinks" (semantic intent) and "how it speaks" (acoustic realization), yielding outputs that closely match expressive or context-appropriate intent (Wang et al., 13 Apr 2026, Zhou et al., 5 Dec 2025).

Continued Challenges

Scaling efficiency mismatch: Bridging the information density gap (phonetic vs. semantic) and reducing token sequence length without loss of essential meaning remain active areas of research (Wang et al., 2024, Cuervo et al., 2024).
Evaluation and benchmarking: Unified, reproducible benchmarks encompassing semantic, paralinguistic, and generative quality are needed to measure true progress and guide ablation studies across architectures (Arora et al., 11 Apr 2025).
Low-resource, multilingual, and domain adaptation: While the SLM paradigm facilitates efficient scaling to new languages via modular adapters and caption-style alignment, further demonstrations are required in truly low-resource and out-of-domain scenarios (Cui et al., 2024, Zhao et al., 16 Dec 2025).
Real-time and full-duplex interaction: Many models remain single-turn or high-latency; deployment in full-duplex, low-latency settings (including streaming and interruption handling) remains a technical hurdle (Cui et al., 2024).
Safety and trust: Proactive research into hallucination mitigation, bias, and toxicity in the context of acoustically-grounded outputs is in its infancy (Arora et al., 11 Apr 2025).

Directions for Advancement

Adaptive blank-filtering and token compression: Approaches that dynamically select semantically dense frames or employ multi-scale tokenization will further bridge the modeling gap between speech and text (Wang et al., 2023, Wang et al., 2024).
Joint contrastive and captioning alignment: Hybrid objectives may improve cross-modal representation and access to both lexical and abstract units (Lu et al., 2024).
Self-aware and modular reasoning systems: Continued development of explicit causal models, self-rewarding preference optimization, and closed-loop expressive feedback mechanisms will push SLMs beyond black-box token models to more human-like, interpretable agents (Wang et al., 13 Apr 2026, Zhou et al., 5 Dec 2025).
Open and reproducible research: Public release of models, data, and evaluation scripts, and systematic ablation studies, are essential for robust progress and innovative model design (Arora et al., 11 Apr 2025).

6. Representative Quantitative Gains

Key experimental results confirm the potential and current limitations of SLMs:

Model/Architecture	Task/Metric	Baseline	SLM	ReSLM/Adapter
Dialog State Tracking (JGA)	DSTC11	24.7%	28.4%	34.6%
ASR WER	(DSTC11, speech input)	13.0%	9.2%	8.5%
sWUGGY (lexical, ZeroSpeech)	(Flow-SLM-1B)	–	69.8%	–
Speaker Similarity (prompted)	(Flow-SLM-1B-ext vs TWIST)	0.09	0.45	–
Paralinguistic Accuracy	(GOAT-SLM, ESD-zh)	44.8–53.2%	45.3–72.1%	–

The table summarizes the impact of adapters and retrieval augmentation on end-to-end dialog state tracking, improvements in base ASR, lexical task scores, and paralinguistic tasks. Substantial gains are observed for rare entity error rates, with slot error rate (SER) reductions of up to 35% on specific categories (Wang et al., 2023), and for expressiveness as quantified by subjective and automatic metrics (Chen et al., 24 Jul 2025, Wang et al., 13 Apr 2026).

SLMs mark a paradigm shift towards universal, expressive, and instruction-following speech understanding and generation, leveraging advances in pre-trained foundation models, modality alignment, and efficient tokenization. Continued progress in architectural efficiency, semantic scaling, paralinguistic modeling, and evaluation infrastructure is critical for realizing their full potential across the spectrum of speech and language applications.