Speech Large Language Models

Updated 30 August 2025

SpeechLLMs are neural models that integrate speech processing and large-scale language modeling to enable end-to-end spoken language understanding, translation, and synthesis.
They employ joint training strategies, cross-modal pre-training, and parameter-efficient adaptation techniques like LoRA to fuse audio and text modalities effectively.
Despite strong semantic performance, challenges remain in capturing paralinguistic nuances, efficient long-form processing, and balancing modality interference.

Speech LLMs (SpeechLLMs) are a class of neural models that integrate large language modeling capacities with speech processing, enabling end-to-end spoken language understanding, generation, and interaction. Unlike classical cascaded systems where speech is first transcribed to text by ASR, processed, and optionally re-synthesized, SpeechLLMs are typically architected to directly ingest, interpret, and/or generate raw audio—often alongside or in combination with text—allowing for a unified and more flexible spoken language pipeline. The recent emergence of SpeechLLMs results from the convergence of advancements in self-supervised audio modeling, parameter-efficient adaptation techniques, and large-scale cross-modal pre-training, with applications spanning ASR, spoken dialogue, speech translation, and speech synthesis.

1. Core Architectures and Integration Paradigms

SpeechLLMs are generally built by augmenting pretrained LLMs with speech processing modules—especially encoders that map speech signals into feature representations compatible with transformer-based decoders.

Audio Feature Integration:

Prepending Speech Embeddings: Semantic or perceptual embeddings (typically from conformer or self-supervised encoders such as Whisper, WavLM, or HuBERT) are stacked or projected into the token embedding space of the LLM and prepended (or interleaved) with text token embeddings. The effective input becomes a joint sequence which can be modeled with existing autoregressive language modeling frameworks (Fathullah et al., 2023).
Direct Discrete Tokenization vs. Continuous Features: Two paradigms have gained traction:
- Discrete tokenization quantizes encoder outputs into symbolic tokens using clustering and BPE, yielding compact, efficient, but potentially information-losing representations.
- Continuous embeddings retain high-dimensional, temporally resolved outputs—better preserving prosody and paralinguistic content at the cost of compute and longer sequences (Wang et al., 25 Aug 2025).
Coupled and Modular Designs: Some models employ a strict separation between LLM (text domain) and speech modules, for example, coupling the LLM as a stateless text encoder with an external TTS (VALL-E) for synthesis, versus superposing or fine-tuning joint models for both modalities (Hao et al., 2023).

Parameter-efficient Adaptation:

LoRA (Low-Rank Adaptation): LLMs are adapted to handle audio-conditional or mixed-modality input via insertion of trainable, low-rank matrices into select attention or feedforward projections, while keeping the vast majority of LLM parameters frozen (Lakomkin et al., 2023, Peng et al., 23 Oct 2024).
Prompt-Conditioned Adaptation: Advanced models modulate adaptation at inference time based on prompt content using attention-based prompt-aware scaling of LoRA updates, allowing flexible task-following under multi-instruction, multi-modal scenarios (Hu et al., 31 Mar 2024).

Dual Encoders and Specialized Modules:

To decouple semantic (linguistic) and paralinguistic (speaker/style) information, dual speech encoders (e.g., Whisper for semantics, WavLM for speaker/style) are independently processed, aligned, and fused before input into the LLM (Hu et al., 31 Mar 2024, Meng et al., 13 Sep 2024).

SpeechLLMs require specialized training recipes to achieve robust multimodal generalization:

Pretraining and Supervised Fine-Tuning:

Speech encoders typically undergo CTC-based or contrastive self-supervised pretraining on large audio corpora before integration with LLMs.
Supervised fine-tuning proceeds on large, labeled speech-text datasets, mixing ASR, speech translation, spoken QA, slot filling, and instruction-following SFT data. Key to preserving LLM’s original language capabilities is joint SFT: text and speech data are sampled together within each mini-batch, mitigating catastrophic forgetting of text-based skills (Peng et al., 23 Oct 2024).

Behavior Imitation and Modality Alignment:

Multi-task behavior imitation (MTBI) supervises the model such that, for each speech–text pair, the output generated for speech input must match exactly the text-prompt output, tightly aligning speech and text representation spaces (Xie et al., 24 May 2025).

Speech-Text Interleaving:

During training, speech segments are randomly interleaved with corresponding textual segments within input sequences, compelling the LLM to process mixed modalities holistically and further improving cross-modal generalization (Xie et al., 24 May 2025).

Curriculum and Multi-Phase Training:

Curriculum strategies build core speech skills on elementary tasks before introducing complex, multi-task, or chain-of-thought (CoT) reasoning instructions, with model adaptation (e.g., prompt-aware LoRA) injected in advanced stages for flexible prompt following (Hu et al., 31 Mar 2024).

Reinforcement Learning and Rationale-based Supervision:

For evaluative or preference learning (e.g., speech-to-speech judgment), models may be trained to align with human evaluators using rationale-based SFT (requiring explanations along with labels) or RL objectives, enabling better alignment with subjective evaluation (Ge et al., 28 Aug 2025).

3. Evaluation Metrics and Compositional Benchmarks

Performance assessment for SpeechLLMs covers core recognition and generation abilities, as well as holistic, linguistic, and acoustic judgment:

Task Domain	Metric/Benchmark	Key Paper(s)
ASR, Speech Translation	WER, BLEU, COMET	(Huang et al., 2023, Fathullah et al., 2023)
Spoken Language Understanding (SLU)	SLU-F1 (Slot Filling), Intent Accuracy, PP	(Li et al., 29 Aug 2024, Wang et al., 5 Jun 2025)
Paralinguistics/Perception	Speaker Recognition, Emotion Accuracy, Prosody Measures	(Wang et al., 5 Jun 2025)
Generalization/Reasoning	Prompt/Task Gen. Accuracy, GSM8K for math SQA, Error Analysis	(Xie et al., 24 May 2025, Wang et al., 5 Jun 2025, Wang et al., 25 Aug 2025)
Human Judgment/Explainability	Agreement Rate, MOS, Rationalization Consistency	(Ge et al., 28 Aug 2025)
Long-Speech Understanding	LongSpeech-Eval (Response Quality, Runtime)	(Guo et al., 20 Jul 2025)

Specialized benchmarks such as MMSU (Wang et al., 5 Jun 2025) and LongSpeech-Eval (Guo et al., 20 Jul 2025) probe nuanced aspects—perception vs. reasoning, semantics vs. paralinguistics, fine-grained prosody, disfluency, code-switching, and long-form dialog. Experiments consistently reveal strong semantic/transcription performance but lingering deficits on paralinguistic, prosodic, and complex multi-turn reasoning tasks, compared to human baselines (Wang et al., 5 Jun 2025).

4. Key Empirical and Theoretical Findings

Multimodal Power-Law Scaling:

SpeechLLMs and pure SLMs show power-law scaling curves analogous to text LLMs; however, the compute-efficiency gap is dramatic—up to three orders of magnitude more compute is required for syntactic/semantic proficiency parity with text models (Cuervo et al., 31 Mar 2024).
Pre-training loss is highly predictive of downstream linguistic performance in both text and speech models, enabling scale estimation and resource planning (Cuervo et al., 31 Mar 2024).

Continuous vs. Discrete Representations:

Continuous SSL-derived features consistently outperform discrete tokens in most SLU tasks and are particularly robust under noise, albeit at the cost of higher bit-rate and slower convergence. Discrete tokens, however, retain efficiency and outperform on phoneme-level recognition (Wang et al., 25 Aug 2025).
Layerwise analysis shows that continuous features align speech-text modalities more smoothly through deeper layers of the LLM. Undertrained discrete tokens (long-tail effect) can limit information transmission; hybrid or adaptive strategies remain an open area (Wang et al., 25 Aug 2025).

Instruction Sensitivity and Prompt Handling:

SpeechLLMs can degrade in semantic reasoning (“LLM dormancy”) if speech embeddings dominate over text in concatenated inputs, especially when connector alignment is poor (Peng et al., 24 Oct 2024). Prompt-aware LoRA, normalization, or improved data balancing can mitigate this.

Generalization and Zero-shot Robustness:

Multi-task behavior imitation, speech-text interleaving, and data-efficient SFT approaches substantially improve prompt and task zero-shot generalization, even with relatively little supervised speech data (Xie et al., 24 May 2025, Li et al., 29 Aug 2024).

Long-Speech Processing and Compression:

Iterative fusion and dynamic compression strategies enable efficient handling of long-form audio: the content density–guided fusion of temporally redundant frames, with training over variable compression ratios, transfers capabilities learned from short-form data to longer contexts without requiring new long-speech corpus (Guo et al., 20 Jul 2025).

5. Limitations and Open Challenges

Paralinguistics, Perception, and Speaker Awareness:

On benchmarks separating perception (prosody, intonation, emotion) from semantic reasoning, current models exhibit a 19–28% accuracy gap on paralinguistic tasks compared to their semantic performance; the gap with human listeners for acoustic nuance remains substantial (Wang et al., 5 Jun 2025).
Recent studies reveal that even top-performing SpeechLLMs function similarly to cascaded ASR+LLM architectures when it comes to speaker identification and show limited explicit speaker awareness. Key speaker attributes (gender, age, style, emotion) are not reliably captured from audio (Wu et al., 7 Sep 2024).
Fusing non-semantic acoustic expertise (e.g., for medical, musical, or emotion-rich tasks) remains difficult. Stacked or joint encoders often fail to preserve abstract or non-semantic cues, especially when paired with instruction sensitivity issues (Bu et al., 17 Oct 2024).

Catastrophic Forgetting and Modality Interference:

Fine-tuning LLMs for speech can lead to catastrophic forgetting of text-based capabilities unless joint, multi-turn speech–text SFT and careful adaptation (e.g., LoRA) are applied (Peng et al., 23 Oct 2024, Shen et al., 27 Oct 2024).

Benchmarking and Evaluation Design:

Most evaluation is still focused on ASR and semantic reasoning rather than holistic, context-rich, or explainable performance. Scalable and explainable human-likeness judgment models (e.g., SageLM) are under development to address this (Ge et al., 28 Aug 2025).

Compute and Data Efficiency:

Achieving text-level proficiency in SLMs via direct scaling has prohibitive compute costs, suggesting a need for transfer learning, cross-modal pre-training, or more information-dense audio representations (Cuervo et al., 31 Mar 2024).

6. Applications, Impact, and Future Roadmap

Industrial and Interactive Applications:

End-to-end SpeechLLMs are displacing cascaded systems in industrial workflows for speech translation, subtitling, voice assistants, and dialogue agents. Integrated systems (e.g., LLM-ST, LLaMA-Omni) offer precise timestamped outputs, lower error propagation, and real-time speech-to-speech, speech-to-text, and instruction following (Huang et al., 2023, Fang et al., 10 Sep 2024).
Recent models provide efficient, low-latency, simultaneous text and speech responses in interactive settings, with scalable training achievable on modest hardware (Fang et al., 10 Sep 2024).

Roadmapping and Level Taxonomy:

An emerging five-level roadmap quantifies progression from basic ASR (Level 1), perception of paralinguistic features (Level 2), richer non-semantic comprehension (Level 3), specialist acoustic reasoning (Level 4), up to a hypothetical AGI-level model unifying linguistic and abstract acoustic faculties (Level 5) (Bu et al., 17 Oct 2024).

Directions for Advancement:

Solution pathways include designing higher-capacity or specialized acoustic encoders, advanced fusion (e.g., attention-based or token-aligned connectors), comprehensive multi-modal/cross-task instruction tuning, and domain-expert augmentation (e.g., medical, musical, prosodic).
Explainable, multi-aspect rationalization and multi-turn, mixed-modality training and evaluation (e.g., as in SageLM) are advocated to foster human-aligned, transparent evaluation and system behavior (Ge et al., 28 Aug 2025, Wang et al., 5 Jun 2025).

Open Research Themes:

Efficient multimodal fusion under limited supervision (Xie et al., 24 May 2025)
Scalable, data-efficient learning for low-resource and long-form audio (Guo et al., 20 Jul 2025)
Robust modeling of paralinguistic and speaker traits for richer interaction (Wang et al., 5 Jun 2025, Wu et al., 7 Sep 2024)
Comparative trade-offs between discrete and continuous speech representations (Wang et al., 25 Aug 2025)

SpeechLLMs represent a rapidly advancing but still incomplete technology, with outstanding challenges notably in fine-grained acoustic understanding, efficient and generalizable multimodal fusion, and the development of benchmarks and evaluation protocols that fully stress both semantic and paralinguistic faculties. The trajectory toward human- and domain-expert–level spoken language understanding and generation will likely depend on innovations in data efficiency, modality alignment, and explainable evaluation frameworks.