Papers
Topics
Authors
Recent
2000 character limit reached

Speech LLMs: Unified Speech Processing

Updated 27 November 2025
  • Speech Large Language Models (SLLMs) are multimodal architectures that integrate speech encoders with large text models to convert raw audio into structured or free-form outputs.
  • They employ a modular design with feature extraction, modality fusion, and LLM inference, enabling efficient processing of linguistic, paralinguistic, and acoustic cues.
  • SLLMs use parameter-efficient fine-tuning with supervised and contrastive training to excel in ASR, translation, conversational reasoning, and other audio-based tasks.

Speech LLMs (SLLMs) are multimodal neural architectures that incorporate speech encoders and large text-based LLMs, enabling unified, instruction-driven speech understanding, generation, and reasoning across diverse audio-based tasks. Unlike classical cascaded systems that decouple automatic speech recognition (ASR) and downstream language understanding or generation, SLLMs are designed for direct end-to-end modeling from raw acoustic input to structured or free-form output, integrating linguistic, paralinguistic, and non-linguistic cues in a single LLM-centric framework (Peng et al., 24 Oct 2024).

1. Conceptual Foundations and Scope

Speech understanding in SLLMs is defined as a multimodal transformation: mapping raw acoustic input to a textual (or audio) output that reflects linguistic content, paralinguistic attributes (emotion, speaker identity), and non-linguistic cues (background, spatial information). This paradigm rests on three core dimensions (Peng et al., 24 Oct 2024):

  • Informational: Linguistic (transcripts, syntax), paralinguistic (emotion, accent, speaker), non-linguistic (background events).
  • Functional: Ranges from perception (ASR, keyword spotting), through shallow cognition (speech translation, emotion classification), to deep cognition (spoken QA, conversational reasoning).
  • Format: Structured inputs/outputs (transcripts), weakly structured (summaries, slot-values), unstructured (open-ended or dialogic responses).

SLLMs enable general-purpose, instruction-following agents that can process and reason over spoken input without rigidly predefined pipelines.

2. Model Architectures

Contemporary SLLMs are instantiated in several architectural paradigms, commonly decomposable into three stages (Peng et al., 24 Oct 2024):

Stage Methodologies Representative Models / Approaches
Modality Feature Extraction - Continuous: SSL encoders (Whisper, WavLM, XLSR) Whisper-Llama (Li et al., 29 Aug 2024), XLSR-Thai (Shao et al., 18 Sep 2025)
- Discrete: VQ/Codebook Tokenizers (HuBERT, EnCodec, TASTE) TASTE (Tseng et al., 9 Apr 2025)
Modality Information Fusion - Adapter projection and alignment (Conv/MLP/Transformers/Q-Former) DeSTA2 (Lu et al., 30 Sep 2024), WHISMA (Li et al., 29 Aug 2024)
- Concatenation, cross-attention, or token expansion Speech-LLaMA (Wu et al., 2023), LLaMA-Omni (Fang et al., 10 Sep 2024)
LLM Inference - Decoder-only LLMs (Llama, Qwen, Typhoon) with or without LoRA adapters Decoder-only S2TT (Huang et al., 3 Jul 2024), SageLM (Ge et al., 28 Aug 2025)

Architectural specifics vary:

  • Frozen or lightly-adapted encoders and LLMs: Many SLLMs adopt modularity, freezing large pre-trained speech/text backbones and training only narrow adapters (e.g., 2-layer Q-Former, CNN, MLP), which project speech into the LLM's embedding space (Lu et al., 30 Sep 2024, Li et al., 29 Aug 2024).
  • Tokenization granularity: Recent paradigms compress speech into text-aligned tokens (TASTE (Tseng et al., 9 Apr 2025)) or synchronize CTC-filtered frame rates (Wang et al., 2023), yielding dramatically lower sequence lengths for efficient LLM inference.
  • Retrieval paradigms: End-to-end speech retrieval models (SEAL (Sun et al., 26 Jan 2025)) align speech and text into a common embedding space, facilitating direct RAG without text intermediates.

3. Training Algorithms and Methodologies

SLLMs are trained using two principal regimes:

For languages with limited paired data, pipelines such as U-Align deploy differentiable alignment and LLM-based translation+TTS data synthesis to bootstrap realistic scale (Thai-SUP (Shao et al., 18 Sep 2025)).

4. Representative Applications and Benchmarks

SLLMs support the full stack of speech-language applications:

  • ASR, SLT, and Zero-Shot SLU: Decoder-only SLLMs achieve state-of-the-art BLEU on CoVoST 2 and FLEURS for S2TT, outperforming encoder–decoder and prompt-discretization models (Huang et al., 3 Jul 2024, Wu et al., 2023). In WHISMA, multi-task fine-tuning on ASR, SLU, QA, and spoken instructions yields 63.1% SLU-F1 (SLURP) and 79.0% zero-shot accuracy on SLU-GLUE (Li et al., 29 Aug 2024).
  • Retrieval-Augmented Generation: SEAL presents a unified speech–text embedding framework, reducing retrieval latency by 50% and raising Top-1 accuracy by ~7 points compared to ASR-based pipelines (Sun et al., 26 Jan 2025).
  • Long-Form Speech Understanding: FastLongSpeech compresses multi-minute speech via iterative frame-fusion and dynamic compression training, delivering high QA scores with a ∼60% reduction in compute over RoPE-based context extension (Guo et al., 20 Jul 2025).
  • Instruction Following & Cross-Lingual Reasoning: Instruction-following SLLMs such as DeSTA2 and XS-CoT transfer chain-of-thought reasoning into speech domains, exceeding baseline models by substantial GPT-4 score margins in non-core languages (+45% in Japanese SALMONN) (Xue et al., 29 Apr 2025, Lu et al., 30 Sep 2024).
  • Speech Judgement and Oral Proficiency: SageLM delivers state-of-the-art agreement (82.8%) with human speech-to-speech judges by training on both semantic and acoustic preference data with rationale supervision (Ge et al., 28 Aug 2025); Qwen2Audio-based SLLMs set new benchmarks in L2 proficiency scoring and cross-part generalization (Ma et al., 27 May 2025).

Key tasks and datasets include: Dynamic-SUPERB (48 classification tasks, assessed by GPT-4o), AIR-Bench-Chat for open-ended QA (Lu et al., 30 Sep 2024), slot filling (CallCenter-A/B (Hacioglu et al., 17 Oct 2025)), and the LongSpeech-Eval QA benchmark (Guo et al., 20 Jul 2025).

5. Evaluation, Scaling, and Generalization

Evaluation methods are differentiated by task format:

  • Structured tasks: WER, CER for ASR; F1 and Slot/Intent Accuracy for SLU/slot-filling.
  • Weakly/unstructured tasks: BLEU/ROUGE/BERTScore for translation/summarization; GPT-4o or human LLM-based scoring for open-ended outputs and instruction following (Xue et al., 29 Apr 2025, Ge et al., 28 Aug 2025).
  • Explainability: SLLMs employing rationale-augmented SFT can emit model-consistent rationales alongside verdicts, improving human alignment (Ge et al., 28 Aug 2025).

Recent scaling analyses demonstrate power-law loss–performance relationships for SLMs: L(N, D)=E+A/( Nα+B Dβ ),with α,β≈0.25L(N, D) = E + A / ( N^α + B D^β ), \text{with}~α,β\approx0.25 However, SLMs acquire syntax and semantics ∼1000× more slowly in compute compared to LLMs; e.g., reaching Pythia-6.9B English proficiency would require O(101210^{12}) parameters and O(101210^{12}) tokens (Cuervo et al., 31 Mar 2024). Synthetic semantic data (e.g., sTinyStories) provides a ∼2× compute lift but does not obviate the scaling bottleneck.

Parameter-efficient training (e.g., frozen encoders, LoRA/QLoRA, lightweight adapters) remains the default for cost-effective transfer to many settings (Lu et al., 30 Sep 2024, Hacioglu et al., 17 Oct 2025). Generalization is further enhanced by multi-task behavior imitation and interleaving, which enable SLLMs to approach cascaded system upper bounds with less annotated speech data (Xie et al., 24 May 2025).

6. Challenges, Limitations, and Open Directions

Key open problems outlined across multiple studies include (Peng et al., 24 Oct 2024, Xue et al., 29 Apr 2025, Shao et al., 18 Sep 2025, Hacioglu et al., 17 Oct 2025):

  • Instruction Sensitivity: SLLMs manifest high variability with prompt paraphrasing or reformatting; instruction-following robustness remains brittle.
  • Semantic Reasoning Degradation: End-to-end SLLMs lag two-stage (ASR → LLM) pipelines in deep reasoning tasks, reflecting limitations of joint modality alignment.
  • Low-Resource and Multilingual Gaps: SLLMs using generic encoders (e.g., Whisper) degrade on languages with limited annotated audio. Language-specific SSL pretraining, universal alignment (U-Align), and cross-lingual instruction data generation (XS-CoT) have proven effective (Shao et al., 18 Sep 2025, Xue et al., 29 Apr 2025).
  • Data Scarcity and Scaling: Bootstrapping paired speech–text data for rare languages or tasks remains challenging; synthetic augmentation and speech-text translation with TTS and LLMs are important interim steps.

Future methods are expected to leverage multi-modal pretraining, advanced alignment or contrastive objectives, hierarchical and adaptive compression for long-form audio, and increasingly explorations of cross-modal chain-of-thought for instruction generalization (Xue et al., 29 Apr 2025, Guo et al., 20 Jul 2025).

7. Impact and Prospects

SLLMs have demonstrated state-of-the-art performance or significant gains over both audio-only self-supervised models and classical cascaded baselines in ASR, spoken language understanding, translation, dialog, slot filling, and spoken judgement tasks—often with substantial improvements in zero-shot and generalization settings (Li et al., 29 Aug 2024, Sun et al., 26 Jan 2025, Hacioglu et al., 17 Oct 2025).

Scaling, instruction robustness, and generalization remain central barriers to universal speech LLMs, but advances in modular architectures, data-efficient training, and evaluation transparency (e.g., rationale-augmented feedback, LongSpeech-Eval) drive rapid improvement. Cross-lingual and multimodal unification (including vision and text) and real-time streaming—already exemplified by LLaMA-Omni (sub-250ms response) (Fang et al., 10 Sep 2024)—are key technical frontiers as SLLMs become foundational to open-domain spoken agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Speech Large Language Models (SLLMs).