Speech-LLMs: Unified Speech & Language
- Speech-LLMs are large language models extended to integrate speech processing with text-based tasks, enabling unified spoken language understanding and generation.
- They leverage modular architectures that combine state-of-the-art speech encoders with LLMs via modality adapters for tasks like transcription, translation, and dialogue systems.
- Techniques such as parameter-efficient fine-tuning, reinforcement learning, and scheduled modality interleaving boost performance, adaptability, and robustness across diverse linguistic scenarios.
Speech-LLMs are LLMs architecturally and algorithmically extended to process, interpret, or generate speech alongside text, thereby unifying spoken language understanding, speech generation, and multimodal reasoning in a single, flexible model. These systems aim to bridge modality gaps inherent between continuous acoustic input and discrete, text-oriented representations that are native to LLMs, allowing for direct integration of speech into language-centric AI workflows. Research in this area spans speech-to-text understanding, spoken entity retrieval, pronunciation assessment, speech-to-speech translation, diarization, low-resource and zero-resource adaptation, and adversarial robustness, with state-of-the-art implementations leveraging parameter-efficient adaptation modules, advanced self-supervised speech encoders, and intricate training paradigms that exploit both paired and unpaired speech-text data.
1. Core Architectures and Integration Mechanisms
Speech-LLMs are characterized by a modular architecture that combines a robust speech encoder (e.g., Conformer, Whisper, HuBERT) with a LLM (e.g., Llama, T5, mT5) through specialized modality adapters. The canonical architectural pipeline consists of three stages (Peng et al., 24 Oct 2024):
- Modality Feature Extraction: A dedicated speech encoder processes raw audio into high-dimensional acoustic features, often pretrained on large-scale unpaired audio using self-supervised objectives.
- Modality Information Fusion: The speech representation is aligned with the LLM's text token embedding space through a transformation network such as a self-attention-based adapter (Wang et al., 2023), windowed query-transformer (Saon et al., 13 May 2025), CNN-based projector (Mundnich et al., 24 Dec 2024), or Q-Former (Djanibekov et al., 13 Feb 2025). Techniques such as CTC-based blank filtering (Wang et al., 2023), temporal downsampling, and low-rank adaptation (LoRA) (Saon et al., 13 May 2025, Li et al., 29 Aug 2024) are commonly used.
- LLM Inference: The fused embedding is fed into the decoder of a frozen or partially-adapted LLM, which interprets or generates output based on the downstream task (transcription, translation, question answering, summarization, etc.).
This modularity allows dual-mode operation: using the model for either text-only tasks (preserving LLM capabilities and safety) or speech tasks with modality adaptation (Saon et al., 13 May 2025). Table 1 summarizes key architectural elements.
Component | Example Implementation | Function |
---|---|---|
Speech Encoder | Conformer, Whisper, w2v-BERT | Extracts high-level acoustic representations |
Modality Adapter | Q-Former, CNN, Self-attention | Aligns speech embeddings with LLM input space |
Text LLM | Llama, T5, mT5 | Provides advanced semantic reasoning |
LoRA/Fusion Layer | LoRA, Mixture-of-Experts | Efficient parameter/knowledge sharing |
2. Training Paradigms and Adaptation Strategies
The training of Speech-LLMs encompasses a hierarchy of techniques aimed at robust modality alignment and generalization:
- Supervised Fine-Tuning (SFT): Training on paired speech-text datasets for downstream tasks such as ASR, automatic speech translation (AST), or spoken language understanding (SLU) (Wang et al., 2023, Saon et al., 13 May 2025).
- Parameter-Efficient Fine-Tuning: Use of LoRA adapters inserted into key projection layers or attention modules of the LLM, enabling adaptation to the speech modality with minimal update overhead (Li et al., 29 Aug 2024, Saon et al., 13 May 2025, Hao et al., 2023).
- Unpaired/Low-Resource Adaptation: Text-only fine-tuning of LLM modules (keeping speech encoders and projectors frozen) allows for effective domain adaptation in low-resource or zero-resource scenarios, circumventing the need for additional audio data (Fang et al., 6 Jun 2025, Kim et al., 1 Jun 2025).
- Scheduled and Interleaved Training: Gradual interleaving of text and speech units during training facilitates smooth modality adaptation and bridges the semantic gap in S2ST and multi-modal reasoning tasks (Futami et al., 12 Jun 2025).
- Multi-Task and Reinforcement Learning: Multi-task regimes (integrating ASR, slot-filling, SQA, etc.) and reinforcement learning techniques such as PPO and Direct Preference Optimization (DPO) further align multimodal reasoning and optimize the response accuracy-latency frontier (Ling et al., 23 Sep 2025, Nagpal et al., 25 Dec 2024, Shih et al., 8 Oct 2025).
3. Functional Capabilities and Applications
Speech-LLMs deliver a unified and extensible framework for multiple spoken language tasks:
- Speech Understanding and Dialogue Systems: End-to-end speech-to-text conversational understanding, dialog state tracking, slot-filling, entity retrieval, and spoken question answering (Wang et al., 2023, Li et al., 29 Aug 2024, Wu et al., 7 Sep 2024).
- Speech Synthesis and Multimodal Generation: Extension of LLMs toward speech output via integration with TTS models (e.g., VALL-E, acoustic semantic tokenizers) and late-fusion, mixture-of-expert architectures for seamless text-speech multimodality (Hao et al., 2023, Shen et al., 27 Oct 2024).
- Pronunciation Assessment: Multi-modal LLM architectures provide high-quality scoring for pronunciation fluency and accuracy, leveraging task-specific prompts, robust speech encoding, and modality adaptation for educational and linguistic evaluation (Fu et al., 12 Jul 2024).
- Speech Summarization and Quality Assessment: Reinforcement learning frameworks refine summarization directly from speech, supporting controllable styles and zero-shot generalization (Ling et al., 23 Sep 2025); LLMs also act as scalable pseudo-raters for non-intrusive speech quality assessment in large, simulated datasets (Cumlin et al., 8 Aug 2025).
- Zero-Resource and Low-Resource Transfer: Architectures enable speech recognition and translation in languages with no paired audio-text data, utilizing pretraining on large multilingual corpora and lightweight adaptation modules to "soft-prompt" the LLM (Mundnich et al., 24 Dec 2024, Fong et al., 7 Aug 2025).
- Diarization and Multi-Speaker Recognition: Unified architectures for diarization (joint diarization and recognition) employ special token interleaving and context-refresh inference to align speaker turns and transcription (Saengthong et al., 26 Jun 2025).
4. Performance, Evaluation, and Generalization
Speech-LLMs achieve state-of-the-art or highly competitive results on multiple benchmarks, often underpinned by parameter-efficient adapters and fusion modules:
- ASR Performance: Models such as Granite-speech achieve word error rates (WER) in the 6–8% range for English with training on open-source data, outperforming systems trained on proprietary corpora (Saon et al., 13 May 2025). Zero-resource transfer with cross-lingual adaptation yields WERs in the 28% range for fully unseen languages (Mundnich et al., 24 Dec 2024).
- SLU and Slot Filling: WHISMA demonstrates 26.6% relative improvement in slot-filling F1 compared to baseline zero-shot models on SLURP, and 33% relative improvement over Qwen-Audio on the task-agnostic SLU-GLUE benchmark (Li et al., 29 Aug 2024).
- Pronunciation Assessment: Multi-modal LLM-based systems report Pearson correlation coefficients (PCC) for fluency and accuracy up to 0.777 and 0.713, demonstrating competitive alignment-free evaluation compared to traditional align-based methods (Fu et al., 12 Jul 2024).
- Speech Synthesis and QA Retention: Mixture-of-expert and late-fusion strategies mitigate catastrophic forgetting, retaining both QA and TTS modality, with TTS MOS (Mean Opinion Score) around 3.1–3.5 and QA accuracy up to 54.8% on MMLU (Shen et al., 27 Oct 2024).
- Summarization Robustness: Multi-stage RL fine-tuning narrows the audio-text summarization performance gap, achieving up to 28% relative gains with robust zero-shot multilingual generalization (Ling et al., 23 Sep 2025).
Evaluation methodologies incorporate WER, SLU-F1, BLEU for translation, MOS and UTMOS for TTS/audio quality, PCC for scoring, and overall quality scores derived from LLMs or human raters.
5. Robustness, Adversarial Vulnerabilities, and Limitations
The flexibility of Speech-LLMs renders them exposed to sophisticated universal acoustic adversarial attacks (Ma et al., 20 May 2025). A fixed audio segment prepended to any input can mute output or re-task the model; selective attacks permit attribute-conditioned control such as gender- or language-specific suppression. Both Qwen2-Audio and Granite-speech are vulnerable, with near 100% success rates in certain settings. Defenses may include adversarial training, fixed-segment detection, and tighter task alignment but require further research.
Empirical studies reveal persistent limitations:
- Speaker Awareness: Most models exhibit limited utilization of acoustic cues for speaker discrimination and perform poorly on identity-critical questions absent ground-truth segmentation—approaching text-only system performance (Wu et al., 7 Sep 2024).
- Instruction Sensitivity and Semantic Reasoning: There remains a tendency for models to become instruction-insensitive or relinquish nuanced reasoning when audio dominates the fused representation, suggesting the need for careful normalization and alignment (Peng et al., 24 Oct 2024).
- Data Regimes: Matching Whisper-only performance in low-resource settings typically demands 200+ hours of in-domain data or transfer learning from high-resource language-adapted projectors (Fong et al., 7 Aug 2025).
- Speech-Less Training: Text-only training via unified encoder alignment (TESU-LLM) enables speech inference without access to speech during training, but may sacrifice paralinguistic aspects unless further extended (Kim et al., 1 Jun 2025).
- Latency: The addition of chain-of-thought (CoT) reasoning incurs increased response latencies; entropy-based measures of “question completeness” help to optimize the accuracy-latency trade-off, and DPO reduces latency up to 70% without accuracy loss (Shih et al., 8 Oct 2025).
6. Emerging Directions and Research Opportunities
Speech-LLM research is rapidly evolving with notable emergent strategies:
- Reinforcement and Direct Preference Optimization: RL (e.g., PPO) and DPO have shown to enhance adaptation to disordered speech and fine-tune latency-accuracy frontiers in multimodal reasoning tasks (Nagpal et al., 25 Dec 2024, Shih et al., 8 Oct 2025, Ling et al., 23 Sep 2025).
- Chain-of-Thought Prompting in Speech: Instruction-tuned LLMs with explicit CoT context exhibit up to 2.4× reasoning accuracy improvements, while entropy-informed reasoning start points minimize end-to-end latency (Shih et al., 8 Oct 2025).
- Scheduled Modality Interleaving: Gradually decreasing the ratio of text tokens during S2ST adaptation smooths the transition from text to speech units, strengthening performance especially in low-resource language pairs (Futami et al., 12 Jun 2025).
- Unified Encoder Alignment: Text-only supervision frameworks (e.g., TESU-LLM) with unified encoders and frozen multimodal alignment projectors deliver competitive ASR and translation without multimodal corpora (Kim et al., 1 Jun 2025).
- Transfer via High-Resource Language Pretraining: Leveraging cross-lingual projectors pretrained on “nearby” high-resource languages improves low-resource ASR generalization without excessive data requirements (Fong et al., 7 Aug 2025).
Future directions include enhanced speaker and temporal awareness, more effective cross-modal fusion mechanisms (e.g., discrete audio token expansions), reinforcement-leveraged instruction tuning, speaker-aware and temporally-aligned summarization, adversarial robustness, and broader domain adaptation with real-time evaluation.
Speech-LLMs synthesize cutting-edge advances in self-supervised learning, model fusion, and parameter-efficient adaptation, enabling unified, flexible, and robust spoken language understanding and generation. Their development, benchmarked across diverse linguistic, acoustic, and multimodal reasoning tasks, continues to broaden the applicability of LLMs beyond the textual field, with ongoing research focusing on efficiency, generalization to low-resource scenarios, robust alignment, and defense against adversarial vulnerabilities.