SpeechLLMs: Unified Speech & Language Models
- SpeechLLMs are large language models that integrate speech processing with language generation to perform diverse spoken tasks end-to-end.
- They employ cross-modal alignment, adapter-based fusion, and joint training techniques to directly map audio input to the language model’s token space for tasks like ASR, dialogue, and translation.
- Recent advancements highlight improved performance in multilingual, noisy environments and benchmark tasks, while addressing challenges in speaker identity and robust reasoning.
SpeechLLMs are LLMs that integrate direct speech/audio processing with advanced natural language generation and understanding, enabling a unified, end-to-end approach to spoken language tasks such as automatic speech recognition, dialogue modeling, instruction following, and multimodal reasoning. Unlike traditional cascaded pipelines linking an ASR front-end to a text LLM, SpeechLLMs leverage cross-modal representation alignment, modality fusion, or joint training to directly map spoken input (and in some cases, output) to and from the LLM’s token space, substantially expanding the breadth and flexibility of spoken language AI. Recent research details architectures, training strategies, performance benchmarks, and persistent technical challenges associated with deploying these models in real-world multilingual, noisy, and instruction-driven environments.
1. Definition and Taxonomy of SpeechLLMs
SpeechLLMs broaden the scope of LLMs by coupling large-scale pretrained neural architectures (typically Transformer-based LLMs) with dedicated speech representation modules, permitting end-to-end modeling of spoken input (and optionally output) throughout the LLM. Speech understanding in this context is formalized as the inference of not only textual transcriptions but also semantics, intent, and contextual/paralinguistic cues from audio, thus encompassing tasks beyond classic ASR—such as spoken dialogue understanding, spoken question answering, and multimodal speech translation (Peng et al., 24 Oct 2024).
A taxonomy along input-output modalities clarifies the design space:
- S2T: Speech-to-text models (classic ASR; LLM extends ASR).
- ST2T: Speech+Text-to-text models (spoken prompt plus textual prompt yields textual output; e.g., speech-driven instruction following).
- ST2ST: Speech+Text-to-speech+text models (spoken input and prompt yields both textual and spoken output, relevant for speech-to-speech translation or dialogue). Such categorization captures the evolution from pipeline (ASR + LLM) toward joint models where rich modalities are fused in a unified LLM-centric architecture.
2. Modal Representation Alignment and Fusion Techniques
Central to SpeechLLM design is modality alignment: mapping continuous or discrete speech features into a form compatible with a LLM’s token embedding space, thereby reducing the “modality gap.”
Adapter-Based Alignment
A common approach employs a modality adapter—a learnable neural module (often Transformer-based or a Q-Former) to project speech encoder outputs (e.g., from HuBERT, Whisper, WavLM) to the LLM’s input space (Wang et al., 2023, Djanibekov et al., 13 Feb 2025):
- CTC-based blank-filtering is used to retain only acoustically sentential frames, compressing acoustic sequence lengths to text token scales and preserving semantics (Wang et al., 2023).
- Query-token modules, where a fixed set of learnable “queries” aggregate features by attending to dense speech representations, enable efficient alignment for LLM consumption (Djanibekov et al., 13 Feb 2025).
Dual/Tower Architecture and Retrieval
Speech-to-entity retrievers utilize dual encoders (speech/text) trained with a contrastive loss to retrieve relevant entities directly from audio, effectively augmenting the LLM context with rare or out-of-vocabulary entities, significantly boosting dialogue state tracking performance (Wang et al., 2023).
Multi-Stream/Parallel Architectures
For response generation, parallel decoding of text and speech tokens within a multimodal sequence permits low-latency interaction, with careful loss-balancing between streams and right-padding for alignment (Mitsui et al., 18 Jun 2024). Multi-stream speech decoding yields up to 3x speedup in spoken dialogue systems.
3. Training Strategies and Optimization
SpeechLLMs employ a rich spectrum of training approaches to handle cross-modal alignment, data scarcity, and transfer learning:
- Adapter-Only Fine-Tuning: Keeping large LLM/speech encoder backbones frozen while only updating adapters and/or lightweight projection modules (e.g., LoRA) for parameter-efficient adaptation to speech (Li et al., 29 Aug 2024, Meng et al., 13 Sep 2024).
- Unsupervised Interleaved Pre-Training: The InSerter method leverages synthetic speech-text interleaving to align speech and text representations through next-token prediction, eliminating the need for explicit paired data and achieving superior instruction-following (Wang et al., 4 Mar 2025).
- Scheduled Interleaved Training: Gradually transitioning from text-heavy to speech-heavy interleaved token training enables smooth adaptation of text-pretrained LLMs to speech units in low-resource settings (Futami et al., 12 Jun 2025).
- Contrastive Learning: Cross-modal contrastive loss (aligning speech/audio segments to associated bias words for contextual ASR) forms the backbone of scalable retrieval-augmented recognition (Gong et al., 25 May 2025).
4. Evaluation Methodologies and Benchmarks
Advances in SpeechLLMs have driven the formulation of specialized benchmarks:
- MMSU: A 5,000-instance, 47-task benchmark spanning phonetics, semantics, prosody, rhetoric, and paralinguistics for holistic spoken language understanding and reasoning underpins comprehensive model evaluation (Wang et al., 5 Jun 2025).
- SpeechInstructBench: Evaluates instruction-following accuracy under accent variation, noise, and open-ended/adaptive queries (Wang et al., 4 Mar 2025).
- SLU-GLUE and Gaokao: Assess zero-shot SLU, question answering, and speaker-identification with focus on downstream task generalization and limitations in speaker awareness (Li et al., 29 Aug 2024, Wu et al., 7 Sep 2024). Metrics include:
- ASR: Word Error Rate (WER), Biased WER (B-WER) for context-aware tasks.
- Semantic alignment: Pearson Correlation Coefficient (PCC), BLEU for translation, task-specific accuracy.
- Robustness: Evaluation under noise perturbation, accent/dialect shift.
- Evaluation loss (SageLM): Judgement agreement rates with human evaluators for multi-aspect spoken dialogue (Ge et al., 28 Aug 2025).
5. Comparative Paradigms: Discrete Tokens vs. Continuous Features
The choice of speech representation is pivotal:
- Discrete tokenization (frame-level quantization, BPE compression) offers massive bandwidth and memory efficiency, rapid convergence, and superior performance for phoneme recognition tasks, but at the cost of reduced paralinguistic/semantic fidelity and noise robustness due to token instability (Wang et al., 25 Aug 2025).
- Continuous representation pipelines (SSL feature vectors) excel in tasks requiring nuanced prosodic or emotional interpretation, show better stability to noise/accent shifts, and support superior performance in emotion recognition, speech translation, and intent classification, particularly when used with larger LLMs.
Innovations such as StableToken, featuring a multi-branch majority-voting quantizer and consensus regularization, address the instability of prior discrete tokenizers, substantially reducing Unit Edit Distance (UED) under noise and improving downstream resilience in SpeechLLMs (Song et al., 26 Sep 2025).
6. Challenges and Open Problems
Despite progress, several challenges remain:
- Modality Dominance and Instruction Sensitivity: Projected audio embeddings can overwhelm LLM textual components (“LLM dormancy”), diminishing prompt/instruction effectiveness. Careful scaling and normalization, or architectural expansion to include native audio tokens, are required (Peng et al., 24 Oct 2024).
- Speaker Awareness Deficit: State-of-the-art SpeechLLMs, including Qwen-Audio and WavLLM, tend to ignore identity-critical cues, losing out on true multimodal capability and merely replicating ASR + LLM cascaded behavior (Wu et al., 7 Sep 2024).
- Low-Resource and Multilingual Generalization: Extensive speech or paired data is scarce for many languages. Transfer learning with projectors pretrained on high-resource languages, as well as unified modality encoders, effectively close the gap, but cross-domain generalization is limited (Fong et al., 7 Aug 2025, Kim et al., 1 Jun 2025).
- Robustness under Acoustic Variability: Fragility to environmental noise or meaning-irrelevant perturbations in both discrete and continuous representation pipelines increases training burden and lowers downstream reliability (Song et al., 26 Sep 2025).
- Complex Reasoning and Human-Aligned Feedback: MMSU results indicate substantial gaps relative to human performance, especially for paralinguistic and prosodic reasoning. Emerging strategies such as rationale-augmented supervised fine-tuning (SageLM) and instruction adherence/correction metrics are being developed to drive more interpretable, explainable systems (Ge et al., 28 Aug 2025).
7. Applications and Future Directions
SpeechLLMs underpin a new generation of systems for:
- End-to-end dialogue and task-oriented spoken dialogue systems with reduced response latency and true multimodal semantic alignment (Mitsui et al., 18 Jun 2024, Wang et al., 4 Mar 2025).
- Automated L2 oral proficiency graders that outperform text-based or end-to-end scoring models, with high generalization across test parts and datasets (Ma et al., 27 May 2025).
- Real-time speaker diarization and recognition in multilingual conversational AI, surpassing baseline diarization/transcription pipelines in both accuracy and computational efficiency (Saengthong et al., 26 Jun 2025).
- Low-resource ASR and speech understanding for underrepresented languages via cross-lingual transfer and modality-invariant learning (Fong et al., 7 Aug 2025, Kim et al., 1 Jun 2025).
- Comprehensive spoken language understanding benchmarks (MMSU, SLU-GLUE) for stress-testing and guiding model development toward fine-grained, robust, and scalable human-AI interaction (Wang et al., 5 Jun 2025, Li et al., 29 Aug 2024).
- Explainable and preference-aligned spoken dialogue evaluation supporting industrial deployment, with agreement to human raters and rationale generation (Ge et al., 28 Aug 2025).
Critical ongoing directions include optimizing hybrid discrete/continuous representations, enhancing robustness and speaker identity modeling, developing data-efficient alignment/training strategies, and expanding multimodal integration (e.g., including vision or context metadata) in SpeechLLM architectures. As training and resources for speech-processing LLMs continue to scale, the field is anticipated to move beyond classic ASR and spoken NLU benchmarks toward open-ended, explainable, and more human-aligned speech understanding and generation.