Speech-Aware LLMs: Integrating Speech & Text

Updated 25 September 2025

Speech-Aware LLMs are advanced multimodal models that combine speech processing with language generation to enable joint audio-text tasks.
They employ techniques like embedding prepending and modality adapters to merge audio features from encoders directly into large language models.
Empirical evaluations show improvements in WER, decoding speed, and cross-modal alignment, while challenges remain in avoiding catastrophic forgetting and ensuring fairness.

Speech-Aware LLMs (SALLMs) constitute an emergent class of neural architectures in which LLMs are endowed with direct speech processing and understanding capabilities, thereby allowing for joint handling of audio and text modalities in both input and output. These models unify the traditionally distinct domains of speech recognition, understanding, and generation with the broad reasoning and generative capabilities of LLMs, targeting applications in multilingual ASR, speech-driven dialog agents, multimodal retrieval, expressive TTS, and context-aware spoken language interaction. SALLMs integrate speech representations—often in the form of fixed or learned embeddings, discrete tokens, or paralinguistic vectors—directly into the LLM, enabling seamless multimodal generative tasks and significantly advancing the integration of spoken language into foundation models.

1. Architectural Principles and Integration Strategies

SALLMs are typically realized by coupling a high-capacity LLM (e.g., LLaMA, Mistral, Qwen, InternLM) with one or more speech encoders and a set of modality adapters. The canonical approach involves processing raw audio into feature representations via a conformer, transformer, or self-supervised speech encoder (e.g., Whisper, WavLM, HuBERT), followed by projection and (optionally) stacking operations to align speech embeddings with the LLM’s input space (Fathullah et al., 2023, Chen et al., 2023).

A widely used integration mechanism is the "embedding prepending" strategy, where variable-length sequences of projected audio embeddings are prepended to the textual input sequence of the decoder-only LLM, thereby transforming classical next-token prediction into a speech-conditional generative process:

$y_j = W \cdot [x_{jn}, x_{jn+1}, \ldots, x_{jn+n-1}] + b$

where $x_i$ are audio encoder outputs, $W$ is a projection matrix, and $y_j$ are the resultant embeddings.

Recent variants adopt more sophisticated multi-modal alignment, including:

Modality adapters that bridge high-dimensional audio embeddings to text token space (e.g., 2-layer conformer adapters (Chen et al., 2023)).
Mixture-of-experts audio encoder designs, with routing driven by the user prompt or downstream task (Shan et al., 21 Feb 2025).
Multi-speaker and paralinguistic embeddings to encode speaker identity, emotion, or dialectal attributes (Chen et al., 24 Jul 2025, Kim et al., 8 Feb 2024).

Advancements in tokenization—such as LLM-aware tokenization (LAST) and decoupled multi-stream approaches—support better semantic alignment and improved speech modeling by producing discrete units optimized for LLM-driven next-unit prediction (Turetzky et al., 5 Sep 2024, Fan et al., 14 Jun 2025). Decoupling semantic tokens from acoustic detail tokens has been shown to improve both alignment and expressive synthesis quality.

2. Training Methodologies and Optimization Schemes

The training of SALLMs varies from parameter-efficient adaptation (e.g., LoRA-based fine-tuning (Fathullah et al., 2023, Peng et al., 23 Oct 2024)) to full end-to-end multi-modal instruction-following, multitask learning, and large-scale staged alignment strategies. Key methodologies include:

Frozen backbone adaptation: In approaches such as SALM (Chen et al., 2023), the LLM is frozen and only the audio encoders and adapters are updated, allowing for modular reuse and preservation of original textual capabilities.
Joint multi-modal, single-stage supervised fine-tuning: Approaches like VoiceTextBlender (Peng et al., 23 Oct 2024) advocate mixing batches from text-only and various speech-related datasets (ASR, AST, SQA, mixed-modal) during fine-tuning, mitigating catastrophic forgetting of text skills while efficiently acquiring speech competence.
Reinforcement Learning with Grouped Preference Optimization: Recent work introduces Group Relative Policy Optimization (GRPO) for optimizing SALLMs on open-format generative speech understanding tasks. GRPO leverages rewards such as BLEU between sampled generations and human references, using group-normalized advantage and importance sampling to guide updates:

$\hat{A}_i = \frac{r_i - \text{mean}(R)}{\text{std}(R)}, \qquad l_{i,t} = \min(s_{i,t}(\theta)\hat{A}_i,\, \text{clip}[s_{i,t}(\theta), 1-\epsilon, 1+\epsilon]\hat{A}_i)$

Empirically, GRPO yields significant performance gains over SFT for spoken question answering and automatic speech translation (Elmakies et al., 21 Sep 2025).

Iterative Cross-modal Pretraining: For SALLMs designed for dialog or expressive generation, multi-step pretraining with prosody-infused acoustic tokens, interleaved speech-text input streams, and chain-of-reasoning templates are used to better align semantic and paralinguistic cues (Kim et al., 8 Feb 2024, Chen et al., 24 Jul 2025).
Alignment with LLM Distillation (ALLD): SALLMs can be further tuned for speech quality evaluation by distilling descriptive capabilities from an expert LLM into an audio-aware LLM, optimizing a combined reward and KL penalty to capture both global and dimension-specific quality attributes (Chen et al., 27 Jan 2025).

3. Performance Evaluation and Empirical Results

SALLMs are evaluated on a broad spectrum of speech and language understanding tasks, including:

Multilingual ASR and AST (measured by WER, BLEU),
Spoken QA, open-format generative tasks (BLEU, BERTScore, ROUGE, METEOR),
In-context learning and prompt-driven keyword boosting (F-score),
Speaker and paralinguistic robustness (Similarity, MOS, attribute-responsiveness),
Instruction-following and catastrophic forgetting quantification (Speech-IFEval, C3T metrics) (Lu et al., 25 May 2025, Kubis et al., 15 Sep 2025).

Key empirical findings:

LLMs equipped with conformer-based audio frontends and trained on multilingual datasets have shown up to 18% improvement over monolingual baselines in WER, with robust performance even when the LLM is frozen or operates at coarse audio strides (up to 1s) (Fathullah et al., 2023).
Decoupled tokenizers, speaker-aware generation schemes, and multi-token prediction yield significant decoding speedups (up to 12×) and WER reductions (e.g., from 6.07 to 3.01), with sustained knowledge understanding and speaker consistency (Fan et al., 14 Jun 2025).
Instruction-following is a persistent weakness; SALLMs typically exhibit large forgetting rates (>–50%) relative to text-only LLMs, with marked declines in both constrained output formatting and chain-of-thought tasks (Lu et al., 25 May 2025).
Interleaved SLMs (speech & text tokens) initialized from pretrained TextLMs attain strong semantic metrics with lower compute and data requirements than textless SLMs, especially when optimized for larger model sizes over dataset scale (Maimon et al., 3 Apr 2025).
Fairness and modality robustness remain a challenge; even models with high aggregate accuracy can fail to provide fair or consistent results across age, accent, and gender groups in speech interfaces, as revealed by fine-grained evaluations such as C3T (Kubis et al., 15 Sep 2025).

A summary table of selected empirical results appears below:

Model/Approach	Key Metric(s)	Result/Observation
LLaMA-7B + Conformer	Avg. WER on MLS	9.7% (outperforming mono ASR by 18%)
SALM	LibriSpeech ASR (WER)	2.4 (test-clean), 5.3 (test-other)
SyllableLM	Training compute	30× reduction vs SoTA SpeechLM
VoiceTextBlender	Multi-turn SQA/AST	Superior to 7B/13B models with only 3B
CosyVoice 2	Streaming TTS latency	<1s with human-parity naturalness
GOAT-SLM	TELEVAL dialect resp.	>90% consistency, high emotion/age resp
Interleaved SLM	StoryCloze sSC	Matches SoTA with 10× less data
Speech-IFEval	Instruction follow Δ	–50% loss compared to text-only LLMs

4. Scaling Laws, Tokenization, and Data Requirements

Research on scaling properties indicates that SLMs’ linguistic performance scales predictably with compute, model size, and training steps, but at a much lower efficiency (–3 orders of magnitude) than text-based models (Cuervo et al., 31 Mar 2024). Synthetic data (e.g., sTinyStories) can boost semantic learning, provided that tokenization preserves sufficient acoustic and linguistic detail. Coarse segmentation (e.g., syllable-level units (Baade et al., 5 Oct 2024)) and LLM-aware speech tokenizers (Turetzky et al., 5 Sep 2024) mitigate inefficiencies by reducing token rate and aligning speech units for better cross-modal modeling.

Hybrid interleaved models further reduce compute and data requirements. The optimal compute allocation leans toward model growth rather than increased data volume once initialized from rich TextLMs, and higher semantic scores are achievable with less raw audio (Maimon et al., 3 Apr 2025).

5. Specialized Modules and Multimodal Extensions

Recent SALLMs incorporate prompt-aware mixtures of audio encoders, paralinguistic model heads, and explicit module branching (Write/Speak) to handle diverse downstream and expressive tasks (Shan et al., 21 Feb 2025, Chen et al., 24 Jul 2025):

Mixture-of-Experts selection for task-specific feature extraction (ASR, speaker counting, audio captioning).
Decoupling of linguistic and acoustic generation via dual-head architectures.
Explicit modeling of prosody, emotion, dialect, and age in both input understanding and output speech synthesis.
Grouped policy optimization for flexible generative open-ended speech tasks (Elmakies et al., 21 Sep 2025).
Seamless integration of cross-modal retrieval and RAG from unified embeddings (Sun et al., 26 Jan 2025), with marked reductions in inference latency.

Multi-turn dialog and context-aware speech-driven interaction are enabled via interleaved or chain-of-reasoning templates and paralinguistic-aware branching, enhancing the social and dialogic quality of conversational agents (Kim et al., 8 Feb 2024, Chen et al., 24 Jul 2025).

6. Evaluation, Robustness, and Open Challenges

Recent benchmarks directly target the instruction-following, fairness, and modality robustness of SALLMs:

Speech-IFEval isolates task adherence from speech perception and quantifies catastrophic forgetting (Lu et al., 25 May 2025).
C3T evaluates the preservation and fairness of language understanding across speaker profiles and between text and speech channels (Kubis et al., 15 Sep 2025).
Human and automatic preference tests, as well as chain-of-thought and creative tasks, highlight that current SALLMs often underperform their text-only origins in tasks requiring detailed output control or instruction adherence.

Core challenges include:

Preserving or recovering textual reasoning and instruction-following capabilities during speech-centric training (mitigating catastrophic forgetting).
Achieving demographic fairness and cross-modal consistency in real-world spoken interfaces.
Efficient scaling—both data and compute—without sacrificing generalization or semantic performance.
Robust paralinguistic reasoning and generation (emotion, speaker traits, non-linguistic vocal signals).

A plausible implication is that future SALLMs will require more granular alignment strategies, hybrid or multi-layer architectures, and advanced evaluation protocols to ensure balanced competence in both linguistic and non-linguistic dimensions.

7. Future Directions

The field is rapidly progressing toward:

Richer alignment of speech, text, and visual modalities to enable genuinely generalist agents (Luo et al., 8 Jan 2025).
Broader deployment of SALLMs in conversational assistants, retrieval-augmented systems, voice-based interfaces, and accessibility tools.
Efficient adaptation mechanisms (e.g., task-specific adapters, prompt routing, multi-token prediction) to support emerging application scenarios while mitigating trade-offs in memory, computation, and skill retention.
Improved evaluation frameworks that capture subtleties in instruction-following, fairness, and expressive/social competence.
Open-source models, datasets, and benchmarks to catalyze further innovation and comparative analysis across architectures and scaling regimes (Maimon et al., 3 Apr 2025).

Significant methodical and empirical work remains to be done to achieve robust, fair, and contextually aware spoken language understanding and generation on par with the most advanced LLMs’ text capabilities, particularly in mixed or open-ended settings.