End-to-End Speech LLMs Overview

Updated 16 May 2026

End-to-End Speech LLMs are unified deep learning architectures that directly convert raw speech into language outputs by jointly modeling acoustic and linguistic features.
They employ pretrained speech encoders and large language models connected via lightweight bridge networks or adapters, achieving efficient parameter adaptation.
They excel in diverse tasks like ASR, translation, and dialogue understanding, demonstrating competitive accuracy and reduced training overhead.

End-to-end Speech LLMs (Speech LLMs) refer to neural architectures that directly map input speech waveforms to high-level language outputs—including transcribed text, structured representations, dialogue states, translations, and summaries—using unified models built around LLMs. Unlike traditional cascaded pipelines, end-to-end Speech LLMs integrate pretrained speech encoders and LLM backbones via lightweight adapters or bridging networks, enabling acoustic and linguistic modeling to be co-optimized, parameter-efficient adaptation, and the transfer of linguistic knowledge and reasoning capabilities from text-domain LLMs to speech-based tasks. This paradigm has achieved strong results across automatic speech recognition (ASR), speech translation, spoken question answering, spoken dialogue understanding, simultaneous speech-to-speech translation, and speech-driven retrieval tasks.

1. Core Model Architectures and Modality Bridging

End-to-end Speech LLMs are centered on combining a frozen or lightly adapted pretrained speech foundation model (such as Whisper, HuBERT, or Wav2Vec2.0) with a large pretrained LLM (e.g., Llama, GPT, OLMo, Gemma). The connection between modalities is realized through a bridge or adaptor module that projects dense acoustic features to the LLM’s embedding space.

Typical pipeline:

Speech encoder: Maps waveform (e.g., 16 kHz waveform x) to high-dimensional frame-level embeddings $h = f_\text{speech}(x)$ .
Bridge network: Compresses/projets the encoder output into a format the LLM can consume. Options include: convolutional downsampling, 1D conv + linear projection (Hono et al., 2023), CTC-based frame selection (Ling et al., 2023), Q-Former attention pooling (Shang et al., 2024), or Transformer-based projectors (Mohapatra et al., 28 Jan 2026).
LLM head: Decoder-only Transformer consumes projected speech prompt $s$ as a prefix; generates output tokens $y$ in an autoregressive fashion.

Module-level math as in (Hono et al., 2023):

$h = f_\text{speech}(x)$ , $s = g_\text{bridge}(h)$ ,
$p(y|s;\theta_\text{LLM}) = \prod_{i=1}^I p(y_i | y_{<i}, s; \theta_\text{LLM})$ ,
Training losses combine causal LM loss with optional CTC loss: $L_\text{total} = L_\text{LM} + \lambda_\text{CTC} L_\text{CTC}$ , $\lambda_\text{CTC}=0.5$ .

Adapters such as LoRA may be inserted at self-attention or feed-forward layers in both the speech encoder and the LLM to enable efficient domain adaptation and reduce training cost (Hono et al., 2023, Ling et al., 2023, Mei et al., 4 Jan 2026).

2. Training Regimes, Objectives, and Parameter-efficient Adaptation

Training typically proceeds via a two-stage paradigm:

Pretraining/Stage 1 (modality alignment or feature-space matching): The bridge/projector is pretrained to align speech representations with the LLM’s text-embedding space, using mean-squared-error, cosine similarity, or cross-entropy losses on ASR or synthetic tasks (Mohapatra et al., 28 Jan 2026, Zhang et al., 2023). Padding and masking strategies are used to handle sequence-length mismatches.
Fine-tuning/Stage 2 (task-specific adaptation): The projector and optionally LoRA adapters in the LLM are further tuned with downstream objectives (ASR, translation, dialogue state tracking, SQA), updating only a small parameter subset. For some architectures, the LLM backbone remains frozen except for adapters (Hono et al., 2023, Mohapatra et al., 28 Jan 2026).

Parameter-efficient adaptation uses LoRA with small rank and scaling, e.g., $W = W_0 + AB$ ( $A\in\mathbb{R}^{d\times r}, B\in\mathbb{R}^{r\times d}, r\ll d$ ), resulting in a few tens of millions of trainable parameters in models with billions of total parameters (Hono et al., 2023).

Ablation studies confirm that full fine-tuning of both speech and language modules yields the best accuracy, but LoRA enables near-optimal adaptation with greatly reduced computational overhead and strong domain portability (Hono et al., 2023, Ling et al., 2023, Mei et al., 4 Jan 2026).

3. Applications Across Speech-Language Tasks

End-to-end Speech LLMs have been successfully applied to a range of tasks beyond conventional ASR:

Task	Key Components	Example Systems
ASR	Speech+LLM+Bridge	Nue-ASR (Hono et al., 2023), Speech LLaMA (Lakomkin et al., 2023)
Diarization+ASR	Joint token modeling	Unified Speech LLM (Saengthong et al., 26 Jun 2025)
Translation	Speech+LLM+Adapters	LLaST (Chen et al., 2024), LST (Zhang et al., 2023, Luu et al., 11 Oct 2025)
Summarization	Speech-QF-LLM	(Shang et al., 2024)
Dialogue	Full spoken context	(Ghazal et al., 10 Oct 2025), E2E RAG (Feng et al., 27 Apr 2025)
SQA/SQA+ST	Modular projector	SpeechMapper (Mohapatra et al., 28 Jan 2026)
Simultaneous S2S	Duplex E2E model	Seed LiveInterpret 2.0 (Cheng et al., 23 Jul 2025)

Notably, these configurations demonstrate that, with appropriate alignment and modular bridging, LLMs pretrained only on text can accurately generate fully formatted text—including punctuation, casing, and numerals—when conditioned directly on continuous speech features (Ling et al., 2023). The “plug-and-play” approach to adaptation enables unification of recognition, translation, and understanding, and supports simultaneous speaker turn, timestamp, and text output in diarization and dialogue settings (Saengthong et al., 26 Jun 2025, Ghazal et al., 10 Oct 2025).

For speech translation, modular architectures such as LLaST combine ASR-augmented multitask training, multilingual augmentation, and dual-LoRA adaptation to reach state-of-the-art SacreBLEU (Chen et al., 2024). In simultaneous speech-to-speech translation, highly integrated models employing streaming attention, voice cloning, and RL-based latency/quality tradeoff achieve sharp improvements in both fidelity and real-time usability (Cheng et al., 23 Jul 2025).

4. Modality Gap, Representation Bottlenecks, and Analysis

Despite marked gains, a persistent “modality gap” remains: for equivalent semantics, speech input yields lower downstream performance than text input, especially for reasoning, translation, and structured understanding. A detailed cross-layer analysis (Hsu et al., 2 Mar 2026) defines the gap as the difference in task metric: $s$ 0 where a positive $s$ 1 implies degraded performance on matched speech.

Cross-layer CKA reveals that:

Early speech layers are misaligned from text representations (dark “zone” in CKA heatmaps).
Mid-layers show a broad alignment “band” (∼6 LLM layers), mirroring semantic “smearing”: information is distributed across redundantly many speech tokens due to the lossy, high-redundancy nature of acoustic data.
Late layers in the LLM fail to perform the “decision sharpening” observed with text, resulting in more diffuse attention, lower logit margins, and instability in output tokens.

Simple statistical calibration or feature-moment matching is not sufficient to close the gap and can reduce performance catastrophically (BBH –15.5 pp, SpeechMMLU –47.2 pp). The implication is that the speech-to-LLM mapping bottleneck is not purely geometric; it reflects a granularity mismatch where acoustic frames must be condensed adaptively into information-dense, contextually salient units before token-level decisions (Hsu et al., 2 Mar 2026).

Recommended directions arising from this analysis include:

Learnable hierarchical token merging or pooling for early-stage compression.
Prosody-aware or semantic change-point segmentation for speaker and unit alignment.
Hybrid discrete-continuous representations (e.g., VQ-VAE-based tokenizers) to enable LLMs to reason over speech inputs with token-like granularity.

5. Empirical Performance and Comparative Evaluation

Across ASR, translation, and dialogue tasks, end-to-end Speech LLMs have demonstrated competitive performance relative to established baselines, as detailed in the following empirical highlights:

ASR: Nue-ASR achieves character error rates on par with Whisper-large-v2 and dominant conformer baselines (e.g., CER=8.6% vs. 8.7% on JSUT; RTF=0.15 with DeepSpeed optimization) (Hono et al., 2023).
ASR contextualization: Speech LLaMA, with only 30M adapter parameters, attains a relative WER reduction of 7.5% and rare-word WER reduction of 17% over a much larger RNN-T system with WFST biasing, using 25× less training data. Gains are robust, especially on domain-specific or entity-rich transcripts (Lakomkin et al., 2023).
End-to-end speech translation: LLaST-14B and LST-13B, using lightweight adapters and frozen LLM backends, set new SOTA SacreBLEU on CoVoST-2 and MuST-C, outperforming both cascaded and prior multimodal approaches, and demonstrating strong scaling with larger LLMs and speech encoders (Chen et al., 2024, Zhang et al., 2023).
Dialogue and diarization: Incorporating full spoken context, speech-aware LLMs drive joint goal accuracy on SpokenWOZ to 39.3% (vs. 32.1% for text+speech hybrid), with attention-pooling compressed history recovering most of the gain at reduced memory cost (Ghazal et al., 10 Oct 2025).
Multitask and zero-shot transfer: SpeechMapper’s two-stage modularity enables rapid attachment of pretrained speech projectors to any LLM, matching heavy instruction-tuned baselines on SQA and speech translation in <2 GPU-hours of Stage 2 updates (Mohapatra et al., 28 Jan 2026).

A recurring observation is that, despite closing much of the performance gap, LLM-based ASR remains slightly inferior to direct E2E models (e.g., Whisper full fine-tune), especially for OOD robustness. Strong guidance emerges for foregrounding transcript-aligned objectives (e.g., auxiliary CTC or ASR loss), minimizing adapter complexity, and favoring simple linear or convolutional projectors unless extreme in-domain adaptation is required (Mei et al., 4 Jan 2026).

6. Knowledge Integration, Summarization, and Advanced Applications

Emerging paradigms for end-to-end Speech LLMs now incorporate retrieval-augmented generation (RAG) and advanced language understanding:

End-to-end speech-to-text retrieval: Speech-to-text embedding alignment enables direct speech-driven retrieval against text indices without ASR intermediates, yielding 4× acceleration in retrieval at minor cost in F1, and seamlessly enabling external knowledge fusion in S2S dialogue generation (Feng et al., 27 Apr 2025).
Summarization: End-to-end abstractive speech summarization is enabled by combining a frozen speech encoder, Q-Former connector, and an LLM head, with a multi-stage curriculum spanning ASR, document-level ASR, and abstraction tasks. Performance matches cascaded pipelines, especially on long-form How-2 inputs (Shang et al., 2024).
Simultaneous S2S translation with voice cloning: Integration of speech encoding, LLM-driven streaming decoding, flexible reward shaping with RL (proximal policy optimization), and FiLM-conditioned neural vocoding delivers high-fidelity, low-latency (<3 s lag) target speech in the speaker’s own voice (Cheng et al., 23 Jul 2025).

7. Open Challenges and Future Research Directions

Principal challenges in end-to-end Speech LLM development include:

Information loss at the speech-to-token alignment: Improvements in semantic-level adaptive pooling, prosodic/action-triggered segmentation, and discrete bottlenecking are forecast as key.
Multilinguality and OOD robustness: Scaling to low-resource languages, cross-lingual settings, and code-switching scenarios require improved pretraining and regularization (Mei et al., 4 Jan 2026).
Latency and context windowing: For long-form and streaming applications, context-efficient architectures and compression strategies are necessary to mitigate quadratic attention scaling and memory usage (Ghazal et al., 10 Oct 2025, Saengthong et al., 26 Jun 2025).
Knowledge fusion and multi-modal generalization: Further integration with retrieval engines, unified speech–vision–language frameworks, and on-device quantized deployment pipelines offer practical and research frontiers (Feng et al., 27 Apr 2025, Mohapatra et al., 28 Jan 2026).
Evaluation: There remains a need for new benchmarks that stress test multi-turn, multilingual, and structured output scenarios (e.g., reasoning/QA, slot-filling, S2S+emotion), together with fine-grained analysis of modality-specific failure cases (Hsu et al., 2 Mar 2026).

This body of work establishes the foundational mechanisms, trade-offs, and guiding principles for future general-purpose unified speech-language architectures at scale. The field is advancing rapidly, fueled by open benchmarks, robust ablation protocols, and modular adapter designs.