Prompt-Based ASR with Speech LLMs

Updated 16 January 2026

The topic introduces prompt-based ASR mechanisms that use audio embeddings as continuous prompts to enable generative transcription by LLMs.
It combines frozen or lightly adapted LLM decoders with fine-tuned acoustic encoders and soft-prompt tuning, achieving significant word error rate reductions.
The approach supports robust multi-talker and domain-adapted recognition through hybrid decoding, contextual biasing, and advanced error correction strategies.

Prompt-based ASR with Speech LLMs refers to a class of automatic speech recognition systems that utilize LLMs and explicit prompting strategies (often via audio and/or text embeddings) to transcribe spoken input. The paradigm leverages LLMs’ generative and contextual capabilities, integrating acoustic features and domain/task-specific prompts to achieve state-of-the-art recognition rates, rapid domain adaptation, and robust error correction. These systems are architecturally flexible, supporting a wide variety of input modalities, fusion strategies, and decoding workflows, from direct speech-to-text inference to hybrid rescoring and error arbitration. Recent work establishes the empirical and theoretical foundations for prompt-based ASR, demonstrating consistent reductions in word error rate (WER), robust handling of rare words, and improved generalization, particularly in domain adaptation, multi-talker, and low-resource scenarios.

1. Principles of Prompt-Based ASR with Speech LLMs

Prompt-based ASR employs LLMs as autoregressive decoders conditioned on speech-derived prompt embeddings. The architecture typically comprises a pretrained audio encoder (e.g., Conformer-CTC) that transforms raw or feature-extracted waveforms into a sequence of high-dimensional prompt vectors, which are prepended to text embeddings and fed into a frozen or lightly adapted decoder-only LLM (often LLaMA-style) (Ma et al., 2024, Deng et al., 2024, Fathullah et al., 2023). The LLM operates on the concatenated sequence, modeling $P(\mathbf{x}\mid\mathrm{prompt})$ and generating transcriptions auto-regressively. Prompt tokens may be continuous (audio embeddings), discrete (synthetic text prompts), or hybrid (e.g., CTC outputs plus audio features).

The core insight is that, by treating audio representations as “continuous prompts,” the generative abilities of LLMs can be extended to the speech domain, supporting zero-shot and few-shot adaptation, language understanding, and cross-modal error correction. Soft-prompt fine-tuning techniques further enable domain-specific text adaptation by aligning the fine-tuning environment with conditions observed during inference (Ma et al., 2024).

2. Architectures and Prompt Embedding Strategies

Architecture Components

Most recent speech-LLM systems follow a modular architecture:

Component	Description
Audio Encoder	Conformer-CTC, WavLM, or Whisper-style networks mapping audio $a\in\mathbb{R}^{T\times d}$ to prompt embeddings $A(a)\in\mathbb{R}^{d_p\times e}$
Prompting Layer	Prepending continuous audio embeddings, synthetic text prompts, or CTC outputs to the decoder
Text Embedding	Standard learned token embedding module (dimension $e$ )
Decoder	Autoregressive transformer (e.g., LLaMA, Vicuna, Qwen) with 8–32 layers, 500M–7B parameters

Prompt Types

Continuous soft prompts: Randomly initialized vectors $S_{\zeta}\in\mathbb{R}^{d_s\times e}$ tuned for domain adaptation (Ma et al., 2024)
Audio-derived prompts: Embeddings from Conformer/WavLM/Whisper, downsampled and projected to LLM dimensionality (Fathullah et al., 2023)
Synthetic text prompts: Domain-specific snippets, historical context, bias word lists, or CTC-generated rough transcripts (Yang et al., 2024)
Mixture-of-experts fusion: Routing task-specific combination of multiple audio encoders according to textual prompt content (Shan et al., 21 Feb 2025)
Serialized output prompts: Per-speaker CTC outputs concatenated for multi-talker recognition and diarization (Shi et al., 1 Sep 2025)

Training leverages a combination of cross-entropy over text tokens and prompt-alignment objectives (e.g., MSE between speech and text embeddings for Wav2Prompt (Deng et al., 2024)), typically with all LLM parameters frozen except for lightweight adapters (LoRA).

3. Fine-Tuning and Domain Adaptation Methods

Soft Prompt Fine-Tuning

The dominant strategy for domain adaptation uses a two-step procedure:

Soft-prompt tuning: Freeze all parameters, optimize a trainable prompt $S_{\zeta}$ to maximize the unconditional likelihood of domain-specific text $Q$ (without audio), yielding $S_{\zeta}^* = \arg\max_S \sum_{x\in Q} \log P_{E,D}(x|\,\mathrm{prompt}=S)$ .
Decoder adaptation: With $S_{\zeta}^*$ fixed, fine-tune the decoder and text embedding on $Q$ with the pseudo-prompt prepended, ensuring the model conditions on domain-specific structure as in actual audio (Ma et al., 2024).

This method yields up to 8–9% relative WER/EER reductions across music and chatbot domains, with further gains via shallow fusion with external LMs (additional 2–5% EER drop). The approach explicitly mitigates “prompt mismatch,” in which adaptation without pseudo prompts limits generalization of domain knowledge into real audio inference.

Zero/Few-Shot Prompting and Multimodal ICL

Wav2Prompt (Deng et al., 2024) introduces continuous integrate-and-fire mechanisms to align audio prompts to transcript token embeddings, enabling zero-shot transfer to speech translation, understanding, and QA. Multimodal in-context learning (MICL) (Li et al., 9 Jan 2026) uses paired audio-text demonstrations as prompts for cross-lingual ASR in low-resource settings, showing that cross-lingual fine-tuning plus hypothesis re-ranking substantially improves WER over naive speech LLM prompting.

Zero-shot prompt-based ASR is robust to architecture variations (e.g., frozen LLM, long-form audio via strided encoders (Fathullah et al., 2023); prompt-aware mixture-of-experts fusion (Shan et al., 21 Feb 2025)) and can generalize across languages, domains, and novel tasks.

4. Error Correction and Uncertainty Handling via Prompting

Prompt-augmented LLMs provide advanced post-ASR error correction mechanisms. Representative workflows include:

Confidence-guided error correction: Embed word-level entropy/confidence scores in the prompt to focus LLM corrections on low-confidence regions, achieving up to 47% relative WER reduction for disordered speech (Hernandez et al., 29 Sep 2025).
N-best and confusion-network correction: Feed LLMs n-best ASR hypotheses, optionally extended by prompt-generated candidates, and rescore with acoustic+language-weighted composite functions (e.g., ProGRes (Tur et al., 2024); fusion in device-directed detection (Dighe et al., 2023)).
Evolutionary prompt optimization: Use genetic algorithm-inspired LLM prompting to discover task-optimal correction templates and maximize performance on domain-specific and cross-domain error correction (Sachdev et al., 2024).
Multi-ASR ensemble fusion: Serialize bracketed confusion regions and let speech-LLMs arbitrate between alternatives, producing high-quality pseudo-labels for semi-supervised ASR (Prakash et al., 5 Jun 2025).

Confidence-based filtering and explicit localization of uncertain tokens avoid harmful overcorrections and support robust error handling across spontaneous, domain-specific, and impaired-speech ASR pipelines.

5. Contextual Biasing, Rare Word Recognition, and Multi-Talker ASR

Contextual Biasing

LLM-ASR systems ingest dynamic bias lists via prompt-construction algorithms. Coarse CTC decodes are used to filter thousands of rare words down to a handful of hotwords for contextual prompts (Yang et al., 2024, He et al., 31 May 2025). Prompts are constructed as “Transcribe speech to text. Some hotwords might help. The hotwords are {...}.”, and the LLM decoder autoregressively generates the final transcript, substantially boosting recall of long-tail entities and domain terms.

Multi-Talker and Complex Scenarios

Serialized output prompting (SOP) (Shi et al., 1 Sep 2025) interleaves per-speaker CTC outputs with special tokens (e.g., <sc> for speaker-change) as an LLM prompt, enabling separation and diarization in multi-talker mixtures. Three-stage training decouples encoder, separator, and LLM decoder adaptation, yielding significant WER reductions in both 2- and 3-talker cases versus single-encoder baselines.

Contextual multi-talker frameworks (He et al., 31 May 2025) integrate rare-word biasing and SOT-style prompts, with two-stage filtering to control prompt size. This supports overlapping speech, rare entity recognition, and real-world meeting ASR, outperforming prior contextual bias systems at scale.

6. Decoding, Fusion, and Inference Workflows

Prompt-based speech LLMs are highly flexible at inference:

Direct autoregressive decoding: Prepend audio prompt embeddings and decode in a standard LLM autoregressive fashion.
Hybrid and non-autoregressive schemes: Switch between AR and NAR decoding (with hard limit on repetitions) anchored by CTC-generated text prompts for error-free, fast transcription (Li et al., 2024).
Iterative fusion: Decompose the MAP objective into separate acoustic and LLM probability terms; optimize jointly via beam search, prompting the LLM with partial transcripts, and using Viterbi or acoustic alignment for each candidate extension (Cohen et al., 4 Aug 2025).
Shallow fusion: At each beam-search step, rescore candidate next tokens with external LSTM/LLaMA/Transformer LMs, interpolating probabilities (Ma et al., 2024, Tur et al., 2024).
Re-ranking and hypothesis selection: Use MICL or post-ASR LLM scoring to select the best candidate from strong acoustic model decodes, especially in low-resource or unseen language scenarios (Li et al., 9 Jan 2026).

Downstream tasks—domain adaptation, translation, understanding, slot filling, tokenization, error correction—are realized in the same continuous-prompt paradigm, with minimal retraining.

7. Empirical Performance, Limitations, Future Directions

Prompt-based speech LLMs consistently yield competitive and often superior ASR performance:

Scenario	Baseline WER	Prompt-based WER	Relative Reduction
Domain adaptation (music/chatbot)	19.34	18.27	–8%
Error correction (CHiME-4)	7.49	4.88	–35%
Contextual ASR (LibriSpeech-clean)	1.96	1.27	–40%
SOP multi-talker (Libri3Mix)	39.1	28.5	–27%
Ensemble pseudo-labeling	14.36	9.30	–35%

Fusion with domain-specific LMs, confidence-guided prompting, and modular acoustic and LLMs drive gains even in complex, multilingual, and rare-word-laden conditions. Limitations include prompt length tuning, over-correction in ill-posed regions, reliance on strong acoustic front-ends, and, for zero-shot models, poor direct ASR in unseen languages without cross-lingual adaptation (Li et al., 9 Jan 2026).

Future research aims to extend prompt architectures (prompt length optimization, per-utterance dynamics), deeper end-to-end adaptation (joint training, multimodal error signaling), and generalization to streaming, multilingual, and highly overlapping real-world scenarios.

Prompt-based ASR with speech LLMs represents the convergence of acoustic modeling, generative language modeling, and task-adaptive prompting in a rigorous probabilistic framework. The paradigm leverages alignment by continuous or text-derived prompts, two-step fine-tuning, error correction via uncertainty localization, and robust fusion mechanisms, setting new standards for adaptability, domain coverage, rare-word recall, and multi-modal integration in speech recognition.