LLM Prompting with Speech Recognition

Updated 29 November 2025

LLM-based ASR is a technique that integrates speech encodings with text tokens, enabling end-to-end transcription via transformer architectures.
It employs advanced prompt engineering and modular fusion adapters to enhance rare-word recall and multi-modal transcription in diverse audio scenarios.
The approach supports streaming, long-form, and multi-talker inputs, achieving competitive WERs and efficient inference through chunking and LoRA adapters.

LLMs can be equipped with automatic speech recognition (ASR) abilities by prompting them with audio encodings. This paradigm leverages pretrained transformer models, typically with decoder-only or encoder–decoder architectures, and integrates audio representations—often from dedicated neural encoders—directly into the LLM’s input sequence or context. Sophisticated prompt engineering, modular fusion adapters, and scalable attention mechanisms now allow these systems to operate on streaming, long-form, and even multi-talker or audio-visual inputs while maintaining high accuracy and computational efficiency.

1. Integration of Audio Encodings into LLMs

A dominant architecture for LLM-based speech recognition attaches a trained audio encoder that converts raw speech (e.g., log-Mel filterbanks) into dense vector sequences. These audio encodings are prepended or concatenated with text token embeddings and passed to a frozen or LoRA-adapted transformer LLM. For instance, in "Prompting LLMs with Speech Recognition Abilities," an 18-layer Conformer converts speech into high-dimensional vectors, which are linearly projected to match the LLM hidden size and prepended to the textual input. This allows end-to-end ASR via next-token autoregressive generation in the text domain, often with only minor tuning of LLM attention weights via LoRA adapters, yielding competitive WER on multilingual benchmarks even when the LLM remains largely frozen (Fathullah et al., 2023).

Advanced architectures use time-reduced streaming encoders (e.g., Emformer with 240 ms look-ahead), integrated chunking strategies, and dedicated adapters for modality alignment. Input audio is partitioned into contiguous chunks; each chunk’s encodings serve as prompts for chunk-level autoregressive transcription. Systems such as SpeechLLM-XL employ this methodology to enable linear scaling in inference cost and perfect length extrapolation without accuracy degradation, by restricting the decoder context to a fixed window size (Jia et al., 2 Oct 2024).

2. Prompt Engineering, Contextualization, and Task Definition

Prompt design for speech recognition in LLMs encompasses both the formatting of audio-text fusion and the explicit injection of prior or contextual knowledge. Contextualization techniques, such as keyword injection in prompts, significantly boost rare-word and homonym recall. In practical systems, a typical prompt may combine audio embeddings, a language tag, a keyword list, and an output placeholder (e.g., "<bos> [AUDIO_EMBEDDINGS] Language: en ; Keywords: ... ; Transcription: ______ <eos>") (Nozawa et al., 15 Aug 2024). Randomizing keyword order mitigates positional bias and mixing keyword/no-keyword data prevents unwanted deletions in non-keyword contexts.

More generally, additional static context (e.g., video titles or descriptions) can be prepended as text tokens alongside the audio encoding, jointly incentivizing the LLM to leverage unstructured context information. This approach yields measurable improvements in both overall and rare-word WER (Lakomkin et al., 2023).

For multi-talker and directional ASR, prompt engineering incorporates structured tags denoting speaker or spatial direction, e.g., "<30°>: <AUDIO_EMBED>", enabling both recognition and localization tasks within a single LLM framework. Serialized Output Prompting (SOP) introduces CTC-extracted, per-speaker preliminary transcripts as discrete prefix prompts, allowing refined modeling of overlapping utterances (Xie et al., 17 Jun 2025, Shi et al., 1 Sep 2025).

3. Scalability, Streaming, and Efficient Inference

A key challenge for LLM-based speech recognition is handling long or streaming audio without prohibitive compute or memory usage. Conventional full-context transformers scale quadratically in input length. SpeechLLM-XL implements a chunking strategy with restricted self-attention, maintaining a window of (b+1) chunks, so total inference cost is linear in audio frames O(N) rather than quadratic. During both training and inference, each audio chunk is matched via CTC forced alignment to its transcript, and the LLM generates tokens for that chunk until EOS, then advances. This yields state-of-the-art streaming accuracy (2.7 % / 6.7 % WER, LibriSpeech test-clean/other), perfect extrapolation to sequences 10× longer than in training, and competitive latency compared to CTC or Transducer baselines (Jia et al., 2 Oct 2024).

Parameter efficiency is further enhanced by restricting adaptation to small projectors or LoRA modules (typically <30 M params), while keeping large backbone models and encoders frozen. Downsampling of audio and video streams (by compressing feature sequences temporally) allows controllable trade-off between compute and modeling fidelity in multimodal scenarios (Cappellazzo et al., 18 Sep 2024).

4. Multi-Modality, Directionality, and Multi-Talker Extensions

Recent research has expanded LLM-ASR to support not only speech, but also audio-visual and multi-talker recognition. For AVSR (audio-visual speech recognition), modality-specific encoders process both acoustic and lip-ROI video streams to produce feature tokens. These are concatenated and projected to the LLM hidden space, then used as joint prompts for transcription. This approach achieves new state-of-the-art results on LRS3, e.g., WER 0.77 % for AVSR, using only projectors and LoRA adapters as trainable parameters (Cappellazzo et al., 18 Sep 2024).

Directionality is addressed by serializing spatial tags in the prompt sequence and employing directional beamformers at the front end. Systems such as directional-SpeechLlama extract independent audio streams for K pre-set azimuths, prompting the LLM with continuous embeddings and direction tags, then fine-tune using both cross-entropy and contrastive losses to enforce localization and target extraction. These models achieve sub-4 % WER and 98 % source localization success rate in multi-talker conditions (Xie et al., 17 Jun 2025).

For massively-overlapping multi-talker mixtures, SOP-MT-ASR uses CTC decoders to produce preliminary per-speaker transcriptions, embedding these as explicit prefix prompts for the LLM. Three-stage curricula—pretrain on serialized outputs, jointly train separator + CTC + LLM, and adapt the LoRA parameters—drive WER far below previous baselines, demonstrating structured prompting’s critical role (Shi et al., 1 Sep 2025).

5. Exploiting ASR Uncertainty, Second-Pass Correction, and Biasing

LLMs can also enhance post-processing of ASR outputs via N-best list rescoring, error correction, and contextual biasing. Prompting with N-best hypotheses, optionally with decoding costs, improves spoken language understanding tasks such as device-directed detection and keyword spotting. Fine-tuning LoRA adapters on structured prompt templates enables the model to learn to exploit uncertainty inherent in ASR outputs, outperforming traditional selection based on 1-best alone (Dighe et al., 2023).

Post-hoc correction pipelines apply confidence-based filtering to decide when LLM intervention is warranted. By carefully constructing system prompts with few-shot correction examples and JSON-formatted payloads (e.g., highlighting low-confidence words), models such as GPT-4 can achieve substantial WER reductions for weaker ASR engines, while avoiding unnecessary rewrites when confidence is high (Naderi et al., 31 Jul 2024).

Biasing lists and multi-task prompts are effective in improving recognition of named entities. Structured prompts interleave biasing lists of entities with partial hypotheses, sometimes augmented by few-shot demonstrations or entity-class predictions. Dynamic prompting per token (using class-tag heads) enables the model to restrict context to likely entity types, maintaining efficiency and maximizing relative WER gain (Sun et al., 2023).

6. Training Protocols, Losses, and Adaptation Strategies

Most systems optimize the autoregressive next-token cross-entropy over the transcription sequence, sometimes supplemented by auxiliary losses—CTC alignment on the encoder, mean squared error for embedding manifold alignment (as in Wav2Prompt (Deng et al., 1 Jun 2024)), or prompt-expert classification in mixture-of-experts fusion. Joint training of pre-fusion adapters, fusion routers, and LoRA modules allows scaling to multiple audio encoders, trading off semantic and acoustic features according to task-specific prompt content. Practical recipes for zero-shot, few-shot, and domain adaptation have emerged, with many systems maintaining the emergent abilities of the base LLM by restricting adaptation to small, parameter-efficient modules (Shan et al., 21 Feb 2025).

Ablation studies confirm the contribution of each component: chunk size, attention window, prompt design, fusion weights, and entity-classification heads all impact final accuracy. Systems generally favor modularity (frozen large encoders, small adapters), and prompt-aware task routing in mixture-of-experts fusion consistently outperforms naive concatenation or averaging of features (Shan et al., 21 Feb 2025).

7. Future Directions and Practical Recommendations

Emerging best practices for prompting LLMs with speech recognition abilities include hybrid chunked attention to achieve O(N) scaling, careful prompt engineering for context and rare token recall, structured tagging for directionality/localization, and modular multi-encoder fusion for robust multitask performance. Ongoing developments such as SOP for multi-talker settings, Wav2Prompt for explicit speech-text alignment and zero-shot instruction generalization, chain-of-thought prompting for speech translation, and advanced post-hoc correction strategies are pushing the frontiers of LLM-based ASR (Hu et al., 17 Sep 2024, Deng et al., 1 Jun 2024, Shi et al., 1 Sep 2025).

In sum, LLM prompt design and audio integration methods have advanced from simple fusion with text to deeply structured, scalable, and multimodal solutions. These frameworks allow LLMs to serve as universal engines for streaming ASR, audio-visual transcription, multi-talker separation, and downstream spoken-language understanding tasks with strong empirical gains and high parameter efficiency.