Speech LLMs are Contextual Reasoning Transcribers

Published 1 Apr 2026 in cs.CL | (2604.00610v1)

Abstract: Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of LLMs in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces CoT-ASR, a two-stage model that decouples contextual reasoning from transcription to improve ASR accuracy.
It employs a CTC-guided modality adapter to bridge continuous speech and discrete LLM tokens, achieving significant WER and EER reductions.
Empirical results validate CoT-ASR's superiority over state-of-the-art systems, demonstrating enhanced instruction-following and robust domain adaptation.

Chain-of-Thought ASR: Contextual Reasoning for Speech LLMs

Introduction

The paper “Speech LLMs are Contextual Reasoning Transcribers” (2604.00610) introduces CoT-ASR, a novel automatic speech recognition paradigm that explicitly integrates chain-of-thought (CoT) reasoning into the ASR pipeline. Existing approaches commonly prompt LLMs with speech encoder outputs to generate direct transcription, often constraining LLMs’ reasoning capacity due to the content-preserving nature of the speech-to-text task. CoT-ASR decouples contextual analysis from transcription, enabling LLMs to utilize their semantic understanding and internal knowledge before producing the transcription.

CoT-ASR Model Paradigm

CoT-ASR decomposes the speech recognition task into two consecutive, but tightly integrated, stages: (1) contextual analysis through explicit reasoning, and (2) transcription. The system operates in a one-pass auto-regressive pipeline, adhering to standard LLM next-token generation. Given an audio input and a fixed template prompt, CoT-ASR generates a segment marked by <CONTEXT>, containing a high-level analysis of the speech, followed by a <TRANSCRIPT> segment, yielding the final transcription.

Figure 1: The CoT-ASR pipeline with a fixed ASR prompt generating both contextual analysis (red) and transcription (green) sequentially in a single auto-regressive generation.

This explicit separation prompts the LLM to first resolve ambiguities, disambiguate domain-specific terms, and leverage world knowledge, all prior to token emission for transcription. Importantly, CoT-ASR can also incorporate user-provided context, fully exploiting the in-context learning and instruction-following capacities of LLMs.

CTC-Guided Modality Adapter

To efficiently bridge the substantial modality gap between continuous speech encoder representations and discrete LLM token embeddings, CoT-ASR introduces a CTC-guided modality adapter. This adapter employs frame-level CTC posterior distributions to compute a weighted combination of pre-trained LLM embeddings, explicitly differentiating between blank and non-blank tokens to maximize information preservation.

Figure 2: Architecture of the CTC-guided Modality Adapter, projecting encoder outputs to the CTC vocabulary and the LLM hidden space, and combining them via weighted summation and a residual gated branch.

This design directly leverages the LLM’s token embedding space and weights, guided by non-blank speech evidence, while gated residual connections preserve frame-level structure. Unlike conventional linear adapters or previously proposed CTC compressors, the CTC-guided adapter avoids prompt compression, preventing loss of temporal information critical to ASR performance.

Empirical Analysis

The experimental evaluation is extensive, spanning LibriSpeech, FLEURS, and multiple domain-specific in-house benchmarks with up to 38,000 hours of English speech. Phi4-mini-instruct (3.8B parameters) is the LLM backbone, and all models use a 24-layer Conformer encoder. Results demonstrate the following:

CoT-ASR achieves a relative reduction of 8.7% in WER on LibriSpeech test-clean and 16.9% in entity error rate (EER) on diverse in-house domains compared to a state-of-the-art Phi4MM-style baseline.
User-provided context, when supplied to the <CONTEXT> block, yields a further 24.9% relative reduction in EER: CoT-ASR exhibits robust instruction-following and efficiently integrates external semantic priors.
The CTC-guided modality adapter yields a 9.1% relative EER reduction compared to a (non-compressing) linear adapter. CTC-based prompt compression, as in prior art, consistently degrades recognition of key entities.
Ablation studies confirm that the explicit, non-compressing CTC-guided adaptation is critical for optimal ASR quality.
Figure 3: Comparison of CoT-ASR against standard LLM-based ASR, demonstrating how explicit contextual reasoning improves transcription of content with ambiguous vocabulary.

Remarkably, CoT-ASR outperforms leading open-source systems (e.g., Whisper-large-v3, Qwen2.5-Omni-7B, Gemma 3n, Voxtral) even when training on less speech data, indicating that reasoning-augmented generation yields better entity retention than brute-force scaling of model or data.

Implications and Future Directions

The paradigm shift from direct speech-to-text mapping to reasoning-augmented ASR repositions the LLM as more than a denoising transducer. By explicitly constructing a chain of thought, LLMs in ASR are able to operationalize general world/model knowledge, domain-specific context, and user intent. Practically, this leads to measurable improvements in recognition of critical entities—a core requirement for real-world use-cases in healthcare, finance, and other verticals.

Architecturally, the CTC-guided adapter provides a template for future modality adapters that explicitly preserve temporal modalities and maximize compatibility with pretrained LLM weights.

Future developments will likely augment the reasoning phase via richer instruction-tuning, introduce meta-cognitive error detection during the analysis phase, or yield hybrid architectures that dynamically determine the need and depth of reasoning based on input complexity. Extension to low-resource languages, zero-shot speech task generalization, and unified speech-based reasoning agents are natural next research directions.

Conclusion

CoT-ASR establishes a robust paradigm for LLM-based ASR by introducing chain-of-thought reasoning, decoupling contextual semantic analysis from transcription within a unified sequence generation framework. The empirical results decisively demonstrate significant reduction in WER and EER, outperforming both conventional and more data-heavy, parameter-scaled open-source models. The explicit contextual reasoning mechanism, in tandem with a CTC-guided modality adapter, sets a new technical threshold for high-integrity, knowledge-aware ASR and signals promising avenues for general speech understanding systems.

Markdown Report Issue