- The paper introduces CoT-ASR, a two-stage model that decouples contextual reasoning from transcription to improve ASR accuracy.
- It employs a CTC-guided modality adapter to bridge continuous speech and discrete LLM tokens, achieving significant WER and EER reductions.
- Empirical results validate CoT-ASR's superiority over state-of-the-art systems, demonstrating enhanced instruction-following and robust domain adaptation.
Chain-of-Thought ASR: Contextual Reasoning for Speech LLMs
Introduction
The paper “Speech LLMs are Contextual Reasoning Transcribers” (2604.00610) introduces CoT-ASR, a novel automatic speech recognition paradigm that explicitly integrates chain-of-thought (CoT) reasoning into the ASR pipeline. Existing approaches commonly prompt LLMs with speech encoder outputs to generate direct transcription, often constraining LLMs’ reasoning capacity due to the content-preserving nature of the speech-to-text task. CoT-ASR decouples contextual analysis from transcription, enabling LLMs to utilize their semantic understanding and internal knowledge before producing the transcription.
CoT-ASR Model Paradigm
CoT-ASR decomposes the speech recognition task into two consecutive, but tightly integrated, stages: (1) contextual analysis through explicit reasoning, and (2) transcription. The system operates in a one-pass auto-regressive pipeline, adhering to standard LLM next-token generation. Given an audio input and a fixed template prompt, CoT-ASR generates a segment marked by <CONTEXT>, containing a high-level analysis of the speech, followed by a <TRANSCRIPT> segment, yielding the final transcription.
Figure 1: The CoT-ASR pipeline with a fixed ASR prompt generating both contextual analysis (red) and transcription (green) sequentially in a single auto-regressive generation.
This explicit separation prompts the LLM to first resolve ambiguities, disambiguate domain-specific terms, and leverage world knowledge, all prior to token emission for transcription. Importantly, CoT-ASR can also incorporate user-provided context, fully exploiting the in-context learning and instruction-following capacities of LLMs.
CTC-Guided Modality Adapter
To efficiently bridge the substantial modality gap between continuous speech encoder representations and discrete LLM token embeddings, CoT-ASR introduces a CTC-guided modality adapter. This adapter employs frame-level CTC posterior distributions to compute a weighted combination of pre-trained LLM embeddings, explicitly differentiating between blank and non-blank tokens to maximize information preservation.
Figure 2: Architecture of the CTC-guided Modality Adapter, projecting encoder outputs to the CTC vocabulary and the LLM hidden space, and combining them via weighted summation and a residual gated branch.
This design directly leverages the LLM’s token embedding space and weights, guided by non-blank speech evidence, while gated residual connections preserve frame-level structure. Unlike conventional linear adapters or previously proposed CTC compressors, the CTC-guided adapter avoids prompt compression, preventing loss of temporal information critical to ASR performance.
Empirical Analysis
The experimental evaluation is extensive, spanning LibriSpeech, FLEURS, and multiple domain-specific in-house benchmarks with up to 38,000 hours of English speech. Phi4-mini-instruct (3.8B parameters) is the LLM backbone, and all models use a 24-layer Conformer encoder. Results demonstrate the following:
Remarkably, CoT-ASR outperforms leading open-source systems (e.g., Whisper-large-v3, Qwen2.5-Omni-7B, Gemma 3n, Voxtral) even when training on less speech data, indicating that reasoning-augmented generation yields better entity retention than brute-force scaling of model or data.
Implications and Future Directions
The paradigm shift from direct speech-to-text mapping to reasoning-augmented ASR repositions the LLM as more than a denoising transducer. By explicitly constructing a chain of thought, LLMs in ASR are able to operationalize general world/model knowledge, domain-specific context, and user intent. Practically, this leads to measurable improvements in recognition of critical entities—a core requirement for real-world use-cases in healthcare, finance, and other verticals.
Architecturally, the CTC-guided adapter provides a template for future modality adapters that explicitly preserve temporal modalities and maximize compatibility with pretrained LLM weights.
Future developments will likely augment the reasoning phase via richer instruction-tuning, introduce meta-cognitive error detection during the analysis phase, or yield hybrid architectures that dynamically determine the need and depth of reasoning based on input complexity. Extension to low-resource languages, zero-shot speech task generalization, and unified speech-based reasoning agents are natural next research directions.
Conclusion
CoT-ASR establishes a robust paradigm for LLM-based ASR by introducing chain-of-thought reasoning, decoupling contextual semantic analysis from transcription within a unified sequence generation framework. The empirical results decisively demonstrate significant reduction in WER and EER, outperforming both conventional and more data-heavy, parameter-scaled open-source models. The explicit contextual reasoning mechanism, in tandem with a CTC-guided modality adapter, sets a new technical threshold for high-integrity, knowledge-aware ASR and signals promising avenues for general speech understanding systems.