Papers
Topics
Authors
Recent
Search
2000 character limit reached

Code-Aware ASR Output Refinement

Updated 24 January 2026
  • The paper introduces a modular ASR refinement pipeline that leverages LLMs to restore code-specific accuracy and reduce error rates.
  • It employs zero-shot prompt engineering to correct phonetic distortions and reinsert lost code symbols from ASR-generated transcripts.
  • Downstream evaluations demonstrate significant improvements in code retrieval and QA tasks across multilingual, code-mixed scenarios.

Code-aware ASR (Automatic Speech Recognition) output refinement refers to the process of post-processing ASR-generated transcripts of spoken programming queries to restore code-specific accuracy—particularly identifiers, symbols, and syntactic structures—using LLMs guided by prompts specialized for code contexts. This approach is motivated by the unique challenges that arise when code elements are embedded within natural language, including domain-specific vocabulary, custom identifier names, phonetic ambiguity, and multilingual code-mixed expressions. Such challenges significantly degrade the downstream performance of code understanding models accessed via voice-driven interfaces, especially in low-resource and multilingual environments (Havare et al., 20 Jan 2026).

1. Modular Architecture for Code-Aware ASR Refinement

A modular pipeline architecture underpins the code-aware ASR refinement method. The system comprises three cascaded stages:

  • Speech Recognition: Audio queries—spoken in English or one of four Indic languages (Hindi, Gujarati, Tamil, Bengali)—are transcribed into text using Whisper for English and indic-conformer models for Indic languages. The raw transcript includes both natural-language fragments and code tokens.
  • Code-Aware ASR Refinement: The transcript is post-processed by GPT-4o-mini (or alternative LLMs), prompted to detect misrecognized code terms, correct phonetic distortions (e.g., “ask key” → “ASCII”), re-insert symbolic operators (“equal equal” → “==”), and disambiguate code versus natural language contexts.
  • Code Understanding and Feedback: The refined transcript is forwarded to downstream systems, including code question answering (using a code-capable LLM) and code retrieval (using BAAI/bge-code-v1 embeddings), and outputs both textual and synthesized spoken responses in the user's language (Havare et al., 20 Jan 2026).

This separation ensures systematic correction of ASR error patterns specific to programming discourse, particularly in multilingual and code-mixed input.

2. Prompt Engineering and LLM Inference

The LLM-guided refinement employs zero-shot style prompts engineered for robust code error correction, following explicit instructions:

  1. Identify misrecognized code identifiers or keywords and restore them verbatim.
  2. Convert spelled-out symbols (e.g., “underscore”) back to punctuation (_).
  3. Correct phonetically garbled technical terms (“ask key” → “ASCII”).
  4. Preserve code-mixed and multilingual syntax.

Inference uses greedy decoding or small-beam search (beam=2), with deterministic settings (temperature = 0.0–0.3), and selects the highest-scoring completion as the final refined transcript. This approach is empirically justified by consistent correction performance in pilot evaluations. Additional reranking or multi-candidate selection is omitted due to the high reliability of the top completion produced by GPT-4o-mini (Havare et al., 20 Jan 2026).

3. Evaluation Metrics and Quantitative Results

Transcription and refinement quality are measured by:

WER=S+D+IN\mathrm{WER} = \frac{S + D + I}{N}

where SS, DD, and II denote the numbers of substitutions, deletions, and insertions, and NN is the number of words in the reference transcript.

  • Phoneme Error Rate (PER):

PER=Sp+Dp+IpNp\mathrm{PER} = \frac{S_p + D_p + I_p}{N_p}

calculated on phonemic transcriptions produced by Epitran.

  • Weighted Feature-based Edit Distance (WFED): Computed using PanPhon feature distances between substituted phonemes.
Dataset & Language Stage WER PER WFED
CSN–Python (Hindi) ASR 44.0% 33.4% 34.5%
ASR-R 30.7% 15.4% 7.8%
CSN–PHP (Bengali) ASR 73.0% 42.5% 26.9%
ASR-R 51.5% 30.6% 18.3%

Average reductions (absolute percentage points): WER by 20.95, PER by 28.85, WFED by 33.36.

Downstream effect is demonstrated in code retrieval and question answering tasks. For code retrieval on bge-code-v1 (Recall@5):

  • Original queries: 92% recall, 87.2% MRR
  • Raw ASR: 87% recall, 81.58% MRR
  • With ASR-R: 90% recall, 83.78% MRR

For CodeQA question answering, code-aware refinement restored up to 50–80% of the original answer quality, as measured by LLM-based deviation classification (Havare et al., 20 Jan 2026).

4. Failure Modes in ASR for Code and Remediation

Common ASR error modes on code-centric input include:

  • Phonetic Drift: “async” → “a sink”, “ASCII” → “ask key”
  • Code-Mixed Identifier Errors: “getUserInfo” → “get use run foe”
  • Keyword Ambiguity: keywords such as “not in” parsed as natural phrases
  • Symbol Loss: explicit operators omitted (“equal equal” dropped)
  • Identifier Recall Failures: rare identifiers replaced with generic words
  • Accent-Based Substitutions: accent-driven lexical modifications (e.g., “default” → “de fall”)

LLM-based refinement corrects these errors by restoring code-accuracy, exemplified by “print underscore sum” (original) → “print sum” (ASR) → “print_sum” (ASR-R) (Havare et al., 20 Jan 2026).

5. Data Construction and Multilingual Processing

The pipeline was evaluated on 18,000 multilingual query–code pairs across CodeQA, CodeSearchNet, and CoRNStack benchmarks. Multilingual queries were generated by:

  1. Transliteral translation using GPT-4o-mini with a “code-preserve” prompt, maintaining identifiers and symbols verbatim.
  2. Preprocessing code elements for TTS (e.g., “==” replaced by “equal equal”).
  3. Synthetic speech via Microsoft Edge TTS in the target Indic language.

Manual annotation ensured reference transcript fidelity for evaluating code understanding tasks in multilingual and code-mixed contexts (Havare et al., 20 Jan 2026).

6. Comparative Analysis and Ablations

Alternative LLMs (Claude Sonnet 4.5, Gemini 2.5 Pro) were compared for post-processing efficacy. Ablation studies on 100 CoRNStack samples (Hindi) yielded:

Model WER PER WFED
Claude Sonnet 4.5 22.7% 7.0% 3.7%
Gemini 2.5 Pro 20.2% 3.8% 1.5%

End-to-end multimodal LLMs (Phi-4, Qwen3-Omni-Flash) were also evaluated:

Method WER PER WFED
Phi-4 27.9% 9.3% 7.4%
Qwen3-Omni 39.6% 8.1% 7.3%
Pipeline (ASR→LLM) 19.9% 2.9% 1.3%

The cascaded ASR→LLM refinement design outperformed direct end-to-end multimodal systems, particularly for Indic languages and code-mixed utterances (Havare et al., 20 Jan 2026).

7. Implications for Voice-Driven Programming Tools

Interposing a code-aware LLM post-processor between ASR and code-understanding models substantially recovers transcription integrity at word and phoneme levels, restoring downstream code retrieval and QA performance close to the baseline of manually entered queries. This architecture, leveraging prompt design and multilingual resources, addresses inclusivity and usability in regions with diverse language practice and code-mixing, offering a practical solution for robust, voice-driven programming interfaces (Havare et al., 20 Jan 2026).

A plausible implication is that future systems could further benefit from adaptive prompt engineering, dynamic reranking, and domain-specific fine-tuning for yet more challenging code-mixed, multilingual input scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Code-Aware ASR Output Refinement.