Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

EvolveCaptions: Real-Time ASR Adaptation

Updated 4 October 2025
  • EvolveCaptions is a collaborative ASR adaptation system that tailors live transcriptions for DHH users through human-in-the-loop corrections.
  • It employs phonetically targeted clause generation and error-driven minimal pair data collection to guide incremental model fine-tuning.
  • Empirical studies reveal a 27% reduction in WER with just five minutes of recording per session, enhancing accessibility and efficiency.

EvolveCaptions is a real-time, collaborative adaptation system for automatic speech recognition (ASR) that facilitates efficient, low-burden transcription personalization for Deaf and Hard of Hearing (DHH) users in live conversational contexts. By integrating human-in-the-loop correction, phonetically targeted data collection, and lightweight on-the-fly model fine-tuning, EvolveCaptions addresses the persistent deficits of standard ASR engines on atypical (often DHH) speech without requiring extensive pre-recorded data or disproportionate user effort.

1. Collaborative Real-Time Adaptation Workflow

The EvolveCaptions system operates through a closed-loop interaction cycle designed to iteratively improve ASR accuracy for DHH users in situ. The key stages are:

  1. Live Caption Correction: During conversation, a Whisper-based ASR engine provides on-screen speech-to-text captions. Hearing participants actively review and edit these captions, marking corrections (highlighted in yellow) and segments with high uncertainty (highlighted in red). Corrections both immediately enhance caption fidelity and pinpoint error loci for targeted adaptation.
  2. Targeted Clause Generation and Recording: Each time an error is corrected, the system generates a short, natural clause using GPT-4. The prompt is constructed to ensure that the problematic word appears in a clear and phonetically informative context that closely resembles practical conversational use:
    1
    2
    3
    4
    5
    6
    7
    
    "You are generating short, spoken English clauses to help improve an automatic speech recognition (ASR) system. Based on a word that was misrecognized by ASR, your goal is to create a new clause (5-15 words) that:
    – Sounds natural in daily conversation.
    – Contains the corrected word in a prominent, clear context.
    – Has a similar phonetic structure to the original sentence.
    Original words: '{original}'
    Corrected words: '{corrected}'
    Generate one new clause that can be used to help the ASR model learn this correction."
    The DHH user then records this clause, supported by interface elements for visual waveform feedback and re-recording options as needed.
  3. Incremental Model Fine-Tuning: The collected (audio, text) pairs are pipelined into a lightweight ASR fine-tuning loop using tools such as the HuggingFace Seq2SeqTrainer. Hyperparameters include a learning rate of 1×10⁻⁵, batch size 8, and up to 100 update steps per round. Fine-tuning proceeds asynchronously, continually incorporating new in-context corrections and recordings—thereby incrementally aligning recognition to the individual DHH speaker.

The process is designed for rapid iteration and employs as little as five minutes of recording to achieve substantial reductions in word error rate (WER) within an hour of usage.

2. Phonetically Targeted Prompt Generation

The use of contextually rich, phonetically explicit recording prompts is critical for rapid, data-efficient model adaptation. Rather than collecting large generic datasets, EvolveCaptions focuses on error-driven minimal pairs—new clauses are generated only for words or phrases that the ASR system detected or transcribed incorrectly. Prompts are generated to:

  • Situate each error word in a realistic, high-salience clause,
  • Retain a similar overall phonetic structure relative to the original utterance, maximizing transferability of adaptation,
  • Minimize cognitive and articulatory complexity for the DHH user.

This targeted protocol ensures that fine-tuning samples are maximally informative for the ASR, enabling speaker-specific learning trajectories with vastly reduced data collection requirements.

3. Model Personalization and Incremental Fine-Tuning

EvolveCaptions employs a session-based, low-latency model update routine. Each new audio–text pair is ingested as training data and used to update the Whisper ASR model incrementally. The system is implemented using standard sequence-to-sequence training procedures, typically running fine-tuning in the background with only minor computational resource demand (e.g., batch size 8, 100 steps per update cycle).

The fundamental evaluation metric is Word Error Rate (WER),

WER=S+D+IN\mathrm{WER} = \frac{S + D + I}{N}

where SS is the number of substitutions, DD the deletions, II the insertions, and NN the number of words in the ground-truth transcript. The protocol specifically targets error classes associated with the DHH user’s idiosyncratic speech, reducing SS, DD, II incrementally through each personalized adaptation cycle.

4. User-In-the-Loop Design and Interaction Paradigm

EvolveCaptions' interactive design distributes adaptation effort across the conversational dyad:

  • Hearing participants act as real-time annotators, correcting and validating ASR output as conversation unfolds.
  • DHH users engage only in brief, context-rich clause recordings, isolated to instances where errors are actually detected.

The interface provides systematic visual cues (highlights, waveform feedback), immediate error acknowledgment, and integrated controls for oversight and repeatability. This approach minimizes DHH user time investment—on average, only five minutes of recording were needed per conversational hour—while maintaining a workflow that is both accessible and unobtrusive.

5. Empirical Findings from User Study

The system was evaluated via a remote longitudinal paper involving 12 DHH and six hearing participants. Key quantitative and qualitative results:

  • Average median WER was reduced by approximately 27.2% (mean 30.4%) in under one hour of use per DHH user.
  • The total additional user recording burden averaged only five minutes per session.
  • The reduction in WER was statistically significant (Wilcoxon signed-rank, p<0.05p < 0.05).
  • DHH users described a clear sense that the system “gradually learned” their speech over time, using phrases such as “actively teaching the system.”
  • Hearing users described the collaborative correction workflow as “intuitive” and reported that their annotation effort reduced with successive rounds.
  • Both groups highlighted the system’s integration and accessibility—all corrections, prompt recordings, and ASR updates occur in a unified, low-effort session.

These findings provide evidence that collaborative, real-time adaptation grounded in error-driven, phonetically-informative prompts can produce marked improvements in personalized ASR performance with minimal burden, supporting more equitable and fluid mixed-ability communication.

6. Broader Implications and Limitations

EvolveCaptions demonstrates that live, collaborative ASR adaptation—combining immediate correction, targeted data augmentation, and incremental model learning—offers a tractable approach to rectifying the systematic errors that ASR systems exhibit on DHH or atypical speech. By shifting much of the adaptation effort to real-time, context-sensitive annotation and by optimizing the data collection pipeline for maximal informativity and minimal redundancy, the system sidesteps the scalability and user-burden concerns of traditional approaches.

A plausible implication is that such a human-in-the-loop, error-driven paradigm could generalize to other domains of atypical speech or language technology personalization, provided that error localization and data-efficient, on-the-fly adaptation mechanisms are available. The paper also reveals ongoing challenges, chiefly the need to balance conversational flow with correction effort and to further lower DHH user workload as far as possible. Future improvements may involve more sophisticated active learning of correction targets, adaptive prompt generation strategies, or incorporation of non-verbal cues to further boost ASR resilience and robustness.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to EvolveCaptions.