Papers
Topics
Authors
Recent
Search
2000 character limit reached

Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

Published 2 Apr 2026 in cs.CL and cs.AI | (2604.01705v1)

Abstract: Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with LLMs demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.

Summary

  • The paper presents EndoASR, a novel ASR system featuring a two-stage domain adaptation with synthetic data and noise augmentation that significantly improves transcription accuracy.
  • It demonstrates a reduction in character error rate from 20.52% to around 14.14% on clinical data and boosts medical term accuracy from 54.3% to 87.6%, outperforming general ASR models.
  • The study validates EndoASR in multi-center clinical trials, showcasing its low-latency performance and robust human-AI teaming in real-world gastrointestinal endoscopy workflows.

Domain-Adaptive Speech Recognition for Human-AI Teaming in Real-World Gastrointestinal Endoscopy

Introduction and Motivation

This work presents EndoASR, a domain-adapted automatic speech recognition (ASR) system tailored for real-time deployment in gastrointestinal endoscopy workflows. The paper addresses limitations of general-purpose ASR models in clinical environments, emphasizing specialized medical terminology, high acoustic variability, and latency constraints in procedural medicine. The authors argue that ASR is a critical interface for effective human-AI teaming, particularly in multi-task, hands-busy workflows such as endoscopy, where natural language voice control is superior to alternative interaction modalities. Figure 1

Figure 1: The EndoASR framework, illustrating a methodological pipeline from human–AI teaming, domain-specific challenges, a two-stage adaptation strategy, to progressive clinical validation.

Methodological Design

The EndoASR system is based on Seaco-Paraformer, a non-autoregressive end-to-end ASR model. The core methodological innovation is a two-stage domain adaptation pipeline designed to address linguistic and acoustic mismatch:

  1. Stage 1 – Domain-Specific Language Adaptation: The base model is fine-tuned on synthetic speech generated from structured endoscopy reports, optimizing for recognition of professional terminology and formulaic reporting styles.
  2. Stage 2 – Noise-Robustness Enhancement: The model is further fine-tuned with noise-augmented synthetic data, where real-world endoscopy room recordings are mixed into the synthetic speech, thus modeling challenging, heterogeneous acoustic backgrounds.

The complete framework supports both controlled (retrospective) and real-world (prospective, multi-center) clinical validation, allowing quantification of both algorithmic advances and practical deployment robustness.

Retrospective Benchmarking

General-purpose ASR models, including Whisper-large-v3 and wav2vec2-XLSR-53, were systematically benchmarked against Paraformer on 600 intraoperative recordings from six endoscopists, with no domain adaptation applied. Results demonstrate that Paraformer outperformed both large-scale and smaller models, achieving a CER of 20.52% and superior semantic metrics, providing the backbone for EndoASR development.

Model comparison revealed that state-of-the-art general ASR systems exhibit high error rates and notable deficiencies in medical term transcription under authentic clinical acoustics. Paraformer, with a favorable efficiency–accuracy trade-off, was selected as the adaptation substrate. Figure 2

Figure 2: Retrospective performance across metrics, speaker- and parameter-dependent variability, and the medical terminology recognition/efficiency frontier.

Two-Stage Adaptation Strategy and Controlled Evaluation

Fine-tuning Paraformer on synthetic domain-specific speech (Stage 1) immediately reduced CER to 3.71% on synthetic data and 14.72% on clinical data; noise-robust fine-tuning (Stage 2) brought further improvements (CER 14.14%). Medical term accuracy (Med ACC) improved from 54.3% (baseline) to 87.6% post adaptation—confirming the effectiveness of domain-adaptive techniques.

Speaker-wise breakdown demonstrates that both language adaptation and noise robustness generalize over diverse individual speaking styles, though gains are predominantly driven by linguistic adaptation due to greater inter-speaker than inter-noise variability.

Prospective Multi-Center Validation

For deployment robustness, prospective evaluations were conducted on real-world data from five endoscopy centers, each characterized by diverse recording environments and clinical content types. t-SNE visualizations of MFCC-derived acoustic embeddings confirmed pronounced inter-center heterogeneity. Figure 3

Figure 3: t-SNE embedding of multi-center speech segments, displaying acoustic clustering by center, and the monotonically increasing trend of Med ACC with synthetic data scale.

EndoASR and EndoASR-noise consistently outperformed Paraformer across all metrics (mean CER reduced from 16.2% to 14.97%; Med ACC from 61.6% to 84.2%). Importantly, terminology accuracy gains proved most significant in clinically critical workflow segments (lesion description, intra-operative communication) and showed only modest saturation as synthetic data scale increased. Figure 4

Figure 4: Multi-center breakdown of CER and Med ACC by clinical content category and institution, demonstrating robust cross-domain performance of EndoASR models.

Further analysis uncovered that noise-robust fine-tuning is most impactful when clinical acoustic conditions closely match training augmentation statistics; linguistic adaptation alone explains the bulk of cross-center generalization gains in current deployment scenarios.

Downstream Integration and Systemic Impact

Embedding EndoASR outputs in LLM-powered clinical information extraction pipelines demonstrated that higher ASR fidelity translates directly to error reduction in downstream structured documentation and automated report generation. EndoASR produced near-real-time, high-accuracy medical transcripts, facilitating new forms of clinician–AI interaction essential for next-generation clinical workflow automation. Figure 5

Figure 5: Upstream training data construction and augmentation, transcription quality head-to-head, RTF profiling, and downstream LLM-based report extraction comparison between EndoASR and Whisper-Large.

EndoASR's real-time factor (RTF) of 0.005 far outpaces Whisper-large-v3 and is competitive even against smaller models, supporting its suitability for low-latency, edge-side deployment in high-acuity clinical settings.

Implications and Future Directions

The study provides compelling evidence that domain-adaptive strategies, leveraging even moderate-scale synthetic data and center-specific noise augmentation, can robustly bridge the OOD gap between generalist ASR models and high-demand, specialized medical speech applications. Gains in terminology recognition yield downstream improvement in automated reporting, clinical decision support, and auditability in human–AI teaming.

The theoretical implications include a demonstration that tailored linguistic fine-tuning dominates ASR adaptation returns within highly-structured specialized domains; practical implications involve enabling real-world trials and scaled uptake of conversational AI agents in medical environments.

Future research should incorporate multi-accent and spontaneous dialogue data to further increase inclusivity and robustness. Center-agnostic or dynamic noise modeling may further enhance out-of-distribution generalization. Integrating ASR outputs with agentic LLMs could unlock higher-level workflow intelligence, with feedback between perception modules (such as EndoASR) and reasoning components as an avenue for adaptive improvement.

Conclusion

This paper establishes that domain-adapted ASR, achieved with a scalable two-stage adaptation mechanism, produces marked and clinically salient advances in speech-driven human–AI interaction for gastrointestinal endoscopy. The EndoASR system demonstrates strong generalization and efficiency, validated across heterogeneous multi-center cohorts, and paves the way for reliable, workflow-integrated AI agents in procedural clinical practice. This approach and its methodology are directly extensible to other complex, terminology-intensive medical and technical domains.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.