Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

Published 2 Apr 2026 in cs.CL and cs.AI | (2604.01705v1)

Abstract: Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with LLMs demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper presents EndoASR, a novel ASR system featuring a two-stage domain adaptation with synthetic data and noise augmentation that significantly improves transcription accuracy.
It demonstrates a reduction in character error rate from 20.52% to around 14.14% on clinical data and boosts medical term accuracy from 54.3% to 87.6%, outperforming general ASR models.
The study validates EndoASR in multi-center clinical trials, showcasing its low-latency performance and robust human-AI teaming in real-world gastrointestinal endoscopy workflows.

Domain-Adaptive Speech Recognition for Human-AI Teaming in Real-World Gastrointestinal Endoscopy

Introduction and Motivation

This work presents EndoASR, a domain-adapted automatic speech recognition (ASR) system tailored for real-time deployment in gastrointestinal endoscopy workflows. The paper addresses limitations of general-purpose ASR models in clinical environments, emphasizing specialized medical terminology, high acoustic variability, and latency constraints in procedural medicine. The authors argue that ASR is a critical interface for effective human-AI teaming, particularly in multi-task, hands-busy workflows such as endoscopy, where natural language voice control is superior to alternative interaction modalities.

Figure 1: The EndoASR framework, illustrating a methodological pipeline from human–AI teaming, domain-specific challenges, a two-stage adaptation strategy, to progressive clinical validation.

Methodological Design

The EndoASR system is based on Seaco-Paraformer, a non-autoregressive end-to-end ASR model. The core methodological innovation is a two-stage domain adaptation pipeline designed to address linguistic and acoustic mismatch:

Stage 1 – Domain-Specific Language Adaptation: The base model is fine-tuned on synthetic speech generated from structured endoscopy reports, optimizing for recognition of professional terminology and formulaic reporting styles.
Stage 2 – Noise-Robustness Enhancement: The model is further fine-tuned with noise-augmented synthetic data, where real-world endoscopy room recordings are mixed into the synthetic speech, thus modeling challenging, heterogeneous acoustic backgrounds.

The complete framework supports both controlled (retrospective) and real-world (prospective, multi-center) clinical validation, allowing quantification of both algorithmic advances and practical deployment robustness.

Retrospective Benchmarking

General-purpose ASR models, including Whisper-large-v3 and wav2vec2-XLSR-53, were systematically benchmarked against Paraformer on 600 intraoperative recordings from six endoscopists, with no domain adaptation applied. Results demonstrate that Paraformer outperformed both large-scale and smaller models, achieving a CER of 20.52% and superior semantic metrics, providing the backbone for EndoASR development.

Model comparison revealed that state-of-the-art general ASR systems exhibit high error rates and notable deficiencies in medical term transcription under authentic clinical acoustics. Paraformer, with a favorable efficiency–accuracy trade-off, was selected as the adaptation substrate.

Figure 2: Retrospective performance across metrics, speaker- and parameter-dependent variability, and the medical terminology recognition/efficiency frontier.

Two-Stage Adaptation Strategy and Controlled Evaluation

Fine-tuning Paraformer on synthetic domain-specific speech (Stage 1) immediately reduced CER to 3.71% on synthetic data and 14.72% on clinical data; noise-robust fine-tuning (Stage 2) brought further improvements (CER 14.14%). Medical term accuracy (Med ACC) improved from 54.3% (baseline) to 87.6% post adaptation—confirming the effectiveness of domain-adaptive techniques.

Speaker-wise breakdown demonstrates that both language adaptation and noise robustness generalize over diverse individual speaking styles, though gains are predominantly driven by linguistic adaptation due to greater inter-speaker than inter-noise variability.

Prospective Multi-Center Validation

For deployment robustness, prospective evaluations were conducted on real-world data from five endoscopy centers, each characterized by diverse recording environments and clinical content types. t-SNE visualizations of MFCC-derived acoustic embeddings confirmed pronounced inter-center heterogeneity.

Figure 3: t-SNE embedding of multi-center speech segments, displaying acoustic clustering by center, and the monotonically increasing trend of Med ACC with synthetic data scale.

EndoASR and EndoASR-noise consistently outperformed Paraformer across all metrics (mean CER reduced from 16.2% to 14.97%; Med ACC from 61.6% to 84.2%). Importantly, terminology accuracy gains proved most significant in clinically critical workflow segments (lesion description, intra-operative communication) and showed only modest saturation as synthetic data scale increased.

Figure 4: Multi-center breakdown of CER and Med ACC by clinical content category and institution, demonstrating robust cross-domain performance of EndoASR models.

Further analysis uncovered that noise-robust fine-tuning is most impactful when clinical acoustic conditions closely match training augmentation statistics; linguistic adaptation alone explains the bulk of cross-center generalization gains in current deployment scenarios.

Downstream Integration and Systemic Impact

Embedding EndoASR outputs in LLM-powered clinical information extraction pipelines demonstrated that higher ASR fidelity translates directly to error reduction in downstream structured documentation and automated report generation. EndoASR produced near-real-time, high-accuracy medical transcripts, facilitating new forms of clinician–AI interaction essential for next-generation clinical workflow automation.

Figure 5: Upstream training data construction and augmentation, transcription quality head-to-head, RTF profiling, and downstream LLM-based report extraction comparison between EndoASR and Whisper-Large.

EndoASR's real-time factor (RTF) of 0.005 far outpaces Whisper-large-v3 and is competitive even against smaller models, supporting its suitability for low-latency, edge-side deployment in high-acuity clinical settings.

Implications and Future Directions

The study provides compelling evidence that domain-adaptive strategies, leveraging even moderate-scale synthetic data and center-specific noise augmentation, can robustly bridge the OOD gap between generalist ASR models and high-demand, specialized medical speech applications. Gains in terminology recognition yield downstream improvement in automated reporting, clinical decision support, and auditability in human–AI teaming.

The theoretical implications include a demonstration that tailored linguistic fine-tuning dominates ASR adaptation returns within highly-structured specialized domains; practical implications involve enabling real-world trials and scaled uptake of conversational AI agents in medical environments.

Future research should incorporate multi-accent and spontaneous dialogue data to further increase inclusivity and robustness. Center-agnostic or dynamic noise modeling may further enhance out-of-distribution generalization. Integrating ASR outputs with agentic LLMs could unlock higher-level workflow intelligence, with feedback between perception modules (such as EndoASR) and reasoning components as an avenue for adaptive improvement.

Conclusion

This paper establishes that domain-adapted ASR, achieved with a scalable two-stage adaptation mechanism, produces marked and clinically salient advances in speech-driven human–AI interaction for gastrointestinal endoscopy. The EndoASR system demonstrates strong generalization and efficiency, validated across heterogeneous multi-center cohorts, and paves the way for reliable, workflow-integrated AI agents in procedural clinical practice. This approach and its methodology are directly extensible to other complex, terminology-intensive medical and technical domains.

Markdown Report Issue