Evaluating WER under selective prediction in long-form ASR

Determine a principled method to compute and report Word Error Rate (WER) for long-form automatic speech recognition when a subset of predicted words is intentionally ignored (for example, filtered out based on word-level uncertainty), enabling fair and meaningful evaluation under selective prediction settings.

Background

The paper studies uncertainty estimation for long-form ASR and notes that common evaluation via error-retention curves becomes problematic when some predictions are dropped. In particular, WER—being a standard metric—does not have a straightforward definition when certain words are ignored based on uncertainty flags.

To proceed, the authors adopt alternative metrics (uncertainty ratio and recall of error detection), explicitly noting the lack of a clear procedure for computing WER in this selective setting. This highlights the need for a rigorous evaluation framework that accounts for skipped words without compromising interpretability or comparability.

References

However, in long-form speech recognition, it is not clear how to evaluate WER when ignoring some words.

Pisets: A Robust Speech Recognition System for Lectures and Interviews  (2601.18415 - Bondarenko et al., 26 Jan 2026) in Uncertainty modeling metrics (Section: Uncertainty modeling)