Effective removal of out-of-context and line-break inserts in dictation transcripts

Develop a post-processing approach that can reliably detect and remove out-of-context and line-break insertions in dictation-style transcripts, addressing the observed limitation that a trained Longformer-based model failed to remove a sufficient number of inserts while maintaining correct paragraph segmentation.

Background

In the 'Total Dictation' scenario, the authors attempted to use a Longformer-based model to detect out-of-context and line-break inserts for transcript cleanup. While the approach succeeded in splitting text into paragraphs correctly, it was unable to remove enough insertions.

This indicates a concrete unresolved challenge in building a reliable insert-removal component for such long-form, structured dictation settings.

References

It was not possible to remove a sufficient number of inserts, but it split the text into paragraphs correctly.

Pisets: A Robust Speech Recognition System for Lectures and Interviews  (2601.18415 - Bondarenko et al., 26 Jan 2026) in Appendix: Dictation mistakes overview (Section: Linguistic Conditions of the Text)