Post-OCR Document Correction with large Ensembles of Character Sequence-to-Sequence Models (2109.06264v3)

Published 13 Sep 2021 in cs.CL

Abstract: In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample- and resource-efficient, supported by thorough experimentation. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. Our code for post-OCR correction is shared at https://github.com/jarobyte91/post_ocr_correction.

Citations (13)

View on Semantic Scholar

Summary

The paper introduces an ensemble strategy of character sequence-to-sequence models that significantly improves post-OCR error correction.
It employs overlapping n-gram segmentation with tailored weighting functions to efficiently integrate corrections from multiple models.
Experimental results on nine-language datasets showcase state-of-the-art performance, particularly for historical and low-resource texts.

Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models

The paper presents a novel approach to post-OCR document correction utilizing large ensembles of character sequence-to-sequence models. Optical Character Recognition (OCR) systems, while achieving high accuracy on modern documents, struggle with historical texts due to unique typographical and contextual challenges. Current OCR systems are typically tailored to resource-rich languages, reflecting a gap in effective solutions for a wider linguistic scope. This paper addresses these issues by introducing a method that leverages the capability of character-based sequence models to correct OCR errors across diverse languages.

Methodology and Innovations

Character-based sequence models are inherently flexible in managing out-of-vocabulary issues, making them suitable for multilingual applications. The methodology introduced makes use of the following concepts:

Sequence Models on N-Grams: The proposed method trains character sequence models on shorter text windows instead of complete documents, thereby enhancing efficiency and scalability. Correction is executed on overlapping n-grams, and a voting scheme integrates the corrected segments into a cohesive whole.
Ensemble Model Strategy: By processing documents in n-grams, the approach effectively utilizes an ensemble of sequence models, each focusing on a specific text segment. This ensemble strategy aids robustness, distributing error detection and correction efforts across multiple smaller models.
Weighting Functions: A key part of the proposal involves the integration of weighting functions, which prioritize the central characters in n-grams. Various functions like bell, triangle, and uniform distributions are used, each contributing to the ensemble vote based on their contextual positioning.

Experimental Results

The paper evaluates the method on the ICDAR 2019 competition dataset comprised of nine languages. Achieving state-of-the-art performance in five languages, including Spanish, German, Dutch, Bulgarian, and Czech, demonstrates significant improvements over previous models. Notably, the proposed method does not employ pretrained LLMs, showing promise for application in low-resource environments.

Implications and Future Directions

The implications of this research are considerable for both theory and practice. It demonstrates that character sequence models can effectively correct OCR errors in varying languages without extensive resource requirements. The robust, ensemble-based approach suggests potential extensions beyond OCR, such as applications in automatic speech and handwritten text recognition. Future research could explore modifications to the current framework, potentially incorporating more sophisticated decoding strategies and additional contextual information to further enhance correction accuracy.

This work highlights an efficient, adaptable approach to improving OCR output, particularly significant for historical and multi-lingual texts. Its contributions lie in addressing the computational inefficiencies of character models while effectively handling document corrections in a resource-constrained setting. The integration of ensemble learning principles within the sequence-to-sequence framework promises substantial advancements in text recognition technologies.

PDF Markdown

Related Papers

GitHub

GitHub - jarobyte91/post_ocr_correction: Source code for the paper "Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models" (35 stars)