Papers
Topics
Authors
Recent
Search
2000 character limit reached

Post-OCR Document Correction with large Ensembles of Character Sequence-to-Sequence Models

Published 13 Sep 2021 in cs.CL | (2109.06264v3)

Abstract: In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample- and resource-efficient, supported by thorough experimentation. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. Our code for post-OCR correction is shared at https://github.com/jarobyte91/post_ocr_correction.

Citations (13)

Summary

  • The paper presents a novel ensemble method that leverages character n-grams with sequence-to-sequence models for post-OCR correction.
  • The methodology includes dividing documents into overlapping n-grams and merging corrections using a weighted voting scheme to improve accuracy.
  • Experimental evaluations demonstrate state-of-the-art results in multiple languages, highlighting the approach’s efficiency and scalability.

Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models

Introduction

The task of post-OCR document correction is vital in improving the accuracy of text scanned using Optical Character Recognition (OCR), particularly in historical documents that present unique challenges in vocabulary, typography, and layout. While modern neural network approaches, particularly sequence-to-sequence models, have advanced text correction substantially, handling long sequences in a resource-efficient manner remains a barrier. The presented study addresses these challenges by introducing a strategy that processes documents using character-based sequence-to-sequence models, assembling their outputs through a robust ensemble approach.

Methodology

The core innovation of this approach is the use of character n-grams for correcting text sequences. Instead of processing document sequences in their entirety, which is computationally dense, this method leverages n-grams, allowing parallel processing and thereby improving efficiency. The segmentation of documents into n-grams is followed by individual n-gram correction using a pre-trained Transformer model. Figure 1

Figure 1: An example of correcting a document using disjoint windows of length 5.

Sequence Model

A standard Transformer-based sequence-to-sequence model is employed, trained on aligned text pairs extracted from the raw OCR and their correct transcriptions. Through this model, the input sequences are corrected to better align with their correct versions.

Document Correction

At inference time, documents are divided into overlapping n-grams, alleviating context-based errors at window boundaries. These corrected segments are then merged using a voting scheme, weighted based on their positions within n-grams. This ensemble effectively acts as multiple sequence models applying corrections over the document. Figure 2

Figure 2: An example of correcting a document using n-grams of length 5.

Experimental Setup

The authors evaluated the approach on datasets from the ICDAR 2019 competition, spanning nine languages. Training involved pre-aligning OCR outputs with ground truth using character-based pairs, with n-gram lengths judiciously chosen to balance context and computational feasibility.

Results and Discussion

The ensemble method achieved new state-of-the-art performance in five languages: Bulgarian, Czech, German, Spanish, and Dutch. Notably, robust improvements were noted in languages with varied linguistic characteristics and document types, though less improvement was seen in French due to unique dataset properties. The method exhibited minor sensitivity to weighting functions, indicating that even simple uniform weighting could provide competitive results. Figure 3

Figure 3: Distribution of the length in characters against the Character Error Rate for each document in the German and French datasets.

The analysis suggests that increasing window size does not consistently yield better performance, with disparities potentially arising from language-specific characteristics. Greedy Search, offering computational speed, sometimes surpassed Beam Search, challenging traditional expectations regarding search-based refinement in sequence models.

Conclusions

This research demonstrates an efficient, scalable method for post-OCR correction harnessing character-based models. By adopting n-gram parallelism and ensemble techniques, the approach enhances correction accuracy without extensive computational overhead. Future extensions might explore applicability to other domains such as Automatic Speech Recognition, where sequence alignment challenges persist.

The findings promote scalable document correction, broadening the potential for resource-constraining environments and diverse linguistic applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.