A Tool for Facilitating OCR Postediting in Historical Documents (2004.11471v1)

Published 23 Apr 2020 in cs.CL

Abstract: Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a LLM (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary ,1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.

Authors (5)

Alberto Poncelas (15 papers)
Mohammad Aboomar (1 paper)
Jan Buts (2 papers)
James Hadley (5 papers)
Andy Way (46 papers)

Citations (10)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

A Tool for Facilitating OCR Postediting in Historical Documents (2004.11471v1)

Summary

Related Papers