Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Tool for Facilitating OCR Postediting in Historical Documents (2004.11471v1)

Published 23 Apr 2020 in cs.CL

Abstract: Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a LLM (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary ,1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Alberto Poncelas (15 papers)
  2. Mohammad Aboomar (1 paper)
  3. Jan Buts (2 papers)
  4. James Hadley (5 papers)
  5. Andy Way (46 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.