Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OCR Post Correction for Endangered Language Texts (2011.05402v1)

Published 10 Nov 2020 in cs.CL

Abstract: There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR post-correction method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34% on average across the three languages.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shruti Rijhwani (25 papers)
  2. Antonios Anastasopoulos (111 papers)
  3. Graham Neubig (342 papers)
Citations (44)

Summary

We haven't generated a summary for this paper yet.