Early evidence of how LLMs outperform traditional systems on OCR/HTR tasks for historical records (2501.11623v1)

Published 20 Jan 2025 in cs.CV, cs.AI, and cs.LG

Abstract: We explore the ability of two LLMs -- GPT-4o and Claude Sonnet 3.5 -- to transcribe historical handwritten documents in a tabular format and compare their performance to traditional OCR/HTR systems: EasyOCR, Keras, Pytesseract, and TrOCR. Considering the tabular form of the data, two types of experiments are executed: one where the images are split line by line and the other where the entire scan is used as input. Based on CER and BLEU, we demonstrate that LLMs outperform the conventional OCR/HTR methods. Moreover, we also compare the evaluated CER and BLEU scores to human evaluations to better judge the outputs of whole-scan experiments and understand influential factors for CER and BLEU. Combining judgments from all the evaluation metrics, we conclude that two-shot GPT-4o for line-by-line images and two-shot Claude Sonnet 3.5 for whole-scan images yield the transcriptions of the historical records most similar to the ground truth.

Summary

The paper finds that Large Language Models (LLMs) like GPT-4o and Claude Sonnet 3.5 generally outperform traditional OCR/HTR methods for transcribing historical handwritten documents based on BLEU and CER scores.
Zero-shot LLMs demonstrate better performance than fine-tuned traditional models like TrOCR, accurately transcribing elements like names which traditional tools often fail to produce.
Performance varies by method, with line-by-line transcription generally higher-scoring, Claude Sonnet 3.5 excelling on whole scans, and the BLEU metric better distinguishing LLM quality than CER.

The paper "Early evidence of how LLMs outperform traditional systems on OCR/HTR tasks for historical records" explores the ability of LLMs such as GPT-4o and Claude Sonnet 3.5 to transcribe historical handwritten documents in a tabular format. The authors compare the performance of these LLMs to traditional Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) systems, including EasyOCR, Keras, Pytesseract, and TrOCR. The evaluation uses Character Error Rate (CER) and Bilingual Evaluation Understudy (BLEU) scores.

The paper uses a dataset of 20 scanned pages of Belgian probation data from 1921, specifically "Déclaration de Succession" documents written in French with some Dutch influences. The authors executed two experimental setups:

Line-by-line transcription
Whole-scan transcription

The authors compared several strategies to enhance the LLMs transcription performance: simple prompt, complex prompt, one-shot learning, two-shot learning, and iterative refinement. For TrOCR, the authors experimented with fine-tuning the model using 20% and 50% of the dataset. Human evaluations were also conducted to better understand the outputs of whole-scan experiments and the influential factors for CER and BLEU.

Key findings include:

LLMs outperform conventional OCR/HTR methods in reproducing the texts in scanned images of historical records.
Zero-shot LLMs can correctly read names, whereas OCR/HTR tools without fine-tuning fail to produce meaningful outputs.
Zero-shot LLMs perform better than fine-tuned TrOCR.
In line-by-line experiments, OCR/HTR tools score near zero on the BLEU metric, while fine-tuned TrOCR variants score higher but still lower than LLMs.
The CER metric shows smaller differences between LLMs and OCR/HTR tools compared to the BLEU metric, though LLMs generally have lower error values.
Line-by-line experiments score higher on average than whole-scan experiments, but whole-scan experiments exhibit smaller variance in BLEU scores.
EasyOCR and Pytesseract perform better than KerasOCR and TrOCR in whole-scan experiments without fine-tuning but are limited to OCR rather than HTR.
LLM outputs contain more words similar to the ground truth than OCR outputs.

The authors find that BLEU and CER scores do not always align and that BLEU scores show more distinct differences between LLMs and OCRs. Human evaluations of whole-scan experiments indicated that Claude Sonnet 3.5 with a two-example prompt returned the best outputs. Human evaluators are more lenient with outputs where the header is not well transcribed, focusing more on handwritten content such as names, locations, and dates.

When disregarding headers in whole-scan outputs, BLEU scores for Claude methods increased, while those for GPT methods decreased. CER scores for Claude two-example dramatically reduced. The authors conclude that understanding the structure of the outputs is crucial for comparing transcription quality and that BLEU distinguishes the quality of transcriptions better than CER.

In conclusion, the paper highlights the potential of LLMs in transcribing historical handwritten documents and suggests that LLMs perform better with sliced images. Claude Sonnet 3.5 excels with whole scans, while GPT-4o performs better with sliced images, with the two-example prompt strategy yielding the best results in both cases.

PDF Markdown

Early evidence of how LLMs outperform traditional systems on OCR/HTR tasks for historical records (2501.11623v1)

Summary

Related Papers