- The paper evaluates open-weight LLMs for OCR error post-correction in historical English and Finnish, identifying key factors like language, segment length, and context handling.
- While various LLMs improved English CER and WER, only GPT-4o showed positive results for historical Finnish, highlighting significant language-dependent performance.
- Effective post-correction depends on factors like segment length, providing contextual information across boundaries, and appropriate post-processing like overgeneration removal.
The paper "OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches" evaluates open-weight LLM performance in Optical Character Recognition post-correction for historical English and Finnish datasets. The authors investigated diverse strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. The paper used the ECCO-TCP dataset for English and the National Library of Finland (NLF) ground truth data for Finnish. Character Error Rate (CER) and Word Error Rate (WER) served as primary evaluation metrics, with relative reduction calculated as $\text{CER\%} = {(\text{CER}_{\text{orig}-\text{CER}_{\text{post})}/{\text{CER}_{\text{orig}\times 100$, where:
- CERorig is the Character Error Rate before correction
- CERpost is the Character Error Rate after correction
They normalized Unicode whitespace, addressed historical vs. modern spelling differences using Unicode NFKC normalization, and replaced 'w' with 'v' in Finnish text.
The researchers evaluated Llama-3-8B-Instruct, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, Mixtral-8x7B-Instruct-v0.1, Gemma-2-9B-it, and Gemma-2-27B-it, along with OpenAI's GPT-4o. They used the Ollama framework with default 4-bit quantization, experimenting with other quantization levels separately. An overgeneration removal method, based on aligning generated output against the original input, was applied to filter out leading and trailing texts.
The paper optimized temperature, top_k, and top_p parameters using the Optuna library, selecting parameters for each language based on the median value of the 10 best runs. For English, the parameters were temperature 0.26, top_k 65, and top_p 0.66, while Finnish used temperature 0.14, top_k 30, and top_p 0.60.
Results showed that six of seven models improved English CER, ranging from 7.3% (Llama-3-8B) to 58.1% (GPT-4o). All models improved English WER, with GPT-4o at 59.1% and Llama-3.1-70B at 46.3%. In contrast, GPT-4o was the sole model achieving positive improvement for Finnish with 11.9 CER% and 33.5 WER%. Overgeneration removal significantly improved results for Llama models.
The team evaluated models at 4-bit Q4_0 quantization and 16-bit fp16, with fp16 generally performing better, with a gain of 2.5-4.7 percentage points, but at a higher memory cost.
The effect of segment length was examined by dividing pages into non-overlapping segments of 50, 100, 200, and 300 words. Shorter segments (50-100 words) resulted in worse CER% scores. They evaluated post-correction methods on segment boundaries: Baseline (independent correction), Left-corrected-concatenate (LCC) (left segment corrected first), and Left-uncorrected-concatenate (LUC) (uncorrected left segment provided for context). LCC and LUC improved post-correction at the boundary.
The authors conclude that while open-weight models show promise for English post-correction, zero-shot post-correction remains out of reach for historical Finnish. They highlight the importance of post-processing, segment length, and methods for incorporating context at segment boundaries.