OlmOCR-Bench: OCR Evaluation Benchmark
- OlmOCR-Bench is a comprehensive benchmark that evaluates OCR systems on diverse, real-world document layouts with challenging structures.
- It employs a unit-test-driven methodology with synthetic annotations and RLVR to ensure accurate extraction of structured text such as formulas, tables, and multi-column layouts.
- State-of-the-art models demonstrate notable improvements on this benchmark, though challenges remain in parsing complex layouts and resolving low-quality scans.
OlmOCR-Bench is a unit-test-driven, end-to-end evaluation benchmark developed to assess the practical correctness of modern Optical Character Recognition (OCR) systems on diverse, challenging real-world documents. It is specifically tailored for benchmarking large multimodal models (LMMs) and vision-LLMs (VLMs) tasked with extracting clean, structured, and reading-order-faithful text—including mathematical formulas, tables, multi-column layouts, and scanned material—from complex document images such as PDFs. OlmOCR-Bench is widely referenced as a standard for measuring state-of-the-art progress in end-to-end document OCR and is tightly integrated with recent advances in reinforcement learning from verifiable rewards (RLVR) and in synthetic unit-test annotation pipelines (Poznanski et al., 22 Oct 2025, Taghadouini et al., 20 Jan 2026).
1. Dataset Composition and Document Coverage
OlmOCR-Bench evaluates OCR systems on a diverse set of challenging document types systematically designed to stress key weaknesses of prior state-of-the-art models. The benchmark comprises 1,403 pages selected to cover the following six principal categories, each representing unique content and layout challenges (Taghadouini et al., 20 Jan 2026, Poznanski et al., 22 Oct 2025):
| Category | Description |
|---|---|
| ArXiv | Scientific PDFs with dense mathematics, complex structures |
| Old Scans Math | Degraded scanned manuscripts rich in formulas |
| Tables | Pages featuring tables with complex, nested, or irregular layouts |
| Old Scans | Non-mathematical scanned text, including noise and OCR artifacts |
| Multi-column | Documents with two or more columns, demanding robust reading order |
| Long Tiny Text | Pages with either highly condensed or extremely small text stretches |
Although the benchmark is predominantly English and Latin-script-based, a plausible implication is that occasional samples may exercise multilingual tokenization and layout handling. Headers and footers are specifically excluded in most reported category scores, given the benchmark's emphasis on end-to-end extraction fidelity rather than trivial page artifacts.
The benchmark operates as a fixed, held-out evaluation set: no subdivisions for training or validation exist, ensuring all evaluations are performed “zero-shot” under identical, deterministic conditions (single-pass decoding, no rotation or retry).
2. Benchmark Construction: Synthetic Unit-Test Pipeline
The core methodology for constructing OlmOCR-Bench is the automatic generation of robust, fine-grained unit tests for every page, built atop synthetic ground-truth derived from HTML renderings. The synthetic annotation pipeline proceeds in three VLM-mediated stages (Poznanski et al., 22 Oct 2025):
- PDF Sourcing: Sampling of real-world documents (e.g., arXiv for mathematics-heavy content, archival scans for complex layouts).
- PDF to HTML Conversion: General-language VLMs (e.g., Claude-Sonnet-4) are prompted on rendered page images to analyze structural layout and produce semantic HTML with high visual fidelity.
- Unit-Test Extraction: Programmatic scanning of the HTML to generate binary predicates—asserting, for instance, the correct presence, ordering, or absence of phrases; cell-level table placements; visual equivalence of rendered LaTeX for equations; alignment to natural reading order across columns; and non-repetition of n-grams/foreign glyphs in long text.
The result is a library of machine-checked unit tests—over 30,000 for the 2,186-page synthetic pre-mix (Poznanski et al., 22 Oct 2025)—enabling scalable, precise verification without reliance on manual annotation.
3. Evaluation Methodology and Metrics
OlmOCR-Bench eschews traditional error rates such as CER and WER in favor of strict, deterministic, task-specific unit tests. For each page, a set of binary unit tests is defined, and the page-level score is the mean pass rate:
where is the number of tests for the page, and indicates pass/fail for output .
Categories of tests are detailed as follows (Poznanski et al., 22 Oct 2025, Taghadouini et al., 20 Jan 2026):
- Plain-text recognition (Base): Key phrase presence/absence and order.
- Mathematical formula conversion: Strict visual equivalence checks via KaTeX DOM bounding boxes for LaTeX.
- Table parsing: Cell-wise placement and content verification in Markdown or HTML representations.
- Multi-column layout: Natural reading order validation across multiple columns.
- Header/Footer Removal: Suppression of page-number/footer text in the main reading flow.
- Long Text Robustness: Detection and penalization of hallucinated/repeated n-grams and non-target character sets.
The “Overall” score is the mean of the per-category averages. For comparative context, formulas for CER and WER appear in the literature for completeness, but OlmOCR-Bench itself does not utilize these for scoring (Poznanski et al., 22 Oct 2025, Taghadouini et al., 20 Jan 2026):
with = substitutions, = deletions, = insertions, = characters, = words.
4. Model Evaluation and State-of-the-Art Results
OlmOCR-Bench serves as the primary evaluation suite for state-of-the-art end-to-end OCR and vision-LLMs, including LightOnOCR-2-1B and olmOCR-2-7B-1025. Results are reported in terms of overall and per-category pass rates, reflecting the percentage of pages with all relevant unit tests passed (Taghadouini et al., 20 Jan 2026, Poznanski et al., 22 Oct 2025):
| System | ArXiv | Old Scans Math | Tables | Old Scans | Multi-col | Long Text | Base | Overall |
|---|---|---|---|---|---|---|---|---|
| LightOnOCR-2-1B | 89.6 | 85.6 | 89.0 | 42.2 | 84.8 | 99.6 | — | 83.2 ± 0.9 |
| Chandra-9B (9B params) | 82.2 | 80.3 | 88.0 | 50.4 | 81.2 | 99.9 | — | 81.7 ± 0.9 |
| olmOCR-2-7B-1025 | 83.0 | 82.3 | 84.9 | 47.7 | 83.7 | 96.1 | 99.7 | 82.4 ± 1.1 |
| Infinity-Parser 7B* | 84.4 | 83.8 | 85.0 | 47.9 | 84.2 | 86.4 | 99.8 | 82.5 |
*Results for Infinity-Parser 7B, PaddleOCR-VL, and Chandra OCR are as reported. Other systems are reproduced in-house (Poznanski et al., 22 Oct 2025).
Key quantitative milestones:
- LightOnOCR-2-1B (1B params) surpasses Chandra-9B (9B params) with 83.2 ± 0.9 vs. 81.7 ± 0.9 overall, and markedly outperforms prior models in ArXiv and mathematical scan categories (+7.4 and +5.3 points, respectively) (Taghadouini et al., 20 Jan 2026).
- RLVR fine-tuning provides a 1.4 point overall gain for LightOnOCR-2-1B (81.8 → 83.2).
- olmOCR-2-7B-1025 achieves improvements of +19.7 (ArXiv), +14.8 (Old Scans Math), +22.6 (Tables), +16.1 (Multi-column), and +27.1 (Long Text) percentage points over baseline olmOCR releases (Poznanski et al., 22 Oct 2025).
A plausible implication is that no single model yet achieves near-perfect pass rates on all categories, especially for “Tables” and “Old Scans,” but RLVR-based models show consistent, cross-category gains.
5. Key Methodological Insights: RLVR and Unit-Testing Framework
Reinforcement Learning with Verifiable Rewards (RLVR) is central to recent improvements on OlmOCR-Bench. RLVR utilizes the same unit tests used for evaluation to define reward signals during policy optimization—a tight loop that directly targets actionable, high-precision document extraction objectives (Poznanski et al., 22 Oct 2025, Taghadouini et al., 20 Jan 2026). Specifically:
- Reward Signal: Each unit test (binary predicate on parsing output) provides a dense and interpretable supervision signal. For example, matching LaTeX bounding boxes for formulas or verifying table cell placements.
- Policy Update: Group Relative Policy Optimization (GRPO), regularized with KL-divergence, is applied; random seed soups (weight-averaged final models) further stabilize improvements.
- Auxiliary Rewards: End-of-sequence (EOS) token and metadata quality are optionally included (weighted equally).
This synergy enables rapid and targeted optimization of structured text extraction, as evidenced by dramatic quantitative gains in structured content categories.
6. Analysis of Benchmark Impact and Failure Modes
OlmOCR-Bench has crystallized several key insights for both model architecture and evaluation (Poznanski et al., 22 Oct 2025, Taghadouini et al., 20 Jan 2026):
- Largest single-task gains arise in mathematical formula conversion, table parsing, multi-column reading order, and long-text stretch robustness.
- Simple base text recognition and header/footer removal show smaller but still positive gains from RLVR.
- Unit-test pass rates are statistically robust: overall improvements of +14 points between first and second generation olmOCR variants exceed the ±1.1 point 95% confidence interval (Poznanski et al., 22 Oct 2025).
Common failure modes uncovered by the benchmark include:
- Degraded or low-contrast scans (especially in the Old Scans Math category).
- Dense mathematical or symbolic content intermixed with non-tabular layouts.
- Complicated/layered tables requiring deep structural parsing.
- Residual instances of repeated n-grams or hallucinated text in long-tiny-text settings.
A plausible implication is that ongoing improvements in synthetic test annotation, task-specific reward shaping, and high-resolution model architectures are principal avenues for future progress.
7. Benchmark Availability and Integration Guidelines
OlmOCR-Bench is released under permissive open licenses, enabling reproducible, side-by-side evaluation of arbitrary OCR models (Poznanski et al., 22 Oct 2025, Taghadouini et al., 20 Jan 2026). To ensure valid comparison and leaderboard submission:
- Models must be evaluated on the fixed hold-out set using the provided unit-test scripts without any data augmentation, rotation, or retry heuristics.
- Output format must include appropriate Markdown+LaTeX structure for deterministic evaluation.
- In localization-augmented variants (e.g., LightOnOCR-bbox-bench), normalization of bounding boxes to a [0,1000] scale is required.
For integration, standard instruction templates and metric computation toolkits (Python scripts) are distributed alongside the data.
OlmOCR-Bench represents the current standard for rigorous, actionable evaluation of OCR systems in research and large-scale document understanding, driving advancements in vision-language modeling and robust automated document digitization (Poznanski et al., 22 Oct 2025, Taghadouini et al., 20 Jan 2026).