olmOCR-Bench: OCR Benchmark for Scientific PDFs

Updated 10 May 2026

olmOCR-Bench is an English OCR benchmark comprising binary unit tests that evaluate textual accuracy, table integrity, and math formula reproduction in PDFs.
It leverages a synthetic document generation pipeline with semantic HTML to rigorously assess layout, natural reading order, and content presence or absence.
The benchmark supports reinforcement learning frameworks by providing detailed, interpretable rewards for optimizing vision–language models on complex document structures.

olmOCR-Bench is an English-language optical character recognition (OCR) benchmark centered on binary unit tests that rigorously assess not only character-level accuracy but also the fidelity of higher-order document structures such as tables, mathematical formulas, and natural reading order in digitized print documents—primarily PDFs. It plays a pivotal role in evaluating and guiding the development of modern vision–LLMs (VLMs) and end-to-end OCR systems for challenging academic and scientific content categories (Poznanski et al., 22 Oct 2025, Taghadouini et al., 20 Jan 2026).

1. Definition and Purpose

olmOCR-Bench constitutes a comprehensive evaluation suite designed to stress-test vision–language document models on complex layout and semantic understanding tasks. Each document in olmOCR-Bench is associated with 10–20 binary unit tests, where each test targets a specific aspect of document structure or content reproduction. The benchmark assesses:

Text Presence: Verification that required phrases (such as section headings or captions) appear exactly once.
Text Absence: Confirmation the model has correctly omitted distractor content (e.g., running headers, page numbers).
Natural Reading Order: Assessment of sequential correctness, ensuring that spans of text occur in intended linear order with no interruptions or spurious interleaving.
Table Accuracy: Validation that table cell contents appear at correct row–column indices.
Math Formula Reproduction: Comparison of rendered mathematical formulas (using KaTeX) to verify visual layout matches the gold standard.
Baseline Robustness: Detection and suppression of edge-case content such as repeated tokens or non-target language noise.

A system's performance is quantified as the mean fraction of unit tests passed (“page reward”) across the benchmark’s pages, with category-level sub-scores isolating nuances in model strengths and weaknesses (Poznanski et al., 22 Oct 2025).

2. Design and Construction of Unit Tests

Each binary unit test in olmOCR-Bench is derived from ground-truth semantic HTML (produced by VLM annotation and refinement pipelines), encoding the page’s logical structure. The synthetic document generation pipeline comprises three critical stages:

Layout Analysis: Utilizing a general-purpose VLM to classify document features (columns, tables, math blocks, headers/footers) in each page image.
Content Rendering: Prompting the VLM with the layout summary and image to generate semantic HTML using a rich tag set (<p>, <header>, <footer>, <table>, <math>).
Output Refinement: Re-rendering VLM-produced HTML to image form, then iteratively refining the HTML for maximal visual–structural fidelity.

From refined HTML, unit tests are programmatically extracted:

Sampling <header>/<footer> text for omission/presence checks,
Extracting (cell value, row, column) triples for table verification,
Generating reading-order assertions from interleaved content,
Harvesting LaTeX for KaTeX DOM comparison in formula accuracy tasks.

Test results are aggregated as a binary pass/fail vector per page, yielding fine-grained programmatic supervision for both evaluation and reward construction (Poznanski et al., 22 Oct 2025).

3. Evaluation Methodology and Metrics

olmOCR-Bench adopts a strict binary test-passing regime, reporting for each model:

Overall score: Mean test pass rate across all pages (with confidence interval, typically ±1.0–1.2).
Sub-category scores: Task-level or document-type breakdowns—such as ArXiv (math), tables, multi-columns, headers/footers, long-text, and base text.
Per-task aggregation: Individual page reward defined as $r(\text{page}) = \frac{\#\text{passed tests}}{\#\text{total tests}}$ .
Model-wide aggregation: Mean score across the corpus, enabling robust comparison against both open and closed-source baselines under identical, transparent test conditions.

No post-hoc test-time tricks (e.g., rotation sweeps, retries) are employed during evaluation, ensuring comparability (Taghadouini et al., 20 Jan 2026).

4. Role in Reinforcement Learning and Model Development

olmOCR-Bench is central to the reinforcement learning with verifiable rewards (RLVR) paradigm used for training state-of-the-art document VLMs. Specifically, for models such as olmOCR-2-7B-1025, the reward function is composed of:

The fraction of unit tests passed per completion ( $r(x, y)$ ),
An EOS token reward ( $r_\text{EOS}(y)$ ),
A metadata compliance reward ( $r_\text{meta}(y)$ ).

The combined reward, $R(x, y) = r(x, y) + r_\text{EOS}(y) + r_\text{meta}(y)$ , provides dense, interpretable feedback suitable for policy optimization within Group Relative Policy Optimization (GRPO) frameworks. Synthetic–real data mixes are used for RL, with inference tuning (e.g., temperature scaling, YAML output) increasing robustness and accuracy (Poznanski et al., 22 Oct 2025).

5. Quantitative Results and Comparative Performance

olmOCR-Bench enables granular benchmarking against the broader OCR landscape. As of late 2025/early 2026, leading models achieve the following:

System	ArXiv	Old-Scans-Math	Tables	Multi-col	Overall
LightOnOCR-2-1B [1B]	89.6%	85.6%	89.0%	91%	83.2 ±0.9%
olmOCR 2-7B-1025 (+RLVR, souping)	83.0%	82.3%	84.9%	83.7%	82.4 ±1.1%
Chandra OCR 0.1.0 [7B]	82.2%	80.3%	88.0%	81.2%	83.1 ±0.9%
Infinity-Parser 7B	84.4%	83.8%	85.0%	84.2%	82.5%
PaddleOCR-VL* (Oct ’25)	85.7%	71.0%	84.1%	79.9%	80.0 ±1.0%
Mistral OCR API (Mar ’25)	77.2%	67.5%	60.6%	71.3%	72.0 ±1.1%
olmOCR (1st release, Feb ’25)	63.3%	67.5%	62.3%	67.6%	68.2 ±1.1%
OpenAI GPT-4o	68.9%	–	–	–	68.9 ±1.1%

*Select columns; full results in (Poznanski et al., 22 Oct 2025, Taghadouini et al., 20 Jan 2026).

Improvements in table and formula handling, multi-column layout parsing, and resistance to hallucinations are attributed directly to RLVR with synthetically generated binary unit tests. Model scaling, high-fidelity LaTeX and table normalization, and architectural innovations (high-res ViT, task-arithmetic merging) also contribute strongly to state-of-the-art performance (Poznanski et al., 22 Oct 2025, Taghadouini et al., 20 Jan 2026).

6. Contextualization within the OCR Benchmark Ecosystem

olmOCR-Bench is distinct among contemporary OCR benchmarks in its emphasis on structured, binary pass/fail criteria for complex document layout features. OCRBench v2 (Fu et al., 2024) provides a broader, multilingual, and multi-domain test set—with tasks including text localization, handwritten content extraction, and logical reasoning—but olmOCR-Bench introduces programmatically dense, HTML-derived unit tests specifically for English-language scientific and technical PDF reproduction. Compared to domain-specific benchmarks (e.g., food packaging (Nagayi et al., 3 Oct 2025)), olmOCR-Bench prioritizes scientific document fidelity and layout/semantic correctness.

olmOCR-Bench serves as both an evaluation framework and a synthetic reward suite, informing state-of-the-art methods in document vision–language processing. It remains an active reference for researchers seeking standardized, reproducible, and interpretable scoring on high-value structured document OCR tasks.

References:

(Poznanski et al., 22 Oct 2025) “olmOCR 2: Unit Test Rewards for Document OCR” (Taghadouini et al., 20 Jan 2026) “LightOnOCR: A 1B End-to-End Multilingual Vision-LLM for State-of-the-Art OCR” (Fu et al., 2024) “OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning”