Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

olmOCR-2-7B-1025 OCR Model

Updated 23 October 2025
  • olmOCR-2-7B-1025 is a 7-billion–parameter vision-language model designed for high-fidelity OCR in complex documents such as PDFs.
  • It employs a transformer-based architecture with dual-stage training and reinforced learning via verifiable binary unit tests derived from ground-truth HTML.
  • The model significantly improves conversion accuracy for math formulas, tables, and multi-column layouts, achieving a +14.2 benchmark point enhancement.

olmOCR-2-7B-1025 is a specialized 7-billion–parameter vision-LLM (VLM) developed for high-fidelity optical character recognition (OCR) in complex documents such as PDFs. Integrated at the core of the olmOCR 2 system, this model is uniquely trained using a reinforced learning protocol with verifiable, document-specific rewards—implementing a diverse set of binary unit tests that are automatically generated from ground-truth HTML. OlmOCR-2-7B-1025 advances the extraction of clean, naturally ordered text from digitized print documents, with substantial improvements in conversion accuracy for mathematical formulas, tables, and non-linear, multi-column layouts.

1. Architectural Overview

OlmOCR-2-7B-1025 employs a transformer-based VLM foundation, specifically an upgraded derivative of Qwen 2.5 VL, engineered for document OCR. The multimodal architecture accepts rasterized document images, infusing visual features via attention layers alongside language-generation tokens. This fusion of modalities enables effective semantic modeling of layout-dependent features, such as headers, tables, and embedded math notation. The end-to-end transformer paradigm allows for direct mapping from document images to plain text, preserving fine-grained structural and positional relationships.

A key capability of olmOCR-2-7B-1025 is its handling of complex document layouts—including multi-column text, spatially entangled tables, and visually intricate mathematical expressions—within the language modeling pipeline. This design incorporates dedicated attention mechanisms to encode spatial relationships, facilitating ordered extraction of content in natural reading sequence.

2. Training Protocol and Reinforcement Learning with Verifiable Rewards (RLVR)

The training regime for olmOCR-2-7B-1025 comprises two major stages:

  • Initial supervised fine-tuning (SFT) using a curated dataset (olmOCR-mix-1025) spanning thousands of pages from heterogeneous sources (e.g., scientific articles, business forms, technical documents).
  • Refinement via RLVR, a reinforcement learning approach in which reward signals are constructed from a diverse battery of binary unit tests derived from the ground-truth HTML source for each document.

For each document page, multiple candidate completions (28 per page in experimental settings) are produced and scored against these tests. Examples of unit test criteria include: verification of header/footer presence or absence, fidelity of natural reading order, precision of table cell placement, and accuracy of mathematical formula conversion (validated using rendered KaTeX DOM comparisons). The aggregate reward for a completion is given by

$R = \frac{\text{# of tests passed}}{\text{Total # of tests}}$

Additional constraints are enforced via reward components that check for formatting properties, such as EOS token termination and correct placement of document metadata.

The RL update is performed using Group Relative Policy Optimization (GRPO), regularized with KL divergence (β=0.01\beta = 0.01), implemented on the Hugging Face TRL framework. Multiple parallel seeds are trained and subsequently “souped” via checkpoint averaging to enhance generalization and robustness.

3. Synthetic Document and Unit Test Generation Pipeline

Scaling RLVR required development of a synthetic training pipeline capable of producing diverse and challenging documents with automatically extractable ground truth:

  1. Layout Analysis: A general-purpose VLM inspects random PDF pages, annotating key layout features—column count, presence of figures, tables, headers, and footers.
  2. Semantic HTML Rendering: Based on annotated layout, the model generates a “clean” HTML representation matching original spatial parameters.
  3. Iterative Refinement: HTML output is rendered to an image compared pixel-wise with the original; discrepancies trigger further refinement via model feedback loops.

From this validated HTML, structural tags enable automatic construction of binary unit tests. Example: HTML <header>, <footer> elements map directly to presence/absence tests, while math formulas encoded and rendered via KaTeX permit direct visual comparison. Table cell positions are similarly compared via DOM traversal. This pipeline yielded the olmOCR2-synthmix-1025 dataset, encompassing 2,186 PDF pages with over 30,000 unique test cases.

4. Evaluation Methodology and Performance Metrics

Performance is measured using olmOCR-Bench, a benchmark that evaluates OCR output not with edit distance, but via binary unit tests reflecting semantic correctness:

  • Accurate extraction of equations, with visual render checks.
  • Ordered reconstruction of multi-column and table data.
  • Conformity to natural reading order, especially for documents with non-linear progression.

Relative to earlier releases (including baseline olmOCR and other systems), olmOCR-2-7B-1025 achieves an improvement of +14.2 benchmark points. The model demonstrates marked superiority in:

  • Mathematical formula conversion: RL rewards focus on KaTeX DOM correctness, reducing error rates relative to edit-distance metrics.
  • Table parsing: Enhanced cell alignment and structure fidelity.
  • Multi-column layouts: Improved natural ordering and avoidance of cross-column mixing.

A comparative table (Table~\ref{tab:ocr_comparison} in the cited paper (Poznanski et al., 22 Oct 2025)) confirms state-of-the-art performance across all major challenge types.

5. Release and Licensing

The complete suite—including model weights, training/inference code, and both the supervised (olmOCR-mix-1025) and synthetic (olmOCR2-synthmix-1025) datasets—is released under permissive open-source licenses (e.g., Apache 2.0, MIT), with explicit notation on API or usage restrictions as appropriate. Open release supports full reproducibility of the experiments and eases integration into community research workflows.

6. Significance and Technical Impact

The olmOCR-2-7B-1025 architecture and RLVR methodology introduce a verifiable standard for document OCR, moving beyond traditional edit distance–based assessment to structured, property-specific criteria directly derived from ground-truth semantics. This approach is particularly impactful for document types where layout—rather than mere character recognition—governs meaningful extraction: scientific literature, business forms, mathematical publications, and technical manuals. The model’s ability to generalize through automated synthetic data generation and extensive reinforcement via test-based reward signals suggests new directions for OCR benchmarking and fidelity assurance.

A plausible implication is that binary unit test–driven reinforcement learning can be broadly applied to other vision-language extraction tasks, especially those necessitating fine-grained, context-sensitive structured outputs. The open-source release provides a robust framework for further research in document understanding and semantic extraction from complex visual sources.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to olmOCR-2-7B-1025.