Papers
Topics
Authors
Recent
2000 character limit reached

olmOCR 2: Advanced Open-Source OCR

Updated 23 October 2025
  • olmOCR 2 is an open-source OCR system that leverages a 7-billion parameter vision language model, trained with reinforcement learning using binary unit-test rewards for high-fidelity text conversion.
  • It employs advanced techniques like dynamic temperature scaling, model checkpoint averaging, and enhanced layout parsing to accurately process multi-column pages, tables, and math formulas.
  • All model weights, training data, and code are publicly released under permissive licenses, promoting reproducible research and wide application in digital archiving and document analysis.

olmOCR 2 is an open-source Optical Character Recognition (OCR) system for converting digitized print documents, particularly PDFs, into clean, naturally ordered plain text. It utilizes a specialized 7-billion parameter vision LLM (VLM) trained with reinforcement learning using verifiable, binary unit-test rewards. This approach combines large-scale, synthetic document creation—with known ground-truth HTML and corresponding test cases—with rigorous training and an improved inference pipeline. olmOCR 2 exhibits state-of-the-art performance on the olmOCR-Bench, delivering marked advances in math formula conversion, table parsing, and multi-column layout fidelity. All model weights, training data, and code are publicly released under permissive licenses.

1. Architecture and Model Innovations

olmOCR 2 is powered by the olmOCR-2-7B-1025 vision LLM. The architecture is adapted specifically for OCR tasks, comprising an upgraded base model and stabilized inference pipeline. Notable advances include:

  • Refined prompt ordering and use of dynamic temperature scaling during decoding, which mitigate mode collapse and repetitive outputs.
  • “Souping”: model checkpoint averaging across multiple seeds is applied to enhance generalization.
  • Enhanced layout handling: architectural improvements and decoding strategies enable robust parsing of multi-column pages and floating elements.
  • Robust EOS (end-of-sequence) enforcement and output normalization to ensure consistent termination and metadata delivery.

The underlying VLM leverages the vision transformer backbone, adapted for both scanned and born-digital page images, supporting the advanced handling of tables, equations, and complex layouts.

2. Reinforcement Learning with Verifiable Rewards (RLVR)

Training is governed by a reinforcement learning protocol—Group Relative Policy Optimization (GRPO)—using binary unit tests as the reward function. The RLVR paradigm supersedes traditional edit-distance measures by:

  • Deploying a large suite of unit tests per document, each signaling pass/fail on distinct dimensions (text presence, exclusion of headers/footers, reading order, cell correspondence in tables, visual fidelity of math formulas via KaTeX rendering).
  • Aggregating unit test results per page as a numeric reward: Rpage=MNR_\text{page} = \frac{M}{N}, where MM is the number of passed unit tests and NN the total tests.
  • Integrating these binary rewards into the RL objective, with additional regularization (e.g., KL-divergence penalty with β=0.01\beta = 0.01).

This mechanism robustly enforces correctness, natural document ordering, and layout preservation—even with ambiguous or multi-valid outputs.

3. Synthetic Document Generation and Supervision Pipeline

The training corpus is constructed via a synthetic pipeline:

  • Source PDFs, especially with complex tables, math, and multi-column layouts, are sampled.
  • Layout analysis is performed by prompting a general VLM to identify columns, headers/footers, tables, etc.
  • Clean, semantic ground-truth HTML is generated per page, which is then rendered and iteratively refined to maximize fidelity to the original page image.
  • The final HTML serves as reliable supervision; unit tests are automatically extracted mapping HTML tags (e.g., <header>, <footer>, table cells) to evaluative test cases.

This dual role—providing both ground-truth for fine-tuning and a scalable mechanism to generate dense, verifiable unit tests—enables efficient RL training and generalized document understanding.

4. Training Regimen and Checkpoint Averaging

Training proceeds in three phases:

  • One epoch of supervised fine-tuning on the cleaned dataset (olmOCR-mix-1025).
  • One epoch of RL training with synthetic data and binary unit-test rewards (olmOCR2-synthmix-1025).
  • Model averaging (“soupsing”) from multiple random seeds to form the release checkpoint.

Hyperparameters and training protocols are selected to maximize coverage and balance between local fidelity (token-level match) and global structural coherence (layout-ordering, table cell accuracy).

5. Benchmarking and Quantitative Performance

olmOCR 2 achieves state-of-the-art scores on the olmOCR-Bench, with a +14.2 point improvement over prior versions. The largest gains are in:

  • Math formula extraction: verified against KaTeX-rendered output.
  • Advanced table parsing: accurate cell positioning and Markdown-output consistency.
  • Multi-column content linearization: robust reading order and content arrangement across complex page layouts.

Unit tests quantitatively measure gains per dimension; overall scores outperform previous pipelines and generalist VLM baselines.

6. Release, Licensing, and Reproducibility

The authors publicly release all components:

  • Model weights (olmOCR-2-7B-1025).
  • Training datasets (olmOCR-mix-1025, olmOCR-synthmix-1025).
  • Code base for inference, training, and synthetic data generation.
  • Open licensing (Apache 2.0/MIT) for all assets.

GitHub and Hugging Face repositories ensure accessibility for reproducible research, extensibility, and integration with existing document processing pipelines.

7. Applications and Methodological Implications

olmOCR 2 enables robust, high-fidelity conversions of digitized print media and scanned documents to plain text and Markdown for:

  • Digital archiving (library and heritage).
  • Academic publishing and legal document workflows.
  • Automated extraction of mathematical and tabular content for downstream machine learning or NLP tasks.

Its RLVR methodology—using binary unit-test supervision—suggests a new standard for OCR evaluation, particularly for mode-ambiguous outputs and complex layouts, with broad applicability to document conversion and analysis systems.


olmOCR 2 thus represents a technically advanced, reproducible system, setting benchmarks in document OCR through a combination of a specialized vision LLM, reinforcement learning with verifiable unit tests, and large-scale synthetic supervision pipelines (Poznanski et al., 22 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to olmOCR 2.