Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

olmOCR 2: Advanced Open-Source OCR

Updated 23 October 2025
  • olmOCR 2 is an open-source OCR system that leverages a 7-billion parameter vision language model, trained with reinforcement learning using binary unit-test rewards for high-fidelity text conversion.
  • It employs advanced techniques like dynamic temperature scaling, model checkpoint averaging, and enhanced layout parsing to accurately process multi-column pages, tables, and math formulas.
  • All model weights, training data, and code are publicly released under permissive licenses, promoting reproducible research and wide application in digital archiving and document analysis.

olmOCR 2 is an open-source Optical Character Recognition (OCR) system for converting digitized print documents, particularly PDFs, into clean, naturally ordered plain text. It utilizes a specialized 7-billion parameter vision LLM (VLM) trained with reinforcement learning using verifiable, binary unit-test rewards. This approach combines large-scale, synthetic document creation—with known ground-truth HTML and corresponding test cases—with rigorous training and an improved inference pipeline. olmOCR 2 exhibits state-of-the-art performance on the olmOCR-Bench, delivering marked advances in math formula conversion, table parsing, and multi-column layout fidelity. All model weights, training data, and code are publicly released under permissive licenses.

1. Architecture and Model Innovations

olmOCR 2 is powered by the olmOCR-2-7B-1025 vision LLM. The architecture is adapted specifically for OCR tasks, comprising an upgraded base model and stabilized inference pipeline. Notable advances include:

  • Refined prompt ordering and use of dynamic temperature scaling during decoding, which mitigate mode collapse and repetitive outputs.
  • “Souping”: model checkpoint averaging across multiple seeds is applied to enhance generalization.
  • Enhanced layout handling: architectural improvements and decoding strategies enable robust parsing of multi-column pages and floating elements.
  • Robust EOS (end-of-sequence) enforcement and output normalization to ensure consistent termination and metadata delivery.

The underlying VLM leverages the vision transformer backbone, adapted for both scanned and born-digital page images, supporting the advanced handling of tables, equations, and complex layouts.

2. Reinforcement Learning with Verifiable Rewards (RLVR)

Training is governed by a reinforcement learning protocol—Group Relative Policy Optimization (GRPO)—using binary unit tests as the reward function. The RLVR paradigm supersedes traditional edit-distance measures by:

  • Deploying a large suite of unit tests per document, each signaling pass/fail on distinct dimensions (text presence, exclusion of headers/footers, reading order, cell correspondence in tables, visual fidelity of math formulas via KaTeX rendering).
  • Aggregating unit test results per page as a numeric reward: Rpage=MNR_\text{page} = \frac{M}{N}, where MM is the number of passed unit tests and NN the total tests.
  • Integrating these binary rewards into the RL objective, with additional regularization (e.g., KL-divergence penalty with β=0.01\beta = 0.01).

This mechanism robustly enforces correctness, natural document ordering, and layout preservation—even with ambiguous or multi-valid outputs.

3. Synthetic Document Generation and Supervision Pipeline

The training corpus is constructed via a synthetic pipeline:

  • Source PDFs, especially with complex tables, math, and multi-column layouts, are sampled.
  • Layout analysis is performed by prompting a general VLM to identify columns, headers/footers, tables, etc.
  • Clean, semantic ground-truth HTML is generated per page, which is then rendered and iteratively refined to maximize fidelity to the original page image.
  • The final HTML serves as reliable supervision; unit tests are automatically extracted mapping HTML tags (e.g., <header>, <footer>, table cells) to evaluative test cases.

This dual role—providing both ground-truth for fine-tuning and a scalable mechanism to generate dense, verifiable unit tests—enables efficient RL training and generalized document understanding.

4. Training Regimen and Checkpoint Averaging

Training proceeds in three phases:

  • One epoch of supervised fine-tuning on the cleaned dataset (olmOCR-mix-1025).
  • One epoch of RL training with synthetic data and binary unit-test rewards (olmOCR2-synthmix-1025).
  • Model averaging (“soupsing”) from multiple random seeds to form the release checkpoint.

Hyperparameters and training protocols are selected to maximize coverage and balance between local fidelity (token-level match) and global structural coherence (layout-ordering, table cell accuracy).

5. Benchmarking and Quantitative Performance

olmOCR 2 achieves state-of-the-art scores on the olmOCR-Bench, with a +14.2 point improvement over prior versions. The largest gains are in:

  • Math formula extraction: verified against KaTeX-rendered output.
  • Advanced table parsing: accurate cell positioning and Markdown-output consistency.
  • Multi-column content linearization: robust reading order and content arrangement across complex page layouts.

Unit tests quantitatively measure gains per dimension; overall scores outperform previous pipelines and generalist VLM baselines.

6. Release, Licensing, and Reproducibility

The authors publicly release all components:

  • Model weights (olmOCR-2-7B-1025).
  • Training datasets (olmOCR-mix-1025, olmOCR-synthmix-1025).
  • Code base for inference, training, and synthetic data generation.
  • Open licensing (Apache 2.0/MIT) for all assets.

GitHub and Hugging Face repositories ensure accessibility for reproducible research, extensibility, and integration with existing document processing pipelines.

7. Applications and Methodological Implications

olmOCR 2 enables robust, high-fidelity conversions of digitized print media and scanned documents to plain text and Markdown for:

  • Digital archiving (library and heritage).
  • Academic publishing and legal document workflows.
  • Automated extraction of mathematical and tabular content for downstream machine learning or NLP tasks.

Its RLVR methodology—using binary unit-test supervision—suggests a new standard for OCR evaluation, particularly for mode-ambiguous outputs and complex layouts, with broad applicability to document conversion and analysis systems.


olmOCR 2 thus represents a technically advanced, reproducible system, setting benchmarks in document OCR through a combination of a specialized vision LLM, reinforcement learning with verifiable unit tests, and large-scale synthetic supervision pipelines (Poznanski et al., 22 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to olmOCR 2.