olmOCR 2: Advanced Open-Source OCR
- olmOCR 2 is an open-source OCR system that leverages a 7-billion parameter vision language model, trained with reinforcement learning using binary unit-test rewards for high-fidelity text conversion.
- It employs advanced techniques like dynamic temperature scaling, model checkpoint averaging, and enhanced layout parsing to accurately process multi-column pages, tables, and math formulas.
- All model weights, training data, and code are publicly released under permissive licenses, promoting reproducible research and wide application in digital archiving and document analysis.
olmOCR 2 is an open-source Optical Character Recognition (OCR) system for converting digitized print documents, particularly PDFs, into clean, naturally ordered plain text. It utilizes a specialized 7-billion parameter vision LLM (VLM) trained with reinforcement learning using verifiable, binary unit-test rewards. This approach combines large-scale, synthetic document creation—with known ground-truth HTML and corresponding test cases—with rigorous training and an improved inference pipeline. olmOCR 2 exhibits state-of-the-art performance on the olmOCR-Bench, delivering marked advances in math formula conversion, table parsing, and multi-column layout fidelity. All model weights, training data, and code are publicly released under permissive licenses.
1. Architecture and Model Innovations
olmOCR 2 is powered by the olmOCR-2-7B-1025 vision LLM. The architecture is adapted specifically for OCR tasks, comprising an upgraded base model and stabilized inference pipeline. Notable advances include:
- Refined prompt ordering and use of dynamic temperature scaling during decoding, which mitigate mode collapse and repetitive outputs.
- “Souping”: model checkpoint averaging across multiple seeds is applied to enhance generalization.
- Enhanced layout handling: architectural improvements and decoding strategies enable robust parsing of multi-column pages and floating elements.
- Robust EOS (end-of-sequence) enforcement and output normalization to ensure consistent termination and metadata delivery.
The underlying VLM leverages the vision transformer backbone, adapted for both scanned and born-digital page images, supporting the advanced handling of tables, equations, and complex layouts.
2. Reinforcement Learning with Verifiable Rewards (RLVR)
Training is governed by a reinforcement learning protocol—Group Relative Policy Optimization (GRPO)—using binary unit tests as the reward function. The RLVR paradigm supersedes traditional edit-distance measures by:
- Deploying a large suite of unit tests per document, each signaling pass/fail on distinct dimensions (text presence, exclusion of headers/footers, reading order, cell correspondence in tables, visual fidelity of math formulas via KaTeX rendering).
- Aggregating unit test results per page as a numeric reward: , where is the number of passed unit tests and the total tests.
- Integrating these binary rewards into the RL objective, with additional regularization (e.g., KL-divergence penalty with ).
This mechanism robustly enforces correctness, natural document ordering, and layout preservation—even with ambiguous or multi-valid outputs.
3. Synthetic Document Generation and Supervision Pipeline
The training corpus is constructed via a synthetic pipeline:
- Source PDFs, especially with complex tables, math, and multi-column layouts, are sampled.
- Layout analysis is performed by prompting a general VLM to identify columns, headers/footers, tables, etc.
- Clean, semantic ground-truth HTML is generated per page, which is then rendered and iteratively refined to maximize fidelity to the original page image.
- The final HTML serves as reliable supervision; unit tests are automatically extracted mapping HTML tags (e.g., <header>, <footer>, table cells) to evaluative test cases.
This dual role—providing both ground-truth for fine-tuning and a scalable mechanism to generate dense, verifiable unit tests—enables efficient RL training and generalized document understanding.
4. Training Regimen and Checkpoint Averaging
Training proceeds in three phases:
- One epoch of supervised fine-tuning on the cleaned dataset (olmOCR-mix-1025).
- One epoch of RL training with synthetic data and binary unit-test rewards (olmOCR2-synthmix-1025).
- Model averaging (“soupsing”) from multiple random seeds to form the release checkpoint.
Hyperparameters and training protocols are selected to maximize coverage and balance between local fidelity (token-level match) and global structural coherence (layout-ordering, table cell accuracy).
5. Benchmarking and Quantitative Performance
olmOCR 2 achieves state-of-the-art scores on the olmOCR-Bench, with a +14.2 point improvement over prior versions. The largest gains are in:
- Math formula extraction: verified against KaTeX-rendered output.
- Advanced table parsing: accurate cell positioning and Markdown-output consistency.
- Multi-column content linearization: robust reading order and content arrangement across complex page layouts.
Unit tests quantitatively measure gains per dimension; overall scores outperform previous pipelines and generalist VLM baselines.
6. Release, Licensing, and Reproducibility
The authors publicly release all components:
- Model weights (olmOCR-2-7B-1025).
- Training datasets (olmOCR-mix-1025, olmOCR-synthmix-1025).
- Code base for inference, training, and synthetic data generation.
- Open licensing (Apache 2.0/MIT) for all assets.
GitHub and Hugging Face repositories ensure accessibility for reproducible research, extensibility, and integration with existing document processing pipelines.
7. Applications and Methodological Implications
olmOCR 2 enables robust, high-fidelity conversions of digitized print media and scanned documents to plain text and Markdown for:
- Digital archiving (library and heritage).
- Academic publishing and legal document workflows.
- Automated extraction of mathematical and tabular content for downstream machine learning or NLP tasks.
Its RLVR methodology—using binary unit-test supervision—suggests a new standard for OCR evaluation, particularly for mode-ambiguous outputs and complex layouts, with broad applicability to document conversion and analysis systems.
olmOCR 2 thus represents a technically advanced, reproducible system, setting benchmarks in document OCR through a combination of a specialized vision LLM, reinforcement learning with verifiable unit tests, and large-scale synthetic supervision pipelines (Poznanski et al., 22 Oct 2025).