OmniDocBench: Comprehensive PDF Parsing Benchmark
- OmniDocBench is a large-scale benchmark suite for document parsing evaluation, featuring diverse PDF types and detailed multi-level annotations.
- It structures tasks into layout detection, text, formula, table recognition, and reading order prediction, using principled metrics such as NED, TEDS, and CDM.
- Benchmark results indicate high accuracy in both pipeline and end-to-end models, while highlighting challenges in handling physical distortions and complex layouts.
OmniDocBench is a comprehensive, large-scale benchmark suite for end-to-end evaluation of document parsing systems on real-world PDF documents. Its design addresses the limitations of prior benchmarks, focusing on the diversity of document types, annotation granularity, and the need for multi-faceted, principled metrics that reflect the demands of document intelligence applications, including training and assessment of LLMs and retrieval-augmented generation (RAG) systems.
1. Benchmark Scope, Document Types, and Annotation Schema
OmniDocBench is constructed to benchmark content extraction across heterogeneous PDF sources, emphasizing broad coverage and annotation depth. The evaluation set spans nine major document types: books, slides, research and financial reports, textbooks, exam papers, magazines, academic articles, handwritten notes, and newspapers. These genres were selected to encompass wide layout variability, including single/double/multi-column text, dense tables, inline/nested formulas, figures, code blocks, and handwriting—attributes that challenge both specialized and general document parsing models (Ouyang et al., 2024).
Annotations are provided at three levels:
- Block-Level: 19 semantic categories (e.g., title, text block, table, figure, caption, reference, code block) with bounding-box coordinates.
- Span-Level: Inline elements such as LaTeX equations and footnotes.
- Attribute-Level: Page, block, and table attributes, including language, column layout, scan fuzziness, watermark presence, color background, frame/margins, and rotation.
In total, OmniDocBench v1.0 comprises 981 pages and ≈20,000 blocks (Ouyang et al., 2024); v1.5 extends this to 1,355 pages and >135,000 text blocks, 6,775 formulas, and 5,420 tables (Cui et al., 29 Jan 2026). Annotation protocol enforces exhaustive per-element alignment, reading-order graphs, and cross-linked captions for figures/tables.
2. Benchmark Tasks and Evaluation Protocol
OmniDocBench structures document parsing as a set of unified, tightly coupled subtasks:
- Layout Detection and Classification: Localization and identification of block-level entities.
- Text Recognition: Per-block text transcription.
- Formula Recognition: Conversion of detected formula regions into canonical LaTeX.
- Table Recognition: Structured extraction of row/col/cell topology and cell content, supporting tree-edit forms.
- Reading Order Prediction: Generation of a linear reading sequence reflecting human-imposed semantics (critical for multi-column and complex layouts).
Each model is evaluated on standard input–output pairs: given raw PDF images, produce structured outputs encoded in Markdown (text), LaTeX (formulas), HTML (tables), or JSON (for direct element extraction). The scorer matches predictions against ground truth using strict alignment on both coordinates and semantic category, with post-processing to merge or split blocks as needed (Ouyang et al., 2024, Yin et al., 28 Jan 2026).
3. Metrics and Aggregation
OmniDocBench introduces and standardizes multiple evaluation metrics suited to each subtask:
- Text/OCR Accuracy: Normalized Edit Distance (NED, also denoted Text), computed as Levenshtein distance over character sequences divided by ground-truth length (Ouyang et al., 2024, Yin et al., 28 Jan 2026).
- Formula Recognition: Character Detection Metric (CDM; 1 minus normalized edit distance on tokenized LaTeX) (Ouyang et al., 2024, Yin et al., 28 Jan 2026).
- Table Parsing: Tree Edit Distance Similarity (TEDS); higher is better, reflecting structural equivalence between predicted and ground-truth table trees. A strict variant (TEDS) enforces exact span and alignment preservation (Ouyang et al., 2024).
- Reading Order: Normalized Edit Distance over predicted vs. ground-truth sequence of block indices (Liu et al., 12 Jan 2026, Yin et al., 28 Jan 2026).
- Overall Score: Aggregated as a weighted sum (variant by version) of these metrics:
$\mathrm{Overall} = \frac{(1-\mathrm{NED_{text})\times100 + \mathrm{TEDS} + \mathrm{CDM}}{3}$
(formula as in (Zhou et al., 4 Mar 2026); other versions may use five-term averages) (Cui et al., 29 Jan 2026, Dong et al., 11 Mar 2026).
Precision, Recall, and are applied for detection subcomponents with two-stage matching (Hungarian + box clustering) (Li et al., 2 Dec 2025). Attribute-level and page-type breakdowns enable fine-grained error analysis.
4. Extensions: Robustness, Physical Reality, and Retrieval
OmniDocBench has spawned physically reconstructed and real-capture variants, notably:
- Real5-OmniDocBench (Zhou et al., 4 Mar 2026, Cui et al., 29 Jan 2026): Each v1.5 page is physically printed and recaptured under five canonical distortions: Scanning (noise, blur, misalignments), Warping (creasing, crumpling, book arcs), Skew (3D rotation/homography), Screen-Photography (moiré, display noise), and Illumination (shadows, color casts). All samples use fiducial markers for sub-pixel ground-truth alignment, supporting direct metric correspondence with the digital corpus. This enables per-factor attribution of degradation (e.g., Skew and Warping cause the largest drops in structure-sensitive TEDS and CDM).
- Wild-OmniDocBench (Li et al., 25 Mar 2026): Real-world photos of each OmniDocBench page, balancing print+photo and screen+photo modes and retaining full annotation consistency. Serves as a stress-test for MLLM robustness outside the domain of digital/born-PDFs.
- Evidence Units (EUs) for Retrieval (Han, 1 Apr 2026): EU chunking semantically groups logical units (e.g., table + caption + explanatory paragraph). When applied to OmniDocBench, retrieval LCS increases from 0.50 to 0.81 and Recall@1 from 0.15 to 0.51, highlighting that annotation structure enables non-fragmented retrieval for RAG systems.
5. Baseline Methods, Leaderboard Results, and Analysis
OmniDocBench is the reference standard for both modular pipeline and end-to-end VLM evaluation:
- Pipeline Tools: MinerU, DocLayout-YOLO, PaddleOCR, UniMERNet, Mathpix. Excel in academic-style, clean layouts, but less robust to handwritten, stylized, or highly multi-lingual pages (Ouyang et al., 2024, Li et al., 2 Dec 2025, Wang et al., 17 Oct 2025).
- Specialized VLMs: dots.ocr (Li et al., 2 Dec 2025), PaddleOCR-VL-1.5 (Cui et al., 29 Jan 2026), Youtu-Parsing (Yin et al., 28 Jan 2026), FireRed-OCR (Wu et al., 2 Mar 2026), Qianfan-OCR (Dong et al., 11 Mar 2026), Dolphin-v2 (Feng et al., 5 Feb 2026). End-to-end models increasingly rival pipelines—PaddleOCR-VL-1.5 achieves 94.50 overall, Youtu-Parsing 93.22, Qianfan-OCR 93.12, FireRed-OCR 92.94. Notably, SOTA models display minimal performance loss (<4%) under advanced physical distortion (Real5-OmniDocBench), a sharp contrast to previous generation models (Cui et al., 29 Jan 2026, Zhou et al., 4 Mar 2026).
Representative leaderboard snapshot (v1.5, Overall, higher is better) (Cui et al., 29 Jan 2026, Yin et al., 28 Jan 2026, Dong et al., 11 Mar 2026, Wu et al., 2 Mar 2026): | Model | Params | Overall (%) | Text\,\downarrow^{CDM} | Table\,\uparrow_s$0 | RO$_s$1$_s$2 | |---------------------|--------|-------------|------------------|----------------|------------------|---------|---------------| | PaddleOCR-VL-1.5 | 0.9B | 94.50 | 0.035 | 94.21 | 92.76 | 95.79 | 0.042 | | Youtu-Parsing | 2.5B | 93.22 | 0.045 | 93.19 | 91.15 | 95.43 | 0.026 | | Qianfan-OCR | 4B | 93.12 | 0.041 | 92.43 | 91.02 | 93.85 | 0.049 | | FireRed-OCR | 2B | 92.94 | 0.032 | 91.71 | 90.31 | 93.81 | 0.041 |
Analysis converges on several points:
- End-to-end VLMs with multi-modal, prompt-augmented training match or exceed pipelines on key metrics, particularly on reading-order and table CDM/TEDS under complex layouts.
- Robustness across languages (EN/ZH) and document categories is now a minimum expectation (Dong et al., 11 Mar 2026, Wang et al., 17 Oct 2025).
- Failure modes drift toward geometric distortions, rare layout categories, and high reading-order entropy, for which new strategies (e.g., layout-as-thought (Dong et al., 11 Mar 2026), hybrid anchor prompting (Feng et al., 5 Feb 2026), RL fine-tuning (Wang et al., 17 Oct 2025)) show clear quantitative wins.
6. Challenges, Limitations, and Future Directions
OmniDocBench continues to expose frontier limitations and open avenues:
- Multilingual and Cross-Domain Generalization: While EN/ZH are fully covered, generalization to low-resource scripts and domain-unique layouts (notably business, chemistry, artistic) remains underexplored (Li et al., 2 Dec 2025, Ouyang et al., 2024).
- Physical Reality and Robustness: Real5-OmniDocBench and Wild-OmniDocBench reveal persistent "reality gaps"—modeling geometric warping and non-uniform illumination remains unsolved at the structure level (Zhou et al., 4 Mar 2026, Li et al., 25 Mar 2026).
- Reading Order Complexity: FocalOrder demonstrates that positional disparity (mid-sequence "inverted-U" error) persists for high-element-density pages, requiring adaptive curriculum and ranking-based objectives (Liu et al., 12 Jan 2026).
- Retrieval and RAG Integration: Fragmentation of units, if not addressed via semantic chunking (EUs), severely impairs LLM pipeline recall and increases context cost (Han, 1 Apr 2026).
- Scalability and Efficiency: Compact VLMs (<1B params) such as PaddleOCR-VL-1.5 now rival 200B+ models, suggesting that architectural advances eclipse brute-force scaling under real-world perturbations (Cui et al., 29 Jan 2026).
Planned expansions include: multi-page linking, vertical/complex script support, cross-page reference annotation, tighter coupling of detection and order induction, and open-source release of all distortion and real-capture variants (Ouyang et al., 2024, Zhou et al., 4 Mar 2026, Li et al., 25 Mar 2026). The field anticipates the integration of RL-based reward frameworks, semantic-aware loss formulations, and adaptive decoding protocols to fully close the loop between page-level structure, content fidelity, and downstream document understanding.
7. Impact and Significance
OmniDocBench has established itself as the de facto gold standard for holistic evaluation of document parsing. It provides rigorous protocols and granular metrics that drive the development of both academic and commercial VLMs, enables clear attribution of model gains (or failures) to specific document phenomena, and sets explicit requirements for scalability and realistic robustness (Ouyang et al., 2024, Zhou et al., 4 Mar 2026, Yin et al., 28 Jan 2026, Dong et al., 11 Mar 2026). Its physically reconstructed and retrieval-augmented extensions reshape evaluation to better reflect the deployed realities of intelligent document processing pipelines.
The evolution and adoption of OmniDocBench, together with its derivatives, directly influence the next generation of multitask VLMs, document QA, RAG systems, and OCR solutions, and provide a reference point for addressing open challenges in structured document intelligence.