OmniDocBench v1.5 Benchmark
- OmniDocBench v1.5 is a comprehensive benchmark for PDF document parsing that overcomes previous limitations by expanding dataset diversity and annotation granularity.
- It features nine distinct real-world document types and employs a rigorous pre-annotation, human refinement, and expert quality inspection pipeline.
- The benchmark introduces unified end-to-end, single-module, and attribute-based evaluation protocols to compare pipeline-based and vision-language models.
OmniDocBench v1.5 defines a comprehensive benchmark for diverse PDF document parsing, designed specifically to address persistent limitations in dataset diversity, annotation granularity, and evaluation methodologies present in previous benchmarks. It features detailed, multi-granularity annotations across a broad range of real-world document types and introduces unified evaluation protocols and metrics for markup-oriented outputs. By facilitating rigorous, fair, and extensible assessments of both pipeline-based and vision-language-based document parsing systems, OmniDocBench v1.5 serves as a new methodological standard in the field (Ouyang et al., 10 Dec 2024).
1. Motivation and Objectives
OmniDocBench v1.5 was introduced to address three critical shortcomings in prior benchmarks for PDF document parsing: limited diversity of document types, restricted evaluation dimensions, and the absence of adequate metrics for markup-style outputs. The benchmark unifies end-to-end evaluation over nine distinct real-world PDF page types, encompassing academic papers, textbooks, slides, exam papers, research reports, magazines, books, newspapers, and handwritten notes. Compared with predecessors—many of which restricted themselves to arXiv-like content or a couple of layout classes—v1.5 expands source types from one or two to nine, and increases both layout categories and attribute labels by a factor of approximately three or four. The scope also includes a rigorous pre-annotation, human-in-the-loop, and expert review pipeline, alongside standardized metric computation (Ouyang et al., 10 Dec 2024).
2. Dataset Composition
The OmniDocBench v1.5 dataset consists of 981 high-quality document pages, sampled from over 200,000 crawled PDF files using a ResNet-50 and Faiss-based clustering plus manual balancing procedure. Table 1 summarizes the number of annotated pages by document source:
| Document Type | # Pages |
|---|---|
| Academic Papers | 129 |
| Textbooks | 96 |
| Slides (PPT→PDF) | 133 |
| Exam Papers | 114 |
| Research Reports | 81 |
| Magazines | 97 |
| Books | 104 |
| Newspapers | 111 |
| Handwritten Notes | 116 |
Global page-level attributes span layout (single-column: 477, double-column: 126, three-column: 45, complex/mixed: 213), language (English: 290, Chinese: 612, mixed: 79), and special issues such as fuzzy scans (28), watermarks (65), and colorful backgrounds (246) (Ouyang et al., 10 Dec 2024).
3. Annotation Schema and Quality Processes
The annotation schema consists of 19 layout categories covering all major structural and semantic components encountered in real-world documents, including block-level and inline constructs. Categories include Title, Text Block, Figure, Figure Caption, Table, Table Caption, Header, Footer, Code Block, Reference, inline Equation (LaTeX-encoded), and Footnote Mark, with provisions for three masked ("non-semantic") categories.
Attributes are annotated at three granularities:
- Text-block attributes: language (EN, ZH, Mixed), background color (white, single-color, multi-color), rotation (0°, ±90°, horizontal).
- Table attributes: language, frame type, presence of merged cells, embedded formulas, colorful background, and rotation.
- Formula attributes: inline/display.
- Page-level attributes: page type (from nine source classes), layout type, language, special issues.
The annotation pipeline employs intelligent pre-annotation (LayoutLMv3 for layout, PaddleOCR for text, UniMERNet and GPT-4o for formula/LaTeX, table-OCR), annotator refinement with manual and external tool-based correction (e.g., TablesGenerator, LaTeXLive), and an expert quality inspection phase including rendering-based checks (CDM) and triage by multiple researchers (Ouyang et al., 10 Dec 2024).
4. Evaluation Protocols and Metrics
OmniDocBench v1.5 employs a flexible three-level protocol:
- End-to-End: Given a PDF image, the model outputs markdown with special-component tags, enabling joint assessment of text, formulas, tables, and reading order.
- Single-Module: Evaluates OCR, layout detection, table recognition, formula recognition, and reading order in isolation, using unified ground truth.
- Attribute-Based: Partitions results by page/block attributes (e.g., language, layout complexity, special issues) to analyze robustness.
Key steps in the evaluation pipeline:
- Preprocessing markdown by removing noise.
- Component extraction in a fixed order (LaTeX tables, HTML tables, display formulas, markdown tables, code blocks, plain text).
- Recording character offsets for accurate reading order.
- Normalization of inline formulas for unified evaluation.
- Matching ground truth and predictions using “Adjacency Search Match,” which seeks the optimal split/merge for maximal normalized edit distance (NED) similarity.
- Application of “ignore logic” to exclude headers/footers/captions from metrics.
Principal metrics include:
- Normalized Edit Distance (NED):
- TEDS (Table Tree-Edit-Distance-Similarity):
- CDM (Character-and-Delimiter Match) and BLEU for formula evaluation.
- Reading order is evaluated using NED over block/unit sequences (Ouyang et al., 10 Dec 2024).
5. Comparative Results and System Analysis
Key benchmarks demonstrate significant variation in system strengths across document domains and attributes. For end-to-end parsing:
- Pipeline tools (MinerU, Mathpix) outperform vision-LLMs (VLMs) such as GPT-4o for English text (Edit↓ ≈ 0.058–0.101) and formulas (CDM ≈ 71–76%), with MinerU and Mathpix leading reading order (NED↓ ≈ 0.105–0.138).
- Document-type analysis shows pipelines excel for academic papers and financial reports (Edit↓ ≈ 0.025–0.033), while VLMs outperform on slides and handwritten notes (e.g., GPT-4o Edit↓ = 0.388 vs. MinerU = 0.984 for notes; lower is better).
- Attribute-robustness experiments indicate Qwen2-VL and InternVL2 exhibit minimal metric variance (Edit↓ ≈ 0.124–0.157) under conditions such as fuzzy scans or watermarks; multi-column reading order metrics drop by around 2× compared to single column, though MinerU/Mathpix maintain stability.
- In isolation, DocLayout-YOLO achieves highest layout detection mAP (48.7), RapidTable leads table recognition (mAP ≈ 82.5), PaddleOCR yields top NED for OCR (73.6% average), and Mathpix, GPT-4o, and UniMERNet attain formula CDM ≈ 86% with GPT-4o highest @strict recall.
6. Addressed and Persisting Challenges
OmniDocBench v1.5 resolves prior constraints by:
- Enabling evaluation over a broad document-type spectrum with block- and span-level granularity.
- Providing a unified, rigorous metric suite suitable for markup-based outputs.
- Standardizing quality control with pre-annotation, human refinement, and rendering-based expert triage.
Unresolved challenges include annotation cost, limited script and language coverage, incomplete handling of complex multi-column and cross-page structures by current systems, and a deficiency of metrics addressing image, code block, and multi-page dependencies. Extending VLM fine-tuning to handle complex layouts and expanding benchmarking to low-resource document types remain open future avenues.
7. Future Prospects
Planned directions for OmniDocBench include end-to-end VLM fine-tuning (e.g., with Qwen2-VL and InternVL), support for multi-page and interactive formats, new metrics for images and code, and enhanced focus on handwritten and scanned historical documents. These efforts aim to reduce annotation cost, broaden language/script support, and further raise benchmarking standards for fair and comprehensive document parsing (Ouyang et al., 10 Dec 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free