OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations (2412.07626v2)

Published 10 Dec 2024 in cs.CV, cs.AI, and cs.IR

Abstract: Document content extraction is a critical task in computer vision, underpinning the data needs of LLMs and retrieval-augmented generation (RAG) systems. Despite recent progress, current document parsing methods have not been fairly and comprehensively evaluated due to the narrow coverage of document types and the simplified, unrealistic evaluation procedures in existing benchmarks. To address these gaps, we introduce OmniDocBench, a novel benchmark featuring high-quality annotations across nine document sources, including academic papers, textbooks, and more challenging cases such as handwritten notes and densely typeset newspapers. OmniDocBench supports flexible, multi-level evaluations--ranging from an end-to-end assessment to the task-specific and attribute--based analysis using 19 layout categories and 15 attribute labels. We conduct a thorough evaluation of both pipeline-based methods and end-to-end vision-LLMs, revealing their strengths and weaknesses across different document types. OmniDocBench sets a new standard for the fair, diverse, and fine-grained evaluation in document parsing. Dataset and code are available at https://github.com/opendatalab/OmniDocBench.

Summary

The paper introduces a novel benchmark that compiles diverse document types with comprehensive annotations for thorough evaluation.
It employs a flexible evaluation framework enabling end-to-end, module-specific, and attribute-based assessments with defined metrics like TEDS.
The benchmark reveals limitations of current pipelines and shows potential of VLMs, guiding future document parsing advancements.

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

The paper "OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations" introduces a novel benchmark, tailored for automated document content extraction that encompasses multiple document types and comprehensive annotations. This work effectively addresses the pressing need within the computer vision domain to meet the demands for high-quality data required by LLMs and retrieval-augmented generation (RAG) technologies.

Context and Significance

Document content extraction has become increasingly pivotal, given its role in empowering large models with scale and contextual knowledge. Traditional methods, be they modular pipelines or multimodal end-to-end models, have struggled to deliver comprehensive and diverse evaluations, often limiting themselves to specific document types, such as academic papers. This has left a significant gap in real-world applicability due to inadequate metric systems and evaluation diversity.

Contributions of OmniDocBench

Diverse Dataset and Annotations: OmniDocBench distinguishes itself by assembling a high-quality evaluation dataset comprising nine distinct types of documents, such as textbooks, exam papers, and financial reports. It incorporates meticulous annotations across various dimensions, including 19 layout categories and 14 attribute labels.
Flexible Evaluation Framework: This benchmark introduces a flexible, multi-level evaluation framework designed to assess entire datasets, individual modules, or specific data types. The ability to perform end-to-end, single module, and attribute-based evaluations stands as a critical advancement for the domain.
Comprehensive Evaluations: OmniDocBench facilitates a rigorous examination of mainstream methods, both traditional modular pipelines and multimodal end-to-end models. This analysis highlights their limitations, particularly in terms of document diversity handling and metric adequacy.
Open Data and Tools: The dataset and accompanying tools are made accessible to encourage further research and development in document parsing technologies.

Methodology

OmniDocBench constructs its dataset through a robust process involving data acquisition, intelligent pre-annotation, manual verification, and expert review. It assesses document parsing performance using a sophisticated pipeline that includes text extraction, matching algorithms, and metric calculations. Evaluation criteria extend across pure text, tables, formulas, and reading order, employing methods such as TEDS and Normalized Edit Distance for accuracy and comprehensiveness.

Evaluation and Results

The evaluations cover both component-specific and end-to-end assessments across diverse document types. Pipeline tools like MinerU and Mathpix excel in precision but are somewhat constrained by the diversity of training data. On the other hand, VLMs such as Qwen2-VL and InternVL2 demonstrate adaptability across specialized datasets, underscoring the benefits of broad training data exposure.

Implications and Future Directions

The introduction of OmniDocBench has pivotal implications for both theoretical development and real-world application in document parsing. By offering a diverse and exhaustively annotated benchmark, it sets a new standard in evaluating document parsing methodologies. Furthermore, its flexible framework could encourage the development of more robust and universally applicable document parsing solutions.

Future research may explore the fine-tuning of general VLMs with specialized datasets and further refine models' performance in challenging layouts or document types with unique attributes. As the performance disparity between pipeline-based methods and VLMs narrows, there may be a shift towards leveraging VLMs for their scalability and versatility in document understanding tasks.

In sum, OmniDocBench represents a significant enhancement in the document extraction field, offering a comprehensive platform for benchmarking and driving forward the capabilities of document parsing technologies.

PDF Markdown

Related Papers

GitHub

GitHub - opendatalab/OmniDocBench: A Comprehensive Benchmark for Document Parsing and Evaluation (30 stars)

Tweets

https://twitter.com/_reachsumit/status/1866700755685740991

https://twitter.com/rohanpaul_ai/status/1867416873970901166

https://twitter.com/mandeepabagga/status/1867288806833176988

https://twitter.com/yatskulyak/status/1867655972229149033