- The paper introduces a novel benchmark that compiles diverse document types with comprehensive annotations for thorough evaluation.
- It employs a flexible evaluation framework enabling end-to-end, module-specific, and attribute-based assessments with defined metrics like TEDS.
- The benchmark reveals limitations of current pipelines and shows potential of VLMs, guiding future document parsing advancements.
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
The paper "OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations" introduces a novel benchmark, tailored for automated document content extraction that encompasses multiple document types and comprehensive annotations. This work effectively addresses the pressing need within the computer vision domain to meet the demands for high-quality data required by LLMs and retrieval-augmented generation (RAG) technologies.
Context and Significance
Document content extraction has become increasingly pivotal, given its role in empowering large models with scale and contextual knowledge. Traditional methods, be they modular pipelines or multimodal end-to-end models, have struggled to deliver comprehensive and diverse evaluations, often limiting themselves to specific document types, such as academic papers. This has left a significant gap in real-world applicability due to inadequate metric systems and evaluation diversity.
Contributions of OmniDocBench
- Diverse Dataset and Annotations: OmniDocBench distinguishes itself by assembling a high-quality evaluation dataset comprising nine distinct types of documents, such as textbooks, exam papers, and financial reports. It incorporates meticulous annotations across various dimensions, including 19 layout categories and 14 attribute labels.
- Flexible Evaluation Framework: This benchmark introduces a flexible, multi-level evaluation framework designed to assess entire datasets, individual modules, or specific data types. The ability to perform end-to-end, single module, and attribute-based evaluations stands as a critical advancement for the domain.
- Comprehensive Evaluations: OmniDocBench facilitates a rigorous examination of mainstream methods, both traditional modular pipelines and multimodal end-to-end models. This analysis highlights their limitations, particularly in terms of document diversity handling and metric adequacy.
- Open Data and Tools: The dataset and accompanying tools are made accessible to encourage further research and development in document parsing technologies.
Methodology
OmniDocBench constructs its dataset through a robust process involving data acquisition, intelligent pre-annotation, manual verification, and expert review. It assesses document parsing performance using a sophisticated pipeline that includes text extraction, matching algorithms, and metric calculations. Evaluation criteria extend across pure text, tables, formulas, and reading order, employing methods such as TEDS and Normalized Edit Distance for accuracy and comprehensiveness.
Evaluation and Results
The evaluations cover both component-specific and end-to-end assessments across diverse document types. Pipeline tools like MinerU and Mathpix excel in precision but are somewhat constrained by the diversity of training data. On the other hand, VLMs such as Qwen2-VL and InternVL2 demonstrate adaptability across specialized datasets, underscoring the benefits of broad training data exposure.
Implications and Future Directions
The introduction of OmniDocBench has pivotal implications for both theoretical development and real-world application in document parsing. By offering a diverse and exhaustively annotated benchmark, it sets a new standard in evaluating document parsing methodologies. Furthermore, its flexible framework could encourage the development of more robust and universally applicable document parsing solutions.
Future research may explore the fine-tuning of general VLMs with specialized datasets and further refine models' performance in challenging layouts or document types with unique attributes. As the performance disparity between pipeline-based methods and VLMs narrows, there may be a shift towards leveraging VLMs for their scalability and versatility in document understanding tasks.
In sum, OmniDocBench represents a significant enhancement in the document extraction field, offering a comprehensive platform for benchmarking and driving forward the capabilities of document parsing technologies.