MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations
MMLongBench-Doc is a newly introduced benchmark aimed at evaluating the capabilities of Large Vision-LLMs (LVLMs) in the context of long-form, multi-modal document understanding. The benchmark addresses significant gaps in previous datasets, which predominantly focused on short, single-page documents, thereby limiting the scope of their evaluations.
Key Contributions
- Dataset Construction:
- The dataset comprises 130 PDF-formatted documents with an average of 49.4 pages and 20,970.9 textual tokens. These documents come from diverse sources such as research reports, financial reports, academic papers, brochures, and guidelines, ensuring a well-rounded evaluation metric.
- A distinctive feature is the inclusion of 1,062 expert-annotated questions that require evidence not only from textual content but also from images, charts, tables, and layout structures. Furthermore, 33.2% of these questions are cross-page questions, necessitating comprehension across multiple pages, and 22.8% are designed to be unanswerable, testing the models' hallucination detection capabilities.
- Evaluation Metrics:
- The benchmark uses a combination of generalized accuracy and F1 score to provide a nuanced evaluation across different question types and evidence sources.
- The benchmark methodology comprises a three-step evaluation protocol: response generation, answer extraction, and score calculation. This pipeline ensures a high correlation between human judgment and automatic evaluation, with comprehensive rules for formatted answers including strings, integers, floats, and lists.
- Comparison with Previous Datasets:
- In contrast to previous datasets like DocVQA, ChartQA, and SlideVQA, which primarily focused on single-page documents or those of limited complexity, MMLongBench-Doc stands out by its complexity and diversity. Document lengths, page numbers, and token densities in MMLongBench-Doc are significantly higher, thereby pushing the boundaries of existing LVLM capabilities.
Experimental Results
The paper conducts extensive experiments evaluating 14 LVLMs and 10 LLMs. The results compellingly reveal that long-context document understanding presents substantial challenges to current state-of-the-art models:
- Performance Indicators:
- The best-performing model, GPT-4o, achieved an F1 score of only 42.7%. This performance starkly contrasts with traditional document understanding tasks where models often exceed 90% accuracy.
- Interestingly, many LVLMs perform worse than their LLM counterparts when dealing with OCR-parsed texts, underlining the difficulty in processing multi-modal, multi-page documents effectively.
- Error Analysis:
- Most errors were attributed to hallucinated evidence, perceptual inaccuracies, and struggles with gathering complete evidence for cross-page questions. For instance, GPT-4o frequently attempted to provide answers even when they were unanswerable, leading to higher hallucination rates.
Implications and Future Directions
The pronounced challenges highlighted by MMLongBench-Doc underscore the necessity for more robust LVLM architectures capable of long-context comprehension. The benchmark reveals specific areas where existing models falter, such as:
- Perceptual Capability: Enhancing the visual perception of models to accurately interpret images, charts, and complex layouts.
- Cross-page Comprehension: Developing mechanisms for effective global searching and information aggregation across multiple document pages.
- Hallucination Mitigation: Improving the models' ability to recognize when a question is unanswerable to reduce false positives.
Moving forward, the dataset could be instrumental in guiding the next generation of research in multi-modal long-context document understanding. Enhancing the pre-training corpus with more diverse and complex long-form documents, along with fine-tuning strategies, may bridge the performance gap identified in this paper. Furthermore, practical applications of these advancements could span various domains including legal document analysis, scientific literature reviews, and large-scale financial report audits.
Conclusion
MMLongBench-Doc represents a significant advancement in evaluating the document understanding capabilities of LVLMs, particularly in the context of long, multi-modal documents. By identifying explicit challenges and providing a robust benchmark, this work paves the way for future developments that could significantly enhance the capabilities of LVLMs in practical, real-world applications.