Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 52 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 13 tok/s Pro

GPT-4o 100 tok/s Pro

Kimi K2 192 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling (2505.00063v2)

Published 30 Apr 2025 in cs.CL and cs.CV

Abstract: The rapid advancement of multimodal LLMs (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic improvements. To bridge this gap, we introduce a General Document Intelligence Benchmark (GDI-Bench), featuring 2.3k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty, aiding in model weakness identification and optimization guidance. We evaluate various open-source and closed-source models on GDI-Bench, conducting decoupled analyses in the visual and reasoning domains, revealing their strengths and weaknesses. To address the diverse tasks and domains in the GDI-Bench, we propose a GDI-Model that mitigates catastrophic forgetting during the supervised fine-tuning (SFT) process through an intelligence-preserving training strategy, thereby reinforcing the inherent weaknesses of the base model. Our model achieves state-of-the-art performance on previous benchmarks and the GDI-Bench. Both our benchmark and models are or will be open-sourced on https://huggingface.co/GDIBench.

Collections

Summary

Overview of the GDI-Bench Paper

The paper introduces the General Document Intelligence Benchmark (GDI-Bench), a sophisticated framework designed to evaluate the capabilities of multimodal LLMs (MLLMs) in document-specific tasks. The authors aim to address the limitations of existing benchmarks that fail to diagnose model weaknesses or provide guidance for systematic improvements. GDI-Bench is structured around 1,900 images spanning nine scenarios and 19 distinct tasks, offering a comprehensive platform for evaluating document intelligence.

Key Contributions and Methodology

The novel aspect of GDI-Bench lies in its decoupled evaluation mechanism, which separates visual complexity from reasoning complexity. This dual-axis grading system enables a more granular assessment of model performance, allowing researchers to pinpoint specific areas of strength and weakness. The benchmark categorizes images into three levels of visual complexity (V0 to V2) and three levels of reasoning complexity (R0 to R2), creating a matrix of tasks that covers both simple and compound challenges.

A major technical contribution is the development of a new model, termed the GDI Model, which is fine-tuned using a customized intelligence-preserving training strategy. The research identifies and addresses catastrophic forgetting—a common issue in the supervised fine-tuning (SFT) of MLLMs—by implementing Layer-wise Adaptive Freeze-Tuning (LW-AFT). This method selectively freezes a majority of the model's parameters during the SFT process, allowing only a select subset of domain-sensitive parameters to be updated, thereby retaining pre-trained domain-general knowledge.

Evaluation and Results

The benchmark was assessed using various state-of-the-art open-source and closed-source models, revealing notable differences in performance based on the decoupled axes of complexity. For example, the GPT-4o model excelled in reasoning tasks but showed limitations in visual processing. The proposed GDI-Model outperformed existing models on both previous benchmarks and the newly established GDI-Bench, highlighting its superior adaptability and generalization capabilities. A detailed analysis across GDI-Bench tasks demonstrates that the LW-AFT approach effectively mitigates forgetting without sacrificing performance on general datasets.

Implications and Future Directions

The introduction of GDI-Bench and the associated methodologies have significant implications for advancing document intelligence in AI research. By providing a nuanced framework for evaluating MLLMs, GDI-Bench drives the development of models capable of handling complex document analysis tasks, which are increasingly important in various real-world applications. The decoupled evaluation strategy sets a precedent for future benchmarks, advocating for more sophisticated approaches in assessing AI capabilities.

Future research could explore expanding GDI-Bench to include additional scenarios and task types, thereby increasing its applicability across broader domains. Furthermore, the integration of GDI-Bench with existing AI systems could enhance the development of models that require cross-modal comprehension and reasoning, offering valuable insights into the synergy between visual and linguistic data processing.

In conclusion, the paper on GDI-Bench presents a robust and detailed framework for evaluating multimodal document intelligence, with particular attention to overcoming challenges associated with catastrophic forgetting and disparate task requirements. This research marks a step forward in the structured assessment of AI models, encouraging further exploration into comprehensive benchmarking practices.