Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework (2411.06176v1)

Published 9 Nov 2024 in cs.CL

Abstract: The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended solutions and not just extractive answers. To our knowledge, our training framework is the first to directly address the retrieval setting for multimodal long documents. To enable tuning open-source models, we construct a training corpus in a fully automatic manner for the question-answering task over such documents. Experiments show that our tuning approach achieves a relative improvement of 4.6% for the correctness of model responses, compared to the baseline open-source models. Our data, code, and models are available at https://multimodal-documents.github.io.

Summary

  • The paper introduces M-Longdoc, a benchmark that evaluates large multimodal models on complex documents with text, figures, and tables.
  • It proposes a novel retrieval-aware tuning method combining supervised fine-tuning with retrieval-augmented generation, improving answer correctness by 4.6 points.
  • The study features an automated evaluation framework using multiple judge models that achieve an 88.9% correlation with human assessments.

An Evaluation and Tuning Framework for Multimodal Long Document Understanding

The paper introduces M-LongDoc, a benchmark designed to evaluate the capabilities of large multimodal models in understanding extensive and complex documents. Additionally, a novel retrieval-aware tuning approach is proposed to enhance model performance when processing long multimodal documents.

M-LongDoc stands out in the literature by focusing on documents that contain hundreds of pages with diverse contents, including text, figures, and tables. A total of 851 document samples are provided in this benchmark, necessitating an in-depth understanding for open-ended question answering—thereby surpassing existing datasets which focus on shorter and simpler tasks.

The benchmark introduces a scalable and automated evaluation metric, circumventing the need for reference answers by employing multiple judge models to generate correctness scores based on detailed evaluation guides. This approach helps to eliminate subjective biases and achieves a high correlation with human assessment preferences (88.9% in correlations), ensuring the reliability of automated scoring. Furthermore, the paper highlights the inherent limitations of current models, which demonstrate poorer performance on figures and tables compared to text-only questions. This reveals existing models’ tendency toward multimodal bias and their struggle to focus amidst potentially irrelevant retrieved content.

To address these challenges, a retrieval-aware tuning method is described, merging supervised fine-tuning with retrieval-augmented generation. The method introduces distracting content into training scenarios, compelling models to distinguish between relevant and irrelevant information effectively. This nuanced approach is backed by a large-scale training corpus tailored for tuning document understanding models.

Experimental results highlight the efficacy of these approaches. Notable is the reported relative improvement of 4.6 in answer correctness scores when leveraging the retrieval-aware tuning framework. This performance gain demonstrates the utility of integrated retrieval-aware tuning mechanisms in enhancing the document understanding capabilities of open-source models. The results highlighted in Table \ref{tab:main_results} further delineate the competitive differences between open-source models like Qwen2-VL against proprietary models such as GPT-4o, indicating paths for closing this capability gap.

Key contributions of the paper are:

  1. Establishment of M-LongDoc as a challenging, realistic benchmark for multimodal long documents.
  2. Development of an automated, scalable evaluation procedure for in-depth model assessment.
  3. Proposal of a retrieval-aware tuning strategy, markedly improving model efficacy in document question answering.

While the paper suggests substantive forward motion in processing multimodal documents, the analysis indicates room for further research—particularly in alleviating models’ biases toward text and improving table and figure comprehension. As such, the exploration into retrieval-aware frameworks provides a promising direction for more robust multimodal text analysis in practical applications.

Thus, this research contributes significantly to the domain of multimodal document understanding, laying groundwork for future studies focused on large, complex datasets prevalent in real-world scenarios. The M-LongDoc benchmark, alongside the proposed tuning methodologies, offers a pivotal foundation upon which future innovations and evaluations can be structured.