- The paper introduces M-Longdoc, a benchmark that evaluates large multimodal models on complex documents with text, figures, and tables.
- It proposes a novel retrieval-aware tuning method combining supervised fine-tuning with retrieval-augmented generation, improving answer correctness by 4.6 points.
- The study features an automated evaluation framework using multiple judge models that achieve an 88.9% correlation with human assessments.
An Evaluation and Tuning Framework for Multimodal Long Document Understanding
The paper introduces M-LongDoc, a benchmark designed to evaluate the capabilities of large multimodal models in understanding extensive and complex documents. Additionally, a novel retrieval-aware tuning approach is proposed to enhance model performance when processing long multimodal documents.
M-LongDoc stands out in the literature by focusing on documents that contain hundreds of pages with diverse contents, including text, figures, and tables. A total of 851 document samples are provided in this benchmark, necessitating an in-depth understanding for open-ended question answering—thereby surpassing existing datasets which focus on shorter and simpler tasks.
The benchmark introduces a scalable and automated evaluation metric, circumventing the need for reference answers by employing multiple judge models to generate correctness scores based on detailed evaluation guides. This approach helps to eliminate subjective biases and achieves a high correlation with human assessment preferences (88.9% in correlations), ensuring the reliability of automated scoring. Furthermore, the paper highlights the inherent limitations of current models, which demonstrate poorer performance on figures and tables compared to text-only questions. This reveals existing models’ tendency toward multimodal bias and their struggle to focus amidst potentially irrelevant retrieved content.
To address these challenges, a retrieval-aware tuning method is described, merging supervised fine-tuning with retrieval-augmented generation. The method introduces distracting content into training scenarios, compelling models to distinguish between relevant and irrelevant information effectively. This nuanced approach is backed by a large-scale training corpus tailored for tuning document understanding models.
Experimental results highlight the efficacy of these approaches. Notable is the reported relative improvement of 4.6 in answer correctness scores when leveraging the retrieval-aware tuning framework. This performance gain demonstrates the utility of integrated retrieval-aware tuning mechanisms in enhancing the document understanding capabilities of open-source models. The results highlighted in Table \ref{tab:main_results} further delineate the competitive differences between open-source models like Qwen2-VL against proprietary models such as GPT-4o, indicating paths for closing this capability gap.
Key contributions of the paper are:
- Establishment of M-LongDoc as a challenging, realistic benchmark for multimodal long documents.
- Development of an automated, scalable evaluation procedure for in-depth model assessment.
- Proposal of a retrieval-aware tuning strategy, markedly improving model efficacy in document question answering.
While the paper suggests substantive forward motion in processing multimodal documents, the analysis indicates room for further research—particularly in alleviating models’ biases toward text and improving table and figure comprehension. As such, the exploration into retrieval-aware frameworks provides a promising direction for more robust multimodal text analysis in practical applications.
Thus, this research contributes significantly to the domain of multimodal document understanding, laying groundwork for future studies focused on large, complex datasets prevalent in real-world scenarios. The M-LongDoc benchmark, alongside the proposed tuning methodologies, offers a pivotal foundation upon which future innovations and evaluations can be structured.