- The paper introduces DocHaystack and InfoHaystack, two new benchmarks evaluating vision-language models on large-scale document retrieval with up to 1000 visual documents.
- The paper proposes V-RAG, a vision-centric retrieval framework that achieves significant performance gains (9% and 11% Recall@1) on the new large-scale benchmarks compared to prior methods.
- These contributions provide more realistic evaluation tools and a robust framework to push the boundaries of vision-language models in processing extensive visual data collections.
Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents
The paper "Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents" addresses a significant limitation in the current benchmarks for large multimodal models (LMMs) regarding vision-language understanding, particularly in relation to reasoning across extensive collections of images or documents. In real-world scenarios, large-scale document retrieval is a common challenge, yet existing benchmarks pair questions with a limited set of images, often not exceeding 30. The authors propose two new benchmarks, DocHaystack and InfoHaystack, designed to evaluate LMMs on their performance in large-scale visual document retrieval tasks, with each question potentially mapped to up to 1,000 visual documents. This development significantly enhances the complexity and applicability of retrieval tasks thereby offering better tools for assessing models' capabilities in approximating real-world use-cases.
The paper introduces V-RAG, a vision-centric retrieval-augmented generation framework, which leverages multiple multimodal vision encoders optimized for a variety of attributes along with a specifically designed question-document relevance module. V-RAG exhibits a dramatic improvement in retrieval accuracy, showing 9% and 11% gains in Recall@1 on the challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks compared to prior state-of-the-art models. Integrating V-RAG with LMMs allows effective processing across thousands of images, showing significant improvements on the proposed benchmarks in comparison to existing methods. This proposed framework and set of benchmarks have the potential to push the boundaries in the performance evaluation of LMMs, introducing a new level of difficulty and relevance in benchmark construction.
An in-depth analysis of existing benchmarks, like RetVQA and WebVQA, showcases the limited scope, where each question is typically associated with less than 30 images. However, DocHaystack and InfoHaystack aim to bridge this gap by expanding the dataset scale, consequently better simulating practical application scenarios. The authors address inherent challenges in benchmark construction like ensuring question specificity and exclusion of questions answerable by LMMs without visual context. A detailed data filtering pipeline involving both LLMs and human annotation was implemented to select document-specific question-answer pairs, enhancing the credibility and precision of the evaluation model.
The paper acknowledges that substantial improvements in retrieval accuracy were achievable due to V-RAG's innovative architecture. This architecture combines multiple vision encoders—CLIP, SigLIP, OpenCLIP—each harnessing unique model strengths to optimize retrieval over extensive visual document sets. Additionally, a subsequent LMM-filtering module refines results, ensuring only the most relevant documents are prioritized for processing by the LMM-VQA module. The empirical evidence from the benchmarks illustrates V-RAG's superior efficacy compared to previous techniques, especially in challenging environments with 1,000 document tasks.
In conclusion, the paper offers a pragmatic contribution towards refining LMMs' performance on large-scale vision-language tasks by introducing comprehensive benchmarks along with a robust retrieval framework. The insights from the evaluation could pave the way for future developments in AI, specifically enhancing the processing power and accuracy of LMMs in handling real-world vision-language retrieval issues. These contributions not only augment the current capabilities of LMMs but also propose a pivot in emphasizing the importance of complex, large-scale problem setting in model evaluation. Given the growing need for processing vast amounts of visual data in practical applications, these advancements hold substantial theoretical and practical implications for both academia and industry. Future directions could involve improving model efficiency further and exploring the integration of more advanced filtering techniques to maximize the retrieval quality in increasingly complex data scenarios.