Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark (2407.13766v2)

Published 18 Jul 2024 in cs.CV

Abstract: Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like photo album searches or satellite imagery analysis. In this work, we first assess the limitations of current benchmarks for long-context LMMs. We address these limitations by introducing a new vision-centric, long-context benchmark, "Visual Haystacks (VHs)". We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window. Towards a solution, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU -- far surpassing the 1k-image limit of contemporary models. MIRAGE demonstrates up to 13% performance improvement over existing open-source LMMs on VHs, sets a new state-of-the-art on the RetVQA multi-image QA benchmark, and achieves competitive performance on single-image QA with state-of-the-art LMMs.

PDF HTML Abstract

Analyzing the Capabilities of Large Multimodal Models through the Visual Haystacks Benchmark

This paper explores the challenges Large Multimodal Models (LMMs) encounter when confronted with Multi-Image Visual Question Answering (MIQA) tasks, a domain where existing models are shown to struggle in terms of retrieving pertinent information and reasoning across collections of images. The authors introduce the "Visual Haystacks" (VHs) benchmark specifically designed to assess the effectiveness of LMMs in handling queries over extensive sets of unrelated images, a scenario that mimics practical applications such as analyzing large datasets in medical imaging, environmental monitoring through satellite imagery, and many more.

The paper highlights the inadequacies of current LMMs, particularly closed-source models such as GPT-4o, when tasked with MIQA challenges. Notably, these models exhibit a performance slump of up to 50% relative to their capabilities in non-retrieval standard QA problems, demonstrating a pronounced positional bias. This suggests that LMMs are heavily reliant on the order of images in the context, leading to further degradation in performance when the relevant data is not optimally positioned.

To alleviate these shortcomings, the authors propose "MIRAGE" (Multi-Image Retrieval Augmented Generation), a novel framework that incorporates a query-aware retrieval mechanism and optimized image encoding strategies. MIRAGE is shown to outperform existing closed-source models on the VHs benchmark with notable efficiency. It achieves up to 11% improvement over GPT-4o in terms of performance and up to 3.4 times more efficient processing when data is in text-focused formats.

Key Findings and Implications

Benchmark Development: The "Visual Haystacks" benchmark is a pivotal contribution aimed at evaluating the retrieval and reasoning capabilities of LMMs in a real-world-like setup, where multiple images must be processed and evaluated jointly to generate accurate responses.
Retrieval Mechanisms: The paper emphasizes on the necessity of integrating robust retrieval modules. By incorporating a query-aware retriever into MIRAGE, the model adapates flexibly to the variable relevance of images, reducing distractor influence and enhancing query relevance filtering.
Image Encoding: MIRAGE’s approach towards image data — employing compressive image encoding via Q-Former — demonstrates significant advancements in reducing the context length limit, allowing models to address larger datasets efficiently.
Model Efficiency: A noteworthy aspect of MIRAGE is its optimization for efficiency without sacrificing accuracy, presenting a compelling case for streamlined processing in environments constrained by computational resources.

Theoretical and Practical Implications

The research outlined in this paper holds substantial implications for the field of artificial intelligence, particularly in advancing the methods for multimodal data processing. The challenge of MIQA highlights the evolving landscape of LMMs and their potential application in complex real-world scenarios where multi-image integration is crucial. The introduction of the VHs benchmark provides a valuable metric for future model evaluations, offering a more comprehensive understanding of a model’s capabilities beyond traditional single-image VQA exams.

Furthermore, the development of MIRAGE indicates a promising direction for future models that will need to incorporate both retrieval and reasoning capabilities effectively. This suggests a shift towards models that comprehensively engage with multimodal inputs, accommodating a diversity of image sets with enhanced contextual awareness and reduced processing burden.

Conclusion

This paper presents a thorough investigation into the performance of contemporary LMMs on MIQA tasks and offers innovative solutions in the form of the MIRAGE framework. By tackling the limitations in retrieval accuracy and computational efficiency, the authors pave the way for creating more robust and capable models that can fulfill complex multimodal tasks. Future research might focus on expanding MIQA datasets and refining retrieval mechanisms to further enhance model performance and applicability across diverse domains.