Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models (2406.11230v1)

Published 17 Jun 2024 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Multimodal LLMs (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. Our comprehensive long-context evaluation of MLLMs also sheds lights on the considerable performance gap between API-based and open-source models. All the code, data, and instructions required to reproduce the main results are available at https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack.

PDF HTML Abstract

Evaluation of Long-Context Capabilities in Multimodal LLMs

The paper "Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal LLMs" provides a comprehensive analysis of multimodal LLMs (MLLMs) using the novel MultiModal Needle-in-a-haystack (MMNeedle) benchmark. This benchmark addresses a crucial gap in the evaluation of MLLMs by focusing on their ability to interpret and retrieve information from long-context inputs. The authors highlight the importance of this evaluation to expand the application horizon of MLLMs beyond current limitations.

The MMNeedle benchmark simulates a realistic, challenging scenario where MLLMs must locate a target sub-image or "needle" within a collection of images or "haystack," guided solely by textual descriptions. The benchmark employs a sophisticated dataset, comprising 40,000 images, 560,000 captions, and 280,000 needle-haystack pairs, providing a statistically robust platform for assessing MLLMs. By utilizing diverse settings that vary in context length and complexity, as well as single- and multi-needle configurations, MMNeedle ensures a comprehensive stress test.

Key findings reveal significant disparities between state-of-the-art API-based models such as GPT-4o and open-source models like LLaVA-Llama-3. Notably, GPT-4o excels in most scenarios but still faces challenges in handling complex multi-needle retrieval and large stitching size contexts. Open-source models generally lag behind API models in both exactness of retrieval and reducing hallucinations, especially in negative sample contexts. For instance, GPT-4o's performance drops sharply from 97% accuracy with 10 images to just 1% when analyzing a substantially larger stitched context, illustrating a scalability issue even among top-performing models.

The implication of these results is profound for both theoretical development and practical applications of AI. The performance gap highlights an urgent need for refinement in model architectures and evaluation techniques to better handle high-dimensional, heterogeneous data. This research could catalyze the development of more robust MLLMs capable of seamless, accurate information retrieval across complex, varied inputs—a necessary evolution for deploying AI in real-world applications requiring nuanced understanding and decision-making.

The MMNeedle benchmark's comprehensive dataset and evaluation metrics set a new standard for assessing and developing long-context capabilities in MLLMs. Looking forward, improvements in image processing, context comprehension, and algorithmic efficiency can be speculated as future research endeavors. These developments are essential to enhance the reliability and functionality of MLLMs in both controlled environments and practical implementations. By setting a rigorous baseline and identifying existing model limitations, this research contributes valuable insights for advancing the state-of-the-art in multimodal AI systems.