MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Published 13 Jun 2024 in cs.CV, cs.AI, and cs.CL | (2406.09411v2)

Abstract: We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.

Abstract PDF HTML Upgrade to Chat

Citations (28)

View on Semantic Scholar

Summary

The paper introduces MUIRBench, a benchmark using 11,264 images and 2,600 questions to evaluate multi-image understanding in multimodal LLMs.
The paper demonstrates significant performance gaps, with top models like GPT-4o achieving only 68% accuracy and struggling with unanswerable questions.
The paper underscores that targeted training on diverse multi-image data is essential, as scaling model size alone does not ensure robust reasoning.

A Comprehensive Benchmark for Robust Multi-image Understanding

Introduction

The paper introduces a comprehensive benchmark named MUIRBench, which aims to rigorously evaluate the multi-image understanding capabilities of multimodal LLMs. It recognizes that while humans can naturally synthesize information from multiple images to perceive the world holistically, current multimodal LLMs, including state-of-the-art models such as GPT-4o and Gemini Pro, struggle to achieve similar levels of comprehension when dealing with multiple images.

Benchmark Design and Objectives

MUIRBench comprises 11,264 images and 2,600 multiple-choice questions that span 12 distinctive multi-image understanding tasks. These tasks range from visual retrieval and cartoon understanding to geographic understanding and diagram understanding, encapsulating various categories of multi-image relations like temporal sequences, multi-view perspectives, and ordered pages. Notably, MUIRBench adopts a pairwise design where each answerable question is paired with an unanswerable counterpart that exhibits minimal differences, thereby providing a robust evaluation framework that mitigates the risk of models relying on superficial shortcuts.

Experimental Evaluation

An extensive evaluation on MUIRBench was conducted across 20 recent multimodal LLMs, including models that natively support multi-image inputs and those traditionally designed for single-image inputs. The findings reveal substantial performance gaps: even the best-performing models like GPT-4o and Gemini Pro achieved only 68.0% and 49.3% accuracy, respectively. Intriguingly, open-source multimodal LLMs trained on single images demonstrated poor generalization to multi-image questions, with accuracies hovering around 33.3%.

Key Findings and Model Performance

The comprehensive assessment on MUIRBench highlighted several crucial aspects:

Holistic Multi-image Understanding: Multimodal LLMs showed significant difficulties in holistic multi-image tasks. For instance, while GPT-4o achieved relatively higher accuracies on tasks like image-text matching and visual retrieval, it struggled considerably with tasks requiring detailed integration and reasoning across multiple images, such as multi-image ordering and visual grounding.
Robustness to Unanswerable Questions: The introduction of unanswerable question variants revealed that models frequently fail to recognize what they do not know. The gap in accuracy between answerable and unanswerable variants was notable, suggesting that current models are prone to overconfidence and do not possess reliable mechanisms to abstain from answering when necessary.
Training Data and Model Size: The results underscored the importance of training data and processes. There was no evident correlation between model size and performance, indicating that mere scaling of model parameters does not suffice for developing robust multi-image understanding. Instead, targeted training on diverse multi-image datasets is essential.

Error Analysis

Error analysis provided additional insights. A common error category involved failures in capturing nuances and details within images, which constituted 26% of the errors. Logical reasoning errors and difficulties in object counting were also prevalent, indicating areas where current models require substantial improvements.

Implications and Future Directions

The findings from MUIRBench have important implications for the future development of multimodal LLMs.

Improved Training Paradigms: There is a clear need for training paradigms that imbue models with the ability to effectively synthesize and reason over multiple images. This could involve leveraging large-scale, curated multi-image datasets and refining architectures to better handle multi-dimensional visual inputs.
Robustness and Uncertainty Quantification: Future research should focus on enhancing the robustness of multimodal LLMs, particularly their ability to handle unanswerable questions gracefully. Techniques such as uncertainty quantification and abstention mechanisms could be crucial in this regard.
Expanding Task Diversity: Incorporating a broader range of tasks and image relations will further stress-test models and uncover latent weaknesses, driving continual improvements in multimodal understanding capabilities.

Conclusion

MUIRBench stands out as a rigorous, comprehensive benchmark specifically designed to push the boundaries of multi-image understanding in multimodal LLMs. The current limitations of even the most advanced models underscore the need for refined training methodologies and robust evaluation frameworks. By providing a meticulous assessment of multi-image reasoning tasks, MUIRBench paves the way for developing next-generation models capable of transcending single-image limitations and achieving holistic visual comprehension.

Markdown