Are We on the Right Way for Evaluating Large Vision-Language Models? (2403.20330v2)

Published 29 Mar 2024 in cs.CV

Abstract: Large vision-LLMs (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 24% on average. 2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%. Both problems lead to misjudgments of actual multi-modal gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans. MMStar benchmarks 6 core capabilities and 18 detailed axes, aiming to evaluate LVLMs' multi-modal capacities with carefully balanced and purified samples. These samples are first roughly selected from current benchmarks with an automated pipeline, human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities. Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

PDF HTML Abstract

Evaluating Large Vision-LLMs: A New Benchmark and Metrics

Introduction to MMStar and New Metrics

Recent advancements in Large Vision-LLMs (LVLMs) have necessitated the development of accurate, reliable benchmarks that truly assess these models' multi-modal capabilities. Through an examination of current evaluation methodologies, two significant challenges were identified: the redundancy of visual content in many samples and unintentional data leakage during LLM and LVLM training. Addressing these issues, this paper introduces MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 rigorously selected human-reviewed samples. It's designed to evaluate the genuine multi-modal understanding of LVLMs across six core capabilities and eighteen detailed axes. Moreover, two new metrics, Multi-Modal Gain (MG) and Multi-Modal Leakage (ML), have been developed to measure the performance gains attributable to multi-modal training and the degree of data leakage, respectively.

Methodology of MMStar Benchmark Creation

The MMStar benchmark starts from a comprehensive dataset collection, primarily focusing on areas where existing benchmarks exhibit shortcomings. The curation process involved two distinct stages,

Automated Filtering: An initial coarse filtering using a set of criteria to ensure visual dependency and minimize data leakage. Eight powerful LLMs were utilized for preliminary sample selection, aiming to eliminate any that could potentially lack visual dependency or demonstrate evidence of LLM training data leakage.
Human Review: Subsequently, a stringent human review process ensured that selected samples necessitate visual understanding, cover a wide array of multi-modal capabilities, and present various difficulty levels. This phase solidified MMStar's goal to offer a benchmark that not only challenges LVLMs across multiple dimensions but does so with high-quality, meticulously vetted samples.

Core Capabilities and Dimensions

MMStar benchmarks LVLMs across six core capabilities: Coarse Perception (CP), Fine-grained Perception (FP), Instance Reasoning (IR), Logical Reasoning (LR), Science & Technology (ST), and Mathematics (MA), each split into three detailed axes. This comprehensive structure ensures a holistic evaluation of LVLMs' abilities to process and understand visual and textual content in tandem.

Introducing MG and ML Metrics

The paper proposes two new metrics to overcome current evaluation pitfalls:

Multi-modal Gain (MG): This metric quantifies the actual performance improvement attributable to multi-modal training, enhancing the understanding of how effectively an LVLM leverages visual information besides text.
Multi-modal Leakage (ML): This metric assesses the degree to which data leakage—unintended inclusion of evaluation samples in the training data—might influence the evaluation, ensuring fairer comparisons among models.

Evaluation and Findings

Upon evaluating 16 state-of-the-art LVLMs using MMStar and the proposed MG/ML metrics across seven popular benchmarks, it was observed that even the top-performing models underperform in certain core capabilities, emphasizing the challenging nature of MMStar. The employment of MG and ML metrics revealed insightful distinctions among LVLMs, illustrating varied degrees of multi-modal learning effectiveness and data leakage control.

Implications and Direction for Future Research

The introduction of the MMStar benchmark and novel MG and ML metrics mark significant strides towards more accurately evaluating and understanding LVLMs. This paper's findings underscore the importance of deliberate, careful construction of evaluation benchmarks and metrics to truly advance our comprehension of multi-modal AI capabilities. Looking ahead, the continued expansion of MMStar and dynamic evaluation methodologies promise to push the boundaries of what we expect from and how we assess LVLMs.