Evaluating Large Vision-LLMs: A New Benchmark and Metrics
Introduction to MMStar and New Metrics
Recent advancements in Large Vision-LLMs (LVLMs) have necessitated the development of accurate, reliable benchmarks that truly assess these models' multi-modal capabilities. Through an examination of current evaluation methodologies, two significant challenges were identified: the redundancy of visual content in many samples and unintentional data leakage during LLM and LVLM training. Addressing these issues, this paper introduces MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 rigorously selected human-reviewed samples. It's designed to evaluate the genuine multi-modal understanding of LVLMs across six core capabilities and eighteen detailed axes. Moreover, two new metrics, Multi-Modal Gain (MG) and Multi-Modal Leakage (ML), have been developed to measure the performance gains attributable to multi-modal training and the degree of data leakage, respectively.
Methodology of MMStar Benchmark Creation
The MMStar benchmark starts from a comprehensive dataset collection, primarily focusing on areas where existing benchmarks exhibit shortcomings. The curation process involved two distinct stages,
- Automated Filtering: An initial coarse filtering using a set of criteria to ensure visual dependency and minimize data leakage. Eight powerful LLMs were utilized for preliminary sample selection, aiming to eliminate any that could potentially lack visual dependency or demonstrate evidence of LLM training data leakage.
- Human Review: Subsequently, a stringent human review process ensured that selected samples necessitate visual understanding, cover a wide array of multi-modal capabilities, and present various difficulty levels. This phase solidified MMStar's goal to offer a benchmark that not only challenges LVLMs across multiple dimensions but does so with high-quality, meticulously vetted samples.
Core Capabilities and Dimensions
MMStar benchmarks LVLMs across six core capabilities: Coarse Perception (CP), Fine-grained Perception (FP), Instance Reasoning (IR), Logical Reasoning (LR), Science & Technology (ST), and Mathematics (MA), each split into three detailed axes. This comprehensive structure ensures a holistic evaluation of LVLMs' abilities to process and understand visual and textual content in tandem.
Introducing MG and ML Metrics
The paper proposes two new metrics to overcome current evaluation pitfalls:
- Multi-modal Gain (MG): This metric quantifies the actual performance improvement attributable to multi-modal training, enhancing the understanding of how effectively an LVLM leverages visual information besides text.
- Multi-modal Leakage (ML): This metric assesses the degree to which data leakage—unintended inclusion of evaluation samples in the training data—might influence the evaluation, ensuring fairer comparisons among models.
Evaluation and Findings
Upon evaluating 16 state-of-the-art LVLMs using MMStar and the proposed MG/ML metrics across seven popular benchmarks, it was observed that even the top-performing models underperform in certain core capabilities, emphasizing the challenging nature of MMStar. The employment of MG and ML metrics revealed insightful distinctions among LVLMs, illustrating varied degrees of multi-modal learning effectiveness and data leakage control.
Implications and Direction for Future Research
The introduction of the MMStar benchmark and novel MG and ML metrics mark significant strides towards more accurately evaluating and understanding LVLMs. This paper's findings underscore the importance of deliberate, careful construction of evaluation benchmarks and metrics to truly advance our comprehension of multi-modal AI capabilities. Looking ahead, the continued expansion of MMStar and dynamic evaluation methodologies promise to push the boundaries of what we expect from and how we assess LVLMs.