- The paper critically assesses existing VQA datasets, exposing biases and limitations in evaluating multi-modal LLMs.
- It introduces novel benchmarks—TDIUC, TallyQA, DVQA, and VQDv1—for detailed analysis of visual and textual reasoning.
- The study evaluates leading MLLMs such as LLaVA and GPT-4V, uncovering specific weaknesses in counting, OCR, and spatial comprehension.
The paper "Revisiting Multi-Modal LLM Evaluation" explores the challenging yet pivotal task of assessing multi-modal LLMs (MLLMs). As the field of MLLMs continues to advance, traditional evaluation datasets have become increasingly outdated, exhibiting several significant issues including extreme bias, spurious correlations, and insufficient ability for fine-grained analysis.
Key Contributions
- Critical Evaluation of Current Datasets: The authors scrutinize popular datasets used in visual question answering (VQA) and referring expression comprehension. They argue that these datasets, despite their widespread usage, contain inherent flaws that prevent a comprehensive evaluation of MLLMs.
- Introduction of Robust Datasets:
To address the gaps in traditional datasets, the authors pioneer the use of three novel VQA datasets:
- TDIUC: Permits fine-grained analysis across 12 different question types.
- TallyQA: Features both simple and complex counting questions.
- DVQA: Requires optical character recognition (OCR) for interpreting charts.
Additionally, they introduce VQDv1 for referring expression comprehension, necessitating the identification of all image regions that meet a provided query.
- Evaluation of Recent MLLMs: The paper conducts an in-depth evaluation of state-of-the-art MLLMs including LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o. Through experimental results, the authors unveil previously unreported weaknesses and limitations of these models.
Experimental Findings
- Fine-Grained Analysis with TDIUC:
The analysis using TDIUC reveals discrepancies in MLLM performance across various question types, providing insights into specific areas where models excel or underperform.
- Counting Capabilities in TallyQA:
TallyQA enables the evaluation of MLLMs' counting abilities, distinguishing performance on simple versus complex counting tasks. This sheds light on how well models handle numerical understanding and reasoning.
DVQA tests the OCR capabilities of MLLMs in the context of chart interpretation, a crucial but often overlooked aspect of visual data comprehension.
- Referring Expression Comprehension in VQDv1:
Evaluation with VQDv1 exposes models' abilities to accurately identify image regions corresponding to complex queries, highlighting strengths and vulnerabilities in spatial and contextual understanding.
Integration and Accessibility
The paper also contributes to the broader research community by integrating their evaluation framework into the LAVIS (Language and Vision) toolkit. This integration facilitates streamlined, rapid assessments of future MLLMs, promoting rigorous and comprehensive evaluation practices.
Conclusion
The work presented in this paper underscores the necessity for updated, unbiased datasets to accurately evaluate the capabilities of modern MLLMs. By introducing robust datasets and performing extensive evaluations, the authors provide valuable insights and a practical framework to benchmark progress in the field of multi-modal AI. The project webpage offers additional resources and code to further support the research community in this ongoing endeavor.