Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
Gemini 2.5 Pro Premium
55 tokens/sec
GPT-5 Medium
20 tokens/sec
GPT-5 High Premium
20 tokens/sec
GPT-4o
98 tokens/sec
DeepSeek R1 via Azure Premium
86 tokens/sec
GPT OSS 120B via Groq Premium
463 tokens/sec
Kimi K2 via Groq Premium
200 tokens/sec
2000 character limit reached

Revisiting Multi-Modal LLM Evaluation (2408.05334v1)

Published 9 Aug 2024 in cs.AI, cs.CL, and cs.CV

Abstract: With the advent of multi-modal LLMs (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that requires identifying all image regions that satisfy a given query. Our experiments reveal the weaknesses of many MLLMs that have not previously been reported. Our code is integrated into the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs. Project webpage: https://kevinlujian.github.io/MLLM_Evaluations/

Citations (3)

Summary

  • The paper critically assesses existing VQA datasets, exposing biases and limitations in evaluating multi-modal LLMs.
  • It introduces novel benchmarks—TDIUC, TallyQA, DVQA, and VQDv1—for detailed analysis of visual and textual reasoning.
  • The study evaluates leading MLLMs such as LLaVA and GPT-4V, uncovering specific weaknesses in counting, OCR, and spatial comprehension.

The paper "Revisiting Multi-Modal LLM Evaluation" explores the challenging yet pivotal task of assessing multi-modal LLMs (MLLMs). As the field of MLLMs continues to advance, traditional evaluation datasets have become increasingly outdated, exhibiting several significant issues including extreme bias, spurious correlations, and insufficient ability for fine-grained analysis.

Key Contributions

  1. Critical Evaluation of Current Datasets: The authors scrutinize popular datasets used in visual question answering (VQA) and referring expression comprehension. They argue that these datasets, despite their widespread usage, contain inherent flaws that prevent a comprehensive evaluation of MLLMs.
  2. Introduction of Robust Datasets:

To address the gaps in traditional datasets, the authors pioneer the use of three novel VQA datasets: - TDIUC: Permits fine-grained analysis across 12 different question types. - TallyQA: Features both simple and complex counting questions. - DVQA: Requires optical character recognition (OCR) for interpreting charts.

Additionally, they introduce VQDv1 for referring expression comprehension, necessitating the identification of all image regions that meet a provided query.

  1. Evaluation of Recent MLLMs: The paper conducts an in-depth evaluation of state-of-the-art MLLMs including LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o. Through experimental results, the authors unveil previously unreported weaknesses and limitations of these models.

Experimental Findings

  • Fine-Grained Analysis with TDIUC:

The analysis using TDIUC reveals discrepancies in MLLM performance across various question types, providing insights into specific areas where models excel or underperform.

  • Counting Capabilities in TallyQA:

TallyQA enables the evaluation of MLLMs' counting abilities, distinguishing performance on simple versus complex counting tasks. This sheds light on how well models handle numerical understanding and reasoning.

  • OCR Challenges in DVQA:

DVQA tests the OCR capabilities of MLLMs in the context of chart interpretation, a crucial but often overlooked aspect of visual data comprehension.

  • Referring Expression Comprehension in VQDv1:

Evaluation with VQDv1 exposes models' abilities to accurately identify image regions corresponding to complex queries, highlighting strengths and vulnerabilities in spatial and contextual understanding.

Integration and Accessibility

The paper also contributes to the broader research community by integrating their evaluation framework into the LAVIS (Language and Vision) toolkit. This integration facilitates streamlined, rapid assessments of future MLLMs, promoting rigorous and comprehensive evaluation practices.

Conclusion

The work presented in this paper underscores the necessity for updated, unbiased datasets to accurately evaluate the capabilities of modern MLLMs. By introducing robust datasets and performing extensive evaluations, the authors provide valuable insights and a practical framework to benchmark progress in the field of multi-modal AI. The project webpage offers additional resources and code to further support the research community in this ongoing endeavor.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube