Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

95 tokens/sec

Gemini 2.5 Pro Premium

55 tokens/sec

GPT-5 Medium

20 tokens/sec

GPT-5 High Premium

20 tokens/sec

GPT-4o

98 tokens/sec

DeepSeek R1 via Azure Premium

86 tokens/sec

GPT OSS 120B via Groq Premium

463 tokens/sec

Kimi K2 via Groq Premium

200 tokens/sec

2000 character limit reached

Revisiting Multi-Modal LLM Evaluation (2408.05334v1)

Published 9 Aug 2024 in cs.AI, cs.CL, and cs.CV

Abstract: With the advent of multi-modal LLMs (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that requires identifying all image regions that satisfy a given query. Our experiments reveal the weaknesses of many MLLMs that have not previously been reported. Our code is integrated into the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs. Project webpage: https://kevinlujian.github.io/MLLM_Evaluations/

Citations (3)

View on Semantic Scholar

Summary

The paper critically assesses existing VQA datasets, exposing biases and limitations in evaluating multi-modal LLMs.
It introduces novel benchmarks—TDIUC, TallyQA, DVQA, and VQDv1—for detailed analysis of visual and textual reasoning.
The study evaluates leading MLLMs such as LLaVA and GPT-4V, uncovering specific weaknesses in counting, OCR, and spatial comprehension.

The paper "Revisiting Multi-Modal LLM Evaluation" explores the challenging yet pivotal task of assessing multi-modal LLMs (MLLMs). As the field of MLLMs continues to advance, traditional evaluation datasets have become increasingly outdated, exhibiting several significant issues including extreme bias, spurious correlations, and insufficient ability for fine-grained analysis.

Key Contributions

Critical Evaluation of Current Datasets: The authors scrutinize popular datasets used in visual question answering (VQA) and referring expression comprehension. They argue that these datasets, despite their widespread usage, contain inherent flaws that prevent a comprehensive evaluation of MLLMs.
Introduction of Robust Datasets:

To address the gaps in traditional datasets, the authors pioneer the use of three novel VQA datasets: - TDIUC: Permits fine-grained analysis across 12 different question types. - TallyQA: Features both simple and complex counting questions. - DVQA: Requires optical character recognition (OCR) for interpreting charts.

Additionally, they introduce VQDv1 for referring expression comprehension, necessitating the identification of all image regions that meet a provided query.

Evaluation of Recent MLLMs: The paper conducts an in-depth evaluation of state-of-the-art MLLMs including LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o. Through experimental results, the authors unveil previously unreported weaknesses and limitations of these models.

Experimental Findings

Fine-Grained Analysis with TDIUC:

The analysis using TDIUC reveals discrepancies in MLLM performance across various question types, providing insights into specific areas where models excel or underperform.

Counting Capabilities in TallyQA:

TallyQA enables the evaluation of MLLMs' counting abilities, distinguishing performance on simple versus complex counting tasks. This sheds light on how well models handle numerical understanding and reasoning.

OCR Challenges in DVQA:

DVQA tests the OCR capabilities of MLLMs in the context of chart interpretation, a crucial but often overlooked aspect of visual data comprehension.

Referring Expression Comprehension in VQDv1:

Evaluation with VQDv1 exposes models' abilities to accurately identify image regions corresponding to complex queries, highlighting strengths and vulnerabilities in spatial and contextual understanding.

Integration and Accessibility

The paper also contributes to the broader research community by integrating their evaluation framework into the LAVIS (Language and Vision) toolkit. This integration facilitates streamlined, rapid assessments of future MLLMs, promoting rigorous and comprehensive evaluation practices.

Conclusion

The work presented in this paper underscores the necessity for updated, unbiased datasets to accurately evaluate the capabilities of modern MLLMs. By introducing robust datasets and performing extensive evaluations, the authors provide valuable insights and a practical framework to benchmark progress in the field of multi-modal AI. The project webpage offers additional resources and code to further support the research community in this ongoing endeavor.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Revisiting Multi-Modal LLM Evaluation (2408.05334v1)

Summary

Key Contributions

Experimental Findings

Integration and Accessibility

Conclusion

Follow-up Questions

Authors (7)

GitHub

Don't miss out on important new AI/ML research

Revisiting Multi-Modal LLM Evaluation (2408.05334v1)

Summary

Key Contributions

Experimental Findings

Integration and Accessibility

Conclusion

Follow-up Questions

Related Papers

Authors (7)

GitHub

Don't miss out on important new AI/ML research