MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models (2306.13394v4)

Published 23 Jun 2023 in cs.CV

Abstract: Multimodal LLM (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data application manner and online leaderboards are released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.

PDF HTML Abstract

A Comprehensive Evaluation Benchmark for Multimodal LLMs

The paper "MME: A Comprehensive Evaluation Benchmark for Multimodal LLMs" presents MME, a benchmark designed to evaluate the capabilities of Multimodal LLMs (MLLMs). The authors identify a significant gap in the evaluation of MLLMs—existing methods do not comprehensively assess their diverse abilities. By addressing both perception and cognition, MME provides a robust benchmark encompassing 14 subtasks.

Key Contributions

The authors make several distinct contributions:

Comprehensive Benchmark: MME evaluates MLLMs on both perception and cognition tasks, providing a nuanced measure of their abilities across a range of 14 subtasks, including both coarse-grained and fine-grained recognition.
Manual Annotation: To avoid data leakage, all instruction-answer pairs are manually constructed. This ensures that the models are evaluated on genuine understanding rather than memorization.
Concise Instructions: The benchmark emphasizes concise instruction design, mitigating the impact of prompt engineering and focusing on model capability.
Evaluation Metrics: The metrics used include accuracy and a stricter measure, accuracy+, which reflects comprehensive understanding.

Detailed Evaluation

The benchmark provides a detailed evaluation of 30 advanced MLLMs, including well-known models such as GPT-4V, BLIP-2, and others. The results indicate significant variability in model capabilities across different tasks.

Perception Tasks: These involve the recognition and understanding of visual elements such as existence, count, position, and color of objects, as well as fine-grained tasks such as identifying specific scenes or artworks. The paper reveals that models display notable differences, with some excelling in specific areas like object existence while facing challenges in object position perception.
Cognition Tasks: These tasks require reasoning abilities that combine visual perception with knowledge from LLMs. The results suggest existing MLLMs need further development to excel consistently across tasks like commonsense reasoning and code understanding.

Implicated Challenges and Future Directions

The paper highlights several challenges identified through the evaluation:

Instruction Following: Some MLLMs fail to adhere to clear and concise instructions, indicating a gap in effective instruction following.
Basic Perception and Reasoning: Many models struggle with basic perception tasks, particularly in more nuanced scenarios like counting and spatial recognition, as well as in logical reasoning required for tasks such as arithmetic calculations.
Object Hallucination: A prominent issue is that models sometimes produce outputs based on non-existent objects, highlighting the need for improved grounding mechanisms.

Implications and Future Work

The introduction of MME provides crucial insights into the current capabilities and limitations of MLLMs. The paper suggests that while these models exhibit impressive emergent abilities, there is a substantial room for improvement, particularly in reducing hallucinations and following instructions more reliably. Future research can leverage MME to benchmark advances in MLLM architecture and training methods that aim to address these identified limitations.

This benchmark represents a foundational step in the evaluation of multimodal AI systems, contributing valuable data and insights that can drive the next wave of innovations in AI research and development.

PDF Markdown Bookmark Chat (Pro)

References (59)

Authors (12)

Chaoyou Fu (46 papers)
Peixian Chen (21 papers)
Yunhang Shen (54 papers)
Yulei Qin (17 papers)
Mengdan Zhang (18 papers)
Xu Lin (5 papers)
Jinrui Yang (8 papers)
Xiawu Zheng (63 papers)
Ke Li (722 papers)
Xing Sun (93 papers)
Yunsheng Wu (25 papers)
Rongrong Ji (315 papers)

Citations (544)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

YouTube

Show All Videos