MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? (2408.13257v2)

Published 23 Aug 2024 in cs.CV

Abstract: Comprehensive evaluation of Multimodal LLMs (MLLMs) has recently garnered widespread attention in the research community. However, we observe that existing benchmarks present several common barriers that make it difficult to measure the significant challenges that models face in the real world, including: 1) small data scale leads to a large performance variance; 2) reliance on model-based annotations results in restricted data quality; 3) insufficient task difficulty, especially caused by the limited image resolution. To tackle these issues, we introduce MME-RealWorld. Specifically, we collect more than $300$K images from public datasets and the Internet, filtering $13,366$ high-quality images for annotation. This involves the efforts of professional $25$ annotators and $7$ experts in MLLMs, contributing to $29,429$ question-answer pairs that cover $43$ subtasks across $5$ real-world scenarios, extremely challenging even for humans. As far as we know, MME-RealWorld is the largest manually annotated benchmark to date, featuring the highest resolution and a targeted focus on real-world applications. We further conduct a thorough evaluation involving $28$ prominent MLLMs, such as GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Our results show that even the most advanced models struggle with our benchmarks, where none of them reach $60\%$ accuracy. The challenges of perceiving high-resolution images and understanding complex real-world scenarios remain urgent issues to be addressed. The data and evaluation code are released at https://mme-realworld.github.io/ .

PDF HTML Abstract

Exploring MME-RealWorld: Evaluating Multimodal LLMs in Complex Real-World Scenarios

The paper "MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?" by Zhang et al. presents a comprehensive benchmark aimed at rigorously evaluating Multimodal LLMs (MLLMs) in challenging real-world scenarios. The authors identify several limitations in existing benchmarks, such as small data scales, reliance on model-based annotations, and insufficient task difficulty. To address these issues, they introduce MME-RealWorld and its Chinese counterpart, MME-RealWorld-CN. This essay provides an expert overview of the paper's contributions, findings, and implications for the future of AI research.

Key Contributions

Large-Scale Dataset: The authors construct the largest fully human-annotated benchmark for MLLMs to date. MME-RealWorld encompasses 29,429 question-answer (QA) pairs from 13,366 high-resolution images sourced from over 300K public datasets and the Internet. The dataset spans five primary domains: Optical Character Recognition (OCR) in the Wild, Remote Sensing (RS), Diagrams and Tables (DT), Autonomous Driving (AD), and Monitoring (MO).
High-Quality Annotations: The benchmark features meticulous annotations created by 25 professional annotators and 7 MLLM experts. This extensive effort ensures the robustness and reliability of the data, avoiding the noise introduced by model-based annotations.
Challenging Task Design: The tasks within MME-RealWorld are purposefully designed to be difficult, featuring high-resolution images averaging 2000x1500 pixels and covering real-world scenarios that are complex even for humans. This includes tasks like object counting in remote sensing images and intention prediction in autonomous driving scenarios.
Evaluation of Advanced Models: The benchmark evaluates 28 prominent MLLMs, such as GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, providing a thorough assessment of their performance on both English and Chinese datasets.

Major Findings

Performance Variability: None of the evaluated models achieved more than 60% accuracy, indicating significant challenges in high-resolution image perception and understanding complex scenarios. The best-performing model, InternVL-2, attained only 55.82% accuracy on perception tasks.
Resolution Impact: Models employing high-resolution inputs (e.g., SliME, Cambrian-1, Mini-Gemini-HD) outperformed those reliant on standard vision encoders. This underscores the importance of processing detailed image information for real-world applications.
OCR Excellence and Deficiencies: GPT-4o excelled in OCR tasks, achieving 77.69% accuracy. However, its performance in other domains was less impressive, suggesting a need for more generalized learning approaches.
Real-World Task Difficulty: The tasks involving complex reasoning, such as intention prediction in autonomous driving and object attribute recognition in monitoring, remain particularly challenging. The highest accuracy in reasoning tasks was only 44.12%, achieved by Claude 3.5 Sonnet.

Practical and Theoretical Implications

The findings from MME-RealWorld highlight several critical aspects of MLLM development:

Enhanced Data Processing: The substantial performance gap indicates a need for more sophisticated data processing techniques capable of handling high-resolution images efficiently. This could involve novel algorithms for dynamic image chunking or more powerful vision encoders.
Model Robustness: The benchmark reveals that even the most advanced models struggle with the robustness required for real-world applications. This points to the necessity of further research into models that can better generalize and adapt to diverse, complex scenarios.
Computation Optimization: Given the high computational cost associated with processing high-resolution images, future research should focus on optimizing computational efficiency without sacrificing accuracy. This is vital for deploying MLLMs in resource-constrained environments.

Future Directions

The introduction of MME-RealWorld opens several avenues for future research:

Improved Model Architectures: Developing architectures that can intrinsically handle high-resolution images and complex multimodal data will be critical. This might include hybrid models that combine the strengths of both vision-specific and language-specific encoders.
Specialized Training: Fine-tuning models on domain-specific datasets and improving instruction-following capabilities could enhance their performance in complex scenarios.
Real-World Testing: Deploying MLLMs in real-world applications, such as autonomous driving and remote sensing, will provide practical insights into their performance and areas needing improvement.

Overall, MME-RealWorld represents a significant step forward in the rigorous evaluation of MLLMs, providing a robust framework for future advancements in the field. The benchmark's comprehensive design and challenging tasks set a high bar for the development of more capable and reliable multimodal AI systems.