Comprehensive Evaluation of Multimodal LLMs on Text-Rich Visual Data via SEED-Bench-2-Plus
Introduction
The understanding of text-rich visual data is a crucial capability for Multimodal LLMs (MLLMs). These models are tasked with deciphering intricate visual information that includes extensive embedded texts, such as charts, maps, and web pages, which embody real-world applications. To this end, SEED-Bench-2-Plus has been developed as a robust benchmark to comprehensively evaluate the efficacy of various MLLMs in processing such complex data scenarios.
Overview of SEED-Bench-2-Plus
SEED-Bench-2-Plus extends and enriches the scope of the previous benchmark version by introducing a large set of 2.3K multiple-choice questions that capture a broad range of text-rich visual comprehension challenges across three major categories:
- Charts: Assessing the model's ability to interpret and extract information from various graphical representations.
- Maps: Evaluating geographical and symbolic data comprehension within different map types.
- Webs: Testing the capability to understand and interact with data from various webpage layouts.
These categories are further broken down into 63 diverse types, providing a comprehensive framework to test and refine the capabilities of MLLMs in understanding highly textual visual information.
Data Collection and Evaluation Strategy
The benchmark utilizes rigorous methodologies for both data collection and evaluation:
- Data Source: A combination of manual collection and automation tools (like GPT-4V for question generation) has been employed, ensuring a rich and diverse dataset.
- Evaluation Strategy: Unlike some existing benchmarks, SEED-Bench-2-Plus utilizes an answer ranking strategy where the likelihood of model-generated responses is matched against human annotations to choose the most probable answer. This approach is designed to minimize bias and reliance on the model's ability to follow specific answering patterns.
Findings from the Benchmark
A comprehensive evaluation involving 34 MLLMs was conducted, revealing varied performance across different models and data types. Notably, traditional textual analysis models and newer multimodal models like GPT-4V, Gemini-Pro-Vision, and Claude-3-Opus were included in this assessment. The evaluation highlighted:
- General Challenges in Text-Rich Scenarios: There is a significant disparity in model performance, indicating the ongoing challenge of processing complex multimodal and text-rich information.
- Specific Struggles with Maps and Apps: Maps, which often contain layered information and require contextual comprehension, proved particularly challenging.
- Performance Variability: There is substantial variability in how different models handle various data types, suggesting that current MLLMs might require further specialization or training to handle the intricacies of real-world text-rich scenarios effectively.
Implications for Future Research
The insights from SEED-Bench-2-Plus underscore several avenues for future work:
- Model Improvement: There is a critical need for enhancing MLLM design and training approaches to better handle the complexity and variability of text-rich visual data.
- Benchmark Refinement: Continual development of benchmarks like SEED-Bench-2-Plus is essential to push the boundaries of what MLLMs can understand and how effectively they can operate in real-world applications.
- Community Collaboration: By making SEED-Bench-2-Plus publicly available and maintaining a leaderboard, the benchmark encourages ongoing community engagement and collaborative improvement in the multimodal AI field.
Conclusion
SEED-Bench-2-Plus offers a rigorous and detailed framework for evaluating the proficiency of MLLMs in understanding text-rich visuals, positioning itself as a critical tool for guiding future advancements in AI research and application. With its comprehensive design and robust evaluation methodology, it sets a new standard for assessing the capabilities of AI models in handling the complexities of real-world data.