SEED-Bench-2: Benchmarking Multimodal Large Language Models (2311.17092v1)

Published 28 Nov 2023 in cs.CV

Abstract: Multimodal LLMs (MLLMs), building upon the foundation of powerful LLMs, have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3). However, existing MLLM benchmarks remain limited to assessing only models' comprehension ability of single image-text inputs, failing to keep up with the strides made in MLLMs. A comprehensive benchmark is imperative for investigating the progress and uncovering the limitations of current MLLMs. In this work, we categorize the capabilities of MLLMs into hierarchical levels from $L_0$ to $L_4$ based on the modalities they can accept and generate, and propose SEED-Bench-2, a comprehensive benchmark that evaluates the \textbf{hierarchical} capabilities of MLLMs. Specifically, SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions, including the evaluation of both text and image generation. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations. By revealing the limitations of existing MLLMs through extensive evaluations, we aim for SEED-Bench-2 to provide insights that will motivate future research towards the goal of General Artificial Intelligence. Dataset and evaluation code are available at \href{https://github.com/AILab-CVC/SEED-Bench}

PDF Abstract

SEED-Bench-2: A Comprehensive Benchmark for Multimodal LLMs

In the continuous advancement of Multimodal LLMs (MLLMs), understanding and evaluating their capabilities remains a critical area of exploration. The paper "SEED-Bench-2: Benchmarking Multimodal LLMs" presents a sophisticated and well-structured framework designed to methodically assess the capabilities of MLLMs. This work categorizes the capabilities of these models into hierarchical levels from $L_0$ to $L_4$ and proposes SEED-Bench-2, a comprehensive benchmark that evaluates the hierarchical capabilities of MLLMs, specifically up to $L_3$ .

Hierarchical Capability Levels and Evaluation Dimensions

The authors introduce a thoughtful categorization of MLLM capabilities into hierarchical levels. At the foundational level, $L_0$ , MLLMs generate texts based on text inputs, which aligns with the inherent capabilities of LLMs. The subsequent levels ( $L_1$ to $L_3$ ) progressively demand more complex interactions with multimodal content, culminating in $L_4$ , where models are expected to process and produce interleaved image-text content in an open-form format. The SEED-Bench-2 focuses on levels $L_1$ , $L_2$ , and $L_3$ to provide a current horizon of model capabilities. This hierarchical framework not only illustrates the current progress but also provides a roadmap for future research.

The benchmark is comprised of 24,000 multiple-choice questions across 27 evaluation dimensions. Each evaluation dimension is structured to assess specific aspects of MLLMs' capabilities. For instance, the evaluation includes dimensions that require models to comprehend fixed-format multimodal inputs ( $L_1$ ), interpret interleaved image-text inputs ( $L_2$ ), and generate images in addition to texts ( $L_3$ ). Notably, the benchmark extends beyond single image-text pair comprehension, embracing the complexity and variety inherent in real-world interactions.

Construction and Evaluation Methodology

The data and evaluation questions were constructed using a combination of automatic pipelines and manually curated datasets. The automatic pipeline leverages foundation models like BLIP2, Tag2Text, and SAM for extracting detailed visual information from images, which is then used to automate the generation of multiple-choice questions. Human annotators further refine this data to ensure accuracy and relevance. This meticulous construction process underscores the robustness of SEED-Bench-2 in providing objective and efficient assessment metrics for current MLLMs.

For evaluating the MLLMs, the paper utilizes an answer ranking strategy where models' outputs are ranked based on the likelihood of responses. This approach provides a quantitative performance metric without imposing the subjective biases inherent in human evaluation.

Results and Observations

The evaluation results elucidate several important insights into the current state of open-source MLLMs. First, it is evident that existing models have not fully realized their potential even at the capability $L_1$ , with leading models achieving around 60% accuracy. This highlights room for improvement, especially in complex reasoning tasks such as understanding charts and visual mathematics, where all models performed poorly.

Additionally, the results underscore the challenges in comprehending interleaved image-text data, with performance on part-2 dimensions generally worse than on fixed-format evaluations. This indicates a gap between training data structures and evaluation demands. Furthermore, only a limited subset of models currently support full multimodal generation capabilities, revealing a need for further research into integrated text and image output systems.

Implications and Future Directions

SEED-Bench-2's comprehensive framework has significant implications for advancing the field of multimodal AI. By providing a structured evaluation across a spectrum of capabilities, this benchmark paves the way for identifying specific areas requiring innovation. This work will serve as a cornerstone for both academic research and practical applications aiming to achieve a more profound understanding of and improvements in MLLMs.

As future efforts aspire towards achieving General Artificial Intelligence, frameworks like SEED-Bench-2 will be pivotal. Continuous refinement and expansion of such benchmarks to include emerging modalities and evaluation metrics will be vital to align with the evolving landscape of AI capabilities. This paper presents a foundational step towards developing robust, versatile MLLMs capable of understanding and generating complex multimodal content in varied contexts.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Bohao Li (20 papers)
Yuying Ge (39 papers)
Yixiao Ge (99 papers)
Guangzhi Wang (17 papers)
Rui Wang (996 papers)
Ruimao Zhang (84 papers)
Ying Shan (252 papers)

Citations (53)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - AILab-CVC/SEED-Bench: (CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions. (310 stars)