On Path to Multimodal Generalist: General-Level and General-Bench (2505.04620v1)

Published 7 May 2025 in cs.CV

Abstract: The Multimodal LLM (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: https://generalist.top/

PDF Abstract

The paper "On Path to Multimodal Generalist: General-Level and General-Bench" (Fei et al., 7 May 2025 ) introduces a novel evaluation framework, General-Level, and a comprehensive benchmark, General-Bench, to assess the capabilities of multimodal LLMs (MLLMs) as they evolve towards becoming multimodal generalists and, ultimately, artificial general intelligence (AGI). The authors argue that existing benchmarks primarily focus on task-specific performance and lack a holistic measure of generality, particularly the synergy effect across different modalities and tasks.

The core contribution is the General-Level framework, a 5-tier classification system inspired by autonomous vehicle capability levels. It evaluates MLLMs based on the scope and strength of synergy they exhibit, specifically across task-task, comprehension-generation, and modality-modality interactions.

Level-1 (Specialists): Models excelling in specific tasks or modalities, representing the state-of-the-art (SoTA) in narrow domains.
Level-2 (Unified Comprehension and/or Generation Generalists): Models capable of handling various modalities and tasks through unified architectures (e.g., MLLMs built on LLMs). Scoring depends on the number of supported tasks/modalities and average performance.
Level-3 (Generalists with Synergy in Comprehension and/or Generation): Models that demonstrate improved performance surpassing SoTA specialists on certain tasks through joint learning across multiple tasks within comprehension or generation paradigms. Scoring is based on the average performance on tasks where the model outperforms the corresponding SoTA specialist.
Level-4 (Generalists with Synergy Across Comprehension and Generation): Models exhibiting synergy between their comprehension and generation capabilities. Achieving a higher score requires balanced and strong performance where both comprehension and generation tasks surpass SoTA levels.
Level-5 (Generalists with Total Synergy): The highest level, requiring synergy across all modalities (comprehension, generation, and language). Crucially, this requires the non-language modalities to enhance language intelligence, leading to performance exceeding NLP SoTA specialists. This level represents a key milestone towards AGI.

The framework defines scoring metrics for each level, emphasizing that higher levels require increasingly challenging forms of synergy. A model is classified at its highest level where it achieves a non-zero score. A key simplification is made: outperforming a SoTA specialist on a task is considered evidence of a synergy effect leverageable from other learned tasks or modalities. The framework is designed to be dynamic, allowing for updates to both the benchmark tasks and the reference SoTA specialist scores as the field evolves.

To support this framework, the authors introduce General-Bench, a massive multimodal evaluation benchmark. It is designed to be significantly more comprehensive than existing MLLM benchmarks by:

Covering a wider array of modalities: including image, video, audio, 3D (RGB, point cloud), language, time series, depth, infrared, spectrogram, radar, code, document, and graph.
Encompassing both comprehension and generation tasks, recognizing that true generalists need both capabilities.
Featuring a rich diversity of tasks across 29 domains, spanning physical and social sciences, and evaluating 12 modality-invariant capabilities (like reasoning, problem-solving, affective analysis) and 145 modality-specific skills.
Preserving the original, native task prediction formats, moving beyond the common practice of converting everything to multiple-choice QA.
Containing over 700 tasks and 325,800 instances, making it one of the largest multimodal benchmarks.
Utilizing a diverse set of evaluation metrics tailored to each task's native format, standardized using mapping functions for consistent scoring.

To make the benchmark accessible, General-Bench is divided into public "open" and private "closed" test sets, and leaderboard participation is structured into four scopes (full-spectrum, modality-specific, paradigm-specific, skill-level) to accommodate models with different capabilities and available resources.

The authors present experimental results from evaluating over 100 existing MLLMs and SoTA specialists on General-Bench. Key findings include:

Most current MLLMs, even advanced models like GPT-4V and GPT-4o, show limited task support across the full spectrum of General-Bench, particularly outside image comprehension.
Few MLLMs consistently outperform SoTA specialists across a wide range of tasks, indicating a general lack of strong synergy capabilities necessary for higher levels.
There is a clear imbalance, with much stronger support and performance in comprehension tasks compared to generation tasks across most MLLMs.
Many MLLMs are primarily focused on a single non-language modality (typically image), with limited support for video, audio, or 3D, undermining their claim as true multimodal generalists.
Critically, the evaluation on NLP tasks reveals that no current MLLM demonstrates synergy from non-language modalities to enhance language intelligence beyond the performance of NLP-only SoTA specialists. This highlights a significant gap in achieving Level-5 capabilities.

The leaderboards based on General-Level scoring (Tables 8, 9, 10 in the paper) reflect these observations, showing that current MLLMs are primarily clustered at Level-2 and Level-3, with only a few reaching Level-4, and none achieving Level-5.

The paper concludes by discussing limitations of the current framework and benchmark, such as the simplified synergy measurement and data imbalance, and outlines directions for future research. These include refining the General-Level algorithms, expanding General-Bench to include more diverse and interleaved multimodal tasks, rethinking evaluation methodologies for free-form generation, optimizing model architectures for broader functional and modal support, and, most importantly, focusing research efforts on genuinely strengthening cross-modal synergy, especially from non-language to language modalities, as a critical step towards AGI.