OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks (2505.18775v1)

Published 24 May 2025 in cs.CV and cs.AI

Abstract: Recent breakthroughs in large multimodal models (LMMs), such as the impressive GPT-4o-Native, have demonstrated remarkable proficiency in following general-purpose instructions for image generation. However, current benchmarks often lack the necessary breadth and depth to fully evaluate the diverse capabilities of these models. To overcome this limitation, we introduce OmniGenBench, a novel and comprehensive benchmark meticulously designed to assess the instruction-following abilities of state-of-the-art LMMs across both perception-centric and cognition-centric dimensions. Our OmniGenBench includes 57 diverse sub-tasks grounded in real-world scenarios, systematically categorized according to the specific model capabilities they demand. For rigorous evaluation, we further employ a dual-mode protocol. This protocol utilizes off-the-shelf visual parsing tools for perception-centric tasks and a powerful LLM-based judger for cognition-centric tasks to assess the alignment between generated images and user instructions. Using OmniGenBench, we evaluate mainstream generative models, including prevalent models like GPT-4o, Gemini-2.0-Flash, and Seedream, and provide in-depth comparisons and analyses of their performance.Code and data are available at https://github.com/emilia113/OmniGenBench.

Summary

The paper introduces OmniGenBench, a comprehensive benchmark evaluating Large Multimodal Models (LMMs) across 57 diverse tasks spanning six key domains including reasoning, knowledge, and generation.
OmniGenBench employs a dual evaluation methodology combining automated visual parsing for perception tasks with an LLM-as-a-judge system for cognition tasks, aligning assessment criteria closely with human judgment.
Results highlight GPT-4o-Native's superior performance in both perception and cognition compared to other models, demonstrating the benchmark's utility in identifying current LMM capabilities and limitations.

OmniGenBench: Evaluating Instruction-Following in Large Multimodal Models

The paper "OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks" presents a comprehensive benchmark for evaluating large multimodal models (LMMs). Rapid advancements in LMMs, notably exemplified by GPT-4o-Native, have emphasized the need for benchmarks that can rigorously assess these models across a variety of tasks, taking into account both perception-centric and cognition-centric capabilities.

Overview

OmniGenBench introduces an expansive framework designed to evaluate LMMs across 57 diverse sub-tasks. These are systematically categorized into six primary domains:

World Knowledge Anchored Generation: Tasks in this domain assess the model’s ability to generate images grounded in complex world knowledge, including societal roles, events, symbols, and recognized expressions.
Situational Reasoning Generation: This category evaluates the models on reasoning within contextual scenarios and deducing outcomes or causes of particular situations.
Spatial Reasoning Generation: Focused on understanding and manipulating spatial relationships, this domain challenges the models with tasks requiring 3D and 2D spatial reasoning.
STEM-Driven Reasoning Generation: Engaging with scientific, technological, engineering, and mathematical concepts, this category explores principle visualizations and complex diagramming.
Appearance Compliance Generation: Tasks involve rendering accurate visual attributes and maintaining consistency in visuals as per descriptions, including text generation and controlled object representation.
Dynamics Consistency Generation: Models are evaluated on their ability to maintain visual coherence in dynamically evolving contexts, such as narrative generation and image editing.

Methodology

OmniGenBench features a dual-mode evaluation protocol using both automated systems and human-like judgments. For perception-centric tasks, automated visual parsing tools offer efficiency in evaluating basic attributes. Cognition-centric tasks leverage an LLM-as-a-judge paradigm, implementing protocols that customize evaluation criteria specific to each task. This approach ensures robust assessment criteria closely aligned with human judgment across various scenarios.

Results and Implications

The benchmark highlights notable model performances, with GPT-4o-Native demonstrating superior capabilities in both perception and cognition tasks, outperforming its contemporaries in key areas such as complex reasoning and broad world knowledge incorporation. Close-source models like Gemini-2.0 exhibited significant reasoning skills, second only to GPT-4o-Native, whereas open-source models lagged notably, indicating potential areas for optimization given proprietary models’ access to larger datasets and advanced architectures.

Future Directions

The development and utilization of OmniGenBench reveals critical insights into the strengths and limitations of current LMMs, suggesting pathways for further research—enhancing reasoning capabilities, improving visual fidelity, and integrating broader contextual understanding. As benchmarks evolve, attention to diverse and complex real-world scenarios will be crucial to advancing the field of multimodal generation.

OmniGenBench stands as a robust framework facilitating systematic evaluation and refinement of generative models, offering invaluable guidance to researchers aiming to push the boundaries of AI-driven multimodal understanding and generation. The comprehensive and structured assessment it provides is instrumental in aligning model capabilities with complex, real-world tasks, fostering continued advancements in the field.

Related Papers

GitHub

GitHub - emilia113/OmniGenBench (2 stars)