- The paper introduces OmniGenBench, a comprehensive benchmark evaluating Large Multimodal Models (LMMs) across 57 diverse tasks spanning six key domains including reasoning, knowledge, and generation.
- OmniGenBench employs a dual evaluation methodology combining automated visual parsing for perception tasks with an LLM-as-a-judge system for cognition tasks, aligning assessment criteria closely with human judgment.
- Results highlight GPT-4o-Native's superior performance in both perception and cognition compared to other models, demonstrating the benchmark's utility in identifying current LMM capabilities and limitations.
OmniGenBench: Evaluating Instruction-Following in Large Multimodal Models
The paper "OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks" presents a comprehensive benchmark for evaluating large multimodal models (LMMs). Rapid advancements in LMMs, notably exemplified by GPT-4o-Native, have emphasized the need for benchmarks that can rigorously assess these models across a variety of tasks, taking into account both perception-centric and cognition-centric capabilities.
Overview
OmniGenBench introduces an expansive framework designed to evaluate LMMs across 57 diverse sub-tasks. These are systematically categorized into six primary domains:
- World Knowledge Anchored Generation: Tasks in this domain assess the model’s ability to generate images grounded in complex world knowledge, including societal roles, events, symbols, and recognized expressions.
- Situational Reasoning Generation: This category evaluates the models on reasoning within contextual scenarios and deducing outcomes or causes of particular situations.
- Spatial Reasoning Generation: Focused on understanding and manipulating spatial relationships, this domain challenges the models with tasks requiring 3D and 2D spatial reasoning.
- STEM-Driven Reasoning Generation: Engaging with scientific, technological, engineering, and mathematical concepts, this category explores principle visualizations and complex diagramming.
- Appearance Compliance Generation: Tasks involve rendering accurate visual attributes and maintaining consistency in visuals as per descriptions, including text generation and controlled object representation.
- Dynamics Consistency Generation: Models are evaluated on their ability to maintain visual coherence in dynamically evolving contexts, such as narrative generation and image editing.
Methodology
OmniGenBench features a dual-mode evaluation protocol using both automated systems and human-like judgments. For perception-centric tasks, automated visual parsing tools offer efficiency in evaluating basic attributes. Cognition-centric tasks leverage an LLM-as-a-judge paradigm, implementing protocols that customize evaluation criteria specific to each task. This approach ensures robust assessment criteria closely aligned with human judgment across various scenarios.
Results and Implications
The benchmark highlights notable model performances, with GPT-4o-Native demonstrating superior capabilities in both perception and cognition tasks, outperforming its contemporaries in key areas such as complex reasoning and broad world knowledge incorporation. Close-source models like Gemini-2.0 exhibited significant reasoning skills, second only to GPT-4o-Native, whereas open-source models lagged notably, indicating potential areas for optimization given proprietary models’ access to larger datasets and advanced architectures.
Future Directions
The development and utilization of OmniGenBench reveals critical insights into the strengths and limitations of current LMMs, suggesting pathways for further research—enhancing reasoning capabilities, improving visual fidelity, and integrating broader contextual understanding. As benchmarks evolve, attention to diverse and complex real-world scenarios will be crucial to advancing the field of multimodal generation.
OmniGenBench stands as a robust framework facilitating systematic evaluation and refinement of generative models, offering invaluable guidance to researchers aiming to push the boundaries of AI-driven multimodal understanding and generation. The comprehensive and structured assessment it provides is instrumental in aligning model capabilities with complex, real-world tasks, fostering continued advancements in the field.