This paper introduces MME-Unify (MME-U), a comprehensive benchmark designed to evaluate Unified Multimodal LLMs (U-MLLMs) which integrate both understanding and generation capabilities (Xie et al., 4 Apr 2025 ). The benchmark addresses the lack of standardized evaluation frameworks for these models, particularly for their unique "mixed-modality generation" or "unified" capabilities where understanding and generation synergize, such as drawing auxiliary lines to solve a geometry problem or explaining an image edit.
MME-U evaluates models across three core domains:
- Multimodal Understanding: Assesses comprehension across different visual input types.
- Subtasks: Single-Image Perception and Understanding (SIPU), Multi-Image Interleaved Text-Image Understanding (MITIU), and Video Perception and Understanding (VPU).
- Data: Curated 1,900 samples from 5 existing benchmarks (e.g., MME, MMBench, Video-MME) covering diverse tasks like OCR, spatial perception, attribute reasoning, and video action reasoning.
- Implementation: All tasks are standardized into multiple-choice question-answering (QA) pairs. For models with input limitations, the first image/frame is used, or videos are represented by 6 sampled keyframes.
Evaluation: Accuracy is measured using rule-based matching after randomly shuffling answer options to mitigate positional bias. The Understanding Score (US) is the average accuracy across the three subtasks:
- Multimodal Generation: Evaluates the quality and instruction adherence of generated multimodal content.
- Subtasks: Fine-grained Image Reconstruction (FIR), Text-guided Image Editing (TIE), Text-to-Image Generation (TIG), Conditional Image-to-Video Generation (CIVG), Text-guided Video Generation (TVG), and Video Prediction (VP).
- Data: Samples gathered from datasets like COCO, Emu-Edit, MSR-VTT, ImageNet, and Pexel Videos (at least 200 samples per task).
- Implementation: An "Attribute Unification Pipeline" standardizes input attributes (e.g.,
Text Prompt
,Src Image
,Video
). Task-specific system prompts are engineered to guide model generation based on standardized inputs. Evaluation: Uses standard domain-specific metrics (e.g., LPIPS, CLIP-I, CLIP-T, FVD, FID). Crucially, all metrics are standardized to a (0, 100) scale where higher is better. For example, FVD/FID scores () are normalized: . The Generation Score (GS) is the average of the standardized scores across the six subtasks:
(Specific formulas for subtask scores combining normalized metrics are provided in Appendix B).
- Unify Capability: Assesses the model's ability to perform tasks requiring synergistic understanding and generation.
- Subtasks (Newly designed):
- Common Sense Question Answering (CSQ): Answer a riddle-like question and generate the corresponding image.
- Image Editing and Explaining (IEE): Understand complex edit instructions, explain them, and generate the edited image.
- SpotDiff (SD): Identify differences between two images, state the count, and generate an image highlighting the differences.
- Auxiliary Lines (AL): Solve a geometry problem by first generating a diagram with necessary auxiliary lines.
- Visual CoT (VCoT): Navigate a maze step-by-step, generating the action, coordinates, and resulting maze state image at each step.
- Data Construction: Each task involves manually constructed samples with specific instructions, text multiple-choice options, and image multiple-choice options (correct image + negative samples generated via methods like InstructPix2Pix or manual creation). The paper provides detailed construction procedures (Figure 5) and the exact system prompts used for each task (Appendix Figures 6-11), offering significant practical value for replication or extension.
- Evaluation: Combines text and image multiple-choice evaluation. Text answers are matched directly or via CLIP-T similarity. Image answers are evaluated by calculating CLIP-I similarity between the generated image and the options, selecting the highest score. Two metrics are reported:
-
acc
: Average of text accuracy and image accuracy. For VCoT, it's the average accuracy across action, coordinate, and image prediction per step. -
acc+
: Accuracy where both text and image answers are correct. For VCoT, it's the percentage of mazes solved perfectly across all steps. The Unify Score (Unify-S) is the averageacc
across the five subtasks:
-
- Subtasks (Newly designed):
Overall MME-U Score:
The final benchmark score is the average of the three domain scores:
Experiments and Findings:
The paper evaluates 22 models, including U-MLLMs (Janus-Pro, EMU3, MiniGPT-5, MIO-Instruct, Gemini2.0-flash-exp*) and specialized models (GPT-4o, Claude-3.5 Sonnet, DALL-E 3).
- Overall Performance: U-MLLMs show potential but are still in early stages (highest score ~45.57 by Gemini2.0-flash-exp). There's significant variance, and no single model excels across all dimensions.
- Understanding: A gap exists between open-source U-MLLMs (especially single-tokenizer ones like Emu3) and top closed-source models (Gemini) or specialized understanding models. Architectural choices (e.g., separate encoders in Janus) and large-scale data (MIO-Instruct) improve performance.
- Generation: The gap to specialized models (DALL-E 3) is smaller for tasks like TIG, with Gemini2.0-flash-exp even outperforming DALL-E 3. However, video generation and complex instruction following remain weak points for most U-MLLMs. Visual examples show issues like missing details specified in prompts (Figure 13).
- Unify Capability: This is the most challenging area. Performance is generally poor, especially on the
acc+
metric. Multi-step reasoning and generation tasks like VCoT prove extremely difficult, with no model successfully completing tasks requiring multiple steps. Models struggle to generate images that align with reasoning or instructions (e.g., drawing correct auxiliary lines). - Trade-offs: Models optimized for unified tasks sometimes lag in basic understanding/generation, and vice-versa. Balancing these is a key challenge.
- Instruction Following: Models often fail to follow complex instructions (e.g., auxiliary lines, specific edits) or maintain consistent style (e.g., VCoT maze generation).
Practical Implications and Implementation:
- MME-U provides a standardized framework and dataset (4104 QA pairs total) for rigorously evaluating and comparing U-MLLMs.
- The detailed data construction methods, evaluation protocols (including metric standardization and specific formulas), and provided system prompts (Appendix) offer practical guidance for researchers and developers implementing or evaluating these models.
- The findings highlight key weaknesses in current U-MLLMs: poor performance on unified tasks, challenges in complex instruction following for generation, difficulty with multi-step reasoning/generation, and the trade-off between basic and advanced capabilities. This directs future research towards improving multimodal integration, reasoning, and instruction adherence.
- The benchmark's structure allows for granular analysis across understanding, generation, and unified tasks, helping diagnose specific model weaknesses.
Limitations:
The authors note that evaluating unified image generation using multiple-choice based on CLIP similarity can potentially be "hacked" by models generating stylistically poor but semantically similar images. Future work aims to incorporate direct MLLM or CLIP scoring for stricter evaluation.