MMMU-Pro Benchmark Overview

Updated 20 March 2026

MMMU-Pro Benchmark is a rigorously constructed multimodal evaluation suite designed to assess integrated visual and textual reasoning across multiple disciplines.
It systematically filters out text-only answerable items and increases answer options from 4 to 10 to eliminate shallow statistical shortcuts.
Robustness tests under varied image corruptions highlight significant performance drops, emphasizing the need for advanced ensembling and geometric augmentation methods.

The MMMU-Pro Benchmark is a rigorously constructed multi-discipline, multimodal evaluation suite that targets higher-order reasoning by large vision-LLMs (VLMs). As a “hardened” successor to the original Massive Multi-discipline Multimodal Understanding (MMMU) benchmark, it closes well-documented loopholes in prior evaluations by filtering out text-only-answerable items, expanding the multiple-choice space to 10 plausible options, and embedding both the question and answer choices into a visual input—a scenario that necessitates truly integrated visual and textual comprehension. MMMU-Pro has become a reference benchmark for robust, fine-grained assessment of cross-disciplinary visual reasoning, especially under realistic input conditions and diverse corruption regimes.

1. Motivations and Design Rationale

The design of MMMU-Pro is motivated by two principal deficiencies in existing multimodal benchmarks. First, top models achieve "expert-level" scores (e.g., GPT-4o: 69.1% on MMMU validation), yet many original questions can be solved by LLMs using input text alone, exposing spurious statistical shortcuts. Second, with only four answer choices, elimination and shallow correlation often suffice for correct guesses, undermining the assessment of genuine multimodal reasoning. MMMU-Pro’s construction responds by:

Systematically removing all questions that leading text-only LLMs can answer without the image.
Expanding each question to 10 answer options, reducing the chance-level baseline from 25% to 10%, thereby demanding deep distractor elimination.
Embedding questions and options directly into visual inputs to enforce integrated "see + read" problem solving (Yue et al., 2024).

2. Construction Pipeline and Dataset Composition

The MMMU-Pro dataset is generated via a three-stage process:

Filtering Text-Only Answerable Questions: Four open-source LLMs (Llama3-70B, Qwen2-72B, Yi-1.5-34B, Mixtral-8×22B) are evaluated on each MMMU item in text-only mode for 10 trials each. Any item answered correctly by at least three of four models in >5/10 trials is excluded.
Distractor Augmentation: Surviving questions (randomly sampled to 1,800 items with subject balance, then 1,730 after image-coherence filtering) are expanded from 4 to 10 options. GPT-4o generates distractors, which are human-vetted for semantic/visual relevance and elimination of accidental correct choices.
Vision-Only Input Setting: CCC Each question (stem + 10 options) is rendered into an image across varied backgrounds (whiteboard, notebook, tablet UI, etc.), layouts, and font faces, simulating naturalistic screenshot or photo input.

The final release comprises 3,460 items: 1,730 in standard text+image format and 1,730 "vision-only" embedded images, drawn from six major disciplines (Mathematics, Physics, Chemistry, Biology, Humanities, Social Sciences), with 30 subjects each represented (Yue et al., 2024).

3. Evaluation Protocols and Metrics

Evaluation focuses on accuracy, defined per standard as:

$\mathrm{Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbb{1}[\hat y_i = y_i]$

Additional metrics quantify the impact of each "hardening" step:

Δ₁: Accuracy drop from 4-option to 10-option standard format.
Δ₂: Drop from standard-10 to vision-only (embedded question) format.
Δ₃: Total drop from original MMMU validation to MMMU-Pro.

Example: For GPT-4o, standard-4 accuracy is 64.7%, standard-10 is 54.0% (Δ₁ = –10.7), embedded-vision is lower still. All models experience an 8.7–15.4% accuracy drop when forced to choose among 10 high-quality distractors. OCR prompting has negligible effect (±1%)—suggesting that mere text extraction is not the limiting step—while Chain-of-Thought (CoT) reasoning substantially boosts accuracy, particularly for models with strong instruction-following capability (Yue et al., 2024).

4. Baseline Model Performance and Analysis

Performance on MMMU-Pro is consistently lower than on earlier MMMU versions. Reported results include:

Model	MMMU-Pro Accuracy (%)	MMMU (Validation) (%)	Δ₃ (pp)
GPT-4o	51.9	69.1	–17.2
VILA 1.5 (40B)	22.9	~50	–27.1
Claude 3.5	48.2	65.0	–16.8
Qwen2-VL-7B/72B	~37–48	n/a	n/a

Benchmarks demonstrate that MMMU-Pro's design prevents shallow elimination strategies: added distractors induce frequent confusion among conceptually similar but incorrect options. Integrated vision-text reasoning remains a bottleneck—especially under vision-only inputs, where models must contend with text layout, background noise, and font variation.

Fusion-based ensemble methods, such as V3Fusion-MLP, have shown additional gains. In (Tekin et al., 13 Mar 2026), V3Fusion-MLP + Rectify achieved 49.3% on MMMU-Pro, compared to 47.0% for the best individual baseline (Qwen2.5-VL-7b), leveraging focal error and CKA-based visual diversity across contributing models.

5. Robustness to Visual Perturbations

VLM-RobustBench (Saxena et al., 6 Mar 2026) employs MMMU-Pro to stress-test models across 49 image corruptions (blur, noise, geometric, color, resolution, occlusion) at multiple severities. Key findings:

Spatial and geometric corruptions (e.g., elastic_transform, upsampling, glass_blur) yield the largest performance drops—sometimes exceeding 34 percentage points—even when visual distortion appears subjectively mild.
Model families display unique vulnerability profiles: some have severe-failure rates up to 27%.
Visual Gain (VG; clean vs. no-image baseline) is small (7–17pp), indicating that semantics alone do not suffice—true visual reasoning is essential.
mRCE (mean Relative Corruption Error) varies: Molmo2-8B (robust: 1.0%), Gemma-3-12B (fragile: 24.2%).

A plausible implication is that label shifts due to spatial perturbations pose a greater threat to reasoning-centric VLMs than classic photometric or noise-based attacks, and robust architectures must prioritize geometric augmentation and spatial invariance (Saxena et al., 6 Mar 2026).

6. Applications: Ensembling and Unified Multimodal Evaluation

M3MU-Pro is the pivot for advanced model ensembling and unified testing. Vision Verification Enhanced Fusion (V3Fusion) fuses VLM outputs via a learned MLP that takes per-choice model probabilities, prunes via focal diversity (CKA/error correlation), and applies epistemic-uncertainty-based rectification. This approach mitigates both majority-vote failure modes and hallucinations, achieving gains even when component VLMs independently mispredict.

Additionally, MMMU-Pro forms the reasoning-oriented half of the Uni-MMMU benchmark suite (Zou et al., 15 Oct 2025), systematically coupling visual understanding and generation across eight domains. By enforcing logical interdependence between visual synthesis and analytical reasoning, these unified protocols measure true multimodal synergy beyond unidirectional evaluation.

7. Research Significance and Future Directions

The MMMU-Pro benchmark establishes a sharper evaluation frontier for VLMs and multimodal LLMs, exposing brittle shortcutting and robustly measuring integrated vision–language reasoning. Its use in robustness testing, cross-model fusion, and unified bidirectional assessment highlights several open directions:

Development of models that natively integrate OCR, layout analysis, and cross-modal CoT at higher fidelity under naturalistic conditions.
Training or fine-tuning with explicit geometric and occlusion augmentations to bridge the spatial-sensitivity gap.
Transparent reporting of visual reliance (VG), error stratification under corruption, and prompt sensitivity deltas as part of benchmarking protocol.

Broad adoption of MMMU-Pro is recommended for leaderboard evaluation and ablation in model development, with implications for real-world settings where inputs are messy, ambiguous, or heavily perturbed (Yue et al., 2024, Saxena et al., 6 Mar 2026, Tekin et al., 13 Mar 2026).