Multimodal Reasoning Benchmarks

Updated 1 March 2026

Multimodal reasoning benchmarks are evaluation platforms that use diverse, multi-step tasks to measure models’ abilities in logical, spatial, numerical, and cross-modal reasoning.
They employ dynamic dataset construction, human-vetted annotations, and explicit chain-of-thought metrics to prevent shortcutting and ensure authentic reasoning.
Empirical findings reveal significant performance gaps between models and humans, prompting the need for more advanced, structured, and agentic reasoning systems.

Multimodal reasoning benchmarks are systematic evaluation platforms designed to rigorously probe the capacity of models—including Multimodal LLMs (MLLMs), Vision-LLMs (VLMs), and agent systems—to reason across multiple input modalities (typically combinations of text, images, video, and sometimes audio). Unlike perception-oriented or single-modality tests, these benchmarks prioritize multi-hop, cross-modal inference skills such as deductive, spatial, numerical, symbolic, and temporal reasoning, often under real-world, open-ended, or multi-step conditions. The following sections distill the technical foundations, dataset construction strategies, evaluation protocols, and impact of key benchmarks in the field, as documented in recent arXiv publications.

1. Benchmark Objectives and Taxonomy

Modern multimodal reasoning benchmarks are motivated by gaps in prior evaluations that focus primarily on perception, surface-level matching, or unimodal tasks. They aim to:

Assess logical, spatial, causal, and chain-of-thought reasoning that integrates signals from multiple modalities (image, video, audio, text, structured tables, etc.).
Disentangle pure reasoning capacity from lower-level skills such as recognition or instruction following (as in NPHardEval4V (Fan et al., 2024)).
Provide domain and task diversity, covering mathematics (PolyMATH (Gupta et al., 2024), VideoMathQA (Rasheed et al., 5 Jun 2025)), multidisciplinary science (MDK12-Bench (Zhou et al., 8 Apr 2025), MMReason (Yao et al., 30 Jun 2025)), table reasoning (MMTBench (Titiya et al., 27 May 2025)), and retrieval with reasoning constraints (MR²-Bench (Zhou et al., 30 Sep 2025), MRMR (Zhang et al., 10 Oct 2025)).
Stress vision-indispensable tasks that require outputting or manipulating visual artifacts (RBench-V (Guo et al., 22 May 2025), Uni-MMMU (Zou et al., 15 Oct 2025)).
Track multi-step, process-level reasoning via explicit stepwise annotations, tree-structured chains, or chain-of-thought evaluation pipelines (RPTS-Eval (Wang et al., 10 Nov 2025), MMLU-Reason (Tie et al., 22 May 2025), MMReason (Yao et al., 30 Jun 2025)).

Benchmarks may be classified into distinct evaluation paradigms:

Input Modalities	Output Modalities	Output Structure	Example Benchmarks
Image/Text	Text	Single/Stepwise	PolyMATH, LogicVista
Image/Text	Image+Text	Chain-of-Thought	RBench-V, Uni-MMMU
Video/Text/Audio	Text/Steps	Multi-step Trace	VideoMathQA
Image/Text	Document Retrieval	Ranked List	MR²-Bench, MRMR

The trend is toward increasing modality richness (audio, video), more structured/stepwise outputs, and evaluation of both inference chain quality and not just final answer correctness.

2. Dataset Construction and Annotation Protocols

Benchmark construction leverages diverse, often human-authored or expert-vetted sources across domains:

VideoMathQA (Rasheed et al., 5 Jun 2025): 420 video–question pairs from YouTube lectures and screen-recordings, covering ten mathematical domains. Each question is annotated with multi-step reasoning traces by graduate experts (~4–10 steps/question, totaling 920 man-hours).
RBench-V (Guo et al., 22 May 2025) and Uni-MMMU (Zou et al., 15 Oct 2025): Require models to produce multimodal outputs (drawings, image edits, auxiliary lines, maze paths) during completion of math, physics, and puzzle-solving tasks.
MDK12-Bench (Zhou et al., 8 Apr 2025): 141,320 K-12 exam problems across six disciplines, with 63,463 multimodal items (often with diverse images), labeled with a six-level knowledge structure and key-point tags supporting fine-grained concept analysis.
MMReason (Yao et al., 30 Jun 2025): Merges and filters items from prior multimodal reasoning sources, with a multi-model voting filter to exclude guessable or memorized instances. Each final item is annotated with detailed, minimal-step solution paths.
R1-Onevision-Bench (Yang et al., 13 Mar 2025): Constructs a formal image-to-text mapping (f_{img→text}) for each input, then annotates Chain-of-Thought solutions tightly grounded in visual semantics.

Annotation protocols emphasize:

Multi-stage expert review, with separate rounds for question crafting, distractor authoring, stepwise annotation, and refinement (as in VideoMathQA, PolyMATH).
Explicit marking of reasoning types (direct problem-solving, conceptual transfer, deep comprehension).
Human or LLM-in-the-loop verification of multimodal dependency, unambiguity, and step quality (as in MMReason, RPTS-Eval (Wang et al., 10 Nov 2025)).
Dynamic augmentation to resist data leakage by bootstrapping images and text variants at evaluation time (MDK12-Bench).

3. Evaluation Protocols and Metrics

Multimodal reasoning benchmarks adopt complex, multi-dimensional evaluation frameworks:

Standard Accuracy and Structural Metrics

Binary/MCQ Accuracy: $\mathrm{Accuracy} = \frac{1}{M}\sum_{i=1}^M \mathbf{1}\{\hat y_i=y_i\}$ , standard across MCQ-style tasks (VideoMathQA).
Category/domain-specific breakdown: to stress model robustness outside "familiar" areas (MDK12-Bench).

Stepwise and Chain-of-Thought Metrics

Step alignment in CoT: Score = $\mathrm{round}\left(\frac{\# \text{matched steps}}{N} \times 10\right)$ (VideoMathQA).
Ternary per-step rubric: 1 (correct), 0.5 (unverifiable), 0 (incorrect) as in MMReason.
Reasoning Process Tree Score (RPTS): RPTS = $\frac{\sum_{i=1}^{N} w_i\,s_i }{\sum_{i=1}^{N} w_i}$ where $w_i$ exponentially weights each inference node by its distance to focus height (RPTS-Eval).

Process Quality and Meta-Evaluation

Relevance to question/answer, consistency (MMLU-Reason): RTQ, RTA, RSC, plus trace length, overthinking, and error taxonomies.
Adaptive mode selection rationality (AdaptiveMMBench): Matthews Correlation Coefficient (MCC) on when models invoke tool-augmented vs. direct text reasoning, isolating meta-cognitive performance.
Key-step coverage: fraction of human-verified steps covered in the model's reasoning chain (AdaptiveMMBench).

Multimodal Output Judging

LLM-as-a-Judge (often using GPT-4o): standard for open-ended chains, complex image outputs, and step-level scoring (RBench-V, MMReason, RPTS-Eval).

Retrieval-Specific Metrics

Recall@k, nDCG@10 (MR²-Bench, MRMR): focus on ranking relevant visual-textual documents given complex, reasoning-centric queries.
Contradiction Retrieval Hit@1 (MRMR): detection of conflicting rule/document.

4. Empirical Findings and Model Performance

Experimental studies across recent benchmarks reveal persistent and significant performance gaps:

Human vs. Model Gap: Even the best foundation models fall short of human performance by large margins. For example, on RBench-V, humans obtain 82.3% accuracy vs. OpenAI o3's 25.8% (Guo et al., 22 May 2025); on PolyMATH, Claude-3.5 Sonnet achieves ~41% vs. human 66.6% (Gupta et al., 2024); MM-IQ reports 27.5% (best model) vs. 51.3% (human) (Cai et al., 2 Feb 2025).
Intermediate Reasoning Quality: Models frequently arrive at correct answers via flawed reasoning chains (RPTS-Eval: drop in filtered accuracy up to 30 points for open-source models (Wang et al., 10 Nov 2025); MMLU-Reason: high correct-answer rates coexist with reasoning inconsistencies and overthinking (Tie et al., 22 May 2025)).
Modal and Task Breakdown: Vision-indispensable and spatial/geometric tasks remain the weakest (e.g., "spatial" category ≤30% accuracy in MME-CC (Zhang et al., 5 Nov 2025)), while visual knowledge/recognition tasks yield higher scores, indicating over-reliance on perceptual cues rather than structured reasoning.
Benchmarks Resist Shortcuts: Dynamic augmentation (MDK12-Bench), removal of multiple-choice options (MMReason), and careful filtering eliminate shortcut opportunities, pushing models toward true multi-step, cross-modal reasoning.
Retrieval Benchmarks: Reasoning-focused retrieval settings see performance collapse compared to shallow matching: Seed1.6-Embedding achieves 9.9 Recall@1 on MR²-Bench vs. 77.8 on MMEB (Zhou et al., 30 Sep 2025); contradiction retrieval scores remain at random on MRMR (Zhang et al., 10 Oct 2025).

5. Limitations, Failure Modes, and Insights

Recognition of benchmark limitations and analysis of model error patterns are central themes:

Scale & Annotation Cost: Video-based and multi-step reasoning datasets (VideoMathQA: 2–2.5 hours/sample) are labor-intensive, limiting scalability (Rasheed et al., 5 Jun 2025).
Shortcut Hazards: Prior benchmarks are susceptible to memorization and answer-guessing unless careful multi-model filtering and dynamic augmentation are employed (MMReason (Yao et al., 30 Jun 2025), MDK12-Bench (Zhou et al., 8 Apr 2025)).
Reasoning Drift and Hallucination: Models frequently produce longer, meandering, or irrelevant chains (overthinking, logic drift), especially in open-ended settings (MMLU-Reason (Tie et al., 22 May 2025)).
Failure at Atomic Step Extraction: RPTS-Eval demonstrates that most open-source models struggle with the very first inference step from visual clues (Wang et al., 10 Nov 2025).
Modality Dependency: Textual cues often dominate: ablation studies show improved performance when diagrams are replaced by descriptions, indicating models do not fully internalize spatial/visual structure (PolyMATH (Gupta et al., 2024)).
Limited Transfer Across Languages: Faithful reasoning performance in Chinese noticeably lags English across benchmarks (Wang et al., 10 Nov 2025).

6. Current Directions and Recommendations for Benchmark Design

Recent work highlights several priorities for next-generation benchmarks:

Emphasizing explicit, structured trace annotation (step-by-step chains, tree-structured logic, knowledge-point tagging) with process-level metrics (RPTS, key-step coverage) (Wang et al., 10 Nov 2025, Tie et al., 22 May 2025).
Adopting dynamic, contamination-resistant augmentation (MDK12-Bench (Zhou et al., 8 Apr 2025)), adversarial item generation, and curriculum-based splits.
Institutionalizing reasoning path evaluation beyond final answers (RPTS, Reasoning Trace Evaluation Pipeline, explicit per-step scoring) (Wang et al., 10 Nov 2025, Tie et al., 22 May 2025).
Extending to open-ended, multi-modal outputs and generation-augmented reasoning (RBench-V, Uni-MMMU) (Guo et al., 22 May 2025, Zou et al., 15 Oct 2025).
Developing agentic benchmarks that integrate tool use and adaptive mode selection (AdaptMMBench (Zhang et al., 2 Feb 2026)), simulating real-world deployment conditions (MM-BrowseComp (Li et al., 14 Aug 2025)).
Expanding to richer, domain-diverse content (finance: FinMMR (Tang et al., 6 Aug 2025); multidisciplinary retrieval: MRMR (Zhang et al., 10 Oct 2025)).
Integrating sophisticated judge models and benchmarking frameworks that align with evolving model families and task demands.

7. Impact and Outlook

Multimodal reasoning benchmarks have decisively shifted the research frontier from perception and retrieval toward genuinely cognitive, chain-of-thought, and agentic reasoning. Their influence is visible in the adoption of trace-driven evaluation, multi-level knowledge annotations, and agent-benchmarks that stress tool-use and process fidelity. Nonetheless, large headroom remains before human-level reasoning is achieved across domains and modalities. Ongoing directions include the design of more scalable annotation strategies, richer process-level metrics, domain-specialized tests (finance, education, science), and interactive/real-time agent benchmarks (Rasheed et al., 5 Jun 2025, Guo et al., 22 May 2025, Wang et al., 10 Nov 2025).

References

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos (Rasheed et al., 5 Jun 2025)
RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs (Guo et al., 22 May 2025)
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal LLMs (Zhou et al., 8 Apr 2025)
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark (Zou et al., 15 Oct 2025)
MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI (Yao et al., 30 Jun 2025)
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization (Yang et al., 13 Mar 2025)
PolyMATH: A Challenging Multi-modal Mathematical Reasoning Benchmark (Gupta et al., 2024)
RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation (Wang et al., 10 Nov 2025)
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models (Cai et al., 2 Feb 2025)
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts (Xiao et al., 2024)
MMLU-Reason: Benchmarking Multi-Task Multi-modal Language Understanding and Reasoning (Tie et al., 22 May 2025)
MR $^2$ -Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval (Zhou et al., 30 Sep 2025)
MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval (Zhang et al., 10 Oct 2025)
MMTBench: A Unified Benchmark for Complex Multimodal Table Reasoning (Titiya et al., 27 May 2025)
AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process (Zhang et al., 2 Feb 2026)
FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging (Tang et al., 6 Aug 2025)
NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal LLMs (Fan et al., 2024)
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents (Li et al., 14 Aug 2025)
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity (Zhang et al., 5 Nov 2025)