MMReason: Multimodal Reasoning Benchmark

Updated 15 June 2026

MMReason is a comprehensive, open-ended benchmark designed to evaluate long-chain, multi-step reasoning in multimodal large language models.
It enforces open-ended responses and employs multi-model voting filters to eliminate guessability and ensure true vision-language integration.
It provides detailed, step-by-step reference solutions with ternary scoring, enabling granular analysis of intermediate reasoning and overall model performance.

MMReason denotes both a specific multimodal long-chain reasoning benchmark, "MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI" (Yao et al., 30 Jun 2025), and the broader vision of robust multimodal reasoning evaluation and development in the context of Multimodal LLMs (MLLMs). It targets the critical, open-ended, multi-step, and multi-domain evaluation of multimodal reasoning abilities necessary for progress toward AGI. MMReason directly confronts longstanding deficiencies in benchmark design, particularly with respect to the difficulty, diversity, and interpretability of reasoning challenges in vision-language AI.

1. Motivations and Conceptual Distinctions

MMReason arose in response to critical gaps in existing MLLM evaluation. Benchmarks such as MathVista, MMMU, and OlympiadBench focus on short, predominantly multiple-choice questions, which exhibit several pathologies:

Insufficient Difficulty and Disciplinary Breadth: Pre-existing benchmarks rarely require extended multi-modal reasoning chains, cover only a narrow range of disciplines, and underrepresent complex, university- or competition-level questions.
Guessability and Memorization: Multiple-choice formats enable correct answers via guessing or answer recall, not genuine reasoning. Memorized “leaks” and visually-irrelevant cues are prevalent.
Neglect of Intermediate Reasoning: Most evaluation targets only the final answer, excluding assessment of logical validity and completeness of intermediate steps.

MMReason provides a rigorous alternative by constructing a fully open-ended, multi-modal, and multi-step benchmark. Its primary objectives are as follows:

Enforce open-ended (non-multiple-choice) format to eliminate guessability.
Diversify content across six disciplines and four difficulty tiers, merging foundational, undergraduate, and competition-level challenges.
Filter out memorization and irrelevant shortcuts using a multi-model voting procedure.
Provide detailed, multi-step annotated reference solutions supporting granular intermediate step assessment.
Deploy reference-based, ternary scoring to capture not just correctness but also uncertainty in chain-of-thought reasoning.
Enable comprehensive, discipline-wise, and difficulty-aware analysis of leading MLLMs’ performance under tractable and reproducible conditions (Yao et al., 30 Jun 2025).

2. Benchmark Construction and Domain Scope

MMReason’s corpus is defined by careful problem selection and reformulation:

Disciplines: Mathematics, Business, Science, Engineering, Social Science, Health.
Difficulty Levels: Pre-University (e.g., high school exams), University (undergraduate), Foundational (core domain concepts), Competition (Olympiad/university contest).

Benchmark items are drawn from existing MCQ benchmarks (M³CoT, MMMU, MMStar) and newly curated, web-sourced problems. All questions are:

Converted to open-ended prompts: Only items with a unique answer (numerical, categorical, chemical, etc.) are retained; “which statement is correct”–style ambiguity is removed. Options are stripped, and format re-written as “What is…?”, “How many…?”, etc.
Subjected to strict filtering: Items that can be solved text-only (without visual input) are eliminated to ensure genuine vision-language integration.

The resulting benchmark totals 1,384 high-difficulty, step-annotated items post-filtering.

3. Filtering Protocol: Eliminating Guessability and Memorization

To ensure all tasks demand authentic multimodal reasoning, MMReason employs a multi-model voting filter. The procedure iteratively removes any item answerable correctly by any strong MLLM when images are omitted:

Input:
    Q = {q₁, q₂, …, q_M}   # each q_j = (text T_j, image V_j)
    Models = {π₁, π₂, …, π_K}
    Rounds = T
for i = 1 to T:
    for each model π_k in Models:
        run π_k on {T_j}   # image dropped
        record q_j answered correctly by π_k
    for each q_j in Q:
        if CorrectCount(q_j) > 0:
            remove q_j from Q
Output: filtered Q

If any model can solve a question using only text, that question is deemed either memorized, trivial, or lacking visual relevance, and is discarded. Typically, two rounds are sufficient to reduce text-only accuracy to under 1%, raising visual relevance for remaining items to ~94–97%.

4. Step-by-Step Reference Solutions and Ternary Scoring

To provide a more nuanced evaluation of intermediate reasoning, MMReason annotates discrete solution steps for selected items. Each solution is partitioned into $N$ coherent steps. Model-generated chains-of-thought are then assessed using a reference-based ternary scoring scheme:

Correct: Score = 1.0
Unverifiable: Score = 0.5
Incorrect: Score = 0.0

The mean intermediate-step score is defined as

$S_\mathrm{inter} = \frac{1}{N}\sum_{n=1}^N \mathrm{Score}(s_n)$

Models are also scored on final answer accuracy ( $S_\mathrm{final}$ ), defined as the exact-match rate of the extracted answer versus a unique ground truth.

This evaluation is performed using a grading LLM (GPT-4o), which compares both step traces and final predictions against detailed reference solutions (Yao et al., 30 Jun 2025).

5. Empirical Evaluation and Model Performance

MMReason benchmarks both closed- and open-source MLLMs, including:

Closed-source: GPT-4o-1120, Claude-3.7V Sonnet, Gemini-1.5 Pro
Open-source: DeepSeek-VL2, LLaVA-OneVision, Qwen-2.5-VL (7B/72B), MiniCPM-V-2.6, InternVL-2.5-MPO (8B/78B), LLaVA-CoT, Mulberry, LLaMA-3.2-Vision

Key performance statistics:

Model	$S_\mathrm{final}$ (%)	$S_\mathrm{inter}$ (%)
GPT-4o-1120	25.7	42.1
Claude-3.7V Sonnet	25.1	36.1
Gemini-1.5 Pro	24.9	33.3
Qwen-2.5-VL-72B	24.7	28.1
InternVL-2.5-78B	21.3	23.8

Further breakdown shows:

Mathematics achieves the highest per-discipline final accuracy (up to 43.1%), with engineering and competition-tier problems proving especially challenging (<10% for small open-source models).
Tasks newly scraped and reformulated for MMReason are demonstrably harder, with only 8–22% final accuracy.
Filtering reduced possible answer “leaks” by bringing text-only accuracy below 1%, strongly enforcing visual relevance.

Closed-source models outperform open-source variants in intermediate step assessment, indicating more coherent and robust chain-of-thought execution. Even the state-of-the-art, however, achieves only $\sim$ 26% final answer accuracy and $\sim$ 42% intermediate-step score; substantial capability gaps persist, particularly on long-chain and competition-level reasoning.

6. Contributions, Impact, and Future Directions

MMReason delivers several novel contributions to multimodal reasoning research:

Dataset: A 1,384-item, six-discipline, four-difficulty-tier corpus of open-ended, vision-grounded, long-chain reasoning problems.
Filtering Pipeline: A multi-model voting protocol that robustly eliminates memorization shortcuts and ensures truly multimodal reasoning.
Scoring Methodology: Reference-based ternary scoring of intermediate steps, enabling fine-grained error localization and reasoning quality analysis.
Empirical Analysis: Demonstrated that even leading MLLMs exhibit clear weaknesses in long-chain multimodal reasoning and remain far from AGI-level competency (Yao et al., 30 Jun 2025).

Potential impact is significant: MMReason's structure encourages research into more advanced multimodal chain-of-thought prompting, architectures integrating iterative visual reasoning and self-critique, and further anti-memorization filtering (e.g., dynamic data augmentation, adversarial screening). By diagnosing and quantifying reasoning failures at step granularity, MMReason sets a new bar for interpretability and progress assessment in multimodal AI.

Finally, MMReason is both a specialized benchmark and a paradigm for the broader "MMReason" agenda—a systematic, stepwise, and nuanced evaluation framework foundational for driving multimodal reasoning research toward AGI-level capabilities.

Markdown Report Issue Upgrade to Chat

References (1)

MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MMReason.

MMReason: Multimodal Reasoning Benchmark

1. Motivations and Conceptual Distinctions

2. Benchmark Construction and Domain Scope

3. Filtering Protocol: Eliminating Guessability and Memorization

4. Step-by-Step Reference Solutions and Ternary Scoring

5. Empirical Evaluation and Model Performance

6. Contributions, Impact, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MMReason: Multimodal Reasoning Benchmark

1. Motivations and Conceptual Distinctions

2. Benchmark Construction and Domain Scope

3. Filtering Protocol: Eliminating Guessability and Memorization

4. Step-by-Step Reference Solutions and Ternary Scoring

5. Empirical Evaluation and Model Performance

6. Contributions, Impact, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research