MMFineReason: Open Multimodal Reasoning Benchmark

Updated 14 May 2026

MMFineReason is a large-scale, open benchmark designed to close performance gaps in multimodal reasoning by utilizing high-quality, visually grounded chain-of-thought annotations.
It employs a three-stage pipeline—data aggregation, rationale distillation, and difficulty-aware selection—to curate 1.77M multimodal problems from diverse STEM, science, and puzzle domains.
Models fine-tuned on MMFineReason demonstrate state-of-the-art performance (up to 77.9% on key tasks) with efficient parameter usage and robust generalization to unseen domains.

MMFineReason is a large-scale, open, data-centric benchmark and training corpus designed to systematically close the performance gap in multimodal reasoning for open-source Vision LLMs (VLMs) relative to proprietary models. It provides comprehensive coverage of STEM diagrams, scientific reasoning, visual puzzles, and knowledge-intensive image questions, each annotated with visually grounded, high-quality chain-of-thought (CoT) rationales generated by advanced teacher models. MMFineReason has catalyzed a new generation of parameter-efficient multimodal LLMs (MLLMs) that demonstrate competitive or superior performance in knowledge-rich multimodal reasoning with robust generalization to unseen domains (Lin et al., 29 Jan 2026).

1. Data-Centric Construction Pipeline

MMFineReason’s construction follows a three-stage, quality-focused pipeline:

Large-scale aggregation and standardization: The benchmark assembles 2.29 million raw samples from 24 public datasets (e.g., FineVision, Euclid30K, MMR1, BMMR, GameQA-140K). Processing includes English-standardization, cleaning (URL/garbage character stripping, index removal), input rewriting to enforce reasoning-first prompts, and strict task suitability filtering that discards non-analytical or non-visual reasoning samples. Unreadable images are removed and all images are normalized to canonical RGB format with a max dimension 2048 px.
High-quality chain-of-thought rationale distillation: For each sample, Qwen3-VL-235B-A22B-Thinking generates a multi-phase long-form rationale, explicitly segmented as (1) Information Extraction, (2) Problem Setup, (3) Solution Execution, and (4) Solution Validation. Each reasoning annotation is stored as > … preceding <answer>…</answer> and paired with a dense image caption. Only traces that pass stringent length (≥100-word rationale), template, and de-duplication requirements are retained.
Difficulty- and quality-aware selection: Each sample is independently solved four times by Qwen3-VL-4B-Thinking. The resulting pass rate $PR_i$ is computed as the fraction of correct independent attempts. The MMFineReason “hard” subset comprises the 7% ( $PR=0$ ) instances unsolved by the teacher, and the “moderate” subset captures the ambiguous/difficult region $(0<PR<1)$ . Consistency and correctness checks further winnow the full corpus, with exact answer matches required for final selection.

After filtering, MMFineReason contains 1,770,926 multimodal problems and 5.15 billion CoT tokens, with all samples adhering to a canonical schema including metadata, standardized input/output, and Qwen3-vl-based annotation (Lin et al., 29 Jan 2026).

2. Benchmark Composition, Coverage, and Statistics

MMFineReason exhibits broad coverage and controlled structure:

Domain breakdown:

| Domain | Share | Samples | |--------------|--------|-----------| | Mathematics | 79.4% | ~1.4M | | Science | 13.8% | 244K | | Puzzle/Game | 4.6% | 82K | | General/OCR | 2.2% | 39K |

Image types:

75.3% of images are STEM/diagrammatic (e.g., geometry, plots, charts), the remainder are natural images.

Annotation coverage:

All samples have multistep, visually grounded CoT traces and machine-verifiable answer tags.

Subset selection (“less is more”):

Employing only the 123K most challenging samples $(PR=0)$ (7% of the total) yields 97% of the full benchmark’s performance on standard evaluation suites, establishing efficient data cutting as an effective alternative to brute-force scaling (Lin et al., 29 Jan 2026).

3. Modeling and Training Protocols

MMFineReason serves both as a benchmark and as training data for supervised and RL fine-tuning:

Base Models:

Qwen3-VL-Instruct at 2B/4B/8B parameters; models fine-tuned on MMFineReason are denoted as MFR-2B/4B/8B.

Training:
- Supervised Fine-tuning: AdamW, $LR=10^{-5}$ , batch size 32, up to 32,768 tokens/sequence, 3 epochs, cross-entropy loss over reasoning and answer targets.
- RL Fine-tuning: Group Sequence Policy Optimization (GSPO), $LR=10^{-6}$ , batch size 256, group size 16 rollouts, KL-penalized expected reward maximization, with separate reward calculation for each CoT trace.
Image preprocessing:

Max resolution $768^2$ is optimal for cost-effectiveness except in natural-image-heavy domains, where $2048^2$ may yield marginal improvement (Lin et al., 29 Jan 2026).

4. Empirical Impact and Comparative Performance

MMFineReason-trained MLLMs deliver state-of-the-art results across a wide spectrum of multimodal reasoning evaluations:

Benchmark results (average scores across 16 tasks):

| Model | Score (%) | |------------------------------|-----------| | Qwen3-VL-8B-Thinking | 72.5 | | Qwen3-VL-30B-A3B-Thinking | 74.5 | | Qwen3-VL-32B-Thinking | 77.9 | | MFR-4B (MMFineReason-4B) | 73.9 | | MFR-8B (MMFineReason-8B) | 75.7 |

MFR-4B surpasses Qwen3-VL-8B-Thinking, MFR-8B outperforms the much larger Qwen3-VL-30B-A3B-Thinking, and approaches Qwen3-VL-32B-Thinking despite using only a quarter of the parameters.

Subset robustness:

The 123K hardest questions (7%) achieve 73.3 on the aggregate score vs. the 75.7 of the 1.8M full set; both outperform other public data benchmarks even at much larger scales (Lin et al., 29 Jan 2026).

Synergistic effect:

Inclusion of reasoning-rich data in MMFineReason consistently improves VQA and chart analysis in addition to structured mathematical reasoning.

5. Data Quality Controls, Ablations, and Limitations

Data filtering:

Structural filters (template, length, answer extraction), n-gram de-duplication, and consistency checks ensure annotation diversity and correctness.

Ablations:

Caption augmentation yields marginal benefit with mature CoT models; increased input resolution is only beneficial for natural images.

Domain balance:

The dataset is intentionally math-heavy (79.4%), with puzzles/games underrepresented (4.6%), motivating future research into domain-mixing and curriculum strategies.

Limitations:

English-only; no explicit multilingual or cross-cultural coverage. Teacher annotation biases may propagate. Non-STEM visual reasoning underrepresented. Proposed extensions include multilingual annotation, broader image types, and domain-level balancing (Lin et al., 29 Jan 2026).

6. Relations to Adjacent Advancements

MMFineReason’s influence and methodology interface with multiple concurrent lines:

RL-augmented multimodal agents:

MMRAG-RFT and DR-MMSearchAgent achieve step-wise explainability, retrieval, and CoT-based RL using similar structural templates and reward formulations (Zhao et al., 19 Dec 2025, Wang et al., 21 Apr 2026).

Programmatic reasoning and embedding:

MMEmb-R1 incorporates chain-of-thought into embedding via latent variable and pair-aware RL (Wang et al., 7 Apr 2026).

Fine-grained recognition and image generation:

Fine-R1 and Fine-grained Multimodal Reasoning architectures leverage similar multi-stage distillation and iterative feedback for visual compositionality and local correction (He et al., 7 Feb 2026, Kim et al., 15 Apr 2026).

Financial and scientific reasoning:

Benchmarks such as FinMMDocR and FinMMR extend MMFineReason’s core methodology to high-stakes domains, requiring evidence localization, implicit scenario modeling, and multi-step CPA-grade computation (Tang et al., 31 Dec 2025, Tang et al., 6 Aug 2025).

7. Outlook and Broader Significance

MMFineReason set a new standard for open multimodal benchmarks by demonstrating that a rigorous, high-quality, CoT-annotated, and difficulty-balanced data-centric pipeline can yield models that rival heavily parameterized, closed-source state-of-the-art MLLMs. Its highly parameter-efficient results and observed "less is more" effect (where a 7% most-difficult subset conveys nearly all performance) provide strong support for targeted quality- and difficulty-driven data curation over indiscriminate dataset enlargement (Lin et al., 29 Jan 2026). Extension to new modalities (video, audio), more balanced domain representation, explicit cross-linguistic reasoning, and curriculum-based data composition constitute the primary frontiers for further research.