OpenMMReasoner: Open-Source Multimodal Reasoning
- OpenMMReasoner is a fully open-source multimodal reasoning framework that uses a two-stage training paradigm combining supervised fine-tuning and reinforcement learning.
- It employs rigorous data curation, diverse domain mixing, and automated verification to generate robust chains of thought.
- Empirical evaluations show an 11.6% average improvement over baseline models, establishing a new state-of-the-art in open-source multimodal reasoning.
OpenMMReasoner is a fully open-source framework for large multimodal reasoning that adopts a transparent, two-stage training recipe. Developed in response to the lack of reproducible data curation and scalable methodologies within visual and multimodal reasoning research, OpenMMReasoner combines supervised fine-tuning (SFT) using a rigorously validated 874K-sample “cold-start” dataset with subsequent reinforcement learning (RL) based on on-policy rollouts and composite reward shaping. This approach yields robust empirical gains, including an 11.6% average improvement over the Qwen2.5-VL-7B-Instruct baseline across nine prominent multimodal reasoning benchmarks, and sets a new state-of-the-art for open-source models in this domain. All code, data, and training pipelines are openly released to facilitate transparent research at scale (Zhang et al., 20 Nov 2025).
1. Two-Stage Training Paradigm
OpenMMReasoner employs a two-phase learning strategy to endow large multimodal models (LMMs) with general reasoning capacity:
- Stage I: Supervised Fine-Tuning (SFT) An 874K-sample dataset—comprising image–question–answer triples with chain-of-thought (CoT) steps—is used to train from a Qwen2.5-VL-7B-Instruct checkpoint. Each reasoning trace is enforced to adhere to unified CoT formats and verified through automated answer checking and LLM-based judging. This phase establishes broad, stepwise reasoning competency.
- Stage II: Reinforcement Learning (RL) A curated 74K-sample RL dataset encompassing science, mathematics, diagrams, chart comprehension, and puzzles provides diverse multimodal questions and ground-truth answers. On-policy rollouts are optimized using Group Sequence Policy Optimization (GSPO), rewarding both accuracy and format compliance via a composite reward function
Training continues for 1,232 RL updates, which empirically saturates validation reward (Zhang et al., 20 Nov 2025).
2. SFT Data Construction and Curation
The SFT “cold-start” dataset is constructed through the following processes:
- Data Sourcing:
103K raw samples are drawn from public benchmarks (LLaVA-CoT, OpenVLThinker, We-Math2.0).
- Distillation and Scaling:
Chains of thought are distilled from strong teacher LMMs (Qwen2.5-VL-72B-Instruct and Qwen3-VL-235B-Instruct), with each generated trace validated for final answer correctness and coherence using both rule-based and LLM-as-judge evaluation. To maximize answer diversity, multiple traces per question () are retained if they pass validation, yielding 583K verified traces.
- Domain Mixing:
120K samples are appended from math-focused multimodal corpora (MMR1: image math; MiroMind-M1: text math) for comprehensive domain coverage, bringing the total to 874K.
- Filtering Policies:
Experiments showed that post-distillation length and difficulty filtering hurt benchmark performance; thus, the final dataset forgoes such filters, prioritizing answer diversity (Zhang et al., 20 Nov 2025).
3. Reinforcement Learning Strategy and GSPO
Key elements of OpenMMReasoner’s RL stage:
- RL Dataset:
74K curated samples from MMEureka, ViRL, TQA, We-Math, PuzzleVQA, AlgoPuzzleVQA, and ThinkLiteVL are extracted, with rigorous answer de-duplication.
- Policy Optimization:
Several on-policy algorithms were compared—GRPO (critic-free), DAPO (decoupled clip sampling), and GSPO. Empirical results demonstrate GSPO’s superiority in stability and final reward. GSPO’s clipped policy-gradient objective is:
where and is the normalized group advantage (Zhang et al., 20 Nov 2025).
- Sampling and Stability:
Rollout count per RL update is set to , which meaningfully improves stability over . No curriculum sampling is used during RL; mixed sampling yields the most robust results. An overlength penalty (from DAPO) is included to discourage verbose traces and ensure token efficiency.
4. Model Architecture and Implementation
OpenMMReasoner is implemented atop the Qwen2.5-VL-7B-Instruct backbone:
- Vision Backbone:
No changes are made to the core transformer or visual encoder architecture; improvements are achieved solely through expanded data curation and specialized training.
- Online Packing:
Training employs online example packing via the Liger-Kernel for superior token usage during SFT.
- RL Policy Module:
RL optimization is integrated as a codepath using the GSPO framework, with support for batch rollouts and sequence-level reward assignment.
- Verification Pipeline:
Both SFT and RL data construction utilize a two-stage verification loop (automated rule-based and LLM-judge verdicts) to ensure data consistency and to filter reasoning traces.
5. Hyperparameters and Training Schedules
Key hyperparameters for training are shown below:
| Component | Optimizer | Scheduler | LR | Weight Decay | Steps | Warmup | Max Len | Packing | Liger Kernel |
|---|---|---|---|---|---|---|---|---|---|
| SFT | AdamW | Cosine | 0.0 | 4300 | 430 | 61,440 | on | yes | |
| RL | AdamW | Constant | 0.1 | 1232 | 25 | 32,792 | on | no |
SFT proceeds until validation accuracy plateaus (about 4,300 steps). RL continues for 1,232 steps, after which the reward saturates (Zhang et al., 20 Nov 2025).
6. Empirical Evaluation and Benchmarks
OpenMMReasoner is benchmarked on nine challenging reasoning datasets with 14 total sub-splits including MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MMMU, MMMU-Pro, and CharXiv. Major findings:
- Qwen2.5-VL-7B-Instruct baseline: 67.3 (average across benchmarks)
- SFT-only (“ColdStart”): 76.1 (+8.8 over baseline)
- +RL (GSPO): 79.5 (+11.6 over baseline)
Ablation studies further attribute improvement to teacher model strength (+1.2 for stronger teacher), answer diversity scaling (+4.7 for vs ), domain mixing (+1.1 for including math data), RL algorithm selection (GSPO > GRPO > DAPO), and rollout count ( provides +2.7 over with GSPO) (Zhang et al., 20 Nov 2025).
7. Methodological Insights and Limitations
Empirical analysis highlights several core principles:
- Answer diversity is as significant as question diversity for generalization.
- Stronger teachers in distillation yield higher-quality, more effective CoT data.
- Avoiding over-filtering preserves chain-of-thought robustness.
- RL-based policy optimization reliably sharpens and stabilizes SFT-learned reasoning abilities.
Current limitations include focus on Qwen2.5-VL line and images as the sole modality. Expansion to video, audio, and multi-modal fusion, as well as larger model architectures and more extensive RL resources, are indicated as primary directions for future research (Zhang et al., 20 Nov 2025).
OpenMMReasoner delivers a robust, reproducible pipeline for advancing open-source multimodal reasoning. Its design underscores the impact of principled large-scale data curation and methodologically sound optimization, providing a competitive and extensible platform for future developments in the field (Zhang et al., 20 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free