Upper bound of OpenMMReasoner performance under further scaling

Determine the upper bound of achievable multimodal reasoning performance when further scaling the OpenMMReasoner training recipe that unifies supervised fine-tuning (SFT) and reinforcement learning (RL). Specifically, ascertain how far the performance of models initialized from Qwen2.5‑VL‑Instruct and trained with the OpenMMReasoner SFT (874k samples) and RL (74k samples) pipelines can be pushed as data volume, answer-trace diversity, and RL training scale increase.

Background

OpenMMReasoner presents a transparent two-stage training recipe—an 874k-sample SFT dataset followed by a 74k-sample RL dataset—for building multimodal reasoning models from Qwen2.5‑VL‑Instruct. The empirical study shows that scaling answer-trace diversity, using strong teacher models, and applying GSPO-based RL provide notable improvements across nine benchmarks.

Despite these gains, the authors acknowledge that the ultimate limits of performance with further scaling are not established. Identifying the upper bound would clarify the saturation point or diminishing returns of increasing data diversity, scale, and RL rollout budgets, and guide future resource allocation and design choices.

References

Additionally, although we explore scaling strategies in both SFT and RL stages, we have not yet identified the upper bound of model performance under further scaling, leaving open the question of how far the current recipe can be pushed.

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe (2511.16334 - Zhang et al., 20 Nov 2025) in Limitation and Future Work