Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning (2506.04207v1)

Published 4 Jun 2025 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal LLMs (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.

Summary

The paper introduces ReVisual-R1, a structured training approach that interleaves cold-start text, multimodal, and textual RL to enhance reasoning capabilities.
It employs advanced techniques like Prioritized Advantage Distillation and an Efficient-Length Reward function to stabilize learning and produce concise outputs.
Extensive benchmark evaluations show that ReVisual-R1 outperforms previous open-source models, improving average scores by +16.8 percentage points.

This paper, "Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning" (2506.04207), addresses the challenge of effectively training Multimodal LLMs (MLLMs) to achieve complex multimodal reasoning capabilities. Existing methods often struggle because they directly apply reinforcement learning (RL) techniques from text-only domains without fully accounting for the unique aspects of multimodal learning. The authors identify three key phenomena: the critical importance of effective cold start initialization, the issue of gradient stagnation when applying standard GRPO (Group Relative Policy Optimization) to multimodal RL, and the surprising benefit of a subsequent text-only RL phase for refining multimodal reasoning.

Based on these findings, the paper introduces ReVisual-R1, an open-source 7B MLLM trained with a novel three-stage curriculum and enhanced RL algorithms. The core idea is to systematically build reasoning capabilities, first through strong textual foundations, then by grounding them in visual perception, and finally by refining abstract reasoning and linguistic fluency.

The training curriculum, termed Staged Reinforcement Optimization (SRO), consists of:

Textual Cold Start: The initial phase uses carefully selected, high-difficulty text data to instill complex reasoning templates and foundational reflective capabilities. A preliminary paper revealed that cold-starting with textual data alone significantly improved performance on both textual and multimodal reasoning tasks compared to using existing multimodal cold-start datasets, which often lack sufficient complexity. This highlights the importance of high-quality, challenging data for the initial training phase.
Multimodal Reinforcement Learning (MRL): Following the cold start, this stage focuses on connecting linguistic reasoning with visual perception using multimodal data. The authors employ GRPO as the core RL algorithm but introduce two key enhancements to address challenges in multimodal settings:
- Prioritized Advantage Distillation (PAD): To combat "Gradient Stagnation," where near-zero advantage estimates (especially with sparse rewards) halt learning, PAD filters out samples with negligible advantage and prioritizes sampling from the remaining "effective set" based on their absolute advantage magnitudes. This is done using a temperature-controlled Softmax distribution. This mechanism ensures that training updates focus on the most informative samples, improving stability and efficiency. The algorithm (detailed in the appendix) involves calculating per-sequence absolute advantage, filtering samples within a specified range $[T_{low}, T_{high}]$ where $T_{low}>0$ , and then sub-sampling from this effective set based on weighted probabilities.
- Efficient-Length Reward Function: To prevent overly long or suboptimal response generation, this auxiliary reward component encourages concise outputs without truncating necessary reasoning steps. It penalizes deviations from a target length budget $L_{\text{budget}$ using a clipped linear function: $R_{\text{len}(L_y, L_{\text{budget}, \alpha, \delta) = \max(0.0, \min(1.0, \alpha (L_{\text{budget} - L_y) + \delta))}$. This provides a continuous signal to guide the model towards efficient responses.
Textual RL Refinement: The final stage uses text-only RL training to restore linguistic fluency and further enhance abstract reasoning, mitigating "textual capability decay" that can occur after intensive MRL. This phase refines reasoning expression and linguistic nuances while preserving the multimodal grounding learned in the previous stage. GRPO augmented with PAD is also used here, with a reward function promoting linguistic excellence and conciseness.

The authors curated a new dataset called GRAMMAR for training, combining diverse open-source reasoning data, filtering for verifiability, pruning based on difficulty, and balancing samples across topics and difficulty levels using embedding clustering. The training dataset for ReVisual-R1 includes ~40k text entries for cold start, ~26k multimodal entries for MRL, and ~30k text entries for TRL.

ReVisual-R1 is built upon the Qwen2.5-VL-7B-Instruct model. Training was conducted using LLaMA Factory for cold start and Easy R1 for the RL stages. The vision tower is frozen during the TRL stage. The model was trained on 8 NVIDIA A100-80G GPUs.

Evaluation on a comprehensive suite of benchmarks, including MathVerse, MathVision, MathVista, DynaMath, WeMath, LogicVista (multimodal reasoning), and AIME24/25, GPQA, MATH500 (textual reasoning), demonstrates the effectiveness of the approach. ReVisual-R1 achieves state-of-the-art performance among open-source 7B MLLMs on most challenging benchmarks, often surpassing larger or proprietary models on specific reasoning tasks like AIME and MATH500. It obtained an average score of 53.1% across these benchmarks, significantly outperforming the previous open-source SOTA average by +16.8 percentage points. It also shows competitive performance on general multimodal benchmarks like MMMU and MM-Vet.

Ablation studies validate the proposed SRO framework and the specific algorithmic enhancements. The CS + MRL + TRL sequence is shown to be superior to alternative orderings (CS+MRL or CS+TRL or CS+TRL+MRL) on multimodal reasoning tasks, confirming the benefit of the specific staging. Ablations on PAD demonstrate that both effective sample filtering and prioritized sub-sampling are crucial for improved performance and faster, more stable convergence compared to baseline GRPO variants. The Efficient-Length Reward is also shown to be vital for training stability, preventing accuracy degradation and controlling response verbosity and model entropy.

In summary, ReVisual-R1 provides a practical implementation of a structured training approach for MLLMs that effectively cultivates complex multimodal reasoning. By optimizing the cold start with challenging text data, stabilizing multimodal RL with PAD and efficient length rewards, and refining capabilities with subsequent textual RL, the paper demonstrates that principled curriculum design and targeted algorithmic innovations can unlock advanced reasoning in open-source models. The code page is provided at https://github.com/CSfufu/Revisual-R1.

PDF Markdown

Related Papers

Tweets

https://twitter.com/SuZhaochen0110/status/1931554948241936588

https://twitter.com/SuZhaochen0110/status/1931557792051282424