R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
(2503.05132v2)
Published 7 Mar 2025 in cs.AI, cs.CV, and cs.LG
Abstract: Recently DeepSeek R1 demonstrated how reinforcement learning with simple rule-based incentives can enable autonomous development of complex reasoning in LLMs, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models. aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities. The project code is available at https://github.com/turningpoint-ai/VisualThinker-R1-Zero
Summary
The paper introduces VisualThinker-R1-Zero, demonstrating that applying reinforcement learning directly to a 2B non-SFT model (Qwen2-VL) can replicate the "aha moment" in multimodal reasoning seen in larger models, achieving 59.47% on CVBench.
The method applies GRPO with a simple rule-based reward function directly to the base model, enhancing spatial reasoning significantly (approx. 30% over base) without needing large supervised reasoning datasets or complex prompting.
Crucially, the study finds that applying this RL approach to instruction-tuned (SFT) models results in superficial and less effective reasoning compared to applying it to base, non-SFT models.
The paper "R1-Zero’s “Aha Moment” in Visual Reasoning on a 2B Non-SFT Model" introduces VisualThinker-R1-Zero, which replicates key emergent characteristics of DeepSeek R1 in multimodal reasoning using a 2B non-SFT model. The authors apply reinforcement learning directly to the Qwen2-VL-2B model and achieve a 59.47% accuracy on CVBench, outperforming the base model by approximately 30% and exceeding the SFT model by approximately 2%. The project code is available on GitHub.
The key contributions of this work are:
Replication of DeepSeek R1's "aha moment" and increased reasoning length in multimodal reasoning tasks using a non-SFT 2B model.
Demonstration that vision-centric spatial reasoning tasks benefit from improved reasoning capabilities.
Observation that applying RL on instruction-tuned models leads to superficial reasoning.
Related Work
The authors note that while LLMs can be post-trained to elicit enhanced reasoning abilities, enhancement typically requires sophisticated prompting designs or large amounts of reasoning training data. The community is interested in developing more natural methods to incentivize higher intelligence in models without relying on extensive supervised data or complex prompting techniques. The DeepSeek R1 paper demonstrated that RL can incentivize a model's reasoning abilities without any supervised reasoning data, and discovered an "aha moment" when directly applying RL with rule-based reward on mathematical datasets.
VisualThinker R1 Zero
The method builds on Qwen-2-VL-2B as the base model, applying GRPO with a tailored chat template and prompting strategy to enhance its reasoning capabilities. The authors posit that applying GRPO to the base model may be a more efficient and effective way to replicate multimodal R1 reasoning compared to training instruct fine-tuned models.
For each question q in the dataset Q, the model generates a response o with a specified prompt template and is then optimized using an RL objective. To ease the burden of training an additional value function approximation model employed by PPO, GRPO adopts the average reward of sampled response of the policy model as the baseline in computing the advantage. Given an input question q, a group of responses {o1,o2,⋯,oG} are sampled and corresponding rewards {r1,r2,⋯,rG} are computed with the reward model. The advantage is then computed as:
A^i,t is the estimated advantage at step t for response i
ri is the normalized reward for response i
ri is the reward for response i
mean(r) is the mean of the rewards
std(r) is the standard deviation of the rewards
The policy model is then optimized by maximizing the following KL objective:
$\mathcal{J}<em>{GRPO}(\theta) = \mathbb{E}</em>{q \sim P(Q), {o_i}<em>{i=1}<sup>G</sup> \sim \pi</em>{\theta_{old}(O|q)}
\Bigg[ \frac{1}{G} \sum_{i=1}<sup>G</sup> \frac{1}{|o_i|} \sum_{t=1}<sup>{|o_i|}</sup> \Bigg{ \min \Bigg[
\frac{\pi_\theta(o_{i,t} | q, o_{i,<t})}{\pi_{\theta_{old}(o_{i,t} | q, o_{i,<t})} \hat{A}<em>{i,t}, \
\text{clip} \left( \frac{\pi</em>\theta(o_{i,t} | q, o_{i,<t})}{\pi_{\theta_{old}(o_{i,t} | q, o_{i,<t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}<em>{i,t}
\Bigg] - \beta \mathbb{D}</em>{KL}\left[\pi_{\theta} || \pi_{ref}\right] \Bigg} \Bigg]$*JGRPO(θ) is the GRPO objective function
q is the input question
P(Q) is the distribution of questions
{oi}i=1G is the set of sampled responses
πθold(O∣q) is the old policy
∣oi∣ is the length of response i
oi,t is the token at position t in response i
πθ(oi,t∣q,oi,<t) is the probability of token oi,t given question q and previous tokens oi,<t under the current policy πθ
πθold(oi,t∣q,oi,<t) is the probability of token oi,t given question q and previous tokens oi,<t under the old policy πθold
A^i,t is the estimated advantage at step t for response i
ϵ is the PPO clipping parameter
β is the KL penalty coefficient
DKL[πθ∣∣πref] is the KL divergence between the current policy πθ and a reference policy πref
The RL approach avoids the use of reward models or Monte Carlo Tree Search (MCTS)-like techniques and employs a rule-based reward function that evaluates responses based on their format and correctness. If the response provides a final answer and is correct, the model receives an accuracy reward of +1. If the response encloses its thinking in and the final answer in tags, the model receives a format reward of +1; otherwise, the model receives 0 reward.
Experiments
The authors trained their models on the SAT dataset, a VQA dataset comprising 218k question-answer pairs synthesized using a photo-realistic physics engine to enhance spatial intelligence, focusing on the static subset. To test the generalization of their method, they evaluated on CVBench, a realistic vision-centric benchmark. All experiments were conducted using four NVIDIA H100 GPUs (80GB each), setting the batch size to 1 per device. The model was trained for 1500 steps with a learning rate of 1×10−6 and a temperature of 1.0, setting the maximum response length as 700. During GRPO optimization, 8 responses were sampled per step and a KL coefficient of 0.04 was applied.
The fine-tuned Qwen2-VL-2B non-SFT model was evaluated on CV-Bench and during training, the model autonomously developed increasing response length, along with performance gains. The method was also tested on various spatial reasoning datasets including BLINK and VSR. The method demonstrated improved performance over the Qwen2-VL-2B (base) by approximately 30%, and the Qwen2-VL-2B SFT (base + SFT) by approximately 2% on CV-Bench, also achieving superior performance on the BLINK and VSR benchmark. During training, the model spontaneously revisits its previous judgment and explores alternative options, exhibiting a multimodal "aha moment".
Challenges of Applying RL to Supervised Fine-Tuned Models
Applying RL on an SFT model does improve the model's performance on CVBench. However, it is questionable whether this method incentivizes the models to achieve higher intelligence, as the model responses can degenerate into meaningless or trivial reasoning patterns. Applying RL on an SFT model exposes problems of trivial reasoning patterns. The authors investigated whether performance improvements occur through enhancement of the vision encoder during training, and hypothesized that when the vision encoder is frozen during RL, the model may focus more on developing sophisticated reasoning strategies; conversely, when the LLM is frozen during RL, the performance would be the same level as the full fine-tuning setting. Results demonstrate that both approaches achieve greater improvement than the vanilla implementation, yet they still generate short and trivial responses.
Naive rewarding of lengthy responses does not improve model performance and often leads to reward hacking behaviors, with models generating extremely long yet meaningless content.