Understanding R1-Zero-Like Training: A Critical Perspective (2503.20783v1)

Published 26 Mar 2025 in cs.LG, cs.AI, and cs.CL

Abstract: DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.

Authors (8)

Zichen Liu (34 papers)
Changyu Chen (19 papers)
Wenjun Li (29 papers)
Penghui Qi (8 papers)
Tianyu Pang (96 papers)
Chao Du (83 papers)
Wee Sun Lee (60 papers)
Min Lin (96 papers)

Summary

The paper "Understanding R1-Zero-Like Training: A Critical Perspective" (Liu et al., 26 Mar 2025 ) presents a detailed examination of the methodology used in DeepSeek-R1-Zero, which employs reinforcement learning (RL) directly on base LLMs to enhance reasoning capabilities without an initial supervised fine-tuning (SFT) phase. The authors critically dissect the two primary components of this approach: the choice of base model and the RL optimization algorithm, specifically Group Relative Policy Optimization (GRPO). Their analysis reveals potential confounding factors, challenges certain interpretations of observed phenomena like the "Aha moment," identifies optimization biases, and proposes an improved RL method, Dr. GRPO.

Base Model Characteristics and Pretraining Influence

A significant portion of the analysis focuses on understanding how the properties of the base LLM influence the outcome of R1-Zero-like training. The investigation covers various models, including the Qwen2.5 family and the original DeepSeek-V3-Base.

Inherent Capabilities and Template Dependence: The paper finds that prompt templates (e.g., R1 or Qwen-Math templates) are generally necessary to elicit task-specific (e.g., math problem-solving) behavior from standard base models trained primarily for next-token prediction. However, a critical finding is that all tested base models, prior to any RL, exhibit non-trivial reasoning capabilities. Measured by pass@8 accuracy, these models can already explore trajectories leading to correct solutions, suggesting that RL in the R1-Zero paradigm is not building reasoning ability de novo but rather amplifying or refining pre-existing, potentially latent, capabilities.
Qwen2.5 Pretraining Hypothesis: The analysis reveals anomalous behavior in Qwen2.5 base models, particularly Qwen2.5-Math. These models achieve substantially better reasoning performance when no template is used, outperforming results obtained with standard templates or even few-shot prompting (Table 1). This leads the authors to hypothesize that Qwen2.5 models might have been pretrained on data containing concatenated question-answer pairs, effectively undergoing a form of implicit SFT during pretraining. This characteristic makes them potentially less representative of a pure base model in the context of studying R1-Zero, as they might already incorporate behaviors that R1-Zero aims to instill via RL.
Revisiting the "Aha Moment": The phenomenon of self-reflection or the "Aha moment," previously associated with the emergent capabilities developed during scaled RL in R1-Zero, is critically re-evaluated. The authors demonstrate evidence of self-reflection keywords and related behaviors in almost all base models tested, including the DeepSeek-V3-Base used in the original R1-Zero work, before the application of RL (Fig. 3 right, Fig. 4). This suggests that self-reflection is likely a capability inherent to these large pretrained models, which RL might subsequently enhance or make more frequent, rather than a capability that emerges purely from the RL process. Furthermore, analysis of the original DeepSeek-R1-Zero model indicated that the presence of self-reflection outputs during inference did not correlate strongly with higher solution accuracy (Appendix C, Fig. 11).

Analysis of Reinforcement Learning Optimization (GRPO)

The paper scrutinizes the RL algorithm, GRPO, used in R1-Zero and identifies specific biases within its optimization objective that can influence training dynamics and observed model behaviors.

GRPO Optimization Biases: The standard GRPO objective function is shown to introduce two distinct biases:
1. Response-level Length Bias: The normalization of the advantage estimate by the inverse of the response length ( $1/|\tau_i|$ in Eq. 3) disproportionately weights updates. Shorter correct responses receive larger positive gradient updates, while longer incorrect responses receive smaller negative updates (in magnitude). This implicitly encourages the generation of longer sequences, particularly when the model is incorrect, potentially explaining the commonly observed "double-increase" phenomenon (simultaneous rise in performance and response length) as partially an artifact of the optimization objective rather than solely improved reasoning complexity. The authors note this normalization issue is also present in several popular open-source PPO implementations for LLMs (Table 3).
2. Question-level Difficulty Bias: Normalizing the advantage by the per-question standard deviation (1/std(...)) biases learning towards questions where the model exhibits very consistent performance (low variance in rewards). This can lead to overfitting on examples that are either consistently easy or consistently hard for the current policy, rather than focusing learning on examples where improvement is most likely.
Dr. GRPO: Debiased Optimization: To counteract these biases, the authors propose Dr. GRPO (GRPO Done Right). This modified algorithm removes the problematic normalization terms: the per-response length normalization ( $1/|\tau_i|$ ) and the per-question standard deviation normalization ($1/std(...)$). The resulting objective function (Appendix A, Eq. 7) aligns more closely with standard policy gradient methods like PPO using Monte Carlo returns with an unbiased baseline (akin to REINFORCE Leave-One-Out).
Experimental Validation of Dr. GRPO: Comparative experiments demonstrate that Dr. GRPO successfully mitigates the artificial inflation of response length observed with standard GRPO, especially for incorrect trajectories (Fig. 5, Fig. 7 right). This leads to significantly improved token efficiency during inference. Importantly, this efficiency gain is achieved while maintaining or even slightly improving the final reasoning performance compared to vanilla GRPO.

Interaction Effects and Practical Implications

The paper highlights the complex interplay between the base model's pretraining, the choice of prompt template, and the RL training data.

Template Mismatch and Data Sensitivity: Using a template that is misaligned with the base model's pretraining (e.g., R1 template on Qwen2.5-Math) can initially degrade performance, necessitating more extensive RL training to recover and improve capabilities. Conversely, when the template (or lack thereof, for Qwen) aligns well with the model's inherent structure, RL can effectively leverage even simpler, out-of-distribution data (like GSM8k) to enhance reasoning on more complex tasks (Fig. 6).
Pretraining Boost: Experiments using Llama-3.2 models confirm that while RL can improve reasoning even on base models without strong domain specialization, starting with a model that has undergone relevant domain-specific pretraining (e.g., math pretraining) significantly elevates the performance ceiling achievable through subsequent RL tuning (Fig. 7 left).
Minimalist Recipe for State-of-the-Art: Synthesizing these findings, the paper proposes a minimalist yet effective recipe: fine-tuning Qwen2.5-Math-7B using Dr. GRPO on MATH dataset problems (levels 3-5) with the Qwen-Math template. This approach achieved 43.3% accuracy on AIME 2024, establishing a new state-of-the-art for 7B models at the time of publication, demonstrating the practical benefits of understanding and mitigating the identified biases (Fig. 2, Appendix B Table 4).

In conclusion, the paper provides a critical perspective on R1-Zero-like training, urging researchers to consider the influence of base model pretraining characteristics and potential biases within RL optimization algorithms. It demonstrates that phenomena like self-reflection may pre-exist RL training and that observed behavioral changes like increased response length can be partially attributed to optimization artifacts. The proposed Dr. GRPO offers a more robust and efficient alternative to standard GRPO for RL-based enhancement of LLM reasoning.