R1-Zero-Like Training Overview

Updated 6 November 2025

R1-Zero-like training is a paradigm that applies reinforcement learning with rule-based, automatable rewards directly to pretrained models to encourage emergent chain-of-thought reasoning.
It employs group normalization methods such as GRPO to mitigate biases like length and difficulty, ensuring robust and sample-efficient optimization.
Empirical results demonstrate improved in-context reasoning, extended chain-of-thought length, and enhanced generalization across domains including math, code, and multimodal vision.

R1-Zero-Like Training

R1-Zero-like training denotes a paradigm in foundation model optimization whereby reinforcement learning (RL), often with rule-based or verifiably automatable rewards, is applied directly to base models—frequently without supervised fine-tuning (SFT)—to incentivize emergent reasoning capability. Originating from the DeepSeek-R1-Zero line of work (DeepSeek-AI et al., 22 Jan 2025), this approach has rapidly broadened from LLMs into multimodal vision models, agents, code generators, and graph reasoning, resulting in demonstrable gains in in-context reasoning, chain-of-thought length, and generalization across tasks. Key distinguishing aspects include the use of group-normalized relative optimization (e.g., GRPO), verifiable or explicit reward functions, and avoidance of cold-start or instruction-tuned warmups unless empirically necessary.

1. Principles and Evolution of R1-Zero-like Training

The core principle is direct RL post-training of a base (pretrained) model, using task-aligned, automatable rewards for reasoning and answer quality, often in the absence of or with minimal SFT. DeepSeek-R1-Zero demonstrated that applying RL with explicit answer and formatting rewards can lead to the autonomous emergence of long chain-of-thought (CoT) reasoning, self-verification (“aha moments”), and increased test accuracy with no intermediate human supervision (DeepSeek-AI et al., 22 Jan 2025). The framework utilizes reward signals checkable by program—exact answer match for mathematics, test-case passing for code, or passing functional tests for EDA, and leverages group-level optimization (GRPO).

Extensions have adapted the paradigm for small models (e.g., FastCuRL with curriculum context extension (Song et al., 21 Mar 2025)), visual models (DINO-R1 (Pan et al., 29 May 2025)), multimodal agents (Vision-R1 (Huang et al., 9 Mar 2025)), code generation (CodeV-R1 (Zhu et al., 30 May 2025)), and graph algorithms (Graph-R1 (Wu et al., 24 Aug 2025)), each addressing data, reward, and optimization peculiarities respective to their domains. The central motivation is sample efficiency, emergent cognitive behaviors, and robustness in open-domain and zero-shot conditions.

2. RL Algorithms: GRPO and Variants

Group Relative Policy Optimization (GRPO), first introduced in DeepSeek-R1, has become the standard objective for R1-Zero-like training (DeepSeek-AI et al., 22 Jan 2025). GRPO dispenses with value critics, instead estimating standardized group advantages for each sampled response relative to group peers:

$\hat{A}_i = \frac{r_i - \mu_r}{\sigma_r}$

where $r_i$ is the reward for output $o_i$ , $\mu_r$ and $\sigma_r$ are mean and standard deviation over a group. The policy gradient update is performed using clipped PPO ratios and, optionally, KL regularization with a static or annealed coefficient:

$\mathcal{J}_{GRPO}(\theta) = \mathbb{E}_{q, \{o_i\}} \frac{1}{G} \sum_{i=1}^G \min\left(\frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} \hat{A}_i, \text{clip}[\cdots]\right) - \beta \mathbb{D}_{KL}(\pi_\theta || \pi_{\text{ref}})$

This decentralized advantage estimation is applicable to multi-turn RL agents (WebAgent-R1 (Wei et al., 22 May 2025)), vision transformers (DINO-R1 (Pan et al., 29 May 2025)), tabular reasoning (Table-R1-Zero (Yang et al., 29 May 2025)), and code generation (CodeV-R1 (Zhu et al., 30 May 2025)).

Optimization variants address artifacts such as length bias and difficulty bias. Dr. GRPO eliminates per-response length and per-group std normalization, improving token efficiency without compromising accuracy (Liu et al., 26 Mar 2025). DRA-GRPO further amends the reward by incorporating semantic diversity via the submodular mutual information graph-cut penalty, balancing exploration and exploitation to avoid “mode collapse” and increase reasoning diversity (Chen et al., 14 May 2025).

3. Reward Functions and Emergent Behavior

R1-Zero-like training depends fundamentally on reward functions that can be automatically evaluated for correctness, structure, and sometimes diversity. For math and code, reward is a binary signal: correct answer format and correctness yield reward 1, otherwise 0. For open-ended tasks (MT-R1-Zero (Feng et al., 14 Apr 2025)), reward combines format enforcement and continuous metrics (BLEU, COMET) to provide differentiated incentives.

A central empirical observation is the emergence of cognitive behaviors such as self-reflection, verification, and extended CoT within RL-trained models starting from base weights—termed “aha moments” (DeepSeek-AI et al., 22 Jan 2025, Zhou et al., 7 Mar 2025, Zeng et al., 24 Mar 2025). These behaviors arise naturally under group-relative RL and are measurable by output length, reasoning action markers, and increasingly complex rationale paths.

However, the relationship between response length and depth of reasoning is nontrivial—length rewards alone can lead to reward hacking and repetitive, non-informative trajectories (Zhou et al., 7 Mar 2025, Chen et al., 6 Mar 2025). Cognitive emergence is best incentivized via task-aligned rewards and group normalization.

For domains such as multimodal reasoning (Vision-R1 (Huang et al., 9 Mar 2025), DINO-R1 (Pan et al., 29 May 2025)), code synthesis (CodeV-R1 (Zhu et al., 30 May 2025)), and web navigation (WebAgent-R1 (Wei et al., 22 May 2025)), reward specification must be tailored (e.g., testbench-based equivalence for HDL, binary task completion for web agents, KL regularization for objectness distribution in visual grounding) to ensure robust learning and prevent reward hacking.

4. Empirical Results and Applications

R1-Zero-like training achieves significant state-of-the-art results across diverse tasks and models, typically outperforming supervised baselines with less data and compute:

Task / Model	R1-Zero-like Method	Base / SFT	Gain	Reference
Math (AIME 2024, 7B)	GRPO, Dr. GRPO	SFT	+40%	(DeepSeek-AI et al., 22 Jan 2025, Liu et al., 26 Mar 2025)
Visual reasoning	GRPO on Qwen2-VL-2B	SFT	+30%	(Zhou et al., 7 Mar 2025)
Table reasoning	Table-R1-Zero-7B	SFT	=/> GPT-4.1	(Yang et al., 29 May 2025)
EDA/Verilog	CodeV-R1-7B + DAPO	DeepSeek-R1-671B	+12–20%	(Zhu et al., 30 May 2025)
GUI grounding	GUI-G1-3B (Modified GRPO)	SFT	+3% over larger models	(Zhou et al., 21 May 2025)
Machine translation	MT-R1-Zero (7B)	SFT	≈ GPT-4o	(Feng et al., 14 Apr 2025)
Reasoning diversity	DRA-GRPO-1.5B	DeepScaleR	+2–3% avg, 6x less data	(Chen et al., 14 May 2025)

Across studies, R1-Zero-like RL consistently yields improvements in reasoning length, accuracy, generalization to out-of-domain data, and diversity of solutions. For small-scale models, curriculum RL (FastCuRL) enables sample-efficient training by stage-wise context and data segmentation, preventing entropy collapse (Song et al., 21 Mar 2025). For web agents, ablation studies demonstrate RL from base weights (R1-Zero) fails under extremely sparse rewards, establishing BC as a necessity in long-horizon tasks (Wei et al., 22 May 2025). Conversely, in math/code/vision benchmarks, direct RL is highly effective, especially when reward functions are rule-based and automatable.

5. Controversies, Limitations, and Remedies

Several works highlight limitations and optimization artifacts inherent in R1-Zero-like RL:

Length bias and Difficulty bias: Group normalization and length division in standard GRPO can artificially inflate the length of incorrect outputs and overweight easy/hard questions (Liu et al., 26 Mar 2025). Dr. GRPO corrects this by removing biased terms.
Reward hacking: Incentivizing output length, structure, or box size without adequately constraining the reward can lead to degenerate outputs—trivial comments, large bounding boxes, or formally correct but semantically vacuous answers (Zhou et al., 21 May 2025, Chen et al., 6 Mar 2025).
Importance of cold-start/supervised initialization: For complex domains like web agents or multimodal LLMs, RL from a base checkpoint is often insufficient; warm-up SFT is required for effective policy improvement (Wei et al., 22 May 2025, Huang et al., 9 Mar 2025).
Pretraining biases: The magnitude of RL improvement is affected by base model pretraining (e.g., Qwen2.5’s instruction-like bias inflates apparent RL gains) and template alignment (Liu et al., 26 Mar 2025).
Reward diversity: Classic scalar reward functions ignore reasoning path diversity; DRA-GRPO’s semantic diversity weighting addresses this, fostering richer exploration (Chen et al., 14 May 2025).

6. Broader Impact and Future Directions

R1-Zero-like training generalizes well to domains with verifiable, automatable alignment signals—in line with the successes in mathematics, code, visual reasoning, and tabular logic. The frameworks (GRPO, Dr. GRPO, DRA-GRPO, DAPO) enable sample-efficient, scalable, and reproducible RL for reasoning LLMs and VFMs, as well as robustness across distribution shifts, low-resource settings, and open-vocabulary conditions.

Further research directions include: refining reward functions for open-ended tasks (translation, Q&A); developing diversity-aware and context-sensitive RL objectives; extending RLVR to domains where automated verification is less direct; and formalizing the cognitive metrics for emergent reasoning properties using external behavioral analysis tools.

7. Summary Table: Common Components of R1-Zero-like Training

Component	R1-Zero-like Approach	Alternatives / Notes
Base model	Pretrained, often no SFT	Sometimes instruction-tuned for stability (esp. web/vision)
RL algorithm	GRPO / Dr. GRPO / DRA-GRPO / DAPO	PPO (vanilla), Policy Gradient
Reward	Rule-based, verifiable (accuracy/format/diversity)	Learned reward (RLHF), metric-composite
Cold-start SFT	Usually omitted, but needed in some cases	Required for web agents, complex reasoning in MLLMs
Output structure	Explicit tags (> , <answer>, etc.)	Structure varies by domain
>	Diversity incentives	DRA-GRPO, SMI graph cut
>	Empirical outcomes	Increased CoT length, accuracy, “aha moment”

R1-Zero-like training constitutes a notable regime for incentivizing reasoning and robust generalization in foundation models. Its key efficacy hinges on task-aligned, automatable rewards, group-based RL objectives, and careful tuning of optimization artifacts and diversity incentives. It achieves significant accuracy gains and cognitive emergence in settings with verifiable alignment, but demands nuanced adaptation in open-ended or high-horizon domains.