Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

R-Zero: Self-Evolving Reasoning LLM from Zero Data (2508.05004v1)

Published 7 Aug 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Self-evolving LLMs offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.

Summary

  • The paper introduces a novel self-evolving framework that uses a co-evolutionary loop between a Challenger and a Solver to boost reasoning abilities without human-curated data.
  • It leverages an uncertainty-based reward and iterative filtering to generate challenging questions and fine-tune the Solver, leading to measurable gains on math benchmarks.
  • The results demonstrate consistent improvements and effective transfer to general-domain reasoning tasks, despite challenges like decreasing pseudo-label quality.

R-Zero: A Self-Evolving Reasoning LLM from Zero Data

Introduction

R-Zero introduces a fully autonomous framework for self-evolving LLMs that eliminates the need for any human-curated tasks or labels. The core innovation is a co-evolutionary loop between two independently optimized agents—a Challenger and a Solver—both initialized from the same base LLM. The Challenger is incentivized to generate questions at the edge of the Solver's current capabilities, while the Solver is trained to solve these increasingly difficult tasks. This process yields a self-improving curriculum, enabling the LLM to enhance its reasoning abilities from scratch, without external supervision. Figure 1

Figure 1: (Left) R-Zero employs a co-evolutionary loop between Challenger and Solver. (Right) R-Zero achieves strong benchmark gains without any pre-existing tasks or human labels.

Methodology

Co-Evolutionary Framework

R-Zero's architecture is built on a two-agent system:

  • Challenger (QθQ_\theta): Generates challenging questions using Group Relative Policy Optimization (GRPO), with rewards based on the uncertainty of the Solver's responses.
  • Solver (SϕS_\phi): Fine-tuned on a filtered set of Challenger-generated questions, using pseudo-labels derived from majority voting over its own answers.

The process is iterative: the Challenger is trained to maximize the Solver's uncertainty, and the Solver is subsequently trained to solve the filtered, challenging questions. This loop is repeated, resulting in a progressively more capable Solver. Figure 2

Figure 2: R-Zero framework overview, illustrating the co-evolution of Challenger and Solver via GRPO and self-consistency-based rewards.

Reward Design

  • Uncertainty Reward: For each generated question, the Solver produces mm answers. The reward is runcertainty(x;ϕ)=12p^(x;Sϕ)0.5r_{\text{uncertainty}}(x; \phi) = 1 - 2|\hat{p}(x; S_\phi) - 0.5|, where p^\hat{p} is the empirical accuracy (fraction of answers matching the majority vote). This is maximized when the Solver's accuracy is 50%, i.e., maximal uncertainty.
  • Repetition Penalty: To ensure diversity, a BLEU-based clustering penalty is applied within each batch, discouraging the Challenger from generating semantically similar questions.
  • Format Check Penalty: Outputs not conforming to the required format are immediately assigned zero reward.

The final reward is ri=max(0,runcertainty(xi;ϕ)rrep(xi))r_i = \max(0, r_{\text{uncertainty}}(x_i; \phi) - r_{\text{rep}}(x_i)).

Dataset Construction and Filtering

After Challenger training, a large pool of candidate questions is generated. For each, the Solver's answers are used to assign a pseudo-label via majority vote. Only questions with empirical correctness within a specified band around 50% are retained, filtering out tasks that are too easy, too hard, or ambiguous.

Solver Training

The Solver is fine-tuned on the curated dataset using GRPO, with a binary reward: 1 if the generated answer matches the pseudo-label, 0 otherwise.

Theoretical Motivation

The uncertainty-based reward is theoretically justified by curriculum learning theory: the KL divergence between the Solver's policy and the optimal policy is lower-bounded by the reward variance, which is maximized at 50% accuracy. Thus, the Challenger is incentivized to generate maximally informative tasks for the Solver.

Experimental Results

Mathematical Reasoning

R-Zero was evaluated on multiple mathematical reasoning benchmarks (AMC, Minerva, MATH-500, GSM8K, Olympiad-Bench, AIME-2024/2025) using Qwen3 and OctoThinker models at 3B, 4B, and 8B scales. Across all architectures, R-Zero consistently improved performance over both the base model and a baseline where the Solver is trained on questions from an untrained Challenger.

  • Qwen3-8B-Base: Average score increased from 49.18 (base) to 54.69 (+5.51) after three R-Zero iterations.
  • Qwen3-4B-Base: Average score increased from 42.58 to 49.07 (+6.49).
  • OctoThinker-3B: Average score increased from 26.64 to 29.32 (+2.68).

Performance gains were monotonic across iterations, confirming the efficacy of the co-evolutionary curriculum.

General-Domain Reasoning

Despite being trained only on self-generated math problems, R-Zero-trained models exhibited significant transfer to general reasoning benchmarks (MMLU-Pro, SuperGPQA, BBEH):

  • Qwen3-8B-Base: General-domain average improved by +3.81 points.
  • OctoThinker-3B: General-domain average improved by +3.65 points.

This demonstrates that the reasoning skills acquired via R-Zero generalize beyond the training domain.

Analysis

Ablation Studies

Disabling any core component (RL-based Challenger, repetition penalty, or difficulty-based filtering) led to substantial performance drops. The largest degradation occurred when RL training for the Challenger was removed, underscoring the necessity of the co-evolutionary curriculum.

Evolution of Question Difficulty and Label Quality

As the Challenger improved, the difficulty of generated questions increased, evidenced by declining Solver accuracy on new question sets. However, the accuracy of pseudo-labels (as measured against GPT-4o) decreased from 79% to 63% over three iterations, indicating a trade-off between curriculum difficulty and label reliability.

Synergy with Supervised Fine-Tuning

R-Zero can be used as a mid-training method: models first improved by R-Zero achieved higher performance after subsequent supervised fine-tuning on labeled data, with an observed gain of +2.35 points over direct supervised training.

Implementation Considerations

  • Resource Requirements: Each iteration involves generating thousands of questions and multiple rollouts per question, requiring significant compute, especially for larger models.
  • Scalability: The framework is model-agnostic and demonstrated effectiveness across different architectures and scales.
  • Limitations: The approach is currently best suited for domains with objective correctness criteria (e.g., math). Extension to open-ended generative tasks remains an open challenge due to the lack of verifiable reward signals.

Implications and Future Directions

R-Zero demonstrates that LLMs can self-improve their reasoning abilities from scratch, without any human-provided data. This has significant implications for scalable, autonomous AI development, particularly in domains where labeled data is scarce or unavailable. The framework's transferability to general reasoning tasks suggests that self-evolving curricula can yield broadly capable models.

Future work should address the degradation of pseudo-label quality as task difficulty increases, explore more robust self-labeling mechanisms, and extend the framework to domains lacking objective evaluation criteria. Additionally, integrating model-based verifiers or leveraging external knowledge sources could further enhance the reliability and generality of self-evolving LLMs.

Conclusion

R-Zero establishes a fully autonomous, co-evolutionary framework for reasoning LLMs that requires zero external data. Through iterative Challenger-Solver interactions and uncertainty-driven curriculum generation, R-Zero achieves substantial improvements in both mathematical and general reasoning tasks. The approach is theoretically grounded, empirically validated, and provides a foundation for future research into self-improving, data-independent AI systems.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

alphaXiv