Absolute Zero: Reinforced Self-play Reasoning with Zero Data (2505.03335v2)

Published 6 May 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of LLMs by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of LLM pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

Summary

The paper introduces a self-play reasoning paradigm that enables models to learn autonomously by generating and solving their own tasks without human-curated data.
It presents the Absolute Zero Reasoner (AZR) which uses a Python executor to validate and reward tasks in deduction, abduction, and induction modes.
Experiments show AZR outperforms curated-data models on coding and math benchmarks, demonstrating improved scalability and cross-domain generalization.

The paper "Absolute Zero: Reinforced Self-play Reasoning with Zero Data" (2505.03335) introduces a novel paradigm called Absolute Zero for training reasoning models that learn and improve without relying on any external human-curated data. This approach addresses the scalability limitations of existing Reinforcement Learning with Verifiable Rewards (RLVR) methods, which still depend on manually collected datasets of questions and answers.

The core idea of the Absolute Zero paradigm is that a single model acts as both a task proposer and a solver. During training, the model generates learning tasks that maximize its own learning progress and then attempts to solve them. Learning is driven by verifiable feedback from an environment, replacing human supervision or learned reward models with grounded, objective signals.

The authors instantiate this paradigm with the Absolute Zero Reasoner (AZR), which uses a LLM and a Python code executor as the environment. The code executor serves a dual purpose: validating the integrity of proposed code-based reasoning tasks and verifying the correctness of the model's solutions.

AZR focuses on three fundamental modes of reasoning about a code triplet (program p, input i, output o) where o = p(i):

Deduction: Given p and i, predict o. This models step-by-step logical reasoning.
Abduction: Given p and o, infer a plausible i. This resembles trial-and-error or search.
Induction: Given a set of (i, o) pairs, synthesize p. This requires generalization.

During the propose phase, AZR, conditioned on a task type and a small set of past self-generated examples, generates a program and an input (p, i). The environment (Python executor) runs p(i) to get the output o. If the execution is successful and the program is deterministic and safe (checked against a forbidden list of modules like os, sys), the triplet (p, i, o) is deemed valid and added to a task buffer. For induction, the proposer samples a program from the buffer and generates multiple inputs and a description; the environment executes the program on these inputs to get (i, o) pairs. The proposer receives a "learnability" reward $1 - \bar{r}_{\text{solve}}$ , which is higher for tasks where the solver has a moderate success rate (neither trivial nor impossible), and 0 for tasks with 0% or 100% success.

In the solve phase, AZR receives a task query (e.g., (p, i) for deduction) derived from the validated triplet and generates a solution (y). The environment verifies the solution's correctness using the Python executor (e.g., checking if p(y) == o for abduction). The solver receives a binary reward (1 for correct, 0 for incorrect). Both roles' rewards are combined with a format penalty.

The model is trained end-to-end using Reinforcement Learning. The paper proposes Task-Relative REINFORCE++ (TRR++), a variant of REINFORCE++ that computes separate advantage baselines for each task-role combination (propose/solve for deduction/abduction/induction) to reduce variance in this multitask setup.

The training process starts from a single seed triplet (lambda x: x, 1, 1). All subsequent training data is generated by the model itself through the propose-solve loop. Validated tasks are stored in buffers and used as in-context examples for future task proposals to promote diversity.

Practical Implementation Details:

Environment: A Python interpreter is used for dynamic task validation (syntax, safety checks, determinism verification by running twice) and answer verification (checking output equality for deduction, whether proposed input generates gold output for abduction, whether synthesized program works on hidden test cases for induction).
Safety: A list of forbidden Python modules (e.g., os, sys, shutil) is used to filter proposed programs.
Buffering: Separate buffers are maintained for each task type (deduction, abduction, induction) and populated with validated self-generated tasks. These buffers serve as a source of reference examples for the proposer.
Task Preparation: For solving, parts of the validated triplet are provided as the query x (e.g., (p, i) for deduction, (p, o) for abduction, a subset of (i, o) pairs and a description for induction).
Reward Structure: A composite reward incorporates the task-specific reward (r_propose or r_solve) and a penalty for formatting errors.
Optimization: Trained using AdamW with a fixed learning rate, without KL penalties.

Experimental Results and Applications:

The authors trained AZR on different base models, primarily Qwen2.5 variants (7B base, 7B Coder, 14B Coder, 14B Base) and Llama3.1-8B. They evaluated performance on standard zero-shot coding (HumanEval+, MBPP+, LiveCodeBench) and mathematical reasoning benchmarks (AIME, AMC, MATH500, Minerva, OlympiadBench), which are out-of-distribution (OOD) relative to AZR's training environment (code triplets).

State-of-the-Art with Zero Data: AZR-Coder-7B, trained with zero external data, achieved SOTA performance among zero-setting models on the combined average across OOD coding and math benchmarks, surpassing models trained on tens of thousands of human-curated domain-specific examples. It even outperformed curated-data models on coding benchmarks.
Cross-Domain Generalization: AZR models showed significantly larger improvements on math benchmarks after training in the code environment compared to expert code models trained on curated code data, demonstrating strong generalized reasoning capability gains.
Impact of Base Model: Initializing from a code-centric base model (Qwen-Coder) led to better final performance, suggesting that strong code priors can amplify the reasoning improvements from AZR training.
Scaling: Performance gains scaled positively with model size (3B < 7B < 14B), indicating the approach is effective with larger models.
Emergent Behaviors: Analysis revealed distinct reasoning strategies emerging for different task types (iterative trial-and-error for abduction, step-by-step simulation for deduction). Models naturally interleaved comments resembling "ReAct" style planning within generated code outputs during induction. Token length increase during training varied by task type, notably higher in abduction due to the trial-and-error process.
Safety Concerns: An observed "uh-oh moment" in the Llama model (concerning CoT) highlighted the need for future work on safety in this self-improving paradigm.
Ablations: Studies showed that including all three task types (deduction, abduction, induction) and training the proposer role (especially conditioning on reference examples) were beneficial for overall performance.

Limitations and Future Work:

The environment is currently limited to deterministic Python code execution. Future work could explore non-deterministic environments, web interaction, formal math systems, or real-world scenarios.
Safety of self-improving systems needs further investigation, especially given the "uh-oh moment" observation.
Exploring different initial p(z) (seed data) and methods for dynamically defining the task validation function f are open areas.
Advanced exploration and diversity rewards for both proposer and solver roles could potentially further enhance performance.

In conclusion, the Absolute Zero paradigm and its instantiation AZR present a compelling step towards reasoning models that can autonomously define their learning curriculum and improve through self-play grounded in an environment, potentially overcoming the data bottleneck of human-curated datasets and enabling continuous self-improvement.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ntfargo/status/1926387086619287818

https://twitter.com/jianxliao/status/1943172843375661165

https://twitter.com/Montreal_AI/status/1920518496640757956

https://twitter.com/theomitsa/status/1921224993263718514

https://twitter.com/DavidCorfield8/status/1920928939767201828

https://twitter.com/BankQuote_DAG/status/1921350279301140593

Absolute Zero: Reinforced Self-play Reasoning with Zero Data (2505.03335v2)

Summary

Related Papers

Tweets

YouTube

HackerNews

Reddit