- The paper introduces a self-play reasoning paradigm that enables models to learn autonomously by generating and solving their own tasks without human-curated data.
- It presents the Absolute Zero Reasoner (AZR) which uses a Python executor to validate and reward tasks in deduction, abduction, and induction modes.
- Experiments show AZR outperforms curated-data models on coding and math benchmarks, demonstrating improved scalability and cross-domain generalization.
The paper "Absolute Zero: Reinforced Self-play Reasoning with Zero Data" (2505.03335) introduces a novel paradigm called Absolute Zero for training reasoning models that learn and improve without relying on any external human-curated data. This approach addresses the scalability limitations of existing Reinforcement Learning with Verifiable Rewards (RLVR) methods, which still depend on manually collected datasets of questions and answers.
The core idea of the Absolute Zero paradigm is that a single model acts as both a task proposer and a solver. During training, the model generates learning tasks that maximize its own learning progress and then attempts to solve them. Learning is driven by verifiable feedback from an environment, replacing human supervision or learned reward models with grounded, objective signals.
The authors instantiate this paradigm with the Absolute Zero Reasoner (AZR), which uses a LLM and a Python code executor as the environment. The code executor serves a dual purpose: validating the integrity of proposed code-based reasoning tasks and verifying the correctness of the model's solutions.
AZR focuses on three fundamental modes of reasoning about a code triplet (program p, input i, output o)
where o = p(i)
:
- Deduction: Given
p
and i
, predict o
. This models step-by-step logical reasoning.
- Abduction: Given
p
and o
, infer a plausible i
. This resembles trial-and-error or search.
- Induction: Given a set of
(i, o)
pairs, synthesize p
. This requires generalization.
During the propose phase, AZR, conditioned on a task type and a small set of past self-generated examples, generates a program and an input (p
, i
). The environment (Python executor) runs p(i)
to get the output o
. If the execution is successful and the program is deterministic and safe (checked against a forbidden list of modules like os
, sys
), the triplet (p, i, o)
is deemed valid and added to a task buffer. For induction, the proposer samples a program from the buffer and generates multiple inputs and a description; the environment executes the program on these inputs to get (i, o)
pairs. The proposer receives a "learnability" reward 1−rˉsolve, which is higher for tasks where the solver has a moderate success rate (neither trivial nor impossible), and 0 for tasks with 0% or 100% success.
In the solve phase, AZR receives a task query (e.g., (p, i)
for deduction) derived from the validated triplet and generates a solution (y
). The environment verifies the solution's correctness using the Python executor (e.g., checking if p(y) == o
for abduction). The solver receives a binary reward (1 for correct, 0 for incorrect). Both roles' rewards are combined with a format penalty.
The model is trained end-to-end using Reinforcement Learning. The paper proposes Task-Relative REINFORCE++ (TRR++), a variant of REINFORCE++ that computes separate advantage baselines for each task-role combination (propose/solve for deduction/abduction/induction) to reduce variance in this multitask setup.
The training process starts from a single seed triplet (lambda x: x
, 1, 1). All subsequent training data is generated by the model itself through the propose-solve loop. Validated tasks are stored in buffers and used as in-context examples for future task proposals to promote diversity.
Practical Implementation Details:
- Environment: A Python interpreter is used for dynamic task validation (syntax, safety checks, determinism verification by running twice) and answer verification (checking output equality for deduction, whether proposed input generates gold output for abduction, whether synthesized program works on hidden test cases for induction).
- Safety: A list of forbidden Python modules (e.g.,
os
, sys
, shutil
) is used to filter proposed programs.
- Buffering: Separate buffers are maintained for each task type (deduction, abduction, induction) and populated with validated self-generated tasks. These buffers serve as a source of reference examples for the proposer.
- Task Preparation: For solving, parts of the validated triplet are provided as the query
x
(e.g., (p, i)
for deduction, (p, o)
for abduction, a subset of (i, o)
pairs and a description for induction).
- Reward Structure: A composite reward incorporates the task-specific reward (
r_propose
or r_solve
) and a penalty for formatting errors.
- Optimization: Trained using AdamW with a fixed learning rate, without KL penalties.
Experimental Results and Applications:
The authors trained AZR on different base models, primarily Qwen2.5 variants (7B base, 7B Coder, 14B Coder, 14B Base) and Llama3.1-8B. They evaluated performance on standard zero-shot coding (HumanEval+, MBPP+, LiveCodeBench) and mathematical reasoning benchmarks (AIME, AMC, MATH500, Minerva, OlympiadBench), which are out-of-distribution (OOD) relative to AZR's training environment (code triplets).
- State-of-the-Art with Zero Data: AZR-Coder-7B, trained with zero external data, achieved SOTA performance among zero-setting models on the combined average across OOD coding and math benchmarks, surpassing models trained on tens of thousands of human-curated domain-specific examples. It even outperformed curated-data models on coding benchmarks.
- Cross-Domain Generalization: AZR models showed significantly larger improvements on math benchmarks after training in the code environment compared to expert code models trained on curated code data, demonstrating strong generalized reasoning capability gains.
- Impact of Base Model: Initializing from a code-centric base model (Qwen-Coder) led to better final performance, suggesting that strong code priors can amplify the reasoning improvements from AZR training.
- Scaling: Performance gains scaled positively with model size (3B < 7B < 14B), indicating the approach is effective with larger models.
- Emergent Behaviors: Analysis revealed distinct reasoning strategies emerging for different task types (iterative trial-and-error for abduction, step-by-step simulation for deduction). Models naturally interleaved comments resembling "ReAct" style planning within generated code outputs during induction. Token length increase during training varied by task type, notably higher in abduction due to the trial-and-error process.
- Safety Concerns: An observed "uh-oh moment" in the Llama model (concerning CoT) highlighted the need for future work on safety in this self-improving paradigm.
- Ablations: Studies showed that including all three task types (deduction, abduction, induction) and training the proposer role (especially conditioning on reference examples) were beneficial for overall performance.
Limitations and Future Work:
- The environment is currently limited to deterministic Python code execution. Future work could explore non-deterministic environments, web interaction, formal math systems, or real-world scenarios.
- Safety of self-improving systems needs further investigation, especially given the "uh-oh moment" observation.
- Exploring different initial
p(z)
(seed data) and methods for dynamically defining the task validation function f
are open areas.
- Advanced exploration and diversity rewards for both proposer and solver roles could potentially further enhance performance.
In conclusion, the Absolute Zero paradigm and its instantiation AZR present a compelling step towards reasoning models that can autonomously define their learning curriculum and improve through self-play grounded in an environment, potentially overcoming the data bottleneck of human-curated datasets and enabling continuous self-improvement.