Open-Reasoner-Zero: Scalable RL Reasoning Models

Updated 3 September 2025

Open-Reasoner-Zero is an open-source framework that employs reinforcement learning from a base LLM with minimalist binary rewards to enhance multi-step reasoning.
It uses vanilla PPO and GAE with lambda=1 and gamma=1 alongside learned critic estimation to ensure stable, efficient policy updates without complex auxiliary losses.
Empirical benchmarks reveal that the framework scales effectively across model sizes, achieving superior performance and emergent cognitive behaviors with fewer RL steps.

Open-Reasoner-Zero refers to a class of open-source, reinforcement learning (RL)–trained LLMs and reasoning frameworks that directly enhance multi-step reasoning without supervised fine-tuning, emphasize transparent methodology, and scale efficiently across model families and parameter regimes. The “Zero” in the moniker designates the paradigm of beginning RL training from the base model (pretraining checkpoint), without intervening supervised instruction tuning. These systems are both a technical methodology and a blueprint for reproducible, community-driven reasoning models that can serve as both benchmarks and practical engines for advanced cognitive tasks.

1. Training Paradigm: Zero RL and Minimalist Reward Design

Open-Reasoner-Zero systems are characterized by the application of RL algorithms—such as vanilla Proximal Policy Optimization (PPO) and Generalized Advantage Estimation (GAE) with $\lambda=1, \gamma=1$ —initiated from a base LLM (e.g., Qwen2.5) without a supervised fine-tuning precursor (Hu et al., 31 Mar 2025). The primary reward function is straightforward: a binary signal based on the correctness of a final answer identified within the generated output, extracted via specific markers (e.g., <answer>...</answer>), with no format or stepwise rewards. This eschews reward complexities and regularization heuristics commonly found in RLHF and RLVR literature.

In formal notation, with reference answer $A^*$ and extracted answer $\hat{A}$ ,

$r = \begin{cases} 1, & \text{if}~\hat{A} = A^* \ 0, & \text{otherwise} \end{cases}$

No KL-regularization is applied, in contrast to most RLHF recipes, which further simplifies and accelerates the policy updates (Hu et al., 31 Mar 2025).

2. Algorithmic Details: PPO, GAE, and Critic Estimation

The Open-Reasoner-Zero methodology employs vanilla PPO for policy optimization, relying on learned critic networks for value estimation. GAE is set as $\lambda=1$ and $\gamma=1$ , so the advantage for each action $a_t$ at time $t$ reduces to the difference between the rollout reward $R$ and the critic value $V_t$ : $\hat{A}_t = R - V_t$ The surrogate objective optimized is: $J_\text{PPO}(\theta) = \mathbb{E}_t\left[ \min\left( \rho_t(\theta) \hat{A}_t, \operatorname{clip}(\rho_t(\theta), 1-\varepsilon, 1+\varepsilon) \hat{A}_t \right) \right]$ where $\rho_t(\theta)$ is the per-token policy ratio and $\varepsilon$ typically set to $0.2$ (Hu et al., 31 Mar 2025). The critic is trained with mean squared error between predicted values and the observed return $R$ .

This minimalist, critic-centric approach yields robust and stable RL training dynamics without the need for auxiliary loss terms or heavy reference model lookups.

3. Empirical Performance and Scaling

Open-Reasoner-Zero implementations have demonstrated that with this RL regime, both benchmark accuracy and response length can be scaled up rapidly and efficiently:

On AIME2024, Open-Reasoner-Zero–32B achieves a score of 48.1, outperforming DeepSeek-R1-Zero-Qwen-32B (47.0).
On MATH500, Open-Reasoner-Zero–32B scores 92.2 (vs. 91.6).
On GPQA Diamond, Open-Reasoner-Zero–32B attains 55.5 (vs. 55.0).

Notably, these benchmark results are achieved with only one-tenth the number of RL steps required by previous strong pipelines (e.g., DeepSeek-R1-Zero) (Hu et al., 31 Mar 2025).

The system supports scaling across model sizes, from 0.5B to 32B parameters, and generalizes effectively, increasing both the quality and sophistication (length, self-reflection, and verification steps) of generated reasoning traces.

4. Design Choices Enabling Efficient Scaling

Critical design choices underpin the scalability and efficiency of Open-Reasoner-Zero:

Reward Simplicity: The reward function avoids format-based constraints and focuses solely on extracted answer correctness, preventing reward hacking and overthinking/collapse in early RL stages (Zeng et al., 24 Mar 2025, Hu et al., 31 Mar 2025).
No KL Regularization: Omitting KL terms removes the need for reference policy lookups and eliminates additional hyperparameters, further reducing computation and memory requirements.
Difficulty-Aligned Curriculum: Problems are bucketed into difficulty (easy, medium, hard), aligning with model capability and preventing under- or over-exploration collapse (Zeng et al., 24 Mar 2025).
Efficient Critic: A learned critic, rather than group-normalized empirical rewards, supports robust and high-resolution advantage estimation that improves both exploration and update stability.

5. Emergence of Cognitive Reasoning Behavior

Open-Reasoner-Zero models naturally exhibit the emergence of higher-order cognitive behaviors—such as verification (“aha moments”), backtracking, subgoal setting, and enumeration—through purely reward-driven exploration without explicit behavior supervision (Zeng et al., 24 Mar 2025). This is evident not only in large Qwen2.5-based models but also in other families (e.g., Llama, DeepSeek-Math), with reflective reasoning first observed in smaller, non-Qwen models in controlled experiments.

The increase in chain-of-thought length and the qualitative shift to more rigorous self-checking and step-by-step solutions is a signature of this training paradigm.

6. Open Source Release and Reproducibility

Transparency and reproducibility are central to Open-Reasoner-Zero:

All code, configuration files, curated training data, and full model weights (across all sizes) are released in the public domain (e.g., GitHub, HuggingFace).
Analysis tools for cognitive behavior, detailed training logs, and reported benchmark scripts accompany the models, supporting robust community validation and extension (Hu et al., 31 Mar 2025).

This open-source commitment democratizes reasoning research, lowers the barrier for scaling up RL for reasoning, and ensures the approach can serve as a reliable baseline for future studies.

7. Comparative Context and Methodological Impact

Compared to contemporaneous methods:

DeepSeek-R1-Zero employs similar zero-RL but with Group Relative Policy Optimization (GRPO), which may have issues with repetitive response detection and value estimation granularity.
General-Reasoner (Ma et al., 20 May 2025) and ReasonBridge (Zhong et al., 28 Jun 2025) add more sophisticated verification or hierarchical distillation components, but Open-Reasoner-Zero demonstrates the sufficiency of minimalism for rapid and transparent scaling.
The approach stands in contrast to RLHF pipelines heavily reliant on reward model alignment, KL regularization, and extensive preference annotation.

Methodologically, Open-Reasoner-Zero represents a baseline for scaling reasoning in open LLMs with a minimal set of ingredients, prioritizing simplicity, interpretability, and accessibility over incremental regularization or reward augmentation.

In sum, Open-Reasoner-Zero establishes a scalable, open, and efficient RL-first recipe for enhancing reasoning-capable LLMs, demonstrating that a minimalist, binary-reward RL setup—coupled with effective critic estimation and open resource release—suffices to achieve performance on par with more complex, less accessible frameworks (Hu et al., 31 Mar 2025, Zeng et al., 24 Mar 2025).