SimpleRL-Zoo: Zero RL for Base Models
- SimpleRL-Zoo is an open-source framework for zero RL that directly optimizes pretrained language models without supervised fine-tuning, enabling emergent chain-of-thought reasoning.
- It uses a length-rectified GRPO algorithm and a correctness-only reward design to improve pass@k performance and foster reflective, multi-step cognitive behaviors.
- Extensive experiments across models like Llama3, Mistral, and Qwen demonstrate enhanced reasoning, stable training dynamics, and effective adaptation to different task difficulties.
SimpleRL-Zoo is an open-source framework for investigating the paradigm of zero reinforcement learning (zero RL) applied to open base LLMs. Zero RL training denotes direct reinforcement learning (RL) optimization performed on a pretrained base model, excluding any intermediate supervised fine-tuning stages. The primary motivation behind SimpleRL-Zoo is to elucidate the generality and transferability of emergent chain-of-thought (CoT) reasoning and cognitive behaviors when using a simple rule-based reward structure, previously demonstrated in DeepSeek-R1 for very large models. The suite encompasses extensive empirical studies on a diverse "zoo" of ten open-source models from multiple families (Llama3, Mistral, DeepSeek, Qwen2.5) and a range of sizes (0.5B–32B parameters), employing a unified GRPO optimization methodology and controlled experimental protocol. The repository provides open-source code, checkpoints, and comprehensive analysis tools for further study (Zeng et al., 24 Mar 2025).
1. Zero RL Paradigm and Chain-of-Thought Emergence
Zero RL is formally defined as RL optimization starting from a pretrained base LLM with no prior supervised fine-tuning (SFT) on the target task. The key phenomenon under investigation is the spontaneous emergence of extended CoT reasoning and self-reflection capabilities ("aha moment") in model outputs when trained under a sparse, rule-based reward signal. For instance, DeepSeek-R1 showed that rewarding only the final correct answer (+1 for correct, 0 otherwise) allowed large models to develop long, multi-step reasoning sequences. SimpleRL-Zoo evaluates whether such behavior generalizes to smaller and more diverse base model architectures. A critical empirical insight is that large open models such as Qwen2.5 already possess strong instruction-following and self-reflection, potentially obscuring true zero RL effects. Tracking additional cognitive behaviors (verification, enumeration, backtracking) is essential for characterizing reasoning emergence.
2. Implementation Architecture and Optimization
SimpleRL-Zoo employs a length-rectified version of the Generalized Reinforcement Policy Optimization (GRPO) algorithm for policy updates. For each query , rollouts are sampled using the old policy . The token-level importance ratio is defined as . The standardized advantage for each rollout is where . The clipped GRPO objective is:
where , and for models 14B, for larger models. Reward design is minimal: if the final answer in is correct, otherwise. The training system (verl/HybridFlow) utilizes batch size 256, prompt batch size 1,024, up to 8,192 tokens per rollout (4,096 for Qwen-Math-7B), and sampling temperature 1.0. Sampling size is set to 8 by default. No strict format rewards are imposed, and KL regularization is controlled by .
3. Key Experimental Design Strategies
SimpleRL-Zoo demonstrates several critical strategies for successful zero RL training:
- Correctness-Only Rewards: Strict format-based penalties (e.g., enforcing \boxed{…}) suppress exploration and risk collapse in weaker models. Only correctness signals are used for reward.
- Difficulty Alignment: The training set is subdivided into Easy (GSM8K + MATH lv.1), Medium (MATH lv.1–4), and Hard (MATH lv.3–5) splits (~8K examples each). Model-specific difficulty alignment is essential: excessive difficulty causes learning collapse; insufficient difficulty fails to incentivize advanced reasoning.
- Prompt Tailoring: Weak instruction-followers (Llama3-8B, Mistral-7B, Qwen0.5B/1.5B) are trained with minimal "step-by-step" prompts, while more complex prompt structures are used for stronger models.
- Exploration Hyperparameters: Larger sampling sizes ( up to 32) and increased temperature stabilize training and enhance accuracy by promoting broader exploration in the policy space.
4. Model Coverage, Benchmarks, and Evaluation Protocols
The suite evaluates zero RL across the following models: Llama-3.1-8B, DeepSeek-Math-7B, Mistral-v0.1-7B, Mistral-Small-24B-Base-2501, Qwen-2.5 (0.5B, 1.5B, 7B, 14B, 32B), and Qwen-2.5-Math-7B. Tasks and benchmarks are selected for both math reasoning and generalization:
- Training Data: GSM8K, MATH (total ~15.5K problems)
- Evaluation: GSM8K, MATH500, Minerva Math, OlympiadBench, AIME 2024, AMC 2023 (math); IFEVAL (instruction), MMLU (knowledge), GPQA-Diamond (science QA)
All models use matched hyperparameters for comparability.
5. Quantitative Results: Reasoning Gains and Training Dynamics
Substantial improvements are observed after zero RL, as measured by Pass@1 (percentage of correct first attempts) and generalization metrics:
| Model | Base Avg (%) | SimpleRL-Zoo Avg (%) | Δ |
|---|---|---|---|
| Llama-8B | 10.6 | 22.0 | +11.4 |
| DeepSeek-7B | 11.3 | 29.2 | +17.9 |
| Mistral-7B | 5.3 | 18.6 | +13.3 |
| Mistral-24B | 27.6 | 49.6 | +22.0 |
| Qwen-0.5B | 12.1 | 20.9 | +8.8 |
| Qwen-1.5B | 18.5 | 36.1 | +17.6 |
| Qwen-7B | 40.3 | 55.2 | +14.9 |
| Qwen-Math-7B | 37.2 | 59.5 | +22.3 |
| Qwen-14B | 43.2 | 56.8 | +13.6 |
| Qwen-32B | 45.9 | 61.9 | +16.0 |
On generalization benchmarks: Mistral-24B improves from 25.0% to 55.3% (+30.3), Llama-8B from 23.6% to 32.6% (+9.0), DeepSeek-7B from 18.7% to 34.1% (+15.4), Qwen-32B from 46.7% to 60.6% (+13.9).
Notably, the pass@k gap widens post-RL, with Mistral-24B's base pass@8 ≈ 60% versus post-RL pass@1 ≈ 65%, consistent with genuine capability gains rather than reranking. Average response length increases significantly (e.g., DeepSeek-7B from ~300 to ~1,200 tokens), yet this does not always correlate with cognitive behavior emergence. Clip ratio is maintained <5% except in unstable models.
6. Cognitive Behavior Tracking and "Aha Moment" Emergence
SimpleRL-Zoo performs explicit cognitive behavior analysis using GPT-4o and the "Cognitive Behaviors" framework, labeling responses for subgoal setting, enumeration, verification, and backtracking. Model-specific patterns are revealed:
- Qwen-2.5 (7B/32B): Already strong reasoning, minimal change in verification/backtracking.
- Mistral-24B, Llama-8B, DeepSeek-7B: Verification & backtracking rise from ~0% to ≈50% of responses, indicating bona fide "aha moments."
- Qwen-0.5B/1.5B: Marked increases in subgoal setting and enumeration.
- Mistral-v0.1-7B: Unstable training, low behavior rates, collapse.
For example, after RL training, Mistral-24B evolves from producing straight-line CoT to exhibiting systematic root solving, constraint checking, and solution revision—demonstrating reflective reasoning.
7. Best Practices, Implications, and Open Resources
Principal findings are as follows:
- Increased response length does not equate to advanced reasoning; behaviors must be tracked directly.
- Rigid format rewards constrain exploration; correctness-only reward fosters emergent cognition.
- Data difficulty must be matched to model capability to prevent collapse or stagnation.
- Applying zero RL directly yields improved reasoning; SFT on short CoT data prior to RL impedes exploration.
- Large sampling sizes () and higher exploration temperature enable stable learning.
- Zero RL reliably boosts pass@k by 10–30 points, with evidence for genuine capability enhancement over superficial reranking.
SimpleRL-Zoo provides open-source code and resources (https://github.com/hkust-nlp/simpleRL-reason), including verl/HybridFlow training scripts, all model checkpoints, and cognitive behavior analysis notebooks to facilitate reproducibility and further study (Zeng et al., 24 Mar 2025).