Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 102 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 30 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 110 tok/s
GPT OSS 120B 475 tok/s Pro
Kimi K2 203 tok/s Pro
2000 character limit reached

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild (2503.18892v3)

Published 24 Mar 2025 in cs.LG, cs.AI, and cs.CL

Abstract: DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents zero RL training from pretrained open base models using rule-based rewards, yielding advanced reasoning capabilities like chain-of-thought.
  • It demonstrates model-specific dynamics with significant improvements on benchmarks including GSM8K, MATH, and AMC23.
  • It highlights that flexible reward formats and aligning data difficulty are key to effectively enhancing reasoning behavior.

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Introduction

The paper detailed in "SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild" presents a comprehensive investigation of zero reinforcement learning (RL) across diverse base models. Zero RL training, a paradigm where training begins directly from pretrained base models with rule-based rewards, has been shown to facilitate the spontaneous emergence of advanced reasoning capabilities such as long Chains-of-Thought (CoT) and self-reflection. This paradigm is crucial for exploring the generalizable and scalable potential of open models.

Experimental Setup

The authors examine zero RL training using various models, including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, and the Qwen2.5 series, utilizing a consistent training setup. The experiments employ GRPO as the RL algorithm and leverage GSM8K and MATH datasets to drive rule-based reward models.

Key Observations:

  • Distinct training dynamics are observed across different models.
  • Response length increase does not always signify emergent reasoning capabilities.
  • Practical design strategies, such as flexible reward format and query difficulty adjustment, substantially improve training outcomes.

Main Results

The paper highlights notable improvements in both accuracy and response length across several benchmarks, including GSM8K and AMC23. Figure 1

Figure 1: Accuracy and response length across training iterations for different models, averaged on GSM8K, MATH500, Minerva Math, OlympiadBench, AIME24, and AMC23.

Generalization Performance:

The models tested under zero RL demonstrate enhanced performance on benchmark tests like IFEVAL, MMLU, and GPQA-Diamond, indicating strong generalization capabilities beyond the training datasets. Figure 2

Figure 2: Pass@1 and Pass@8 accuracy over the training iterations of Mistral-Small-24B. The model is trained on the hard data (MATH levels 3–5).

Behavioral Insights and Findings

Careful monitoring of model behavior reveals nuanced insights:

  • Reasoning Behaviors: The "aha moment" emerges more prominently in small models like Llama3-8B and DeepSeek-Math-7B, with significant growth in reasoning behaviors such as verification observed. Figure 3

    Figure 3: The change in reasoning behavior over the training iterations across all models.

  • Data Difficulty: Alignment of data difficulty with a model's intrinsic exploration capability is critical for effective training. Models trained on more challenging datasets exhibit enhanced reasoning depth. Figure 4

Figure 4

Figure 4: Comparison of accuracy and response length across different data difficulty levels.

Revisiting Supervised Fine-Tuning

A critical evaluation shows that starting RL training from models with prior supervised fine-tuning (SFT) may limit the emergence of advanced reasoning capabilities. Figure 5

Figure 5: Reasoning behavior ratio over RL training iterations after using different SFT steps as starting points.

Starting from the base model yields better exploration potential, enabling the development of complex reasoning behavior post-training.

Conclusion

The research provides strong evidence confirming that zero RL training significantly enhances model reasoning capabilities. It underscores the necessity of carefully designing training protocols, including reward strategies and data complexity alignment, to harness these capabilities effectively. Future work may focus on developing innovative training strategies that further exploit the exploratory potential of base models, facilitating advanced reasoning in real-world AI applications.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com