An Empirical Study on Eliciting and Improving R1-like Reasoning Models (2503.04548v1)

Published 6 Mar 2025 in cs.CL

Abstract: In this report, we present the third technical report on the development of slow-thinking models as part of the STILL project. As the technical pathway becomes clearer, scaling RL training has become a central technique for implementing such reasoning models. We systematically experiment with and document the effects of various factors influencing RL training, conducting experiments on both base models and fine-tuned models. Specifically, we demonstrate that our RL training approach consistently improves the Qwen2.5-32B base models, enhancing both response length and test accuracy. Furthermore, we show that even when a model like DeepSeek-R1-Distill-Qwen-1.5B has already achieved a high performance level, it can be further refined through RL training, reaching an accuracy of 39.33% on AIME 2024. Beyond RL training, we also explore the use of tool manipulation, finding that it significantly boosts the reasoning performance of large reasoning models. This approach achieves a remarkable accuracy of 86.67% with greedy search on AIME 2024, underscoring its effectiveness in enhancing model capabilities. We release our resources at the STILL project website: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.

Authors (13)

Zhipeng Chen (46 papers)
Yingqian Min (14 papers)
Beichen Zhang (27 papers)
Jie Chen (602 papers)
Jinhao Jiang (25 papers)
Daixuan Cheng (8 papers)
Wayne Xin Zhao (196 papers)
Zheng Liu (312 papers)
Xu Miao (2 papers)
Yang Lu (158 papers)
Lei Fang (38 papers)
Zhongyuan Wang (105 papers)
Ji-Rong Wen (299 papers)

Summary

This report details an empirical investigation into methods for eliciting and improving the reasoning capabilities of Large Reasoning Models (LRMs), specifically those exhibiting "slow thinking" characteristics akin to DeepSeek-R1. The paper focuses primarily on Reinforcement Learning (RL) and tool manipulation techniques applied to Qwen2.5-based models.

Reinforcement Learning for Reasoning Enhancement

The core of the paper involves leveraging RL, particularly with rule-based rewards, to enhance the multi-step reasoning abilities of LLMs. This approach circumvents the need for complex trained reward models by focusing on verifiable tasks, such as mathematics problems where correctness can be objectively determined.

RL Methodology

Frameworks and Models: Experiments were conducted using the OpenRLHF and veRL frameworks. The primary models investigated were variants of Qwen2.5 (1.5B, 7B, 32B), including both base models and those already fine-tuned, such as the DeepSeek-R1-Distill series.
Training Data: A dataset comprising 90k verifiable mathematical problems was curated from sources like AIME, MATH, and NuminaMath. This dataset was filtered to ensure diversity, verifiability (excluding multiple-choice, proof-based questions), and appropriate difficulty levels, using model-based filtering.
Reward Function: The primary reward mechanism was a simple output reward: $R=1$ if the final answer enclosed in \boxed{} matched the ground truth, and $R=0$ otherwise. For base models, a format reward was also explored. Auxiliary rewards, such as length reward and action reward (Rewarding Reasoning Actions - RRA), were investigated, though caution was advised regarding potential "length hacking".
RL Parameter Tuning: The paper systematically explored the impact of various RL hyperparameters:
- Train Batch Size (TBS): Larger TBS (e.g., 1024 vs. 128) was found to improve training efficiency and stability.
- Learning Strategy: On-policy learning generally outperformed off-policy approaches, leading to better performance and more extensive exploration (indicated by increased response length).
- Rollout Configuration: Increasing the number of rollouts (e.g., 64 vs. 8) and using higher sampling temperatures (e.g., T=1.2) generally enhanced exploration and performance, provided the model maintained coherent generation.
- KL Regularization: A dynamic KL annealing strategy, where the KL coefficient decays over training, proved superior to fixed KL coefficients or no KL penalty, effectively balancing policy constraints with exploration needs.
- Instructional Prompts: While detailed prompts improved reasoning efficiency (shorter responses for similar accuracy) on larger models (e.g., 7B), they did not consistently boost raw accuracy and could negatively impact smaller models (e.g., 1.5B).

RL Empirical Findings

Eliciting Reasoning in Base Models: RL training directly on a base model (Qwen2.5-32B) successfully elicited complex reasoning. This model, termed STILL-3-Zero-32B, saw its AIME 2024 accuracy dramatically increase from 2.08% (base) to 37.08% post-RL. This improvement was accompanied by a significant increase in average response length, suggesting the model learned to perform more extensive reasoning. Notably, complex reasoning patterns (e.g., involving verification, reflection, correction) emerged early and were reinforced during RL, indicating the activation of latent capabilities within the base model.
Refining Fine-tuned Models: RL also proved effective in further enhancing models already fine-tuned for reasoning. The STILL-3-1.5B model, initialized from DeepSeek-R1-Distill-Qwen-1.5B (AIME accuracy 28.67%), reached 39.33% accuracy on AIME 2024 after RL training. In some cases involving fine-tuned models, RL led to decreased response length alongside increased accuracy, suggesting improvements in reasoning efficiency rather than just verbosity.
Length Hacking: A critical observation was the detrimental effect of explicitly rewarding response length. This often led to "reward hacking," where models generated longer, potentially lower-quality or incomplete outputs, without a corresponding increase in problem-solving accuracy. The paper suggests that response length should be viewed as an emergent property of effective reasoning rather than a direct optimization objective. Rewarding specific reasoning actions (RRA) was found to be less prone to this issue than a simple length reward.

Tool Manipulation via Supervised Fine-Tuning

Beyond RL, the paper explored the integration of tool use, specifically a code interpreter, to augment reasoning capabilities. This was primarily achieved through Supervised Fine-Tuning (SFT).

SFT Methodology for Tool Use

Backbone Models: Tool use was primarily investigated using models pre-fine-tuned for reasoning, such as DeepSeek-R1-Distill-Qwen-32B.
Demonstration Data: SFT data was generated in two ways:
1. Teacher Distillation: Prompting a powerful teacher model (DeepSeek-R1) to generate reasoning steps that included code execution via an interpreter.
2. Heuristic Injection: Injecting code snippets heuristically into rollouts generated by the backbone model itself and then having the model complete the reasoning process incorporating the code.
Training: Standard SFT procedures were used to train the backbone models on this demonstration data, teaching them to invoke and utilize the code interpreter within their reasoning chain.

Tool Use Empirical Findings

Performance Boost: SFT for tool manipulation yielded substantial performance gains. The STILL-3-Tool-32B model (based on DeepSeek-R1-Distill-Qwen-32B) achieved 86.67% accuracy on AIME 2024 using greedy search. This represents a significant improvement over the backbone model's 60.00% accuracy on the same task.
Data Efficiency: Relatively small amounts of high-quality demonstration data were sufficient to enable effective tool use. The paper notes that even 0.8k distilled instances could activate this capability.
Implicit Benefit: Interestingly, the mere act of generating code snippets during the reasoning process, even without actual execution, provided a performance benefit over the baseline model. This suggests that structuring thought processes in the form of code aids the model's reasoning, potentially by enforcing logical structure and intermediate computation steps.

Conclusion

This empirical paper demonstrates that Reinforcement Learning with rule-based rewards on verifiable tasks is a potent technique for both eliciting latent reasoning abilities in base LLMs and further refining already capable LRMs. Careful hyperparameter tuning, particularly regarding exploration strategies and KL regularization, is crucial. While increased response length often correlates with improved reasoning post-RL, directly optimizing for length can be counterproductive ("length hacking"). Furthermore, integrating tool use, specifically code execution taught via SFT on distilled or heuristically generated data, offers a highly effective pathway to substantially boost performance on complex reasoning tasks, achieving high accuracy levels on challenging benchmarks like AIME 2024.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - RUCAIBox/Slow_Thinking_with_LLMs: A series of technical report on Slow Thinking with LLM (473 stars)