Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stable Reinforcement Learning for Efficient Reasoning (2505.18086v1)

Published 23 May 2025 in cs.AI and cs.LG

Abstract: The success of Deepseek-R1 has drawn the LLM community's attention to reinforcement learning (RL) methods like GRPO. However, such rule-based 0/1 outcome reward methods lack the capability to regulate the intermediate reasoning processes during chain-of-thought (CoT) generation, leading to severe overthinking phenomena. In response, recent studies have designed reward functions to reinforce models' behaviors in producing shorter yet correct completions. Nevertheless, we observe that these length-penalty reward functions exacerbate RL training instability: as the completion length decreases, model accuracy abruptly collapses, often occurring early in training. To address this issue, we propose a simple yet effective solution GRPO-$\lambda$, an efficient and stabilized variant of GRPO, which dynamically adjusts the reward strategy by monitoring the correctness ratio among completions within each query-sampled group. A low correctness ratio indicates the need to avoid length penalty that compromises CoT quality, triggering a switch to length-agnostic 0/1 rewards that prioritize reasoning capability. A high ratio maintains length penalties to boost efficiency. Experimental results show that our approach avoids training instability caused by length penalty while maintaining the optimal accuracy-efficiency trade-off. On the GSM8K, GPQA, MATH-500, AMC 2023, and AIME 2024 benchmarks, it improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.

Summary

Stable Reinforcement Learning for Efficient Reasoning

The paper "Stable Reinforcement Learning for Efficient Reasoning" introduces a novel approach to addressing the complexities encountered in Reinforcement Learning (RL) systems for LLMs. Specifically, it focuses on the overthinking phenomena where reasoning models generate excessively long Chain-of-Thought (CoT) sequences, often to the detriment of accuracy and efficiency. The authors propose an innovative solution named GRPO-λ\lambda, which dynamically adjusts reward strategies based on monitored correctness ratios, aiming to balance reasoning accuracy with efficiency.

Overview of GRPO-λ\lambda

The conventional GRPO method incentivizes LLMs to produce longer CoT sequences because longer sequences statistically increase the likelihood of correct reasoning steps. However, this often leads to overthinking, characterized by shallow reasoning and frequent thought-switching. Attempts to counteract this with length-penalty reward functions have led to RL training instability, wherein decreasing sequence lengths inadvertently cause abrupt collapses in model accuracy. GRPO-λ\lambda provides a strategic advancement by integrating dynamic reward adjustments based on real-time performance evaluations:

  • Adaptive Reward Strategy: The key innovation in GRPO-λ\lambda is its ability to switch reward strategies based on batch-wise top-λ\lambda selection. Each query in the training batch generates multiple candidate completions, with their correctness ratio evaluated. Groups showing high reasoning capability are subjected to length penalty to prioritize efficiency, whereas less accurate groups retain standard GRPO's 0/1 outcome reward to emphasize accuracy.
  • Training Stability and Performance Enhancement: Experimental results from GSM8K, GPQA, MATH-500, AMC 2023, and AIME 2024 benchmarks demonstrate that GRPO-λ\lambda ameliorates the stability issues linked with length-penalty methods. It improves average accuracy by 1.48% while achieving a significant reduction in CoT sequence length by 47.3%, facilitating at least 2.5× more viable training iterations.

Implications and Future Directions

The implications of adopting GRPO-λ\lambda are multifaceted, impacting both theoretical research and practical applications in AI.

  • Practical Implications: The ability to maintain high accuracy while efficiently compressing CoT sequences means models can be more effective within resource constraints such as compute power or inference speed. This is paramount for real-world applications where AI systems need to balance performance with operational efficiency.
  • Theoretical Insights: The paper highlights the critical nature of reward strategy design in RL frameworks, suggesting that overly aggressive length reductions need to be controlled. The adaptive approach proposed could shift perspectives on how reasoning models are trained, emphasizing the need for flexible reward mechanisms reminiscent of evolving human learning paradigms.
  • Future Research Directions: The insights gained through this paper suggest exploring other dynamic reward strategies and optimization parameters. For instance, investigating how various configurations of λ\lambda impact the balance of accuracy and efficiency could yield models that better adapt to distinct reasoning domains.

In conclusion, GRPO-λ\lambda represents a meaningful step towards more stable and efficient reinforcement learning frameworks in large-scale LLMs. By focusing on dynamically aligning reward strategies with model competency stages, it prevents the degradation of reasoning capabilities, promising improvements in both the accuracy and efficiency of AI systems.