Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 112 tok/s Pro

Kimi K2 199 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Effective Reinforcement Learning for Reasoning in Language Models (2505.17218v1)

Published 22 May 2025 in cs.AI

Abstract: Reinforcement learning (RL) has emerged as a promising strategy for improving the reasoning capabilities of LMs in domains such as mathematics and coding. However, most modern RL algorithms were designed to target robotics applications, which differ significantly from LM reasoning. We analyze RL algorithm design decisions for LM reasoning, for both accuracy and computational efficiency, focusing on relatively small models due to computational constraints. Our findings are: (i) on-policy RL significantly outperforms supervised fine-tuning (SFT), (ii) PPO-based off-policy updates increase accuracy instead of reduce variance, and (iii) removing KL divergence can lead to more concise generations and higher accuracy. Furthermore, we find that a key bottleneck to computational efficiency is that the optimal batch sizes for inference and backpropagation are different. We propose a novel algorithm, DASH, that performs preemptive sampling (i.e., sample a large batch and accumulate gradient updates in small increments), and gradient filtering (i.e., drop samples with small advantage estimates). We show that DASH reduces training time by 83% compared to a standard implementation of GRPO without sacrificing accuracy. Our findings provide valuable insights on designing effective RL algorithms for LM reasoning.

Summary

Effective Reinforcement Learning for Reasoning in LLMs

The research presented in this paper investigates the application of reinforcement learning (RL) strategies to enhance the reasoning capabilities of LMs. It highlights the distinctive needs of RL algorithms tailored to LMs, contrasting them with those designed for robotics, and systematically explores the design decisions influencing effectiveness and efficiency. By focusing on relatively small models due to computational constraints, the authors aim to provide insights into designing RL algorithms that improve LM reasoning.

The key findings underscore the advantages of particular RL strategies for LM reasoning. Notably, the paper demonstrates that on-policy RL significantly surpasses supervised fine-tuning (SFT) methods, challenging the efficacy of SFT in the context of reasoning with smaller models due to their inability to mimic the reasoning capabilities of larger models or humans effectively. The analysis further explores within the field of on-policy approaches, revealing that Proximal Policy Optimization (PPO) increases accuracy but may introduce higher variance, contrary to conventional wisdom about PPO, which is designed to stabilize training by reducing variance. Additionally, the regularization employing KL divergence tends to compromise performance, leading to less concise generations and reduced accuracy, establishing a counterpoint to the common practice in reinforcement learning for LMs.

Central to these findings is the introduction of the DASH algorithm, which aims to bridge computational efficiency and efficacy. The DASH algorithm leverages preemptive sampling, wherein a large batch is sampled for inference followed by gradient accumulation in smaller increments, effectively reducing training time by 83% compared to standard GRPO implementations, without sacrificing accuracy. Gradient filtering further refines the computational load by discarding samples with minimal advantage estimates, optimizing learning from the most informative samples.

The empirical analysis includes rigorous experimentation across math and coding domains, using datasets such as MATH, GSM8K, and MBPP+. These experiments corroborate the paper's claims by demonstrating DASH's ability to outperform traditional methods, even under tight computational constraints. Critical aspects such as the batch sizes and training dynamics are meticulously tuned to affirm the results, shedding light on the trade-offs between speed and accuracy in RL for LM reasoning.

From a practical standpoint, this research delineates pathways to more nuanced RL algorithm designs, which hold potential for enhancing LMs' intrinsic reasoning capabilities rather than merely improving prompt designs. The theoretical implications further emphasize the need for tailored RL strategies in language processing, distinct from those in traditional RL applications like robotics.

Future developments in the AI field may involve extending these findings to larger models and diverse architectures while maintaining computational efficiency. As the landscape of RL evolves, a continued focus on algorithmic tuning and batch optimization could yield substantial improvements in LM reasoning, potentially transforming existing paradigms in natural language processing and AI learning mechanisms.