Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning (2506.05256v2)

Published 5 Jun 2025 in cs.AI and cs.LG

Abstract: Large reasoning models (LRMs) achieve higher performance on challenging reasoning tasks by generating more tokens at inference time, but this verbosity often wastes computation on easy problems. Existing solutions, including supervised finetuning on shorter traces, user-controlled budgets, or RL with uniform penalties, either require data curation, manual configuration, or treat all problems alike regardless of difficulty. We introduce Adaptive Length Penalty (ALP), a reinforcement learning objective tailoring generation length to per-prompt solve rate. During training, ALP monitors each prompt's online solve rate through multiple rollouts and adds a differentiable penalty whose magnitude scales inversely with that rate, so confident (easy) prompts incur a high cost for extra tokens while hard prompts remain unhindered. Posttraining DeepScaleR-1.5B with ALP cuts average token usage by 50\% without significantly dropping performance. Relative to fixed-budget and uniform penalty baselines, ALP redistributes its reduced budget more intelligently by cutting compute on easy prompts and reallocating saved tokens to difficult ones, delivering higher accuracy on the hardest problems with higher cost.

Summary

The paper introduces Adaptive Length Penalty (ALP), a reinforcement learning method that tailors token generation penalties based on prompt difficulty.
The approach integrates seamlessly with RL algorithms like GRPO and Reinforce++, reducing token usage by 50% in the DeepScaleR-1.5B model without performance loss.
The research improves resource allocation in large reasoning models, addressing computational waste and scalability by dynamically adjusting inference length.

Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning

The paper "Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning" addresses the significant challenge posed by verbosity in Large Reasoning Models (LRMs) when solving both simple and complex reasoning tasks. These models typically show impressive performance on intricate reasoning tasks by generating a more substantial number of tokens during inference. However, the indiscriminate generation of lengthy reasoning traces even for straightforward problems leads to unnecessary computational costs and latency.

The authors propose an innovative method called Adaptive Length Penalty (ALP), which aims to achieve efficient token usage by dynamically tailoring the length penalties associated with token generation according to the difficulty of each prompt. ALP leverages reinforcement learning to continuously assess the prompt's solve rate during training, using it to adaptively scale the magnitude of length penalties. In essence, prompts that are consistently solved across multiple attempts are deemed easy and incur significant penalties for additional token generation. Conversely, more challenging prompts experience minimal penalties, allowing them to utilize extended reasoning.

A core element of the proposed approach is its seamless integration with existing reinforcement learning algorithms that support group-based advantage estimation like GRPO, RLOO, and Reinforce++. This integration does not introduce additional computational overhead, making the approach highly efficient. The implementation of ALP in DeepScaleR-1.5B showcases its effectiveness; it reduces average token usage by 50% without a noticeable decline in performance. Additionally, ALP intelligently redistributes its computational budget, emphasizing efficiency by trimming unnecessary computation on easy problems and reallocating the saved resources to tackle more complex ones.

The implications of this research are both profound and practical. By enabling LRMs to adaptively allocate computational resources, ALP addresses the critical issue of computational waste in reasoning models. The capacity to discern prompt difficulty and adjust token generation accordingly has substantial benefits, particularly in scenarios where resource constraints are paramount. From a theoretical standpoint, this work contributes to ongoing discussions about the optimization of inference compute and the scalability challenges inherent in advanced LLMs.

Looking ahead, the methodology presents exciting opportunities for further exploration. For instance, the extension of adaptive length penalties to diverse domains beyond mathematics could enable broader application in natural language processing tasks, where prompt difficulty can vary significantly. Moreover, exploring the synergy between ALP and other machine learning paradigms may yield additional insights into efficient algorithm design. In summary, ALP represents progress in the quest for practical, scalable, and resourceful artificial intelligence solutions.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/synth_labs/status/1937594044340973880