DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning (2510.15110v1)

Published 16 Oct 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Reasoning LLMs such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.

Summary

The paper introduces DLER, a reinforcement learning framework that optimizes token usage with batch-wise reward normalization and dynamic sampling.
It addresses biases in advantage estimation, entropy collapse, and sparse rewards, reducing token output by up to 70% without accuracy loss.
Experimental results on benchmarks like MATH and Olympiad validate improved accuracy-length trade-offs and robust efficiency gains.

Incentivizing More Intelligence per Token via Reinforcement Learning

Introduction

The paper "DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning" (2510.15110) introduces a novel framework to enhance the efficiency of reasoning LLMs by optimizing the reinforcement learning (RL) procedure, rather than relying on complex length penalties. Recent developments in RL for reasoning models have aimed to reduce the output lengths generated by models like DeepSeek-R1 and OpenAI-o1, addressing a key concern: maximizing intelligence per token. The proposed DLER approach implements batch-wise reward normalization, higher clipping, dynamic sampling, and a straightforward truncation length penalty, showcasing substantial improvements in accuracy-length trade-offs and operational efficiency.

Figure 1: DLER achieves state-of-the-art accuracy/length trade-offs, shortening CoT by up to 70% without losing accuracy.

Challenges in Reinforcement Learning Optimization

The paper identifies three significant challenges inherent in current RL techniques applied to reasoning models:

Biased Advantage Estimation: GRPO-based advantage estimation is heavily impacted by reward noise, resulting in inaccurately calculated advantages due to frequent zero-reward assignments.

Entropy Collapse: This phenomenon severely limits exploration of diverse reasoning paths, as the model struggles to maintain output distribution diversity, which is essential for effective learning.

Sparse Reward Signals: The prevalence of zero rewards due to truncation causes the model to favor easily solvable prompts, thus hindering effective learning on complex tasks.

To counter these problems, DLER integrates a normalized batch-wise reward system and dynamic sampling, gradually filtering prompts to enhance learning efficacy.

Figure 2: Uniform batch-wise reward normalization maintains stability and accuracy, in contrast to GRPO's declining performance.

DLER Training Recipe

DLER combines several elements to enhance RL optimization:

Batch-wise Reward Normalization: Stabilizes advantage estimation, mitigating biases linked to high variance rewards.
Higher Clipping Thresholds: Retains gradient propagation for high-entropy token updates, fostering exploration.
Dynamic Sampling: Filters prompts to maintain balanced training signals, reducing the influence of skewed distributions.

The comprehensive application of these strategies enables substantial improvements in both accuracy and token efficiency, fully recovering accuracy while drastically reducing token usage.

Figure 3: DLER enhances efficiency by reducing token usage and recovering accuracy loss of applying plain batch-wise reward normalization.

Difficulty-Aware DLER

The paper further extends the DLER framework with a difficulty-aware mechanism, DA-DLER, which dynamically adjusts truncation lengths based on prompt difficulty. This refinement allows the model to dedicate fewer tokens to simpler questions, thereby enhancing overall efficiency without sacrificing accuracy on more complex tasks.

Experimentation and Results

DLER models exhibit consistent superiority over competing baselines, notably achieving enhanced reasoning efficiency across diverse benchmarks like MATH, AIME-24, and Olympiad Bench. These results validate the framework's robustness and illustrate the practical benefits of optimized RL strategies over complex penalty design.

Figure 4: Ratio of prompts with zero rewards and prompt shortening patterns observed during training phases.

Implications and Future Directions

The introduction of DLER marks a significant shift towards optimizing reasoning efficiency by refining RL algorithms, rather than merely adjusting penalty designs. This perspective invites further exploration into adaptive RL strategies that can dynamically modulate resource allocation based on contextual difficulty. Moreover, the successful integration of update-selective merging strategies provides a pathway for leveraging publicly available datasets without compromising accuracy, offering a practical solution for deploying efficient reasoning models in resource-constrained settings.

Conclusion

The study redefines the approach to enhancing reasoning efficiency in LMs, emphasizing RL optimization techniques over complex length penalties. Findings demonstrate how intelligent adjustment of RL processes can substantially improve the trade-off between reasoning capability and token consumption. Future efforts can further explore adaptive optimization algorithms to push the boundaries of efficiency in reasoning models, promising advancements in both practical and theoretical AI applications.