Long2short RL: Efficient Reasoning Shaping

Updated 16 August 2025

Long2short RL Algorithm is a family of length-efficient reinforcement learning methods that shape rewards to encourage concise and accurate reasoning traces.
LASER and LASER-D introduce step-reward and adaptive target strategies to balance correctness with brevity based on task difficulty.
Experimental validations show significant token savings, up to 63%, while maintaining or improving performance across various reasoning benchmarks.

The Long2short RL Algorithm is a family of length-efficient reinforcement learning techniques designed for large reasoning models. These methods directly shape the reward function within the RL optimization process to encourage shorter, more efficient reasoning traces—reducing computational costs, memory footprint, and redundancy—while maintaining or even improving reasoning accuracy. Recent advances in this area, notably LASER (Length-bAsed StEp Reward) and LASER-D (Dynamic and Difficulty-aware LASER), have defined new paradigms for token-efficient, Pareto-optimal training of LLMs and reasoning agents (Liu et al., 21 May 2025, Yuan et al., 18 May 2025).

1. Unified Framework for Length Shaping in RL

At the core of the Long2short RL paradigm is a unified reward shaping framework where the reward $R̂(x, y)$ for response $y$ to input $x$ is composed of a correctness term $C(y)$ and a length-based term $S(y)$ controlled by a (possibly dynamic) weight $\lambda(y)$ :

$R̂(x, y) = C(y) + \lambda(y) \cdot S(y)$

$C(y)$ typically measures task performance (e.g., exact answer matching), while $S(y)$ encodes incentives or penalties based on the token count of $y$ . This abstraction captures a broad class of length-aware RL methods, from dense token-level penalties to adaptive, reward-bonus step functions.

This framework allows the reward shaping function to modulate the trade-off between correctness and efficiency, enabling the design of sparsity, adaptivity, and context-awareness into the RL objective.

2. Step Reward Shaping and the LASER Method

The LASER method introduces a step-function length reward that delivers a fixed bonus $\alpha$ if the response is no longer than a target threshold $L_T$ :

$S(y) = \alpha \cdot \mathbb{1}[L(y) \leq L_T]$

where $L(y)$ is the number of tokens in $y$ . This shapes the agent's policy to produce concise, correct outputs by granting a one-time reward for brevity, unlike dense penalties that risk truncating important reasoning steps or harming exploration (Liu et al., 21 May 2025). The LASER reward applies only when correctness is achieved, thereby avoiding any adverse effect on the model during earlier exploitation-exploration phases.

Step-function rewards effectively create a non-intrusive "gate," incentivizing succinctness only if accuracy can be preserved. This mechanism outperforms simple length truncation and naive token-level penalties in both reasoning accuracy and response efficiency.

3. Adaptive and Difficulty-aware Reward Shaping: LASER-D

LASER-D extends LASER through two dynamic reward specification strategies:

Dynamic Target Lengths: Rather than using a static $L_T$ , LASER-D periodically recalibrates the target based on length statistics from a monitoring set, reflecting changes in model reasoning over the course of training.
Difficulty-aware Targets: LASER-D stratifies input queries into difficulty buckets (e.g., easy, medium, hard), assigning each a separate adaptive length target $L_A$ . For each bucket, $L_A$ is the smallest length achieving a sufficient coverage of correct responses (via Expected Correct Responses, $ECR_d = P_{l,d} \cdot |C_d|$ ).

This design ensures that easy problems receive strong incentives for brevity, while harder problems permit longer chains of thought. The reward landscape thus aligns with both learning dynamics and task difficulty, supporting optimal adaptation throughout training. LASER-D achieves a Pareto-optimal trade-off in experiments, boosting accuracy (e.g., +6.1 on AIME2024) while reducing average token count by up to 63% (Liu et al., 21 May 2025).

4. Composite Designs: Length Nucleus, Correctness, and Accuracy-Aware Shaping

Advanced length-aware RL variants further refine the step reward paradigm:

Correctness-conditioned Shaping: Length penalties/rewards are only applied to correct responses, ensuring that model exploration is not prematurely curtailed by brevity constraints (Yuan et al., 18 May 2025).
Length Nucleus: A tolerance parameter $\tau_\ell$ defines a "nucleus"—a range of lengths within which no penalty is applied. Penalties only activate for outputs exceeding the nucleus, preventing over-contraction of reasoning traces.
Accuracy-aware Activation: Length reward terms are activated only when model batch accuracy is close to the historical optimum (within $\tau_\text{acc}$ ). This ensures that brevity incentives are not imposed before the agent has reached sufficient proficiency, avoiding adverse impacts on exploration or learning stability.

The full length reward can be formalized as:

$\text{reward}_\text{len}(i) = \begin{cases} \beta, & \text{if}\ r(x, y_i, y^*) > 0\ \text{and acc} \geq \text{acc}_\text{max} - \tau_\text{acc}\ 0, & \text{otherwise} \end{cases}$

with

$\beta = \begin{cases} \lambda, & \ell(i) > \ell_\text{min} + \tau_\ell\ 0.5, & \text{otherwise} \end{cases} \quad \lambda = 0.5 - \frac{\ell(i) - \ell_\text{min}}{\ell_\text{max} - \ell_\text{min}}$

where $\ell_\text{min}$ and $\ell_\text{max}$ are calculated over correct responses only (Yuan et al., 18 May 2025).

5. Experimental Validation and Impact

Extensive experiments demonstrate that Long2short RL methods produce substantial gains in both efficiency and accuracy:

Setting	Token Saving	Accuracy Δ	Benchmark
Logic Reasoning (Short-RL)	40%	+14%	ppl (logic)
Math (Short-RL)	33%	≈0/+	AIME, AMC, MATH-500
LASER-D (AIME2024, 1.5B)	63%	+6.1	AIME2024

These metrics illustrate that well-structured length rewards compress the chain-of-thought reasoning trace without degrading (and in some cases improving) task performance (Liu et al., 21 May 2025, Yuan et al., 18 May 2025). Qualitative analyses further reveal that models trained with LASER-D generate reasoning traces with reduced redundant self-reflections and filler tokens, resulting in more structured and token-economic solutions.

6. Broader Implications and Generalization

Long2short RL methods demonstrate that efficient reasoning can be incentivized within RL’s standard training loop, obviating the need for costly auxiliary compression or supervised fine-tuning stages. Adaptive and difficulty-aware designs further suggest that length incentives should be contextually modulated through training and across task types.

Experiments on out-of-domain benchmarks (e.g., GPQA, LSAT, MMLU) confirm that these length-shaping techniques generalize robustly to new reasoning settings, supporting their adoption in broader reasoning and language modeling applications. A plausible implication is that RL-based approaches to output efficiency can serve as an alternative or complement to heuristic max-token cuts, offering more granular control and performance robustness.

7. Resources and Reproducibility

LASER and LASER-D—including code, pre-trained models, datasets, and extended results—are publicly available at https://github.com/hkust-nlp/Laser, enabling further research and practical deployment (Liu et al., 21 May 2025). Detailed documentation covers reward schedule configuration, monitoring-based adaptation, and evaluation protocols.

In summary, the Long2short RL Algorithm unifies diverse reward shaping techniques for compressing reasoning chains in large models, with step-reward (LASER), adaptive/difficulty-aware (LASER-D), and composite correctness/length/accuracy-aware designs. These approaches have set new efficiency–accuracy baselines in RL for reasoning, revealing that output conciseness can be learned end-to-end in a reinforcement learning framework rather than externally imposed (Liu et al., 21 May 2025, Yuan et al., 18 May 2025).