Leash: Adaptive Length Penalty & Reward Shaping

Updated 1 January 2026

Leash is an adaptive reinforcement learning framework that dynamically balances length penalties and rewards to optimize output brevity and accuracy.
It employs constrained optimization with dual ascent methods that update penalty coefficients based on token usage relative to a target length.
Empirical studies show that adaptive Leash reduces token count by 50–65% while preserving or improving task accuracy in both mathematical and broader reasoning tasks.

Leash (adaptive LEngth penAlty and reward SHaping) is a reinforcement learning framework for LLMs designed to compress verbose @@@@1@@@@ without compromising accuracy. Leash employs an adaptive, feedback-based approach to length penalty and reward shaping, maintaining a dynamic balance between task performance and computational efficiency. Operating in both mathematical and broader reasoning tasks, it achieves substantial reductions in average output length relative to traditional fixed-penalty baselines, using constrained optimization techniques and dual-ascent controllers to regulate verbosity (Li et al., 25 Dec 2025, Li et al., 25 Jun 2025, Su et al., 23 May 2025).

1. Constrained Optimization and Dual Variables

Leash formulates the goal of length-constrained reasoning as a constrained optimization problem. The target is to maximize expected task reward, $J_R(\theta)$ , subject to a constraint on the average output length, $J_P(\theta)\le 0$ . In its canonical form: $\max_\theta\;\mathbb{E}_{y\sim p_\theta}[r_{\rm task}(y;x)] \quad \text{s.t.} \quad \mathbb{E}_{y\sim p_\theta}[L(y)]\le L_t,$ where $L(y)$ is the number of generated tokens and $L_t$ is the target maximum average length. Introducing a nonnegative Lagrange multiplier $\lambda$ yields the saddle-point objective: $\min_{\lambda\ge0}\,\max_\theta\; \mathcal{L}(\theta,\lambda),$ with

$\mathcal{L}(\theta,\lambda) = \mathbb{E}_{x,y\sim\pi_\theta}\left[ r(x,y) - \lambda\left(\frac{L(y)}{L_t}-1\right) \right].$

This coupling of reward and constraint via dual variables enables explicit separation of accuracy and efficiency, and direct control over the token budget (Li et al., 25 Dec 2025).

2. Adaptive Penalty Mechanism and Primal–Dual Training

Leash employs a primal–dual optimization scheme. Given the current penalty coefficient $\lambda_t$ :

Primal update: Parameters $\theta$ are updated by gradient ascent on $\mathcal{L}(\theta, \lambda_t)$ .
Dual update: $\lambda$ is incremented proportional to the constraint violation, i.e., when average generation exceeds $L_t$ , $\lambda$ increases; if below $L_t$ , $\lambda$ decreases. The update is clipped to $[0, \lambda_{\max}]$ for stability.

Explicitly,

$\lambda_{t+1} = \mathrm{clip}\left[ \lambda_t + \eta_\lambda\, J_P(\theta_{t+1}), 0, \lambda_{\max} \right],$

where $J_P(\theta) = \mathbb{E}_{x,y\sim\pi_\theta}\left[\frac{L(y)}{L_t} - 1 \right]$ . Hyperparameter $\eta_\lambda$ controls the adaptation rate and is critical for avoiding oscillatory or unstable behavior (Li et al., 25 Dec 2025).

Alternative adaptive controllers, including direct reward balancing based on accuracy changes, implement similar feedback—tightening brevity constraints if current accuracy exceeds a reference value, and relaxing otherwise (Su et al., 23 May 2025). This structure generalizes fixed-penalty length reward shaping by rendering the penalty coefficient responsive to observed model behavior.

3. Reward Shaping and One-Sided Length Penalties

To prevent degenerate solutions such as unnaturally terse output, Leash adopts a clipped, one-sided penalty. The penalty term is applied only when $L(y) > L_t$ , leading to sample-wise reward: $R_{\rm shaped}(y;x) = r_{\rm task}(y;x) - \lambda\,\max\left(0, \frac{L(y)}{L_t} - 1 \right),$ clipped to $[-1, 1]$ . This avoids incentivizing excessively short traces and ensures that reward shaping only penalizes overshoot beyond the desired length threshold (Li et al., 25 Dec 2025). Alternative implementations may smooth the length penalty's activation using accuracy-aware scheduling, only enforcing brevity pressure when the model meets or exceeds a dynamic performance target (Li et al., 25 Jun 2025).

4. Algorithmic Workflow

A representative Leash training loop alternates between RL (e.g., PPO or GRPO-style) policy updates and dual updates of $\lambda$ :

Initialize $\theta$ and $\lambda$ .
Loop until convergence:
- Sample batch prompts.
- For each prompt, generate several rollouts, estimate task rewards, compute over-length penalties, and form shaped rewards.
- Update the policy parameters $\theta$ using the chosen RL algorithm and advantage estimates.
- Estimate average constraint violation and update $\lambda$ accordingly, with clipping.
Terminate once both average length and accuracy stabilize.

Typical RL hyperparameters: actor learning rate $\sim10^{-6}$ , penalty learning rate $\eta_\lambda\sim10^{-3}$ – $10^{-2}$ , $\lambda_{\max}=1$ . Validation and target accuracy for dynamic penalty activation can employ an exponential moving average, potential scheduling, or other smoothed estimators (Li et al., 25 Dec 2025, Li et al., 25 Jun 2025).

5. Empirical Performance and Benchmarks

Leash has been evaluated across several LLMs and mathematical reasoning datasets, in both in-domain and transfer regimes.

Model & Task	Method	Avg. Acc (%)	Avg. Tokens	$\Delta$ Tokens (%)	$\Delta$ Acc
DeepSeek-1.5B, $L_t=4$ k	Original	33.1	15,727	–	–
	Leash-C (fixed)	31.5	6,635	–57.8	–1.6
	Leash (adaptive)	33.9	5,873	–62.7	+0.8
Qwen3-4B, $L_t=12$ k	Original	75.5	19,553	–	–
	Leash-C (fixed)	73.5	15,953	–18.4	–2.1
	Leash (adaptive)	74.6	14,428	–26.2	–1.0
DeepSeek-1.5B, Out-of-domain	Original	18.0	8,521	–	–
(GPQA, MMLU-Pro, etc.)	Leash-C (fixed)	20.8	3,499	–58.9	+2.8
	Leash (adaptive)	21.2	3,064	–64.0	+3.2

Empirically, adaptive Leash consistently reduces reasoning trace lengths by 50–65% while maintaining, or occasionally improving, accuracy relative to both baseline and fixed-penalty approaches (Li et al., 25 Dec 2025). On MATH and GSM8k benchmarks, the AALC/A-DLP (alternative Leash variants) reproduce these gains, yielding absolute accuracy reductions under 0.04 while halving or better the mean token count (Li et al., 25 Jun 2025, Su et al., 23 May 2025).

Ablations reveal that adaptivity confers faster convergence and better stability than fixed penalty schemes. Behavioral analysis demonstrates that Leash-trained models eliminate redundant subgoal setting and verification steps, producing concise, non-repetitive chains while still correctly solving complex reasoning tasks.

6. Qualitative Behavioral Effects and Interpretability

Leash's effect on model behavior is to selectively compress intermediate reasoning, prioritizing brevity especially once correctness is reliably established. Qualitative inspection shows that initial stages of finetuning preserve detailed step-by-step chains; as brevity pressure increases, models omit pedagogical framing and extraneous verification:

Example: On the input “How many vertical asymptotes does $y=2/(x^2+x-6)$ $y = 2/ (x^{2} + x - 6)$ have?”
- Baseline: verbose factorization logic, explicit answer structure (~65 tokens).
- Leash: compact factoring and direct answer report (~34 tokens) (Li et al., 25 Jun 2025).

Behavioral frequency analyses show stepwise reductions in “subgoal setting” and “verification,” correlating with reduced token count. However, this compression comes with reduced interpretability: narrative and explanatory context is stripped, which may hinder downstream user comprehension or educational uses (Li et al., 25 Jun 2025). A plausible implication is that efficiency-focused reward shaping imposes a trade-off with interpretive transparency.

7. Limitations and Future Directions

Leash's empirical successes are primarily reported on mathematical and structured reasoning tasks using medium-scale models (e.g., ≤4B parameters). Scaling to larger LLMs, open-ended dialog, or multi-turn contexts is not thoroughly characterized. All variants assume a fixed or globally-scheduled target length $L_t$ , though task-difficulty or adaptive per-task $L_t$ may further improve efficiency.

Reward shaping currently primarily operates on binary correctness; richer or graded reward functions reflecting partial correctness or relevance may yield more granularity (Li et al., 25 Dec 2025). Sensitivity to penalty learning rates ( $\eta_\lambda$ ) and hyperparameters is acute, with large values inducing oscillation and small values hampering adaptation speed (Li et al., 25 Dec 2025, Su et al., 23 May 2025). No sophisticated variance-reduction or smoothing mechanisms for dual variable updates are standard; these could stabilize adaptive behavior, especially as model scale increases.

A plausible implication is that future extensions may exploit control-theoretic or Bayesian scheduling of penalty strength, integration with interpretable reward proxies, and context-aware length constraints to further optimize the efficacy–efficiency frontier.

Principal references: (Li et al., 25 Dec 2025, Li et al., 25 Jun 2025, Su et al., 23 May 2025).