Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

Published 3 Jun 2026 in cs.LG and cs.AI | (2606.05434v1)

Abstract: Group Relative Policy Optimisation (GRPO) has emerged as an effective reinforcement-learning algorithm for aligning LLMs on reasoning tasks, but it treats every token position and every sampled rollout symmetrically. We introduce two complementary extensions: (i) Adaptive-Horizon GRPO (AH-GRPO), which weights each token's policy gradient using a cumulative entropy-based discount that reduces the effective horizon when the model is uncertain, and (ii) Selective-Advantage AH-GRPO (SA-AH-GRPO), which applies this discounting only to negative-advantage rollouts, leaving positive-advantage, successful trajectories unattenuated. We evaluate standard GRPO with alpha = 0, AH-GRPO with alpha = 0.5, and SA-AH-GRPO with alpha = 0.5 on the GSM8K mathematical reasoning benchmark using both Qwen 2.5-1.5B-Instruct and Qwen 2.5-3B-Instruct fine-tuned with LoRA. On the 3B model, SA-AH-GRPO achieves Pass@1 = 0.858 at its peak at step 30 and maintains 0.846 at 180 steps, with training variance reduced to 0.0246, a 3.6 times reduction relative to GRPO while matching its peak accuracy. On the 1.5B model, SA-AH-GRPO achieves a peak Pass@1 of 0.686, improving over the zero-shot baseline of 0.637. Our analysis shows that asymmetric discounting preserves the full gradient signal on correct solutions, prevents entropy collapse, and substantially stabilises training, suggesting a principled inductive bias for reinforcement learning with verifiable rewards on structured generation tasks.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces SA-AH-GRPO, a method that selectively applies entropy-based discounting on negative advantage rollouts to preserve full gradients on promising completions.
The approach significantly reduces training variance (up to 3.6×) while achieving high Pass@1 scores on both 1.5B and 3B scale models in RL fine-tuning.
It leverages per-token normalized entropy to dynamically adjust gradient credit assignment, promoting stable training and improved sample efficiency in language model optimization.

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient RL of LLMs

Introduction and Motivation

This work systematically investigates the limitations of symmetric, uniform token-level weighting in RL-from-verifiable-reward (RLVR) LLM fine-tuning, particularly with Group Relative Policy Optimisation (GRPO). The authors introduce two methodologically principled extensions—Adaptive-Horizon GRPO (AH-GRPO) and its asymmetric variant, Selective-Advantage AH-GRPO (SA-AH-GRPO). Both leverage the model's predictive uncertainty, quantified by per-token normalized entropy, to re-weight the policy gradient loss dynamically. Unlike prior approaches, SA-AH-GRPO restricts this entropy-based discounting to rollouts with negative (group-normalized) advantage, preserving the full gradient signal on promising completions. The result is a significant reduction in training variance (especially pronounced at larger scale), aligning with theoretical expectations from variance reduction and decision theory in RL.

Methodology

The central insight is that the informativeness of each token position during autoregressive generation is contextually variable. When the model is confident (low entropy), the sampled token represents a high-fidelity local mode of the distribution, supporting robust policy updates. When uncertainty is high (high entropy), a single sampled token is a weak proxy, and enforcing aggressive updates can destabilize training. AH-GRPO introduces an exponentially decaying discount to token-level gradients as a function of cumulative entropy, shortening the effective credit assignment horizon at uncertain positions:

$w_t^{(i)} = \prod_{s=1}^{t} \exp(-\alpha H_s^{(i)})$

where $H_t^{(i)}$ is the model's normalized entropy at position $t$ in rollout $i$ , and $\alpha$ is a tunable hyperparameter controlling the degree of discounting.

In SA-AH-GRPO, this discount is selectively applied—limited only to rollouts with negative group-normalized advantage. Thus, positive-advantage (correct) trajectories propagate an unaltered gradient through all token positions.

Main Results

The empirical evaluation is conducted on mathematical reasoning (GSM8K), testing Qwen2.5-1.5B-Instruct and Qwen2.5-3B-Instruct models fine-tuned with LoRA. The three variants (GRPO, AH-GRPO, SA-AH-GRPO) are compared in terms of Pass@1, training variance, and KL divergence.

On the 3B model, all methods achieve high final Pass@1 scores (0.846–0.862), outperforming the zero-shot baseline. SA-AH-GRPO reduces training variance by 3.6× compared to standard GRPO with no loss in peak accuracy. On the 1.5B model, where the zero-shot baseline is lower, the best final accuracy (Pass@1 of 0.686, a +4.9 percentage-point gain) is observed with SA-AH-GRPO.

The variance reduction with SA-AH-GRPO is especially notable at the 3B scale, while at 1.5B it yields superior accuracy with higher reward variation—likely due to increased policy plasticity and more aggressive exploration (mirrored in higher mean KL). An $\alpha$ -ablation study on the 1.5B model shows all positive $\alpha$ values outperform baseline and all negative (entropy-amplifying) settings, confirming the central design hypothesis.

Figure 1: Pass@1 on GSM8K (3B model) for GRPO; shows convergence and performance under uniform weighting.

Figure 2: Pass@1 for SA-AH-GRPO ( $\alpha=0.5$ ) on 3B; demonstrates sustained accuracy and reduced variance over training.

Figure 3: Pass@1 for SA-AH-GRPO ( $\alpha=0.5$ ) on 1.5B; shows consistent gains over baseline and increasing accuracy.

Figure 4: $\alpha$ -ablation of AH-GRPO; positive $H_t^{(i)}$ 0 consistently outperforms the entropy-amplifying regime.

Analysis and Theoretical Implications

SA-AH-GRPO introduces a form of inductive bias: correct trajectories, regardless of local uncertainty, propagate full gradient credit, supporting robust model improvement. By contrast, erroneous completions with high-entropy positions have their gradients attenuated, preventing spurious updates from ambiguous contexts—a well-aligned intervention in the RLVR setting where supervision is outcome-grounded.

This asymmetric approach avoids over-penalization of tokens that, while uncertain, contributed to successful completions, addressing a fundamental inefficiency in prior symmetric techniques. The entropy-adaptive discount also implicitly implements a curriculum: as model confidence increases during training, the effective horizon lengthens, annealing the regularisation effect and naturally transitioning to standard GRPO near convergence.

Relation to Prior Work

Prior work on RLHF and RLVR (e.g., PPO, process rewards, per-token RL) either relies on auxiliary reward models, fixed heuristics, or lacks adaptation to model uncertainty in loss weighting. Entropy regularisation has typically been formulated as explicit maximization (e.g., maximum entropy RL) rather than uncertainty-weighted gradient discounting. The present approach distinguishes itself by deriving token-level weights analytically from the model's own predictive distribution, and by introducing outcome-conditioned, per-rollout asymmetry.

Practical and Future Implications

The demonstrated variance reduction at scale has direct implications for improved training stability and decreased fine-tuning cost. The methods are broadly applicable: per-token discounting mechanisms are orthogonal to reward shaping and can be deployed in RLVR settings with verifiable rewards, multi-step reasoning, or code generation. The natural curriculum inherent in the entropy-based approach is likely to generalize robustly to settings where exploration-exploitation balance is challenging.

Extensions include scaling to even larger architectures (e.g., 7B, 70B) and diverse task domains (code synthesis, logic). A full joint sweep over $H_t^{(i)}$ 1 and asymmetry is warranted. The implications for sample complexity, convergence rate, and optimality in low-resource regimes merit dedicated study.

Conclusion

The paper introduces and rigorously validates entropy-adaptive, asymmetric token-level discounting (SA-AH-GRPO) as an effective variance-reduction and stability mechanism for RL-based LLM fine-tuning with verifiable rewards. The results support the thesis that per-token uncertainty should guide gradient credit assignment, especially in autoregressive structured prediction. The work sets a new methodological baseline for RLVR training protocols and informs the design of scalable, stable RL algorithms for LLMs.

Markdown Report Issue