Papers
Topics
Authors
Recent
Search
2000 character limit reached

Entropy Gated Selective Policy Optimization (EGSPO)

Updated 10 February 2026
  • The paper introduces EGSPO, a token-level gradient gating framework that uses predictive entropy to focus RL updates on high-uncertainty decision points.
  • It employs a three-stage approach—SFT warm-up, RL rollout, and selective policy optimization—to enhance mathematical reasoning performance.
  • Empirical results demonstrate significant improvements in AIME and MATH tasks with reduced convergence epochs and modest computational overhead.

Entropy Gated Selective Policy Optimization (EGSPO) is a hybrid policy optimization framework for LLMs that implements token-level gradient allocation based on predictive entropy. EGSPO is designed to address limitations of conventional hybrid supervised fine-tuning (SFT) and reinforcement learning (RL) approaches, enabling fine-grained credit assignment and stable training by modulating RL updates at the token level according to uncertainty. The three-stage EGSPO framework achieves statistically significant improvements over baseline hybrid methods, particularly in mathematical reasoning tasks, while incurring only modest computational overhead (Hu et al., 3 Feb 2026).

1. Motivation and Problem Context

Hybrid SFT+RL methods, such as CHORD-φ and MIX-CHORD, conventionally alternate between sample-level SFT and RL losses for model-generated responses. This strategy applies either a purely supervised or RL-based objective uniformly to an entire trajectory y=(y1,,yT)y = (y_1, \ldots, y_T), ignoring intra-trajectory uncertainty variation. Specifically, it fails to differentiate between near-deterministic “procedural” tokens and high-uncertainty decision points. The consequences are:

  • Inefficient credit assignment: Rewards or penalties are propagated uniformly to all tokens, regardless of their contribution to the final outcome.
  • Training instability and mode collapse: Applying RL to deterministic tokens can impair factual recall, while excess SFT on uncertain tokens suppresses exploration and discovery.

EGSPO introduces token-level gradient routing based on predictive entropy, aiming to efficiently focus RL on critical decision points while maintaining the stability of procedural tokens (Hu et al., 3 Feb 2026).

2. Formalism

2.1 Loss Functions

  • Supervised Fine-Tuning (SFT) Loss: For model πθ(ytx,y<t)\pi_\theta(y_t|x, y_{<t}) and reference token yty^*_t,

LSFT(yt)=logπθ(ytx,y<t).\mathcal{L}_\text{SFT}(y_t) = -\log \pi_\theta (y^*_t | x, y_{<t}).

  • Reinforcement Learning PPO Surrogate Loss: For token-level RL,

LPPO,t(θ)=min(ρt(θ)At, clip(ρt(θ),1ϵ,1+ϵ)At),\mathcal{L}_{\text{PPO},t}(\theta) = -\min \left( \rho_t(\theta)A_t, \ \operatorname{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon)A_t \right),

where ρt(θ)=πθ(ytx,y<t)/πθold(ytx,y<t)\rho_t(\theta) = \pi_\theta(y_t|x, y_{<t}) / \pi_{\theta_\text{old}}(y_t|x, y_{<t}), AtA_t is the advantage from a process reward model (PRM), and ϵ\epsilon is the PPO clipping hyperparameter.

2.2 Predictive Entropy and Gating

  • Token-Level Predictive Entropy:

Ht=wVpθ(wx,y<t)logpθ(wx,y<t)H_t = - \sum_{w \in V} p_\theta(w|x, y_{<t}) \log p_\theta(w|x, y_{<t})

  • Gating Function: For entropy threshold τ\tau and attenuation factor α[0,1]\alpha \in [0,1],

g(Ht)={1,Ht>τ α,Htτg(H_t) = \begin{cases} 1, & H_t > \tau \ \alpha, & H_t \leq \tau \end{cases}

Typically, τ\tau selects the top 15% highest-entropy tokens in a batch; α\alpha is chosen in the range $0.1$–$0.3$.

  • EGSPO Gradient:

θLEGSPO=Et[g(Ht)θLPPO,t(θ)]\nabla_\theta \mathcal{L}_{\text{EGSPO}} = \mathbb{E}_t [ g(H_t) \cdot \nabla_\theta \mathcal{L}_{\text{PPO},t} (\theta) ]

The SFT loss can also be enforced on low-entropy tokens, and a minimum SFT token fraction βmin\beta_\text{min} avoids excessive forgetting.

3. Three-Stage EGSPO Workflow

  1. SFT Expert Learning (Warm-Up):
    • Initialize model parameters θ\theta via SFT on demonstration set for N1N_1 epochs to stabilize procedural knowledge.
  2. RL Rollout Generation:
    • Generate KK rollouts per prompt at T=1.0T=1.0.
    • For each token yty_t in trajectory:
      • Compute HtH_t.
      • Obtain PRM correctness probability rPRM(yt)r_\text{PRM}(y_{\leq t}) and compute token reward rt=rPRM(yt)rPRM(yt1)r_t = r_\text{PRM}(y_{\leq t}) - r_\text{PRM}(y_{\leq t-1}).
    • Store (x,y1:T,{Ht},{rt})(x, y_{1:T}, \{ H_t \}, \{ r_t \}).
  3. Entropy-Gated Selective Policy Optimization:
    • Compute the τ\tau-percentile threshold for {Ht}\{H_t\}.
    • Assign gating weights g(Ht)g(H_t).
    • Compute token-level advantages AtA_t using generalized advantage estimation (GAE) on rtr_t and value function VV.
    • Compute and apply the EGSPO gradient update.

A minimum SFT-fraction βmin\beta_\text{min} is enforced by re-assigning weights where necessary.

4. Empirical Performance and Ablations

4.1 Benchmark Results

On AIME (150 questions) and MATH (1,500 questions) with majority-vote@32:

Method AIME MATH Compute Overhead
CHORD-φ 18.4% 71.4% Baseline
EGSPO 22.2% 74.3% +3.4% / step; +2% wall-clock
Gains (absolute) +3.8% +2.9%
  • EGSPO reduces convergence time (24%-24\% epochs to target) and gradient variance (31%-31\% variance).

4.2 Baselines and Comparisons

  • Sample-Level Hybrid (CHORD-φ): +2.3% AIME, +1.4% MATH over pure SFT.
  • RLOO (Token-Level Uniform RL): +1.1% AIME.
  • Token-Level DPO: +1.5% AIME.
  • Entropy-Weighted SFT Only: +2.0% AIME (no RL).

Ablation studies indicate optimal performance for τ=15%\tau = 15\%, α=0.1\alpha = 0.1 (soft gating). Removing the PRM or entropy gating both significantly reduce performance (+3.8% to +1.7% AIME). Without a minimum SFT fraction, models are vulnerable to forgetting (Hu et al., 3 Feb 2026).

5. Theoretical Insights and Mechanistic Interpretation

High-entropy tokens display approximately 3×3\times higher PRM reward variance, confirming their status as key decision points. Entropy gating focuses exploration by routing full PPO updates to these tokens and attenuating updates elsewhere, which:

  • Reduces gradient noise from low-uncertainty procedural tokens.
  • Enhances credit assignment specificity.
  • Maintains procedural and factual recall through SFT or attenuated updates.

Consistently negative AtA_t appropriately penalizes confident errors, preventing reinforcement of incorrect but high-confidence predictions (Hu et al., 3 Feb 2026).

6. Computational Characteristics

  • PRM Inference: ~8% latency increase.
  • Entropy Computation: ~2% latency.
  • Overall Overhead: ~10% per training step; faster convergence yields ~5–10% wall-clock reduction.
  • Net wall-clock training time increases only ~2%, despite per-step cost.

A plausible implication is that EGSPO’s computational tradeoff is favorable for large-scale pretraining where wall-clock efficiency and stable knowledge retention are priorities.

7. Relation to Entropy-Gated Policy Optimization Methods

EGSPO’s approach is distinguished by its token-level granularity and the use of model predictive entropy for gating, in contrast to prompt-level semantic entropy gating as in SEED-GRPO (Chen et al., 18 May 2025) or attention-entropy-based gating for diffusion models in AEGPO (Li et al., 6 Feb 2026). While SEED-GRPO gates the magnitude of group-level RL policy updates using prompt-level semantic entropy and AEGPO gates sampling and exploration schedules according to attention entropy, EGSPO operates at the token level within trajectories, directly modulating RL gradients during the training of LLMs. This positions EGSPO as a versatile blueprint for selectively allocating optimization signal in regimes characterized by structured token-wise uncertainty.

Summary Table: Key Features of EGSPO

Component Description Distinctive Aspect
Entropy Signal Token-level predictive entropy HtH_t Intra-sequence decision points
Gating g(Ht){1,α}g(H_t) \in \{1, \alpha\} per token Attenuated vs. full PPO gradient
Reward Signal PRM-based token-wise differential reward rtr_t Precise credit assignment
Baseline Comparison Sample-level (CHORD-φ), token-level uniform (RLOO), DPO Superior, more stable
Computational Overhead 3.4% per step, 2% wall-clock net increase Modest, offset by faster convergence

EGSPO’s entropy-gated, token-level hybrid policy optimization framework achieves robust improvements on mathematical reasoning tasks with minimal additional compute and stable knowledge preservation, establishing it as a foundational methodology for uncertainty-aware LLM training (Hu et al., 3 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Entropy Gated Selective Policy Optimization (EGSPO).