Entropy Gated Selective Policy Optimization (EGSPO)

Updated 10 February 2026

The paper introduces EGSPO, a token-level gradient gating framework that uses predictive entropy to focus RL updates on high-uncertainty decision points.
It employs a three-stage approach—SFT warm-up, RL rollout, and selective policy optimization—to enhance mathematical reasoning performance.
Empirical results demonstrate significant improvements in AIME and MATH tasks with reduced convergence epochs and modest computational overhead.

Entropy Gated Selective Policy Optimization (EGSPO) is a hybrid policy optimization framework for LLMs that implements token-level gradient allocation based on predictive entropy. EGSPO is designed to address limitations of conventional hybrid supervised fine-tuning (SFT) and reinforcement learning (RL) approaches, enabling fine-grained credit assignment and stable training by modulating RL updates at the token level according to uncertainty. The three-stage EGSPO framework achieves statistically significant improvements over baseline hybrid methods, particularly in mathematical reasoning tasks, while incurring only modest computational overhead (Hu et al., 3 Feb 2026).

1. Motivation and Problem Context

Hybrid SFT+RL methods, such as CHORD-φ and MIX-CHORD, conventionally alternate between sample-level SFT and RL losses for model-generated responses. This strategy applies either a purely supervised or RL-based objective uniformly to an entire trajectory $y = (y_1, \ldots, y_T)$ , ignoring intra-trajectory uncertainty variation. Specifically, it fails to differentiate between near-deterministic “procedural” tokens and high-uncertainty decision points. The consequences are:

Inefficient credit assignment: Rewards or penalties are propagated uniformly to all tokens, regardless of their contribution to the final outcome.
Training instability and mode collapse: Applying RL to deterministic tokens can impair factual recall, while excess SFT on uncertain tokens suppresses exploration and discovery.

EGSPO introduces token-level gradient routing based on predictive entropy, aiming to efficiently focus RL on critical decision points while maintaining the stability of procedural tokens (Hu et al., 3 Feb 2026).

2. Formalism

2.1 Loss Functions

Supervised Fine-Tuning (SFT) Loss: For model $\pi_\theta(y_t|x, y_{<t})$ and reference token $y^*_t$ ,

$\mathcal{L}_\text{SFT}(y_t) = -\log \pi_\theta (y^*_t | x, y_{<t}).$

Reinforcement Learning PPO Surrogate Loss: For token-level RL,

$\mathcal{L}_{\text{PPO},t}(\theta) = -\min \left( \rho_t(\theta)A_t, \ \operatorname{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon)A_t \right),$

where $\rho_t(\theta) = \pi_\theta(y_t|x, y_{<t}) / \pi_{\theta_\text{old}}(y_t|x, y_{<t})$ , $A_t$ is the advantage from a process reward model (PRM), and $\epsilon$ is the PPO clipping hyperparameter.

2.2 Predictive Entropy and Gating

Token-Level Predictive Entropy:

$H_t = - \sum_{w \in V} p_\theta(w|x, y_{<t}) \log p_\theta(w|x, y_{<t})$

Gating Function: For entropy threshold $\tau$ and attenuation factor $\alpha \in [0,1]$ ,

$g(H_t) = \begin{cases} 1, & H_t > \tau \ \alpha, & H_t \leq \tau \end{cases}$

Typically, $\tau$ selects the top 15% highest-entropy tokens in a batch; $\alpha$ is chosen in the range $0.1$–$0.3$.

EGSPO Gradient:

$\nabla_\theta \mathcal{L}_{\text{EGSPO}} = \mathbb{E}_t [ g(H_t) \cdot \nabla_\theta \mathcal{L}_{\text{PPO},t} (\theta) ]$

The SFT loss can also be enforced on low-entropy tokens, and a minimum SFT token fraction $\beta_\text{min}$ avoids excessive forgetting.

3. Three-Stage EGSPO Workflow

SFT Expert Learning (Warm-Up):
- Initialize model parameters $\theta$ via SFT on demonstration set for $N_1$ epochs to stabilize procedural knowledge.
RL Rollout Generation:
- Generate $K$ rollouts per prompt at $T=1.0$ .
- For each token $y_t$ $y_{t}$ in trajectory:
  - Compute $H_t$ .
  - Obtain PRM correctness probability $r_\text{PRM}(y_{\leq t})$ and compute token reward $r_t = r_\text{PRM}(y_{\leq t}) - r_\text{PRM}(y_{\leq t-1})$ .
- Store $(x, y_{1:T}, \{ H_t \}, \{ r_t \})$ .
Entropy-Gated Selective Policy Optimization:
- Compute the $\tau$ -percentile threshold for $\{H_t\}$ .
- Assign gating weights $g(H_t)$ .
- Compute token-level advantages $A_t$ using generalized advantage estimation (GAE) on $r_t$ and value function $V$ .
- Compute and apply the EGSPO gradient update.

A minimum SFT-fraction $\beta_\text{min}$ is enforced by re-assigning weights where necessary.

4. Empirical Performance and Ablations

4.1 Benchmark Results

On AIME (150 questions) and MATH (1,500 questions) with majority-vote@32:

Method	AIME	MATH	Compute Overhead
CHORD-φ	18.4%	71.4%	Baseline
EGSPO	22.2%	74.3%	+3.4% / step; +2% wall-clock
Gains (absolute)	+3.8%	+2.9%

EGSPO reduces convergence time ( $-24\%$ epochs to target) and gradient variance ( $-31\%$ variance).

4.2 Baselines and Comparisons

Sample-Level Hybrid (CHORD-φ): +2.3% AIME, +1.4% MATH over pure SFT.
RLOO (Token-Level Uniform RL): +1.1% AIME.
Token-Level DPO: +1.5% AIME.
Entropy-Weighted SFT Only: +2.0% AIME (no RL).

Ablation studies indicate optimal performance for $\tau = 15\%$ , $\alpha = 0.1$ (soft gating). Removing the PRM or entropy gating both significantly reduce performance (+3.8% to +1.7% AIME). Without a minimum SFT fraction, models are vulnerable to forgetting (Hu et al., 3 Feb 2026).

5. Theoretical Insights and Mechanistic Interpretation

High-entropy tokens display approximately $3\times$ higher PRM reward variance, confirming their status as key decision points. Entropy gating focuses exploration by routing full PPO updates to these tokens and attenuating updates elsewhere, which:

Reduces gradient noise from low-uncertainty procedural tokens.
Enhances credit assignment specificity.
Maintains procedural and factual recall through SFT or attenuated updates.

Consistently negative $A_t$ appropriately penalizes confident errors, preventing reinforcement of incorrect but high-confidence predictions (Hu et al., 3 Feb 2026).

6. Computational Characteristics

PRM Inference: ~8% latency increase.
Entropy Computation: ~2% latency.
Overall Overhead: ~10% per training step; faster convergence yields ~5–10% wall-clock reduction.
Net wall-clock training time increases only ~2%, despite per-step cost.

A plausible implication is that EGSPO’s computational tradeoff is favorable for large-scale pretraining where wall-clock efficiency and stable knowledge retention are priorities.

7. Relation to Entropy-Gated Policy Optimization Methods

EGSPO’s approach is distinguished by its token-level granularity and the use of model predictive entropy for gating, in contrast to prompt-level semantic entropy gating as in SEED-GRPO (Chen et al., 18 May 2025) or attention-entropy-based gating for diffusion models in AEGPO (Li et al., 6 Feb 2026). While SEED-GRPO gates the magnitude of group-level RL policy updates using prompt-level semantic entropy and AEGPO gates sampling and exploration schedules according to attention entropy, EGSPO operates at the token level within trajectories, directly modulating RL gradients during the training of LLMs. This positions EGSPO as a versatile blueprint for selectively allocating optimization signal in regimes characterized by structured token-wise uncertainty.

Summary Table: Key Features of EGSPO

Component	Description	Distinctive Aspect
Entropy Signal	Token-level predictive entropy $H_t$	Intra-sequence decision points
Gating	$g(H_t) \in \{1, \alpha\}$ per token	Attenuated vs. full PPO gradient
Reward Signal	PRM-based token-wise differential reward $r_t$	Precise credit assignment
Baseline Comparison	Sample-level (CHORD-φ), token-level uniform (RLOO), DPO	Superior, more stable
Computational Overhead	3.4% per step, 2% wall-clock net increase	Modest, offset by faster convergence

EGSPO’s entropy-gated, token-level hybrid policy optimization framework achieves robust improvements on mathematical reasoning tasks with minimal additional compute and stable knowledge preservation, establishing it as a foundational methodology for uncertainty-aware LLM training (Hu et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (3)

Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models (2026)

SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization (2025)

AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Entropy Gated Selective Policy Optimization (EGSPO).