- The paper introduces OPEFO, an adaptive method that balances token-level entropy changes to prevent premature determinism in RLVR training.
- It achieves 1.7–2.3% absolute accuracy improvements across benchmarks while maintaining smooth entropy dynamics compared to GRPO.
- OPEFO integrates with strict on-policy updates with minimal overhead, offering a robust and principled alternative to heuristic entropy regularization.
On-Policy Entropy Flow Optimization for Stabilizing RLVR Training
Overview and Motivation
Reinforcement Learning with Verifiable Rewards (RLVR) is now widely used to enhance the reasoning abilities of LLMs, particularly in domains requiring compositional and multi-step reasoning such as mathematical problem solving. While RLVR paradigms—often centered around policy optimization methods like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO)—have demonstrated strong empirical advances, they are consistently hampered by entropy collapse, a phenomenon where policy entropy precipitously diminishes early in training, resulting in premature determinism, poor exploration, and ultimately, suboptimal reasoning capacity.
Existing approaches attempt to address entropy collapse through explicit entropy regularization or heuristics based on ratio clipping. However, these approaches are either too coarse to provide stable entropy control or lack theoretical rigor under strict on-policy constraints as they rely on outdated references. This paper introduces a granular, token-level view of entropy dynamics—termed "entropy flow"—and proposes an adaptive, strictly on-policy mechanism, On-Policy Entropy Flow Optimization (OPEFO), that directly balances entropy-increasing and entropy-decreasing updates throughout RLVR training (2605.11491).
Entropy Collapse: Token-Level Analysis
The paper rigorously decomposes policy entropy evolution into local, token-level contributions, building upon first-order approximations under a softmax parameterization. Empirical measurements expose that in standard GRPO-style RLVR, entropy-decreasing token updates dominate early and throughout training, yielding a net negative entropy flow. This consistent imbalance universally drives entropy towards zero across training stages.
Clipping-based techniques (e.g., Clip-higher) selectively preserve entropy-increasing updates by relaxing clipping bounds, partially alleviating collapse but deviating from strict on-policy optimization and relying on reference policies, thereby opening distribution mismatches and instability. Entropy regularization, meanwhile, introduces auxiliary objectives that may conflict with policy-gradient directions and are sensitive to hyperparameter choices, failing to robustly control high-variance entropy dynamics.
OPEFO: Adaptive On-Policy Entropy Flow Control
OPEFO provides a theoretically justified and implementation-friendly solution compatible with strict on-policy requirements. At each policy update, token-level entropy changes (ΔHₜ) are computed, partitioning updates into entropy-increasing (ΔHₜ > 0) and entropy-decreasing (ΔHₜ < 0) sets. Gradient updates from these sets are then rescaled via a single adaptive coefficient, λ, analytically solved to enforce zero net entropy flow for each batch:
- The balancing coefficient λ is computed to ensure the weighted sum of entropy changes for the batch is zero, counteracting transient imbalances in token-level entropy contributions.
- This adaptive scaling preserves the optimization direction of policy gradients, avoids auxiliary loss terms, and requires minimal implementation changes to existing GRPO pipelines.
Empirical Results
Extensive experiments are conducted using Qwen2.5-Math-7B and Qwen3-4B-Base models trained on the DAPO-17k dataset. Benchmarks span six challenging mathematical reasoning tasks (AIME24, AIME25, AMC23, MATH500, Minerva, OlympiadBench).
Key results include:
- OPEFO achieves state-of-the-art accuracy on all benchmarks, for example, 52.4% and 51.9% average accuracy on Qwen2.5-Math-7B and Qwen3-4B-Base, respectively. This constitutes a 1.7–2.3% absolute improvement over the next best strict on-policy and clipping-based baselines.
- Training dynamics with OPEFO maintain smooth entropy trajectories, avoiding the rapid entropy collapse seen in GRPO and unbounded entropy inflation observed with strong entropy regularization.
- OPEFO also sustains longer, more varied response lengths and higher Pass@k scores as k increases, confirming improved exploration and solution diversity.
Strict On-Policy vs. Approximate On-Policy Baselines
Strict on-policy optimization alone, as opposed to approximate on-policy with clipped ratios, does lead to modest improvements in accuracy but does not remediate entropy collapse. OPEFO, layered atop strict on-policy updates, is necessary for controlling entropy trajectories and unlocking further gains.
Ablations and Efficiency
Ablation studies indicate that static scaling or single-sided scaling of entropy-increasing/decreasing updates are inferior to OPEFO's full adaptive balancing. The balancing coefficient λ remains positive and dynamic across training, automatically adjusting to the evolving needs of optimization.
Significantly, the computational overhead of OPEFO is negligible compared to standard GRPO and strict on-policy baselines (149s vs. 158s per batch), dispelling the notion that strict on-policy methods are necessarily more expensive under practical rollout batch sizes.
Theoretical and Practical Implications
The entropy flow framework formalizes the intuition that stable exploration is a product of equilibrium between entropy-increasing and decreasing policy updates. OPEFO avoids the pitfalls of reference policy drift and auxiliary entropy loss instability—common in ratio-based and entropy-regularized RLVR—by ensuring that entropy control is both principled and robust under the strictest policy-gradient protocols.
Practically, OPEFO is broadly adaptable across RLVR setups and compatible with existing codebases, providing an avenue for stable, high-performing LLM fine-tuning, especially in reasoning- or exploration-dominated domains. The token-level entropy flow perspective is further extensible for diagnosing or controlling exploration in non-mathematical RL and could inform improved optimization in dense-reward or complex credit assignment regimes.
Limitations and Future Directions
OPEFO's entropy change analysis is inherently first-order and leverages a softmax-independence assumption, abstracting away higher-order dependencies found in large transformer-based models. The batch-level, zero-entropy-flow heuristic is a sufficient but not globally optimal criterion, and its efficacy may fluctuate in settings beyond sparse, verifiable rewards. Comprehensive exploration in domains with dense rewards and quantification of higher-order entropy interactions remain important directions for subsequent research.
Conclusion
OPEFO introduces a strict on-policy, adaptive mechanism grounded in token-level entropy flow analysis to reliably prevent entropy collapse in RLVR for LLMs. The method delivers both superior empirical accuracy and robust training dynamics across a range of mathematical reasoning benchmarks, with negligible additional computational cost. This work suggests that principled entropy flow balancing is critical for stabilizing RL-based optimization of large, compositional models and points to new directions for entropy-aware reinforcement learning design in both language and general sequential decision-making systems.