- The paper proposes an entropy-based advantage shaping method to boost exploratory reasoning in RL-trained language models while mitigating over-optimization.
- Using minimal modifications to standard RL pipelines, it integrates a gradient-detached entropy term into token-level advantage computations, validated on mathematical benchmarks.
- Results show significant Pass@K improvements and enhanced reasoning actions like pivotal tokens and reflective strategies, leading to better multi-step problem solving.
This paper, "Reasoning with Exploration: An Entropy Perspective" (2506.14758), addresses a key challenge in applying reinforcement learning (RL) to LLMs (LMs) for reasoning tasks: balancing exploitation (getting correct answers) with exploration (discovering novel or more robust reasoning paths). The authors note that standard RL methods for LMs, particularly RL with Verifiable Rewards (RLVR), tend to lead to models that converge on narrow, over-optimized behaviors, limiting their ability to perform multi-step reasoning and causing performance plateaus, especially on complex problems.
Motivated by the role of entropy in traditional RL as a signal for exploration, the paper investigates its relationship with exploratory reasoning in LMs. Through empirical analysis on mathematical reasoning tasks (Figure 2), they find a positive correlation between high-entropy regions in LM outputs and three types of exploratory reasoning actions:
- Pivotal tokens: Words or phrases like "first," "because," "however" that guide or connect logical steps.
- Reflective actions: Sentences or phrases indicating self-verification or correction (e.g., "Let's verify if this is correct...").
- Rare behaviors: Reasoning strategies or steps rarely seen in the base model's outputs, semantically distant from typical generations (Figure 3).
Based on these findings, the paper proposes a simple yet effective method to encourage exploratory reasoning: augmenting the standard advantage function in RL algorithms with a clipped, gradient-detached entropy-based term. This modification aims to reinforce actions taken under higher uncertainty, which the analysis suggests are associated with exploratory reasoning.
The method is designed to be minimally invasive, requiring only one additional line of code in existing RL training pipelines like those based on PPO [(1707.06347)v2] or GRPO (2402.03300). For a token ot with policy entropy Ht, the shaped advantage Atshaped is calculated as:
Atshaped=At+ψ(Ht)
where ψ(Ht)=min(α⋅Htdetach,κ∣At∣).
Here:
- At is the original advantage (e.g., from GAE in PPO or normalized group reward in GRPO).
- Ht is the entropy of the policy's distribution over the vocabulary for token ot.
- Htdetach indicates that the entropy term is detached from the computational graph during backpropagation. This is crucial because it means the entropy bonus influences the magnitude of the gradient update (by modifying the advantage) but not the direction derived from the log-likelihood gradient ∇θlogπθ(ot∣…) and the standard advantage At. This is a key distinction from traditional entropy regularization (Table 1), which adds ∇θHt to the gradient, explicitly pushing the policy towards higher entropy distributions.
- α>0 is a scaling coefficient for the entropy bonus.
- κ>1 is a clipping threshold. The clipping min(⋅,κ∣At∣) ensures that the entropy bonus does not dominate the original advantage or flip its sign when At is negative, thus preserving the core optimization direction guided by the reward signal.
The practical implementation involves adding this term to the computed advantages before calculating the policy loss, as shown in the pseudocode snippet:
1
2
3
4
5
|
adv = compute_advantages(...)
adv += torch.min(alpha * entropy.detach(), adv.abs()/kappa)
loss = compute_policy_loss(adv, ...) |
The authors demonstrate that this entropy-based advantage has a self-regulating property (Figure 5). Initially, high entropy might lead to a larger bonus. This bonus, when paired with a positive advantage At, strengthens the policy update for token ot. This update increases the likelihood of ot, making the distribution sharper and decreasing the entropy Ht for that token in subsequent iterations. Consequently, the entropy bonus for that token naturally diminishes, preventing over-encouragement or "reward hacking" where the model might repeatedly generate high-entropy tokens without valid reasoning.
For experiments, the authors use Qwen2.5-Base-7B and Qwen2.5-Math-Base-7B as backbone models and train using GRPO and PPO baselines built upon the veRL framework (2409.19256), incorporating techniques like Clip-Higher and Token-level Loss. They use outcome-based rewards (+1 for correct, -1 for incorrect) derived from verifiers like Math-Verify [mathverify]. Evaluation is performed on mathematical reasoning benchmarks (AIME 2025/2024 [AIME], AMC 2023 [AMC], MATH500 [Math500]), using both average accuracy (Pass@1) and Pass@K (2107.03374, 2504.13837) as an estimate of reasoning capacity. They use large values of K (up to 256) to probe the upper bound of reasoning capabilities.
The results (Table 2, Figure 4) show consistent improvements with the entropy-based advantage across different models and RL algorithms, outperforming standard RL baselines and other methods like GPG (2502.01456). Notably, the method significantly boosts Pass@K performance, even at large K values, demonstrating enhanced exploration and reasoning potential. The analysis during training (Figure 6) shows that the method achieves slightly higher rewards, sustains longer response lengths, and maintains controlled entropy compared to baselines. Analysis on test data (Figure 7) confirms that models trained with entropy advantage produce significantly more pivotal tokens and reflective actions, generate longer responses without increased repetition, and exhibit more systematic case analysis and constraint checking in case studies (Figure 8, Appendix C), leading to more accurate solutions.
In summary, the paper provides practical insights into connecting entropy and exploratory reasoning in LMs. The proposed entropy-based advantage shaping is a simple-to-implement, effective method to enhance the exploration capabilities of RL-trained LMs, leading to improved reasoning performance, particularly on challenging problems and for tasks requiring multi-step, exploratory thought processes, as evidenced by the strong Pass@K results. The method's robustness and self-regulating nature make it a promising addition to the RLVR toolkit.