Reasoning with Exploration: An Entropy Perspective (2506.14758v1)

Published 17 Jun 2025 in cs.CL

Abstract: Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing LLM (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover strong positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.

Summary

The paper proposes an entropy-based advantage shaping method to boost exploratory reasoning in RL-trained language models while mitigating over-optimization.
Using minimal modifications to standard RL pipelines, it integrates a gradient-detached entropy term into token-level advantage computations, validated on mathematical benchmarks.
Results show significant Pass@K improvements and enhanced reasoning actions like pivotal tokens and reflective strategies, leading to better multi-step problem solving.

This paper, "Reasoning with Exploration: An Entropy Perspective" (2506.14758), addresses a key challenge in applying reinforcement learning (RL) to LLMs (LMs) for reasoning tasks: balancing exploitation (getting correct answers) with exploration (discovering novel or more robust reasoning paths). The authors note that standard RL methods for LMs, particularly RL with Verifiable Rewards (RLVR), tend to lead to models that converge on narrow, over-optimized behaviors, limiting their ability to perform multi-step reasoning and causing performance plateaus, especially on complex problems.

Motivated by the role of entropy in traditional RL as a signal for exploration, the paper investigates its relationship with exploratory reasoning in LMs. Through empirical analysis on mathematical reasoning tasks (Figure 2), they find a positive correlation between high-entropy regions in LM outputs and three types of exploratory reasoning actions:

Pivotal tokens: Words or phrases like "first," "because," "however" that guide or connect logical steps.
Reflective actions: Sentences or phrases indicating self-verification or correction (e.g., "Let's verify if this is correct...").
Rare behaviors: Reasoning strategies or steps rarely seen in the base model's outputs, semantically distant from typical generations (Figure 3).

Based on these findings, the paper proposes a simple yet effective method to encourage exploratory reasoning: augmenting the standard advantage function in RL algorithms with a clipped, gradient-detached entropy-based term. This modification aims to reinforce actions taken under higher uncertainty, which the analysis suggests are associated with exploratory reasoning.

The method is designed to be minimally invasive, requiring only one additional line of code in existing RL training pipelines like those based on PPO [(1707.06347)v2] or GRPO (2402.03300). For a token $o_t$ with policy entropy $\mathcal{H}_t$ , the shaped advantage $A^{\mathrm{shaped}}_t$ is calculated as:

$A^{\mathrm{shaped}}_t = A_t + \psi(\mathcal{H}_t)$

where $\psi(\mathcal{H}_t) = \min(\alpha \cdot \mathcal{H}^{\mathrm{detach}}_t, \frac{|A_t|}{\kappa})$ .

Here:

$A_t$ is the original advantage (e.g., from GAE in PPO or normalized group reward in GRPO).
$\mathcal{H}_t$ is the entropy of the policy's distribution over the vocabulary for token $o_t$ .
$\mathcal{H}^{\mathrm{detach}}_t$ indicates that the entropy term is detached from the computational graph during backpropagation. This is crucial because it means the entropy bonus influences the magnitude of the gradient update (by modifying the advantage) but not the direction derived from the log-likelihood gradient $\nabla_\theta \log \pi_\theta(o_t \mid \dots)$ and the standard advantage $A_t$ . This is a key distinction from traditional entropy regularization (Table 1), which adds $\nabla_\theta \mathcal{H}_t$ to the gradient, explicitly pushing the policy towards higher entropy distributions.
$\alpha > 0$ is a scaling coefficient for the entropy bonus.
$\kappa > 1$ is a clipping threshold. The clipping $\min(\cdot, \frac{|A_t|}{\kappa})$ ensures that the entropy bonus does not dominate the original advantage or flip its sign when $A_t$ is negative, thus preserving the core optimization direction guided by the reward signal.

The practical implementation involves adding this term to the computed advantages before calculating the policy loss, as shown in the pseudocode snippet:

adv = compute_advantages(...)

adv += torch.min(alpha * entropy.detach(), adv.abs()/kappa)

loss = compute_policy_loss(adv, ...)

The authors demonstrate that this entropy-based advantage has a self-regulating property (Figure 5). Initially, high entropy might lead to a larger bonus. This bonus, when paired with a positive advantage $A_t$ , strengthens the policy update for token $o_t$ . This update increases the likelihood of $o_t$ , making the distribution sharper and decreasing the entropy $\mathcal{H}_t$ for that token in subsequent iterations. Consequently, the entropy bonus for that token naturally diminishes, preventing over-encouragement or "reward hacking" where the model might repeatedly generate high-entropy tokens without valid reasoning.

For experiments, the authors use Qwen2.5-Base-7B and Qwen2.5-Math-Base-7B as backbone models and train using GRPO and PPO baselines built upon the veRL framework (2409.19256), incorporating techniques like Clip-Higher and Token-level Loss. They use outcome-based rewards (+1 for correct, -1 for incorrect) derived from verifiers like Math-Verify [mathverify]. Evaluation is performed on mathematical reasoning benchmarks (AIME 2025/2024 [AIME], AMC 2023 [AMC], MATH500 [Math500]), using both average accuracy (Pass@1) and Pass@K (2107.03374, 2504.13837) as an estimate of reasoning capacity. They use large values of $K$ (up to 256) to probe the upper bound of reasoning capabilities.

The results (Table 2, Figure 4) show consistent improvements with the entropy-based advantage across different models and RL algorithms, outperforming standard RL baselines and other methods like GPG (2502.01456). Notably, the method significantly boosts Pass@K performance, even at large $K$ values, demonstrating enhanced exploration and reasoning potential. The analysis during training (Figure 6) shows that the method achieves slightly higher rewards, sustains longer response lengths, and maintains controlled entropy compared to baselines. Analysis on test data (Figure 7) confirms that models trained with entropy advantage produce significantly more pivotal tokens and reflective actions, generate longer responses without increased repetition, and exhibit more systematic case analysis and constraint checking in case studies (Figure 8, Appendix C), leading to more accurate solutions.

In summary, the paper provides practical insights into connecting entropy and exploratory reasoning in LMs. The proposed entropy-based advantage shaping is a simple-to-implement, effective method to enhance the exploration capabilities of RL-trained LMs, leading to improved reasoning performance, particularly on challenging problems and for tasks requiring multi-step, exploratory thought processes, as evidenced by the strong Pass@K results. The method's robustness and self-regulating nature make it a promising addition to the RLVR toolkit.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1935451341234782391

https://twitter.com/theomitsa/status/1936361999992016901

https://twitter.com/HuggingPapers/status/1936818641452122369

https://twitter.com/theomitsa/status/1936362306734059588

https://twitter.com/TheTuringPost/status/1937290551797432355

https://twitter.com/NatKokoromyti/status/1937596353389166915

YouTube

Show All Videos