The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models (2505.22617v1)

Published 28 May 2025 in cs.LG, cs.AI, and cs.CL

Abstract: This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.

Summary

The paper demonstrates that policy entropy collapse in RL for reasoning LLMs follows an exponential relationship (R = -a exp(H) + b), indicating performance saturation as entropy diminishes.
It provides a theoretical framework linking the covariance between token log-probabilities and advantages to entropy reduction, confirmed by experiments on diverse model families.
The study proposes Clip-Cov and KL-Cov techniques to regulate high-covariance tokens, achieving sustained exploration and up to 6.4% performance gains on downstream tasks.

This paper investigates the phenomenon of "policy entropy collapse" in reinforcement learning (RL) for reasoning LLMs. It finds that during RL training, policy entropy sharply decreases, leading to an overly confident policy with diminished exploratory ability and, consequently, performance saturation.

Key Observations and Findings:

Predictable Entropy-Performance Relationship:

The paper empirically establishes a predictable exponential relationship between validation performance ( $R$ ) and policy entropy ( $\mathcal{H}$ ): $R = -a \exp(\mathcal{H}) + b$ .
- This implies that performance gains are achieved by "trading" entropy.
- The coefficients $a$ (rate of entropy conversion to performance) and $b$ (related to max performance) reflect intrinsic properties of the policy and data.
- The performance ceiling can be predicted when entropy is exhausted ( $\mathcal{H} \approx 0 \implies R \approx -a + b$ ), highlighting a bottleneck for scaling RL.
- This relationship allows for predicting late-stage RL performance from early-stage observations and even extrapolating performance for larger models by observing trends in $a$ and $b$ with model size.
Theoretical Analysis of Entropy Dynamics:

The paper derives the mechanism behind entropy change:
- For softmax policies (like LLMs), the change in policy entropy ( $\mathcal{H}(\pi_\theta^{k+1}) - \mathcal{H}(\pi_\theta^k)$ ) is approximately proportional to the negative covariance between the log-probability of an action and the change in its logit:
  
  $\mathcal{H}(\pi_\theta^{k+1}) - \mathcal{H}(\pi_\theta^k) \approx \mathbb{E}_{s \sim d_{\pi_\theta}}\left[-\text{Cov}_{a\sim\pi^k_\theta(\cdot|s)}\left(\log\pi^k_\theta(a|s),\ z^{k+1}_{s,a} - z^k_{s,a}\right)\right]$

* Under Policy Gradient (PG) like algorithms, the change in logits ( $z^{k+1}_{s,a} - z^k_{s,a}$ ) is proportional to the action's advantage ( $A(s,a)$ ) or a function of it. For vanilla PG: $z^{k+1}_{s,a} - z^k_{s,a} = \eta\ \pi_\theta(a \mid s) \ A(s, a)$ . For Natural PG: $z^{k+1}_{s,a} - z_{s,a}^k = \eta \cdot A(s,a)$ . * This means high-probability actions with high advantages (positive covariance) tend to reduce entropy, explaining the observed monotonic decrease. * Empirical studies confirm this: the covariance term and entropy differences match, and the covariance stays mostly positive during training.

Ineffectiveness of Conventional Entropy Regularization: Standard methods like adding an entropy bonus to the loss or KL divergence penalty against a reference model are found to be either ineffective, highly sensitive to hyperparameters, or detrimental to performance in the context of LLM reasoning.
Proposed Entropy Control Methods: Clip-Cov and KL-Cov:

Based on the understanding that high-covariance tokens drive entropy collapse, the paper proposes two techniques to control entropy by restricting updates for these tokens:
- Clip-Cov: Randomly selects a small fraction of tokens with high positive covariances (between log-probability and advantage, as per Theorem~\ref{theorem:ent_npg}) and detaches their gradients, effectively excluding them from the policy update.
  - The loss for a token $y_t$ is set to 0 if its index $t \in I_{\text{clip}}$ .
  - $I_{\text{clip}}$ are indices of tokens with covariance $Cov(y_i)$ within a high range $[\omega_{\text{low}}, \omega_{\text{high}}]$ , and a fraction $r$ of these are selected.
- KL-Cov: Applies a KL penalty (between current and old policy) specifically to tokens with the largest covariances.
  - The loss for a token $y_t$ includes an additional term $-\beta\mathbb{D}_{\text{KL}}(\pi_{\theta_{\text{old}}} || \pi_\theta)$ if its index $t \in I_{\text{KL}}$ .
  - $I_{\text{KL}}$ are indices of tokens with the top- $k$ highest $Cov(y_i)$ . The token-wise covariance is defined as:
$Cov(y_i) = \left(\log\pi_{\theta}(y_{i}) - \frac{1}{N}\sum_{j=1}^{N}\log\pi_\theta(y_{j})\right) \cdot (A({y_i}) - \frac{1}{N}\sum_{j=1}^{N}A({y_j}))$

# Pseudocode for Covariance Calculation and Policy Loss Modification
def compute_token_wise_covariance(log_probs, advantages):
    # log_probs: tensor of log probabilities of sampled tokens
    # advantages: tensor of advantages for sampled tokens
    mean_log_probs = log_probs.mean()
    mean_advantages = advantages.mean()
    
    covariances = (log_probs - mean_log_probs) * (advantages - mean_advantages)
    return covariances

def compute_policy_loss_clip_cov(old_log_probs, current_log_probs, advantages, clip_ratio, cov_threshold_low, cov_threshold_high):
    ratio = torch.exp(current_log_probs - old_log_probs)
    pg_loss_unclipped = -ratio * advantages
    
    token_covariances = compute_token_wise_covariance(current_log_probs, advantages)
    
    # Identify high-covariance tokens
    high_cov_mask = (token_covariances > cov_threshold_low) & (token_covariances < cov_threshold_high)
    high_cov_indices = torch.where(high_cov_mask)[0]
    
    num_to_clip = int(clip_ratio * len(high_cov_indices))
    if num_to_clip > 0 and len(high_cov_indices) > 0:
        # Randomly select among high-covariance tokens to clip
        indices_to_clip_from_high_cov = torch.randperm(len(high_cov_indices))[:num_to_clip]
        actual_indices_to_clip = high_cov_indices[indices_to_clip_from_high_cov]
        
        # Detach gradients for selected tokens
        # This is a conceptual illustration; in practice, you might zero out their loss contribution
        # or use a mask in the loss computation.
        # For simplicity, let's assume pg_loss_unclipped can be modified.
        # A more common way is to create a mask for the final loss.
        loss_mask = torch.ones_like(pg_loss_unclipped)
        loss_mask[actual_indices_to_clip] = 0 # No loss contribution
        pg_loss_clipped_contribution = pg_loss_unclipped * loss_mask

        # If using PPO-style clipping:
        # pg_losses1 = -ratio * advantages
        # pg_losses2 = -torch.clamp(ratio, 1.0 - clip_epsilon, 1.0 + clip_epsilon) * advantages
        # pg_loss_final = torch.max(pg_losses1, pg_losses2)
        # final_loss_masked = pg_loss_final * loss_mask
        # return final_loss_masked.mean()
        
        # For a simpler REINFORCE-style loss for illustration of detach:
        pg_loss_unclipped[actual_indices_to_clip] = pg_loss_unclipped[actual_indices_to_clip].detach()

    return pg_loss_unclipped.mean() # Or a PPO-style clipped loss

def compute_policy_loss_kl_cov(old_log_probs, current_log_probs, advantages, kl_penalty_ratio, kl_coefficient):
    ratio = torch.exp(current_log_probs - old_log_probs)
    pg_loss = -ratio * advantages
    
    token_covariances = compute_token_wise_covariance(current_log_probs, advantages)
    
    num_to_penalize = int(kl_penalty_ratio * len(token_covariances))
    
    if num_to_penalize > 0:
        top_k_cov_values, top_k_indices = torch.topk(token_covariances, num_to_penalize)
        
        kl_div = (old_log_probs - current_log_probs) # or (current_log_probs - old_log_probs)
                                                     # The paper uses D_KL(pi_old || pi_current)
                                                     # which is E_pi_old[log(pi_old/pi_current)]
                                                     # For token-wise, it's (old_log_probs - current_log_probs)
        
        # Apply KL penalty only to selected tokens
        # Create a penalty tensor initialized to zeros
        kl_penalty_term = torch.zeros_like(pg_loss)
        kl_penalty_term[top_k_indices] = kl_coefficient * kl_div[top_k_indices]
        
        pg_loss = pg_loss + kl_penalty_term # adding because pg_loss is negative of objective

    return pg_loss.mean()

Experimental Validation:

Experiments were conducted on various model families (Qwen2.5, Mistral, LLaMA, DeepSeek-Math) across math and coding tasks.
The proposed Clip-Cov and KL-Cov methods successfully maintained higher policy entropy throughout training.
This sustained exploration led to better downstream performance on mathematical reasoning tasks, avoiding the performance plateaus seen with vanilla RL (GRPO). For example, on Qwen2.5-32B, KL-Cov showed a 6.4% average improvement over GRPO.
The methods allow for controllable policy entropy levels by tuning hyperparameters (clip ratio $r$ for Clip-Cov, KL coefficient $\beta$ or selection ratio $k$ for KL-Cov).

Implementation Considerations:

Computational Cost: Calculating token-wise covariances adds some overhead but is manageable as it involves operations on existing quantities (log-probabilities, advantages).
Hyperparameter Tuning:
- Clip-Cov: clip ratio $r$ (e.g., $2 \times 10^{-4}$ ), covariance bounds $\omega_{\text{low}}, \omega_{\text{high}}$ (e.g., 1 and 5).
- KL-Cov: top- $k$ proportion for KL penalty (e.g., $2 \times 10^{-4}$ to $2 \times 10^{-3}$ ), KL coefficient $\beta$ (e.g., 1).
- These hyperparameters are sensitive and crucial for balancing exploration and stability. The paper notes that only a very small fraction of tokens ( $10^{-4}$ to $10^{-3}$ ) needs intervention.
Algorithm Integration: These methods modify the loss calculation within PPO-like algorithms. The paper primarily applies them in the context of GRPO.
Scalability: The methods proved more effective on larger models (e.g., Qwen2.5-32B vs. 7B), suggesting they help unlock the potential of larger pretrained models by mitigating exploration issues.

Practical Implications:

The findings provide a deeper understanding of why RL for LLMs often hits performance ceilings.
The proposed $R = -a \exp(\mathcal{H}) + b$ relationship can be a useful diagnostic tool for RL training, allowing for early prediction of performance limits.
Clip-Cov and KL-Cov offer practical, low-overhead methods to improve RL for LLM reasoning by encouraging sustained exploration. They are relatively simple to implement by modifying the loss computation.
The paper suggests that managing entropy by focusing on high-covariance tokens is more effective than global entropy regularization.

Conclusion:

The paper highlights policy entropy collapse as a major obstacle in scaling RL for LLM reasoning. By understanding the dynamics of entropy through covariance analysis, it proposes Clip-Cov and KL-Cov, two simple yet effective techniques to manage entropy by targeting high-covariance tokens. These methods lead to sustained exploration and improved performance, offering a path towards more scalable and effective RL for LLMs.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1936991191545020655

https://twitter.com/agarwl_/status/1930621101756875073

https://twitter.com/yuchenzhan84564/status/1928107221424165185

https://twitter.com/Basith_AI/status/1928553747086442920

https://twitter.com/Jiacheng_c/status/1928099701947039808

https://twitter.com/WangGuo113/status/1928342981007769733

YouTube

Show All Videos