Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning (2506.01939v1)

Published 2 Jun 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of LLMs, while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.

Summary

  • The paper demonstrates that high-entropy minority tokens, or 'forking tokens', are pivotal in steering LLM reasoning during Reinforcement Learning with Verifiable Rewards.
  • It shows that focusing gradient updates on the top 20% high-entropy tokens significantly enhances reasoning performance, with larger models benefiting the most.
  • The study provides practical insights for designing more efficient RL algorithms by selectively modulating token entropy to optimize exploration and learning.

This paper, "Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning" (2506.01939), investigates the role of token-level entropy in Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of LLMs. The authors find that a small fraction of "high-entropy" tokens, termed "forking tokens," are critical for guiding reasoning paths, and optimizing these tokens is key to effective RLVR.

The core idea is that not all tokens contribute equally to the reasoning process or to the learning signals in RLVR. By analyzing Chain-of-Thought (CoT) reasoning, the paper observes distinct token entropy patterns:

  1. A majority of tokens are generated with low entropy, often completing linguistic structures or executing deterministic steps.
  2. A minority of tokens exhibit high entropy. These tokens typically act as "forks" or decision points that steer the model toward diverse reasoning pathways (e.g., words like "suppose," "however," "thus").

The practical implication of this observation is demonstrated by experiments where decoding temperatures are selectively modulated. Increasing the temperature (and thus entropy) for high-entropy forking tokens improves reasoning performance, while decreasing it leads to degradation. Conversely, modulating the temperature of low-entropy tokens has a less significant impact.

The paper then examines how token entropy patterns evolve during RLVR training (using DAPO as the baseline algorithm). Key findings include:

  • RLVR largely preserves the base model's entropy patterns. The positions of high-entropy tokens remain relatively stable.
  • RLVR primarily adjusts the entropy of already high-entropy tokens, while low-entropy tokens show minimal changes.

Based on these insights, the authors propose a modification to RLVR: restricting policy gradient updates to only the highest-entropy tokens. Specifically, they experiment with using only the top 20% of high-entropy tokens for gradient updates in the DAPO algorithm.

The calculation for token entropy HtH_t at a given step tt is:

Ht=j=1Vpt,jlogpt,jH_t = - \sum_{j=1}^V p_{t,j} \log p_{t,j}

where pt=Softmax(ztT)\boldsymbol{p}_t = \text{Softmax}(\frac{\boldsymbol{z}_t}{T}) is the probability distribution over the vocabulary VV, zt\boldsymbol{z}_t are the pre-softmax logits, and TT is the decoding temperature.

The modified DAPO objective, focusing on high-entropy tokens, is:

JHighEntB(θ)=EBD,(q,a)B,{oi}i=1Gπθold(q)[1i=1Goii=1Gt=1oiI[HtiτρB]min(rti(θ)A^ti,clip(rti(θ),1ϵlow,1+ϵhigh)A^ti)]\mathcal{J}_{\text{HighEnt}^{\mathcal{B}}(\theta)} = \mathbb{E}_{\mathcal{B} \sim \mathcal{D}, (\boldsymbol{q}, \boldsymbol{a}) \sim \mathcal{B}, \{\boldsymbol{o}^i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}(\cdot \mid \boldsymbol{q})}} \left[ \frac{1}{\sum_{i=1}^{G} |\boldsymbol{o}^i|} \sum_{i=1}^{G} \sum_{t=1}^{|\boldsymbol{o}^i|} \mathbb{I}\left[ H_t^i \geq \tau_{\rho}^{\mathcal{B}} \right] \cdot \min \left( r_{t}^i(\theta) \hat{A}_{t}^i, \text{clip}\big(r_{t}^i(\theta), 1 - \epsilon_{\text{low}}, 1 + \epsilon_{\text{high}} \big) \hat{A}_{t}^i \right) \right]

subject to 0<{oiis_equivalent(a,oi)}<G0 < |\{ \boldsymbol{o}^i \mid \text{is\_equivalent}(\boldsymbol{a}, \boldsymbol{o}^i) \}| < G. Here, I[HtiτρB]\mathbb{I}\left[ H_t^i \geq \tau_{\rho}^{\mathcal{B}} \right] is an indicator function that is 1 if the entropy HtiH_t^i of token tt in response ii is above a batch-specific threshold τρB\tau_{\rho}^{\mathcal{B}} (which selects the top ρ\rho proportion of tokens), and 0 otherwise. This effectively masks gradients for low-entropy tokens.

Implementation Details and Experimental Setup:

  • Base Models: Qwen3-8B, Qwen3-14B, Qwen3-32B, and Llama-3.1-8B.
  • RLVR Algorithm: DAPO.
  • Training Data: DAPO-Math-17K dataset.
  • Evaluation Benchmarks: AIME'24, AIME'25, AMC'23, MATH500, Minerva, OlympiadBench, and LiveCodeBench (for OOD generalization).
  • High-Entropy Token Proportion (ρ\rho): Primarily 20%, with ablations for 10% and 50%.
  • Training Hyperparameters: Batch size 512, mini-batch size 32, learning rate 10610^{-6}, max response length 20480 (extended to 29696 in some experiments).

Key Experimental Results:

  • Performance: Training with only the top 20% high-entropy tokens achieves performance comparable to full-gradient updates on Qwen3-8B and significantly surpasses full-gradient updates on larger models like Qwen3-32B (+11.04 on AIME'25, +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25, +5.21 on AIME'24).
  • Scaling Trend: The benefits of focusing on forking tokens increase with model size.
  • Low-Entropy Tokens: Training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance.
  • Exploration: The authors suggest that retaining a critical subset of high-entropy tokens (around 20%) optimally balances exploration and performance. Using too few (e.g., 10%) or too many (e.g., 50%, 100%) high-entropy tokens can reduce overall entropy during training and lead to worse performance, possibly by hindering effective exploration.
  • Generalization: On the out-of-distribution LiveCodeBench dataset, DAPO with only the top 10% or 20% high-entropy tokens still outperformed vanilla DAPO on Qwen3-32B, suggesting that these tokens are linked to generalization capabilities.
  • Response Length: Training with only high-entropy tokens often leads to longer, more detailed responses, which can be beneficial for complex reasoning.

Practical Implications and Applications:

  1. More Efficient RLVR: By focusing gradient updates on a smaller subset of tokens, there's potential for computational savings, although the paper focuses on performance improvements. The primary benefit shown is improved reasoning performance and training stability, especially for larger models.
  2. Understanding LLM Reasoning: The token entropy perspective offers a new way to analyze how LLMs perform reasoning and how RLVR influences this process. It suggests that critical decision points are where learning should be concentrated.
  3. Guiding Future RLVR Algorithm Design: The findings can inform the development of more targeted RL algorithms that strategically leverage these forking tokens.
  4. SFT vs. RL: The paper hypothesizes that RL's tendency to preserve or increase entropy in forking tokens (maintaining reasoning flexibility) might explain its better generalization compared to SFT, which tends to reduce entropy and potentially memorize.
  5. Rethinking Entropy Bonuses: Uniformly applied entropy bonuses (common in RL) might be suboptimal for LLM reasoning, as they could undesirably increase the entropy of low-entropy majority tokens. Techniques like DAPO's "clip-higher" are more aligned with selectively promoting entropy in high-entropy tokens.

Pseudocode for Identifying and Using High-Entropy Tokens in a Batch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
def calculate_token_entropies(logits_for_entropy_calc):
    # logits_for_entropy_calc are from the *training* policy
    # for each token position, even if generated by an old policy
    probs = softmax(logits_for_entropy_calc / temperature, dim=-1)
    log_probs_all_vocab = log_softmax(logits_for_entropy_calc / temperature, dim=-1)
    entropies = -torch.sum(probs * log_probs_all_vocab, dim=-1) # [batch_size, seq_len]
    return entropies

token_entropies_in_batch = calculate_token_entropies(current_policy_logits_for_batch_actions) # [micro_batch_size, seq_len]

all_entropies_flat = token_entropies_in_batch.flatten()
k = int(rho * len(all_entropies_flat))
if k == 0: # handle edge case where no tokens are selected
    entropy_threshold = float('inf') # effectively select no tokens
elif k == len(all_entropies_flat): # handle edge case where all tokens are selected
    entropy_threshold = float('-inf') # effectively select all tokens
else:
    # Find the (1-rho)-th percentile value if sorting ascending, or rho-th if descending
    # Top-k largest values means finding the value at the (N-k)-th position if sorted ascending
    # Or, equivalently, the k-th largest value.
    # If we want top rho percent, we need the (1-rho) quantile.
    entropy_threshold = torch.quantile(all_entropies_flat, 1.0 - rho)

high_entropy_mask = (token_entropies_in_batch >= entropy_threshold).float() # [micro_batch_size, seq_len]

ratios = torch.exp(policy_log_probs - old_policy_log_probs) # [micro_batch_size, seq_len]

surr1 = ratios * advantages
surr2 = torch.clamp(ratios, 1.0 - epsilon_low, 1.0 + epsilon_high) * advantages

policy_loss_terms = torch.min(surr1, surr2)

masked_policy_loss_terms = policy_loss_terms * high_entropy_mask

final_policy_loss = -masked_policy_loss_terms.sum() / high_entropy_mask.sum() # If averaging over selected tokens

loss_contribution = torch.where(
    advantages >= 0,
    torch.min(ratios * advantages, torch.clamp(ratios, 1.0 - epsilon_low, 1.0 + epsilon_high) * advantages), # Standard PPO clip for positive A_t
    torch.max(ratios * advantages, torch.clamp(ratios, 1.0 - epsilon_low, 1.0 + epsilon_high) * advantages)  # Different clip for negative A_t (DAPO's clip-higher is more specific)
    # Actually, DAPO applies clip(r_t, 1-eps_low, 1+eps_high) * A_t for positive A_t
    # and clip(r_t, 1-eps_high_neg, 1+eps_low_neg) * A_t for negative A_t (often symmetric epsilons are used like PPO, but DAPO has eps_low, eps_high)
    # The paper's Eq. 4 shows: min( r_t * A_t, clip(r_t, 1-eps_low, 1+eps_high) * A_t )
    # This implies A_t is positive for this structure. If A_t is negative, the terms flip.
    # A more general PPO form handles this:
    # term1 = ratios * advantages
    # term2 = torch.clamp(ratios, 1.0 - clip_param, 1.0 + clip_param) * advantages
    # policy_objective_terms = torch.where(advantages >= 0, torch.min(term1, term2), torch.max(term1, term2))
    # This is for standard PPO. DAPO's clip-higher uses (1-eps_low, 1+eps_high) always.
)

masked_objective_terms = loss_contribution * high_entropy_mask
total_tokens_in_micro_batch_responses = high_entropy_mask.numel() # or sum of actual lengths
final_dapo_loss = -masked_objective_terms.sum() / total_tokens_in_micro_batch_responses

Limitations and Future Work:

  • The paper primarily uses Qwen models; further validation on diverse model architectures is needed.
  • Experiments are mainly on mathematical reasoning; extending to other domains like programming or complex problem-solving (e.g., ARC-AGI) would be valuable.
  • The optimal proportion of high-entropy tokens (20% in this paper) might vary with different settings, datasets, or models.
  • Future work includes developing new RLVR algorithms that better leverage these insights and exploring applications in SFT, distillation, inference, and multi-modal training.

In summary, the paper presents compelling evidence that focusing RLVR training on a small subset of high-entropy "forking" tokens can lead to significant performance improvements in LLM reasoning, especially for larger models. This offers a more targeted and potentially more efficient approach to RLVR, along with deeper insights into the mechanics of LLM reasoning.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com