Token Entropy Patterns in RLVR-Enhanced LLM Reasoning

Updated 23 June 2025

Token entropy patterns are statistical and structural regularities in the uncertainty associated with each token's selection during sequential reasoning or generation by LLMs. In the context of reinforcement learning with verifiable rewards (RLVR) for chain-of-thought (CoT) LLM reasoning, token entropy patterns capture how the model's confidence varies at each step, delineating critical “forks” in reasoning where multiple plausible continuations exist. Recent research demonstrates that these high-entropy minority tokens—comprising about 20% of all tokens—play a disproportionate role in effective RL optimization and reasoning performance.

1. Token Entropy and Its Computation

Token-level entropy at a generation step $t$ is computed as: $H_t := - \sum_{j=1}^V p_{t,j} \log p_{t,j}$ where $p_{t,j}$ is the probability assigned to token $j$ at position $t$ by the model's output distribution (typically the softmax over logits, at decoding temperature $T$ ). Here, $V$ is the vocabulary size and $\boldsymbol{p}_t = \pi_\theta(\cdot|\boldsymbol{q},\boldsymbol{o}_{<t})$ .

Interpretively:

Low-entropy tokens reflect highly certain or deterministic positions (e.g., grammatical function words, completion of formulaic phrases).
High-entropy tokens correspond to points of high uncertainty, often aligning with decision junctures in reasoning—“forks” determining the direction of the subsequent logical flow.

2. Empirical Properties of Entropy Patterns in Reasoning

Analysis of entropy patterns in CoT responses from state-of-the-art LLMs trained with RLVR reveals:

The majority (>50%) of tokens exhibit very low entropy ( $H_t < 0.01$ for most steps).
Only a minority (typically the top 20% by entropy, termed forking tokens [Editor's term]) show high entropy (e.g., $H_t > 0.672$ ), but these control the diversity and direction of reasoning.
Forking tokens commonly correspond to semantic or logical choice points (e.g., “however”, “since”, mathematical assumption steps), with uncertainty typically reflecting ambiguity or opportunity for exploration in multi-step reasoning.

Modulation of token entropy during decoding—such as increasing the temperature for these tokens—has been shown to directly improve reasoning outcomes, confirming their centrality.

3. Policy Gradient Objectives and High-Entropy Token Optimization

In RLVR (e.g., DAPO algorithm), the policy gradient objective for a trajectory is: $J_\mathrm{PPO}(\theta) = \mathbb{E}\left[ \min\left( r_t(\theta)\hat{A}_t,\, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t \right) \right]$ where $r_t(\theta) = \pi_\theta(o_t|\cdot)/\pi_{\theta_\textrm{old}}(o_t|\cdot)$ and $\hat{A}_t$ is the estimated advantage. In standard RLVR, gradients are computed over all time steps (i.e., all tokens).

A key empirical finding is that restricting policy gradient updates only to the top $\rho$ (e.g., 20%) highest-entropy tokens within each batch yields: $\mathcal{J}_{\text{HighEnt}}^{\mathcal{B}}(\theta) = \mathbb{E}_\mathcal{B} \left[ \frac{1}{\sum_{i=1}^{G} |\boldsymbol{o}^i|} \sum_{i=1}^{G} \sum_{t=1}^{|\boldsymbol{o}^i|} \mathbb{I}[ H_t^i \geq \tau_\rho^\mathcal{B} ] \cdot \text{PG}_t(\theta) \right]$ $\mathbb{I}[\cdot]$ is the indicator that token $t$ is among the forking tokens; $\tau_\rho^\mathcal{B}$ is the batch-wise threshold set so that only the top $\rho$ quantile is selected.

Restricting updates to these high-entropy (“actionable”) tokens:

Preserves and enhances model exploration at reasoning forks.
Avoids unnecessarily updating highly-constrained (low-entropy) tokens where the next step is almost deterministic (thus, gradients at these positions add noise but little meaningful signal).
Is especially beneficial as model size grows, aligning model capacity and optimization effort with the locus of reasoning uncertainty.

4. The 80/20 Rule Refined: Performance Implications

Empirical studies across multiple Qwen3 model scales (8B, 14B, 32B parameters) demonstrate:

High-entropy token-focused RLVR (top 20%) yields comparable or superior accuracy compared to full-token RLVR, with massive gains in larger models. For Qwen3-32B, focusing on forking tokens yields AIME'24 test accuracy of 63.5 (vs 55.8 for full-token RLVR), a +7.7 point gain.
Training only on 80% lowest-entropy tokens leads to severe performance collapse, corroborating the irrelevance (or even detriment) of low-entropy tokens for reasoning generalization.

This surpasses the traditional “Pareto principle”; optimization of the 20% highest-entropy tokens accounts for nearly 100% of RLVR's impact on reasoning benchmarks.

Model	Full-Token RLVR	High-Entropy Token RLVR	Accuracy Gain
Qwen3-8B	33.3	34.6	+1.3
Qwen3-14B	45.2	50.4	+5.2
Qwen3-32B	55.8	63.5	+7.7

5. Significance for RLVR Algorithm Development and Scaling

As model size increases, the magnitude of gains from forking token RLVR grows, indicating that large-capacity models can better exploit exploration and reasoning diversity at uncertain decision points.

Training efficiency is markedly improved: only 20% of tokens require gradient computation, with no loss (and sometimes substantial gain) in performance.
Successful RLVR increasingly requires dynamic identification and targeted optimization of high-entropy/fork tokens rather than indiscriminate application across a sequence.
The approach suggests that RLVR algorithms should incorporate entropy-aware curricula or dynamic masking, focusing compute resources on the tokens most responsive to exploration and reward feedback.

6. Underlying Mechanisms and Rationale

Optimizing only high-entropy tokens enables reinforcement learning to:

Sharpen model decisions at reasoning forks, improving the ability to generalize to new reasoning tasks or domains.
Preserve model flexibility and diversify solution pathways, directly counteracting the “mode collapse” risk when RL naively flattens the entropy landscape.
Filter out spurious gradients from low-entropy, “bureaucratic” tokens, which can tether the optimization to narrow, overfit trajectories.

This pattern aligns with the broader observation that RL (which maintains or increases entropy at forks) yields better out-of-distribution generalization than supervised fine-tuning (which pushes for deterministic, low-entropy outputs everywhere).

7. Implications for the Theory and Practice of LLM Reasoning

The elucidation of token entropy patterns offers a principled framework for RLVR and potentially broader LLM training strategies:

Detailed token-level entropy analysis can serve as a diagnostic or guidance tool for curriculum design, error analysis, and targeted intervention.
Accentuating training on forks elevates the signal-to-noise ratio of reward-driven optimization.
The finding that "20% of tokens drive all the reward" offers a new theoretical and practical vantage point for understanding the mechanisms and scaling properties of LLM-based reasoning.

In summary, token entropy patterns—principally the identification and targeted optimization of high-entropy, forking tokens—are foundational both to understanding and advancing the performance of RLVR-trained LLMs in complex reasoning tasks. The observed scaling trends and efficiency gains suggest that future RL algorithms for large models will benefit from adaptive, entropy-aware optimization focused on these critical decision points.

PDF Markdown Bookmark Chat (Pro)