Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs (2505.12929v1)

Published 19 May 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of LLMs, with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.

Summary

The paper demonstrates that low-probability tokens cause disproportionate gradient norms, leading to poor reinforcement of high-probability, accurate outputs in RL training.
It introduces two mitigation strategies—Advantage Reweighting and Low-Probability Token Isolation (Lopti)—to rebalance token updates and improve model performance.
Empirical results on logical puzzles and math benchmarks show significant accuracy improvements, validating the techniques across different scales and tasks.

Analysis of "Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs"

This paper targets a key gradient imbalance issue in policy-gradient-based RL training for LLMs, specifically within the context of Group Relative Policy Optimization (GRPO). The authors rigorously analyze how low-probability tokens disproportionately dominate gradient updates during RL, provide a formal analytical characterization of the phenomenon, and propose two concrete mitigation strategies: Advantage Reweighting and Low-Probability Token Isolation (Lopti).

Gradient Dynamics and Token Update Bias

The central finding is that low-probability tokens induce gradient norms scaling with $(1 - \pi)$ , where $\pi$ is the token’s probability. As a result, tokens less likely under the model distribution (low $\pi$ ) produce high-magnitude gradients, while high-probability (well-learned) tokens' gradients are suppressed. This causes RL updates—especially under objectives like the GRPO loss—to be dominated by rare/incorrect tokens at the expense of reinforcing correct, high- $\pi$ behavior.

The authors demonstrate empirically and theoretically that this dynamic hinders adjustment of the model’s probability mass for already likely tokens, crucial for correct chain-of-thought and answer formation. Notably, ablations show that when updates are restricted to low-probability tokens, the overall probability mass shifts significantly, often outside the desirable direction for high-probability/advantageous tokens.

Proposed Mitigations: Advantage Reweighting and Lopti

To address the imbalance, two methods are introduced, each implementable atop standard GRPO or other policy gradient RL algorithms:

1. Advantage Reweighting

Instead of using the raw advantage estimate for each token, the advantage is reweighted proportional to a linearly scaled function of its probability:

1	A_new = [α * π(token) + (1 - α)] * A_old

where α is typically selected in [0.1, 0.3]. This constrains the influence of low-probability tokens on the policy update and enables more meaningful adjustment for high-probability tokens. The implementation cost is negligible, simply requiring a new weighting factor during loss computation.

2. Low-Probability Token Isolation (Lopti)

Lopti partitions tokens in each minibatch by their current probability (relative to a threshold η, typically 0.5). RL updates for low-probability tokens are performed first; only then are high-probability tokens updated. This staged update decorrelates the direct negative interference of large low-probability gradients on high-probability tokens. The method is straightforward to add to existing batched RL dataloaders and backprop schedules, but incurs approximately 2x more update steps and computational overhead due to the sequential update procedure.

Practical Algorithm Skeleton

Below is a pseudocode outline for integrating both approaches into a GRPO-style RL schedule:

for minibatch in rl_train_loader:
    # Compute old token probabilities π_old and advantages A_old
    # Optionally recompute A_new via advantage reweighting
    if use_adv_reweight:
        A = alpha * pi_old + (1 - alpha) * A_old
    else:
        A = A_old

    # Optionally apply Lopti
    if use_lopti:
        # First, update on low-prob tokens (pi_old <= eta)
        mask = (pi_old <= eta)
        optimizer.zero_grad()
        loss = rl_policy_loss(minibatch[mask], A[mask])
        loss.backward()
        optimizer.step()
        
        # Then, update on high-prob tokens (pi_old > eta)
        mask = (pi_old > eta)
        optimizer.zero_grad()
        loss = rl_policy_loss(minibatch[mask], A[mask])
        loss.backward()
        optimizer.step()
    else:
        # Standard (reweighted) update
        optimizer.zero_grad()
        loss = rl_policy_loss(minibatch, A)
        loss.backward()
        optimizer.step()

Experimental Results

K{content}K Logic Puzzles

On the K{content}K Logic Puzzles—where the RL phase is required to learn intricate, multi-step logical deduction—the proposed mechanisms yielded large improvements:

For the Qwen2.5-3B-Instruct base, GRPO baseline yields 39% accuracy, while GRPO+Reweight achieves 53% (↑35.9%), GRPO+Lopti 54% (↑38.5%), and the combination achieves 57% (↑46.2%).
Trends hold at larger scale (Qwen2.5-7B-Instruct-1M): +15.6%, +9.1%, and +18.2% absolute improvements, respectively, when integrating Reweighting and/or Lopti into GRPO.

Ablations confirm the core hypothesis: restricting updates to high-probability tokens leads to drastic performance degradation, while order-reversal in Lopti (high-prob first) destabilizes training. Proper tuning of reweighting parameters is required; nevertheless, performance gains are robust across a broad range around the recommended α, η values.

Math Reasoning Datasets

On challenging math benchmarks (OlympiadBench, MATH-500, AMC, AIME2024, Minerva), both mitigation techniques consistently increased evaluation accuracy across all tasks. Notably, further combining both strategies did not lead to additional improvement over using either in isolation. The highest gains were observed on tasks with more diversity in token distribution and complexity.

Theoretical and Practical Implications

The principal theoretical contribution is the rigorous, layer-wise derivation of how gradients backpropagated from the RL loss function assign excessive weight to low-probability tokens, irrespective of overall trajectory reward or actual importance to LLMing capability. The bounds derived connect directly to the underlying model’s well-conditioned Jacobians and vocabulary size, supporting universal applicability to any transformer-based LLM.

On the practical side, the findings imply that future RL algorithms for LLM alignment and reasoning will need to more carefully engineer token-level weighting schemes or dynamically modify their training schedules to avoid over-correcting for rare/odd-generation events. This is increasingly important as LLMs are tasked with long-chain-of-thought tasks where the majority of valuable supervision is distributed over probable tokens, and rare-token reinforcement can induce detrimental shifts.

Resource and Scaling Considerations

Advantage Reweighting introduces negligible computational overhead and is trivial to integrate.
Lopti incurs a roughly 2x increase in training time due to dual-step updates per minibatch.
Both methods can be composed with GRPO, PPO, and REINFORCE-like objectives as well as DPO-style frameworks, and need only minor adaptation of batch-level loss computations.

Future Directions

The work highlights a significant opportunity for structured, token-level curriculum or importance-weighting strategies in RL-based LLM training. Possible future directions include:

Automated hyperparameter selection/scheduling for reweighting and isolation thresholds based on gradient diagnostics.
Extension to task- and language-specific curriculum to better modulate the interplay between token frequency and reward assignment.
Adaptive methods that dynamically adjust gradient scaling to maintain an optimal ratio of low/high-probability token update magnitudes throughout training.
Application to non-textual modalities with large output spaces (e.g., code, multimodal LLMs).

In summary, this work exposes a previously overlooked but impactful source of inefficiency and instability in RL-based LLM training and offers two simple, practical solutions that are highly effective in addressing token dominance; these methods are expected to become part of the standard toolkit for RLHF and self-improving LLM pipelines.

PDF Markdown

Follow-up Questions

Related Papers

Authors (7)

GitHub

GitHub - zhyang2226/AR-Lopti: [arXiv] Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs (2 stars)

Tweets

https://twitter.com/shivanshpuri35/status/1933414571928170733

https://twitter.com/shivanshpuri35/status/1933414606728470629