- The paper presents a gradient-based taxonomy that predicts RLVR stability by analyzing token-level advantages and distinguishing peak from valley tokens.
- The proposed Winner Advantage Policy Optimization (WAPO) algorithm applies only positive-advantage updates, stabilizing policy gradients and improving performance.
- Empirical results show that WAPO outperforms GRPO, DAPO, and GSPO by maintaining exploration, reducing collapse, and achieving higher accuracy on multi-hop QA benchmarks.
A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization
Introduction and Motivation
This paper conducts an in-depth analysis of instability in reinforcement learning with verifiable rewards (RLVR), a predominant paradigm for improving reasoning and planning in LLMs. Despite notable progress, existing group-relative policy optimization (GRPO) methods, and variants such as DAPO and GSPO, remain susceptible to collapse, either producing high-entropy, off-task generations or contracting to repetitive, malformed outputs. The prevailing hypothesis attributes instability to policy drift and train-inference mismatch, mitigated through trust-region constraints and importance sampling. However, these interventions do not consistently safeguard against collapse. Instead, the authors propose a first-order gradient-based explanation: instability arises from the interplay between advantage sign and the local geometry of the current policy's token distribution. This gradient-oriented perspective provides a taxonomy that predicts which updates induce stability or collapse and motivates Winner Advantage Policy Optimization (WAPO), a positive-advantage-only online policy-gradient algorithm.
Gradient Analysis and Token Taxonomy
Through token-level gradient analysis, the authors formalize the differential effect of policy-gradient updates on the next-token distribution. Specifically, for a sampled token s with probability psโ, a small step in the direction of the advantage-weighted negative log-likelihood affects the probability of both the sampled token and all others. Critically, the likelihood of non-sampled tokens may increase even for positive-advantage samples, contingent on their position relative to a reference level C(p) (the sum of squared token probabilities).
The paper introduces a taxonomy by distinguishing:
- Peak tokens: psโโฅC(p)
- Valley tokens: psโ<C(p)
and further refines with advantage sign into four regimes: Pos-peak, Pos-valley, Neg-peak, Neg-valley.
First-order analysis reveals:
- Reinforcing Pos-peak decreases entropy and leads to conservative exploitation.
- Pos-valley and Neg-peak increase entropy, promoting exploration but also instability.
- Neg-valley may reduce entropy but is liable to premature, overconfident collapse.
Empirical ablations confirm these behaviors, with entropy dynamics tightly matching theoretical predictions.
Limitations of Trust Region and Negative Clipping Methods
The work demonstrates that simply reducing off-policyness through ratio-based clipping does not guarantee improved stability. Aggressive removal of divergent tokens often degrades performance, particularly by adversely affecting low-probability (valley) tokens and amplifying harmful gradients. Instability, therefore, is not fully explained by the policy distance from the rollout policy but is more precisely attributed to which token-level gradients survive per-update clipping.
Winner Advantage Policy Optimization
The central algorithmic contribution is WAPO, which applies policy-gradient updates only for group-normalized, positive-advantage completions. Unlike naive winner filtering or rejection fine-tuning, WAPO preserves GRPOโs online structure: it uses importance ratios, group normalization, and policy-gradient clipping. For prompts lacking any positive-advantage completions, gradients are masked. Theoretical analysis in the binary reward setting shows the WAPO update is equivalent to an adaptive policy-gradient ascent with a scaling factor of 1โqxโ (where qxโ is the current policyโs probability of success), concentrating updates on hard prompts.
Relative to related positive-only approaches such as PSR and RAFT++, WAPO maintains token-level advantage masking and avoids the pitfalls of sequence-level normalization, such as short-answer bias.
Experimental Results
Comprehensive empirical evaluation covers mathematical reasoning (NuminaMath-LEAN, Math-500) and multi-hop QA (Hotpot-QA, OTT-QA) benchmarks over diverse LLM families (Qwen3-4B, SmolLM3-3B, Gemma3-4B). The following findings are robustly supported:
- Stability: WAPO consistently prevents collapse across all datasets and models. In cases where DAPO or GRPO collapse or saturate early, WAPO remains stable.
- Final Accuracy: WAPO matches or outperforms GRPO, DAPO, and GSPO, with pronounced gains on the most challenging datasets. For multi-hop QA, WAPO surpasses the next-best baseline by up to 10% final accuracy.
- Exploration and Sample Diversity: Pass@k metrics indicate WAPO preserves exploratory behavior, outperforming baselines not only on pass@1 but across higher k as well.
- Generalization: Out-of-domain evaluation demonstrates robust transfer, especially on multi-hop QA, where WAPO leads across all models and target distributions.
When baselines are stable, WAPO matches their performance; when instability emerges, WAPOโs stability is distinctive.
Theoretical and Practical Implications
This study shifts the analytic lens from trajectory- or divergence-based regularization towards token-level gradient effects as primary determinants of RLVR stability. The taxonomy clarifies that advantage sign and local probability structure jointly mediate exploration and exploitation, suggesting that effective RL objectives should balance both, rather than indiscriminately constraining update magnitude.
Practically, WAPO offers a minimally intrusive alternative to clipping-based stabilization in RLHF pipelines and other RLVR regimes. By exclusively reinforcing positively advantageous rollouts, WAPO mitigates both random (high-entropy) and mode-collapse (low-entropy) pathologies without sacrificing generalization or exploration.
Future Directions
The gradient perspective motivates extensions to more complex RLVR domains, including program synthesis and code generation, and exploration of impact at larger model scales or with sparse mixture-of-expert (MoE) architectures. Adapting taxonomy-based filtering or dynamic update-masking policies may further improve stability-exploration trade-offs, especially with coarse or delayed reward settings.
Conclusion
This work provides a rigorous gradient-theoretic framework for understanding and diagnosing RLVR instability. The peak-valley taxonomy offers predictive and explanatory power regarding entropy trends and the destabilizing role of negative-advantage updates. WAPO operationalizes these insights, delivering strong empirical results and improving the stability-reward frontier in RL-finetuned LLMs. The demonstrated principles lay the groundwork for principled, scalable reinforcement learning techniques in high-dimensional autoregressive generative models.
Reference:
"A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization" (2606.16154)