Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 168 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 122 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 464 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Asymmetric Importance Sampling Policy Optimization

Updated 8 October 2025
  • The paper introduces a flipped importance sampling ratio for positive-advantage tokens, ensuring rare but correct outputs receive stronger updates.
  • It employs soft dual-clipping to stabilize gradient flow and suppress extreme updates, preventing premature entropy collapse during training.
  • Empirical results demonstrate improved sample efficiency, training stability, and performance in tasks like mathematical reasoning and code generation.

Asymmetric Importance Sampling Policy Optimization (ASPO) is a methodology developed to address token-level weighting flaws in reinforcement learning-based fine-tuning of LLMs, particularly under Outcome-Supervised Reinforcement Learning (OSRL) regimes. Standard approaches leverage importance sampling (IS) ratios as token-level weights, but this induces a mismatch: positive-advantage tokens are over-weighted for highly probable outputs and suppressed for rarer, low-probability tokens, causing premature entropy collapse and inadequate exploration. ASPO corrects these dynamics by using an asymmetric IS weighting strategy—specifically, flipping the IS ratio for positive-advantage tokens—along with soft dual-clipping to maintain stability and gradient flow. Systematic evaluation demonstrates improved sample efficiency, training stability, and final performance in mathematical reasoning and code generation tasks (Wang et al., 7 Oct 2025).

1. Theoretical Foundations and Motivation

Recent LLM post-training methods rely on token-level clipping in RL, which repurposes the classical IS ratio for per-token weighting. In standard practice, the IS ratio for token otio_{ti} is

zti=πnew(otiq,ot1)πold(otiq,ot1)z_{ti} = \frac{\pi_{\text{new}}(o_{ti}|q, o_{t-1})}{\pi_{\text{old}}(o_{ti}|q, o_{t-1})}

where πnew\pi_{\text{new}} is the current policy and πold\pi_{\text{old}} is the reference or behavioral policy. Within PPO-Clip frameworks, these ratios are used both for distributional correction and as effective training weights.

The central flaw identified in OSRL is that positive-advantage tokens—those, for example, matching correct answers or desired code—receive maximal weight when already highly probable under the current policy, and minimal weight when their probability is low. Conversely, negative advantages retain expected downweighting. This leads to reinforcement of already high-probability tokens and minimal update toward rare but correct outputs, resulting in rapid decrease of entropy and high repetition rates in generations.

2. Flipping the Importance Sampling Ratio for Positive Tokens

ASPO addresses this bias by “flipping” the IS ratio for tokens with positive advantage (Ati>0A_{ti} > 0). Specifically, for such tokens, the IS ratio is inverted: z^ti=(πold(otiq,ot1)πnew(otiq,ot1))1=πnew(otiq,ot1)πold(otiq,ot1)\hat{z}_{ti} = \left( \frac{\pi_{\text{old}}(o_{ti}|q, o_{t-1})}{\pi_{\text{new}}(o_{ti}|q, o_{t-1})} \right)^{-1} = \frac{\pi_{\text{new}}(o_{ti}|q, o_{t-1})}{\pi_{\text{old}}(o_{ti}|q, o_{t-1})} This inversion, coupled with a stop-gradient operator on the denominator, ensures that positive-advantage tokens with low current probabilities receive stronger update weights, aligning learning signals for positive and negative tokens.

For negative-advantage tokens, standard IS weighting is retained. The rationale is that rare, correct tokens should be up-weighted to avoid premature stabilization around suboptimal, frequent outputs.

3. Soft Dual-Clipping Mechanism

Inverting IS weights can yield extreme update magnitudes, especially as the denominator πnew(otiq,ot1)\pi_{\text{new}}(o_{ti}|q, o_{t-1}) may be very small at early learning stages. To stabilize these updates while ensuring gradients propagate, ASPO introduces soft dual-clipping. Unlike hard clipping—which truncates value and also blocks gradient flow—soft dual-clipping clamps only the value without obstructing the gradient:

  • For large inverted IS ratios, the clipped weight is used in both the loss computation and backward pass.
  • This mechanism suppresses outlier updates that can induce instability or mode collapse, but allows continual learning in the direction indicated by the corrected IS weighting.

This provides stability during optimization and prevents weight explosion for under-represented tokens, facilitating a gradual and meaningful increase in their probabilities.

4. Empirical Evaluation and Results

Comprehensive experiments demonstrate that ASPO outperforms GRPO-based baselines across coding and mathematical reasoning benchmarks:

  • On math datasets such as AIME, AMC, MATH-500, and OlympiadBench, ASPO registers gains in both accuracy metrics (avg@K) and pass rates (pass@K) relative to strong baselines.
  • On code generation benchmarks (e.g., LiveCodeBench), ASPO consistently surpasses DAPO and Nemotron.
  • Training stability benefits from ASPO’s weighting scheme: entropy decreases more slowly, repetition and clip ratios grow slower, and the system avoids premature convergence.
  • Empirical curves show smoother optimization, with more gradual entropy collapse, better exploration, and improved robustness under OSRL.
  • These improvements are attributed directly to token-level weighting corrections and dual-clipping introduced by the ASPO formulation.

5. Insights into Token-Level Weighting and Update Dynamics

The analysis in ASPO reveals that classical IS ratios, when used as token-level training weights, modify the effective direction and magnitude of the update for each token. For positive tokens, the standard formulation inadvertently enhances already over-represented outputs—an effect exacerbated as entropy collapses and the model becomes repetitive.

By flipping the ratio for positive tokens, ASPO reverses this feedback loop. Under this scheme, improvements in rare but desirable outputs are prioritized, maintaining entropy and diversity longer in training. This correction is especially crucial in LLM RL, where proper credit assignment to low-probability but correct tokens is necessary not only for accuracy but also for maintaining coverage in multi-step reasoning or code synthesis.

A plausible implication is that similar asymmetric weighting strategies may be beneficial for other RL fine-tuning paradigms where per-element update signals are misaligned with the intended learning dynamics.

6. Practical Relevance and Broader Impact

ASPO’s methodology is particularly relevant for large-scale LLM post-training in settings where token-level reward assignment is critical—mathematical reasoning, code synthesis, dialogue, or summarization. Its stability and exploration properties—achieved via flipped-IS weighting and soft dual-clipping—enable long-horizon supervised RL training of LLMs to avoid degenerate local optima and maintain output diversity.

The improved training dynamics and theoretical understanding of token-level IS weighting furnish new design criteria for RL-based fine-tuning in large models, highlighting the nontrivial role of importance sampling beyond its classical distribution correction role.

7. Future Directions

The ASPO framework suggests several avenues for further research:

  • Investigation into adaptive clipping schedules or gradient scaling to further control instability during early learning phases.
  • Extension of asymmetrically weighted IS to other RL settings, including continuous control or multi-agent systems, where update dynamics may be similarly unbalanced.
  • Systematic paper of the interaction between entropy regularization and asymmetric weighting, optimizing for both stability and exploration.
  • Exploration of multi-factor or hierarchical weighting schemes, combining asymmetric IS strategies with value-based or uncertainty-driven update signals.

These directions are prompted by the fundamental insight that IS ratios, when interpreted as update weights, should reflect not only the probability mass but also the desired directionality of policy improvement at the elemental (token or action) level.


ASPO represents a principled correction to token-level weighting flaws in outcome-supervised reinforcement learning for LLMs, ensuring balanced updates, sample efficiency, stable optimization, and improved final accuracy over existing state-of-the-art OSRL methods (Wang et al., 7 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Asymmetric Importance Sampling Policy Optimization (ASPO).