Papers
Topics
Authors
Recent
Search
2000 character limit reached

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

Published 2 Jun 2026 in cs.LG and cs.AI | (2606.03070v2)

Abstract: Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optimization, but stale responses introduce distribution drift. Standard behavior-corrected methods control this drift with behavior-policy probabilities, importance ratios, or clipping, which requires token-aligned, versioned, and numerically consistent behavior log-probabilities across rollout and learner systems. We ask whether asynchronous group-relative RL can instead be stabilized using only current-policy probabilities. We identify a scale-imbalance failure mode: when stale responses are evaluated under the current policy, positive and negative loss terms can appear at different negative log-probability scales, so zero-sum advantages no longer imply balanced loss contributions. We propose Asymmetric-Scale Policy Optimization (ASymPO), which normalizes each response's token loss by its current average token negative log-probability. ASymPO requires no behavior-policy probabilities, restores response-level zero-sum balance, and preserves a nonzero learning signal. We also introduce Scaled Policy Optimization (SPO), a fixed negative-scaling baseline, and evaluate both current-policy-only objectives in asynchronous mathematical reasoning post-training.

Summary

  • The paper introduces ASymPO, a current-policy-only reinforcement learning method that uses adaptive response-level normalization to balance the impact of negative and positive updates.
  • It demonstrates that ASymPO recovers group-relative zero-sum balance by moderating scale imbalances, preventing training collapse even without behavior-policy probabilities.
  • Empirical tests on models like Qwen3 and LLaMA show stable reward trajectories and competitive performance compared to GRPO, while simplifying infrastructure requirements.

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

Technical Problem and Motivation

The paper addresses the challenge of efficient asynchronous reinforcement learning (RL) post-training of LLMs under reward-driven objectives—particularly mathematical reasoning—without access to behavior-policy probabilities. In typical RL fine-tuning scenarios, asynchronous designs decouple response generation (rollout workers) from policy optimization (learner), which improves throughput but causes distribution drift between the behavior policy (used to generate responses) and the current policy (used for optimization). Standard PPO-style methods correct this drift via importance ratios and clipping, requiring transport of token-aligned, versioned, numerically consistent behavior-policy log-probabilities. This incurs substantial infrastructure overhead and complexity, motivating the search for current-policy-only objectives that retain stability without relying on behavior information.

Analysis of Scale-Imbalance Failure Mode

The paper formalizes a critical failure mode: naive current-policy-only RL objectives, especially with group-relative rewards, do not preserve the intended positive-negative balance induced by zero-sum advantages. Because stale responses are evaluated under the current policy, positive and negative loss terms may appear at widely different negative log-probability scales; notably, negative samples can dominate due to very low current-policy probability and correspondingly large negative log-probability. This breaks cancellation, destabilizes training, and leads to collapse—even when the sampled responses are well scaled under the behavior policy. Classical importance-sampling-based corrections (PPO, GRPO) transfer the advantage balance to loss contributions through clipping, preserving group-level equilibrium.

Methodological Contributions

The authors propose two current-policy-only objectives:

  1. Scaled Policy Optimization (SPO): Weakens the influence of negative-advantage samples by a fixed hyperparameter α∈(0,1)\alpha \in (0,1). While simple and empirically stabilizing, SPO cannot adapt to individual response scales or differentiate between mildly and strongly stale negatives.
  2. Asymmetric-Scale Policy Optimization (ASymPO): Introduces adaptive response-level normalization, dividing each response's loss by its own current mean token negative log-probability (Sθ,gS_{\theta,g}). This targets scale imbalance directly: responses with large current loss scales are automatically moderated, restoring the intended positive-negative contribution balance. ASymPO requires only current-policy probabilities, eliminating all infrastructure associated with behavior-policy log-probabilities.

Theoretical results demonstrate that ASymPO exactly inherits response-level zero-sum balance from the group-relative advantages, and preserves a meaningful learning signal via the policy-gradient direction. Detailed proofs show that the difference between naive and ASymPO losses is the weighted scale difference between positives and negatives, which ASymPO normalizes to zero.

Empirical Evaluation: Numerical Results

Experiments are conducted on Qwen3-1.7B-Base, Qwen3-4B-Base, and LLaMA-3.2-3B-Instruct, with asynchronous mathematical reasoning post-training using a subset of MATH problems and multiple benchmarks (AIME24, AIME25, MATH500, AMC23, GSM8K, Minerva-Math). Benchmarks are reported using mean@8 and pass@8 metrics.

Observed results validate strong claims:

  • Naive current-policy loss and GPG collapse during training, leaving no meaningful final checkpoints for evaluation.
  • SPO and ASymPO maintain stable reward trajectories throughout training in all evaluated models.
  • Current-policy-only objectives (SPO, ASymPO) are competitive with GRPO, which does require behavior-policy probabilities. In some cases, ASymPO surpasses GRPO on aggregate metrics (e.g., LLaMA-3.2-3B-Instruct mean@8 and pass@8).
  • ASymPO generally improves over SPO on some models, but the advantage is not strictly uniform across all datasets and metrics. Both are robust compared to GPG and naive loss.
  • The mechanism is not merely removal of importance sampling—it is explicit response-level scale balancing; methods lacking this collapse even without importance sampling.

Additional supplementary experiments on alternative datasets (DAPO-Math-17K) confirm that collapse of unbalanced current-policy training is not specific to dataset, and explicit scale control remains necessary.

Infrastructure Simplification and Practical Stabilization

ASymPO and SPO strip away requirements for behavior-policy log-probability transport, policy-version tags, and training-inference logit recomputation. This yields strictly more compact rollout-learner interfaces and eliminates failure modes related to numerical drift and misalignment. The paper discusses practical stabilization of ASymPO via loss clipping, ensuring robustness to pathological token probability regimes, and recommends global gradient clipping as additional safeguard.

Theoretical and Practical Implications

ASymPO provides a principled response-level normalization mechanism that stabilizes asynchronous current-policy-only RL in scenarios where behavior-policy probabilities are expensive or impractical to transport. Theoretically, it demonstrates that scale imbalance—not just advantage sum—is the critical source of instability, and proving exact balance is possible via normalization. This advances understanding of group-relative RL dynamics in LLM post-training.

Practically, ASymPO enables asynchronous high-throughput RL leveraging only quantities available on the learner side, facilitating deployment in distributed, large-scale environments with minimal infrastructure.

Limitations and Future Directions

While ASymPO addresses scale-imbalance and collapse in group-relative current-policy-only RL, it does not entirely solve general off-policy drift; it does not explicitly bound the policy ratio when the behavior policy is far from the learner. SPO’s fixed coefficient is a manual design choice and can be sensitive to task/model specifics. ASymPO’s normalization, though adaptive, is heuristic rather than a full trust-region method, and its efficacy may vary with reward normalization and sequence length heterogeneity.

Future directions include theoretical characterization of response-scale normalization versus behavior correction, adaptive extensions with drift diagnostics, token-level safeguard design, and broader empirical evaluation in RLHF settings with reward models, RL agents, long-horizon tasks, and larger data mixtures.

Conclusion

ASymPO and SPO deliver infrastructure-simplified, stable, current-policy-only objectives for asynchronous RL post-training of LLMs, with response-level scale balancing as the critical stabilization mechanism. ASymPO, in particular, restores the intended positive-negative update equilibrium under zero-sum group advantages, enabling robust training in settings where behavior-policy correction is infeasible. This work has direct implications for large-scale distributed LLM post-training and contributes technical clarity to RL objective design under asynchronous, off-policy conditions.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.