- The paper introduces ASymPO, a current-policy-only reinforcement learning method that uses adaptive response-level normalization to balance the impact of negative and positive updates.
- It demonstrates that ASymPO recovers group-relative zero-sum balance by moderating scale imbalances, preventing training collapse even without behavior-policy probabilities.
- Empirical tests on models like Qwen3 and LLaMA show stable reward trajectories and competitive performance compared to GRPO, while simplifying infrastructure requirements.
ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information
Technical Problem and Motivation
The paper addresses the challenge of efficient asynchronous reinforcement learning (RL) post-training of LLMs under reward-driven objectives—particularly mathematical reasoning—without access to behavior-policy probabilities. In typical RL fine-tuning scenarios, asynchronous designs decouple response generation (rollout workers) from policy optimization (learner), which improves throughput but causes distribution drift between the behavior policy (used to generate responses) and the current policy (used for optimization). Standard PPO-style methods correct this drift via importance ratios and clipping, requiring transport of token-aligned, versioned, numerically consistent behavior-policy log-probabilities. This incurs substantial infrastructure overhead and complexity, motivating the search for current-policy-only objectives that retain stability without relying on behavior information.
Analysis of Scale-Imbalance Failure Mode
The paper formalizes a critical failure mode: naive current-policy-only RL objectives, especially with group-relative rewards, do not preserve the intended positive-negative balance induced by zero-sum advantages. Because stale responses are evaluated under the current policy, positive and negative loss terms may appear at widely different negative log-probability scales; notably, negative samples can dominate due to very low current-policy probability and correspondingly large negative log-probability. This breaks cancellation, destabilizes training, and leads to collapse—even when the sampled responses are well scaled under the behavior policy. Classical importance-sampling-based corrections (PPO, GRPO) transfer the advantage balance to loss contributions through clipping, preserving group-level equilibrium.
Methodological Contributions
The authors propose two current-policy-only objectives:
- Scaled Policy Optimization (SPO): Weakens the influence of negative-advantage samples by a fixed hyperparameter α∈(0,1). While simple and empirically stabilizing, SPO cannot adapt to individual response scales or differentiate between mildly and strongly stale negatives.
- Asymmetric-Scale Policy Optimization (ASymPO): Introduces adaptive response-level normalization, dividing each response's loss by its own current mean token negative log-probability (Sθ,g​). This targets scale imbalance directly: responses with large current loss scales are automatically moderated, restoring the intended positive-negative contribution balance. ASymPO requires only current-policy probabilities, eliminating all infrastructure associated with behavior-policy log-probabilities.
Theoretical results demonstrate that ASymPO exactly inherits response-level zero-sum balance from the group-relative advantages, and preserves a meaningful learning signal via the policy-gradient direction. Detailed proofs show that the difference between naive and ASymPO losses is the weighted scale difference between positives and negatives, which ASymPO normalizes to zero.
Empirical Evaluation: Numerical Results
Experiments are conducted on Qwen3-1.7B-Base, Qwen3-4B-Base, and LLaMA-3.2-3B-Instruct, with asynchronous mathematical reasoning post-training using a subset of MATH problems and multiple benchmarks (AIME24, AIME25, MATH500, AMC23, GSM8K, Minerva-Math). Benchmarks are reported using mean@8 and pass@8 metrics.
Observed results validate strong claims:
- Naive current-policy loss and GPG collapse during training, leaving no meaningful final checkpoints for evaluation.
- SPO and ASymPO maintain stable reward trajectories throughout training in all evaluated models.
- Current-policy-only objectives (SPO, ASymPO) are competitive with GRPO, which does require behavior-policy probabilities. In some cases, ASymPO surpasses GRPO on aggregate metrics (e.g., LLaMA-3.2-3B-Instruct mean@8 and pass@8).
- ASymPO generally improves over SPO on some models, but the advantage is not strictly uniform across all datasets and metrics. Both are robust compared to GPG and naive loss.
- The mechanism is not merely removal of importance sampling—it is explicit response-level scale balancing; methods lacking this collapse even without importance sampling.
Additional supplementary experiments on alternative datasets (DAPO-Math-17K) confirm that collapse of unbalanced current-policy training is not specific to dataset, and explicit scale control remains necessary.
Infrastructure Simplification and Practical Stabilization
ASymPO and SPO strip away requirements for behavior-policy log-probability transport, policy-version tags, and training-inference logit recomputation. This yields strictly more compact rollout-learner interfaces and eliminates failure modes related to numerical drift and misalignment. The paper discusses practical stabilization of ASymPO via loss clipping, ensuring robustness to pathological token probability regimes, and recommends global gradient clipping as additional safeguard.
Theoretical and Practical Implications
ASymPO provides a principled response-level normalization mechanism that stabilizes asynchronous current-policy-only RL in scenarios where behavior-policy probabilities are expensive or impractical to transport. Theoretically, it demonstrates that scale imbalance—not just advantage sum—is the critical source of instability, and proving exact balance is possible via normalization. This advances understanding of group-relative RL dynamics in LLM post-training.
Practically, ASymPO enables asynchronous high-throughput RL leveraging only quantities available on the learner side, facilitating deployment in distributed, large-scale environments with minimal infrastructure.
Limitations and Future Directions
While ASymPO addresses scale-imbalance and collapse in group-relative current-policy-only RL, it does not entirely solve general off-policy drift; it does not explicitly bound the policy ratio when the behavior policy is far from the learner. SPO’s fixed coefficient is a manual design choice and can be sensitive to task/model specifics. ASymPO’s normalization, though adaptive, is heuristic rather than a full trust-region method, and its efficacy may vary with reward normalization and sequence length heterogeneity.
Future directions include theoretical characterization of response-scale normalization versus behavior correction, adaptive extensions with drift diagnostics, token-level safeguard design, and broader empirical evaluation in RLHF settings with reward models, RL agents, long-horizon tasks, and larger data mixtures.
Conclusion
ASymPO and SPO deliver infrastructure-simplified, stable, current-policy-only objectives for asynchronous RL post-training of LLMs, with response-level scale balancing as the critical stabilization mechanism. ASymPO, in particular, restores the intended positive-negative update equilibrium under zero-sum group advantages, enabling robust training in settings where behavior-policy correction is infeasible. This work has direct implications for large-scale distributed LLM post-training and contributes technical clarity to RL objective design under asynchronous, off-policy conditions.