Papers
Topics
Authors
Recent
Search
2000 character limit reached

Asymmetric Group-Relative Policy Optimization

Updated 11 May 2026
  • The paper introduces AsymGRPO, a reinforcement learning framework that breaks standard group-relative symmetry to drive efficient exploration and adaptive learning.
  • It employs group-level attenuation and dynamic sample-level rescaling to adjust policy updates based on current proficiency and sample difficulty.
  • Empirical results show significant gains in tasks like LLM reasoning, MLLM applications, and structured forecasting, with minimal tuning requirements.

Asymmetric Group-Relative Policy Optimization (AsymGRPO), also referred to as Asymmetric Group-Relative Advantage Estimation (A-GRAE), is a reinforcement learning policy optimization framework designed to address exploration inefficiency and difficulty adaptation bottlenecks inherent in the baseline Group Relative Policy Optimization (GRPO) algorithm. Originally introduced to improve alignment and reasoning in LLMs and multi-modal LLMs (MLLMs), AsymGRPO deliberately breaks key symmetries embedded in standard group-relative advantage estimation, deploys a curriculum-driven difficulty scheduler, and yields measurable gains in exploration, robustness, and final task performance. The following exposition outlines core principles, technical constructs, theoretical results, empirical benchmarks, and integration guidance for AsymGRPO.

1. Standard Group-Relative Policy Optimization and Its Symmetry Bottlenecks

Group-Relative Policy Optimization (GRPO) replaces the scalar-valued critic in standard actor-critic RL algorithms with a group-wise comparative baseline. Given a prompt qq and GG sampled rollouts {oi}i=1G\{o_i\}_{i=1}^G from the old policy πθold\pi_{\theta_{\mathrm{old}}}, each with reward ri{0,1}r_i \in \{0,1\}, the group-relative advantage for rollout ii is defined by

Ai=rirˉσrA_i = \frac{r_i - \bar{r}}{\sigma_r}

where rˉ=1Gj=1Grj\bar{r} = \frac{1}{G}\sum_{j=1}^G r_j and σr=std({rj})\sigma_r = \mathrm{std}(\{r_j\}). The GRPO objective, ignoring regularization and clipping, is

JGRPO(θ)=Eq,{oi}[1Gi=1G1oit=1oiρi,tAiθlogπθ(oi,t...)]\mathcal{J}_\mathrm{GRPO}(\theta) = \mathbb{E}_{q, \{o_i\}} \left[ \frac{1}{G}\sum_{i=1}^G \frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \rho_{i,t} A_i \nabla_\theta \log \pi_\theta(o_{i,t}|...) \right]

with GG0.

GRPO suffers from two implicit “advantage symmetries”:

  • Group-Level Symmetry: The standardized GG1 satisfy GG2, yielding GG3 when the group is partitioned into correct and incorrect outcomes. This enforces a zero-sum update on the sampled rollouts and leaves the logits of any unsampled trajectory GG4 unchanged.
  • Sample-Level Symmetry: Across samples with success rate GG5, the sum GG6 is maximized at GG7, implying the strongest gradient is always allocated to medium-difficulty samples. GRPO treats samples with GG8 (hard) and GG9 (easy) equivalently, thus failing to dynamically target the model's curriculum needs as proficiency shifts.

These symmetries prevent efficient exploration in the policy’s behavior space and impede adaptive sample focusing as learning progresses (Yu et al., 5 Feb 2026).

2. The Asymmetric Advantage Mechanism in AsymGRPO

AsymGRPO introduces targeted modifications to the standard advantage calculation, breaking both symmetries identified above:

2.1 Group-Level Attenuation

Define the current batch’s mean reward as {oi}i=1G\{o_i\}_{i=1}^G0, with {oi}i=1G\{o_i\}_{i=1}^G1 the batch size. Introduce a hyperparameter {oi}i=1G\{o_i\}_{i=1}^G2. The group-level asymmetric advantage is

{oi}i=1G\{o_i\}_{i=1}^G3

Early in training, {oi}i=1G\{o_i\}_{i=1}^G4, which severely suppresses positive {oi}i=1G\{o_i\}_{i=1}^G5, amplifies updates along unsuccessful (incorrect) trajectories, and drives exploration. As the average proficiency {oi}i=1G\{o_i\}_{i=1}^G6 rises, suppression lessens and learning becomes more balanced.

2.2 Dynamic Sample-Level Rescaling

Let {oi}i=1G\{o_i\}_{i=1}^G7 denote the within-group success rate for the sample associated with {oi}i=1G\{o_i\}_{i=1}^G8. The sample-level asymmetric advantage is formed as a weighted sum: {oi}i=1G\{o_i\}_{i=1}^G9 For πθold\pi_{\theta_{\mathrm{old}}}0 small, greater emphasis falls on easy samples (πθold\pi_{\theta_{\mathrm{old}}}1 scaling); as πθold\pi_{\theta_{\mathrm{old}}}2, weighting shifts to hard samples (πθold\pi_{\theta_{\mathrm{old}}}3 scaling). The final advantage πθold\pi_{\theta_{\mathrm{old}}}4 is then substituted into the policy loss for both group and sample levels, optionally compounding both forms of asymmetry (Yu et al., 5 Feb 2026).

3. Automatic Curriculum and Scheduling

AsymGRPO deploys an automatic curriculum mechanism, using the monotonic trajectory of πθold\pi_{\theta_{\mathrm{old}}}5 to drive the attention shift from easy to hard samples:

  • After each optimization step, compute πθold\pi_{\theta_{\mathrm{old}}}6 as above.
  • Assign sample-level weights πθold\pi_{\theta_{\mathrm{old}}}7 (hard-focus), πθold\pi_{\theta_{\mathrm{old}}}8 (easy-focus).
  • As the model’s mean reward πθold\pi_{\theta_{\mathrm{old}}}9 increases through training, the curriculum transitions from favoring easy problems to targeting the emerging hard examples, paralleling the increase in model proficiency.

No additional hyperparameters are required beyond ri{0,1}r_i \in \{0,1\}0 and group size ri{0,1}r_i \in \{0,1\}1, making the scheduling robust and lightweight (Yu et al., 5 Feb 2026).

4. Theoretical Properties

Breaking group-level symmetry ensures that updates to previously unsampled correct trajectories receive nonzero gradients, directly overcoming the “dead zone” limitation of standard GRPO. Specifically, for unsampled ri{0,1}r_i \in \{0,1\}2, Theorem 1 in (Yu et al., 5 Feb 2026) establishes that when ri{0,1}r_i \in \{0,1\}3, the logit update ri{0,1}r_i \in \{0,1\}4.

Dynamic curriculum adaptation strictly increases the total absolute advantage for samples whose difficulty matches current policy proficiency, as evidenced by sample-level bounds. This policy gradient allocation reduces wasted effort on over- or under-difficult data at each stage of training, leveraging the bound ri{0,1}r_i \in \{0,1\}5 (Theorem 2).

In fixed-reference settings, the policy optimization landscape remains compatible with reverse-KL regularization structure and group-based normalization, as analyzed in (Vojnovic et al., 25 Feb 2025).

5. Implementation and Pseudocode

The essential AsymGRPO integration amounts to a minimal change in the core PPO/GRPO workflow: replace ii9 with the asymmetric variant, Ai=rirˉσrA_i = \frac{r_i - \bar{r}}{\sigma_r}0 Policy gradient steps, KL regularization, and batch structure remain unchanged, resulting in direct drop-in compatibility with standard GRPO, PPO, DAPO, and Dr.GRPO pipelines. A group size ri{0,1}r_i \in \{0,1\}6 is needed for stable statistics. For multimodal or highly imbalanced regimes, ri{0,1}r_i \in \{0,1\}7 is preferred; for text-only tasks, ri{0,1}r_i \in \{0,1\}8 near 1 is robust. At least ri{0,1}r_i \in \{0,1\}9 rollouts per batch are recommended for reliable estimation of ii0 (Yu et al., 5 Feb 2026).

6. Empirical Performance and Application Scope

AsymGRPO realizes consistent, statistically significant improvements across a wide spectrum of LLM and MLLM reasoning and vision-language tasks. On the MATH dataset (Qwen2.5-Math-7B), Pass@1 rises from 76.5% to 78.3% and Pass@32 from 92.6% to 94.6%. On AIME2025, it achieves a ten percentage point increase in Pass@256 (46.7%→56.7%). AMC23 and Geo3K, as well as medical VQA diagnostics (MRI300, CT300, Xray300), all exhibit robust gains, with false alarm rates and OOD generalization likewise improved (Yu et al., 5 Feb 2026).

In structured forecasting domains, such as long-horizon spatiotemporal air quality predictions, AsymGRPO, when paired with class-wise asymmetric rewards and curriculum rollout, approximately halves the false alarm rate (e.g., 32.86%→17.32% for PM₂.₅ 120h forecasts) with minimal F1 degradation, establishing its utility for cost-sensitive continual prediction under operational constraints (Kang et al., 27 Nov 2025).

7. Integration Guidance and Limitations

Direct replacement of the standard group-relative advantage with the asymmetric formula suffices for most PPO/GRPO-compatible frameworks. Monitoring policy entropy and greedy-task accuracy is advised; if entropy grows monotonically, increasing ii1 or including an explicit entropy bonus may be needed to prevent collapse. Division by ii2 or ii3 should be skipped when ii4 or ii5, since ii6 in those cases. Group sizes below 4 are inadvisable due to instability in variance estimation. The method is computationally efficient and requires minimal tuning beyond ii7 and ii8 (Yu et al., 5 Feb 2026). In current instantiations, AsymGRPO presumes variable or discrete reward functions and is most naturally suited for categorical tasks, but extension to continuous domains is possible.


For further details, explicit derivations, and empirical benchmarks, see (Yu et al., 5 Feb 2026) and applications in (Kang et al., 27 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Asymmetric Group-Relative Policy Optimization (AsymGRPO).