Papers
Topics
Authors
Recent
Search
2000 character limit reached

Weighted GRPO: Dynamic Policy Optimization

Updated 23 February 2026
  • Weighted GRPO is a generalized reinforcement learning framework that introduces static or dynamic weights at group and sample levels to balance surrogate policy gradients.
  • Dynamic weighting methods like DARO, λ-GRPO, and MMR-GRPO adapt weights based on reward statistics, enhancing convergence speed and model stability.
  • Empirical studies show that weighted GRPO variants deliver faster convergence, higher final accuracy, and improved robustness across applications such as mathematical reasoning and multi-objective control.

Weighted Group Relative Policy Optimization (Weighted GRPO) generalizes the Group Relative Policy Optimization (GRPO) framework by introducing explicit weighting schemes at the group and/or sample level within the surrogate policy gradient loss. Weighted GRPO is a common thread underlying the recent proliferation of RL with Verifiable Rewards (RLVR) algorithms for LLMs, mathematical reasoning, multi-objective control, and applied natural language processing. All such variants can be cast as instantiations of weighted group-centric advantage policy optimization, differing primarily by how weights are determined—static (predefined, heuristic) or dynamic (algorithmically adapted to learning state).

1. Mathematical Foundations and Canonical Formulation

The GRPO family considers populations (“groups") of generated responses per prompt, typically leveraging the group’s internal reward statistics to stabilize and calibrate policy updates. For a group gg, let DgD_g denote the set of sampled data (trajectories or tokens), with group-relative normalized advantages defined as

Ag(s,a)=r(s,a)μgσgA^g(s,a) = \frac{r(s,a) - \mu^g}{\sigma^g}

where μg\mu^g and σg\sigma^g are the mean and standard deviation of rewards in group gg.

Standard GRPO Objective:

LGRPO=g=1GE(s,a)Dg[Ag(s,a)]θlogπθ(as)\mathcal{L}_{\rm GRPO} = -\sum_{g=1}^G \mathbb{E}_{(s,a) \sim D_g}[A^g(s,a)] \nabla_\theta \log \pi_\theta(a|s)

Weighted GRPO Generalization (static or adaptive weights wg>0w_g > 0):

LwGRPO=g=1GwgE(s,a)Dg[Ag(s,a)]θlogπθ(as)\mathcal{L}_{\rm wGRPO} = -\sum_{g=1}^G w_g\, \mathbb{E}_{(s,a) \sim D_g}[A^g(s,a)] \nabla_\theta \log \pi_\theta(a|s)

At the sample level, further importance sampling or reweighting (e.g., based on advantage, diversity, or uncertainty) modifies the per-sample learning signal. The formalization is extended to multi-objective settings, token-/sequence-level weighting, and dynamic annealing of weights during training (Zhou et al., 10 Oct 2025, Fontana et al., 8 Jan 2026, Wang et al., 8 Oct 2025, Yao et al., 29 Sep 2025, Wei et al., 14 Jan 2026, Min et al., 9 Jan 2026, Ichihara et al., 26 Sep 2025, Shen et al., 8 Aug 2025).

Table: Overview of Weighted GRPO Variants

Variant Weighting Granularity Weight Source
Static Weighted GRPO Group Fixed (wgw_g)
MO-GRPO Reward dimension Inverse variance (1/σ1/\sigma)
S-GRPO Group Closed-form noise-aware ww^*
λ-GRPO Token Learned scalar (λ\lambda)
MMR-GRPO Sample (in group) Diversity via MMR
Entropy-guided GRPO/GTPO Token/sequence Policy entropy
DARO Group Inverse sub-loss (learned)

2. Static Weighting Schemes and Their Pathologies

Early “weighted” GRPO schemes use fixed wgw_g, reflecting prior assumptions or heuristics regarding sample or group difficulty. Examples include:

  • Binary DAPO weighting (wg=1w_g=1 for groups with 0<μg<10<\mu_g<1, else $0$) and variants that emphasize “medium-difficulty” (wgμg(1μg)w_g \propto \sqrt{\mu_g(1-\mu_g)}).
  • Reward-based weighting: exponential weighting of group or sample advantages, e.g., w(ai)=exp(αai)w(a_i) = \exp(\alpha a_i).

While such weighting increases flexibility, it is fundamentally limited by the fact that loss magnitudes, and thus relative group influences, can drift unpredictably during training. This instability manifests as a loss scale problem: as the model improves, the total loss contributed by each group changes—often sharply—making fixed wgw_g insufficient to maintain balanced learning and causing overfocus on certain difficulty levels or group archetypes (Zhou et al., 10 Oct 2025).

Moreover, static weighting can induce undesirable optimization dynamics, including group-specific prefix biases in LMs, and may even degrade model performance if the implicit learning signal is misallocated as the learner improves (Fontana et al., 8 Jan 2026, Shen et al., 8 Aug 2025).

3. Dynamic, Adaptive, and Learnable Weighting: Overcoming Static Limitations

Dynamic weighting addresses the fundamental shortcomings of static GRPO extensions by making the group/sample weights adaptive—explicitly coupled to model, data, or learning state.

  • DARO (Difficulty-Aware Reweighting Policy Optimization) (Zhou et al., 10 Oct 2025): Each group weight wgw_g becomes a trainable parameter, optimized jointly with policy parameters by the regularized loss:

L(θ,{wg})=g=1G[wgLg(θ)lnwg]\mathcal{L}(\theta,\{w_g\}) = \sum_{g=1}^G \left[w_g \mathcal{L}_g(\theta) - \ln w_g\right]

The optimal wgw_g is 1/Lg(θ)1/\mathcal{L}_g(\theta), enforcing an equalized loss scale across groups. This adaptivity corrects group imbalance and expedites convergence.

  • λ-GRPO (Wang et al., 8 Oct 2025): Introduces a learnable scalar λ\lambda to explicitly tilt per-token weights as a function of response length (favoring concise or verbose chains of thought), with softmax normalization to preserve total gradient scale.
  • MMR-GRPO (Maximal Marginal Relevance) (Wei et al., 14 Jan 2026): Reweights within-group samples by a tradeoff between raw reward and semantic diversity, adaptively controlled via an intra-group reward-variance–dependent parameter λ\lambda. This prioritizes informative, non-redundant samples.
  • Entropy-guided weighting (Tan et al., 6 Aug 2025, Min et al., 9 Jan 2026): Token- or sequence-level entropy of the policy distribution is used to upweight uncertain or decision-critical tokens/trajectories, substantially improving long-chain reasoning credit assignment.
  • S-GRPO (Stable GRPO) (Shen et al., 8 Aug 2025): Computes a closed-form, groupwise weight ww^* optimizing alignment with the true latent (noise-free) reward, robustly attenuating noisy or unreliable signals in the presence of “Think-Answer Mismatch.”
  • MO-GRPO (Multi-Objective GRPO) (Ichihara et al., 26 Sep 2025): In multi-objective settings, each reward channel RiR_i is normalized by its sample standard deviation, automatically preventing “reward hacking” by ensuring balanced policy gradients irrespective of reward variances.

These mechanisms often employ annealed or automatically regularized weights, and leverage meta-objective optimization or closed-form statistical adaptation to maximize stability and fairness.

4. Algorithmic Instantiation and Implementation Details

Weighted GRPO is implemented via modest modifications to standard GRPO/PPO-style pipelines:

  1. Sampling: For each prompt/context, generate a group of GG candidate responses from the current (or previous) policy.
  2. Rewarding and Grouping: Compute scalar or vector rewards per sample; divide the batch into groups (e.g., by empirical pass rate, task, response structure).
  3. Advantage Normalization: Derive group-relative advantages, optionally normalized by group std or per-reward std (for multi-objective tasks).
  4. Weight Assignment: Assign weights statically (e.g., heuristic wgw_g), dynamically (learned wgw_g or λ\lambda), or adaptively (entropy, reward-variance, diversity).
  5. Surrogate Loss: Compute the weighted and, if applicable, clipped policy gradient loss:

L=g=1GwgEig[Ai]θlogπθ(aisi)\mathcal{L} = -\sum_{g=1}^G w_g \mathbb{E}_{i \in g}[A_i] \nabla_\theta \log \pi_\theta(a_i|s_i)

or, for sample weights,

L=1Gi=1GwiAiθlogπθ(yix)\mathcal{L} = -\frac{1}{G} \sum_{i=1}^{G} w_i A_i \nabla_\theta \log \pi_\theta(y_{i}|x)

  1. Joint Optimization: Update both policy and weighting parameters with AdamW or SGD, using appropriate learning rates.

Below is a schematic pseudocode for dynamic weighted GRPO (as inspired by (Zhou et al., 10 Oct 2025) and (Wang et al., 8 Oct 2025)):

1
2
3
4
5
6
7
8
for RL step:
    sample batch of prompts
    for each group g:
        generate responses {o_i}, compute rewards {r_i}
        compute group advantage A_g (e.g., (r_i - mean)/std)
        compute/update group/sample weights (w_g, w_i) based on loss, entropy, etc.
    compute weighted/clipped surrogate loss
    update θ, weighting parameters by gradient descent

Hyperparameter choices are dictated by the weighting regime: learning rate η in [1e-6, 3e-6], group size G in 8-16, clipping thresholds ε in [0.05, 0.28], entropy or MMR tradeoff tuned or annealed (Wei et al., 14 Jan 2026, Wang et al., 8 Oct 2025, Zhou et al., 10 Oct 2025).

5. Empirical Evaluation and Comparative Performance

Weighted GRPO variants have demonstrated robust improvements on standard mathematical reasoning and code generation benchmarks, as well as broader tasks (control, translation, instruction following):

  • Faster Convergence: Dynamic/learned weighting (DARO, MMR-GRPO, entropy-based) achieves comparable or superior accuracy in fewer steps; MMR-GRPO reduces wall-clock time by up to 70.2% (Wei et al., 14 Jan 2026, Zhou et al., 10 Oct 2025).
  • Higher Final Accuracy: DARO improves final pass@1 rates: Qwen-1.5B: 39.6% (GRPO) → 40.6% (DARO); Llama3.1-8B: 18.7% → 21.4% (Zhou et al., 10 Oct 2025). λ-GRPO yields consistent 1–2% gains over GRPO and DAPO (Wang et al., 8 Oct 2025).
  • Noise Robustness: S-GRPO preserves learning when standard GRPO collapses under 20% label noise (+2.2–2.5% absolute, stable entropy/credit assignment) (Shen et al., 8 Aug 2025).
  • Multi-Objective Stability: MO-GRPO prevents reward hacking, achieving balanced improvement across objectives (e.g., BLEURT and readability in WMT translation, simultaneous RM and brevity in AlpacaFarm) (Ichihara et al., 26 Sep 2025).
  • Token-/Sequence-level Reward Shaping: Entropy- and diversity-aware schemes (GTPO, GRPO-S, MMR-GRPO) promote deeper reasoning, exploration, and informative credit assignment, with up to 57% increase in “best@k” mean reward (Tan et al., 6 Aug 2025).

Ablation studies uniformly show dynamic/learned weight schedules outperform static and heuristic analogues (Zhou et al., 10 Oct 2025, Wei et al., 14 Jan 2026, Tan et al., 6 Aug 2025). Static weighting may even elicit negative transfer, as overemphasis on certain groups limits generalizable reasoning ability.

6. Theoretical Analysis, Implications, and Limitations

Weighted GRPO has been subjected to rigorous theoretical investigation. The general surrogate framework reveals subtle gradient biases induced by non-uniform weighting—particularly prefix bias in LLMs when group structure aligns with shared prompt-response tokens (Fontana et al., 8 Jan 2026). Under AdamW optimization, reward scaling is nearly invariant due to internal rescaling (first and second moments), and momentum can cause updates to “overshoot” clipping boundaries of the surrogate loss.

Theoretical results also show that any symmetric weighting of pairwise group-squared losses leads to the same regularized optima in the infinite-sample regime (Yao et al., 29 Sep 2025), but can significantly affect finite-sample update efficiency and stability.

Limitations include:

  • Sensitivity to weight estimation for small groups or highly non-stationary distributions.
  • Heuristic dynamic schedules (e.g., entropy or MMR) may overemphasize superficial uncertainty without explicit model of sample informativeness.
  • Reward normalization approaches (MO-GRPO) do not guard against adversarial, misspecified, or obsolete reward channels.

Extensions involve meta-learned weighting, per-token adaptive weights, and integrating uncertainty-aware credit assignment for improved robustness and theoretical alignment with RL objectives (Zhou et al., 10 Oct 2025, Wang et al., 8 Oct 2025, Fontana et al., 8 Jan 2026).

7. Application Domains and Broader Impact

Weighted GRPO and its descendants are deployed across diverse domains:

Weighted GRPO thus represents both a practical and conceptual generalization of GRPO, subsuming numerous contemporary RLVR and RLHF methods for LLMs, control, and beyond, with a rigorous theoretical foundation, efficient implementation path, and empirically validated gains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weighted Group Relative Policy Optimization (Weighted GRPO).