Weighted GRPO: Dynamic Policy Optimization

Updated 23 February 2026

Weighted GRPO is a generalized reinforcement learning framework that introduces static or dynamic weights at group and sample levels to balance surrogate policy gradients.
Dynamic weighting methods like DARO, λ-GRPO, and MMR-GRPO adapt weights based on reward statistics, enhancing convergence speed and model stability.
Empirical studies show that weighted GRPO variants deliver faster convergence, higher final accuracy, and improved robustness across applications such as mathematical reasoning and multi-objective control.

Weighted Group Relative Policy Optimization (Weighted GRPO) generalizes the Group Relative Policy Optimization (GRPO) framework by introducing explicit weighting schemes at the group and/or sample level within the surrogate policy gradient loss. Weighted GRPO is a common thread underlying the recent proliferation of RL with Verifiable Rewards (RLVR) algorithms for LLMs, mathematical reasoning, multi-objective control, and applied natural language processing. All such variants can be cast as instantiations of weighted group-centric advantage policy optimization, differing primarily by how weights are determined—static (predefined, heuristic) or dynamic (algorithmically adapted to learning state).

1. Mathematical Foundations and Canonical Formulation

The GRPO family considers populations (“groups") of generated responses per prompt, typically leveraging the group’s internal reward statistics to stabilize and calibrate policy updates. For a group $g$ , let $D_g$ denote the set of sampled data (trajectories or tokens), with group-relative normalized advantages defined as

$A^g(s,a) = \frac{r(s,a) - \mu^g}{\sigma^g}$

where $\mu^g$ and $\sigma^g$ are the mean and standard deviation of rewards in group $g$ .

Standard GRPO Objective:

$\mathcal{L}_{\rm GRPO} = -\sum_{g=1}^G \mathbb{E}_{(s,a) \sim D_g}[A^g(s,a)] \nabla_\theta \log \pi_\theta(a|s)$

Weighted GRPO Generalization (static or adaptive weights $w_g > 0$ ):

$\mathcal{L}_{\rm wGRPO} = -\sum_{g=1}^G w_g\, \mathbb{E}_{(s,a) \sim D_g}[A^g(s,a)] \nabla_\theta \log \pi_\theta(a|s)$

At the sample level, further importance sampling or reweighting (e.g., based on advantage, diversity, or uncertainty) modifies the per-sample learning signal. The formalization is extended to multi-objective settings, token-/sequence-level weighting, and dynamic annealing of weights during training (Zhou et al., 10 Oct 2025, Fontana et al., 8 Jan 2026, Wang et al., 8 Oct 2025, Yao et al., 29 Sep 2025, Wei et al., 14 Jan 2026, Min et al., 9 Jan 2026, Ichihara et al., 26 Sep 2025, Shen et al., 8 Aug 2025).

Table: Overview of Weighted GRPO Variants

Variant	Weighting Granularity	Weight Source
Static Weighted GRPO	Group	Fixed ( $w_g$ )
MO-GRPO	Reward dimension	Inverse variance ( $1/\sigma$ )
S-GRPO	Group	Closed-form noise-aware $w^*$
λ-GRPO	Token	Learned scalar ( $\lambda$ )
MMR-GRPO	Sample (in group)	Diversity via MMR
Entropy-guided GRPO/GTPO	Token/sequence	Policy entropy
DARO	Group	Inverse sub-loss (learned)

2. Static Weighting Schemes and Their Pathologies

Early “weighted” GRPO schemes use fixed $w_g$ , reflecting prior assumptions or heuristics regarding sample or group difficulty. Examples include:

Binary DAPO weighting ( $w_g=1$ for groups with $0<\mu_g<1$ , else $0$) and variants that emphasize “medium-difficulty” ( $w_g \propto \sqrt{\mu_g(1-\mu_g)}$ ).
Reward-based weighting: exponential weighting of group or sample advantages, e.g., $w(a_i) = \exp(\alpha a_i)$ .

While such weighting increases flexibility, it is fundamentally limited by the fact that loss magnitudes, and thus relative group influences, can drift unpredictably during training. This instability manifests as a loss scale problem: as the model improves, the total loss contributed by each group changes—often sharply—making fixed $w_g$ insufficient to maintain balanced learning and causing overfocus on certain difficulty levels or group archetypes (Zhou et al., 10 Oct 2025).

Moreover, static weighting can induce undesirable optimization dynamics, including group-specific prefix biases in LMs, and may even degrade model performance if the implicit learning signal is misallocated as the learner improves (Fontana et al., 8 Jan 2026, Shen et al., 8 Aug 2025).

3. Dynamic, Adaptive, and Learnable Weighting: Overcoming Static Limitations

Dynamic weighting addresses the fundamental shortcomings of static GRPO extensions by making the group/sample weights adaptive—explicitly coupled to model, data, or learning state.

DARO (Difficulty-Aware Reweighting Policy Optimization) (Zhou et al., 10 Oct 2025): Each group weight $w_g$ becomes a trainable parameter, optimized jointly with policy parameters by the regularized loss:

$\mathcal{L}(\theta,\{w_g\}) = \sum_{g=1}^G \left[w_g \mathcal{L}_g(\theta) - \ln w_g\right]$

The optimal $w_g$ is $1/\mathcal{L}_g(\theta)$ , enforcing an equalized loss scale across groups. This adaptivity corrects group imbalance and expedites convergence.

λ-GRPO (Wang et al., 8 Oct 2025): Introduces a learnable scalar $\lambda$ to explicitly tilt per-token weights as a function of response length (favoring concise or verbose chains of thought), with softmax normalization to preserve total gradient scale.
MMR-GRPO (Maximal Marginal Relevance) (Wei et al., 14 Jan 2026): Reweights within-group samples by a tradeoff between raw reward and semantic diversity, adaptively controlled via an intra-group reward-variance–dependent parameter $\lambda$ . This prioritizes informative, non-redundant samples.
Entropy-guided weighting (Tan et al., 6 Aug 2025, Min et al., 9 Jan 2026): Token- or sequence-level entropy of the policy distribution is used to upweight uncertain or decision-critical tokens/trajectories, substantially improving long-chain reasoning credit assignment.
S-GRPO (Stable GRPO) (Shen et al., 8 Aug 2025): Computes a closed-form, groupwise weight $w^*$ optimizing alignment with the true latent (noise-free) reward, robustly attenuating noisy or unreliable signals in the presence of “Think-Answer Mismatch.”
MO-GRPO (Multi-Objective GRPO) (Ichihara et al., 26 Sep 2025): In multi-objective settings, each reward channel $R_i$ is normalized by its sample standard deviation, automatically preventing “reward hacking” by ensuring balanced policy gradients irrespective of reward variances.

These mechanisms often employ annealed or automatically regularized weights, and leverage meta-objective optimization or closed-form statistical adaptation to maximize stability and fairness.

4. Algorithmic Instantiation and Implementation Details

Weighted GRPO is implemented via modest modifications to standard GRPO/PPO-style pipelines:

Sampling: For each prompt/context, generate a group of $G$ candidate responses from the current (or previous) policy.
Rewarding and Grouping: Compute scalar or vector rewards per sample; divide the batch into groups (e.g., by empirical pass rate, task, response structure).
Advantage Normalization: Derive group-relative advantages, optionally normalized by group std or per-reward std (for multi-objective tasks).
Weight Assignment: Assign weights statically (e.g., heuristic $w_g$ ), dynamically (learned $w_g$ or $\lambda$ ), or adaptively (entropy, reward-variance, diversity).
Surrogate Loss: Compute the weighted and, if applicable, clipped policy gradient loss:

$\mathcal{L} = -\sum_{g=1}^G w_g \mathbb{E}_{i \in g}[A_i] \nabla_\theta \log \pi_\theta(a_i|s_i)$

or, for sample weights,

$\mathcal{L} = -\frac{1}{G} \sum_{i=1}^{G} w_i A_i \nabla_\theta \log \pi_\theta(y_{i}|x)$

Joint Optimization: Update both policy and weighting parameters with AdamW or SGD, using appropriate learning rates.

Below is a schematic pseudocode for dynamic weighted GRPO (as inspired by (Zhou et al., 10 Oct 2025) and (Wang et al., 8 Oct 2025)):

for RL step:
    sample batch of prompts
    for each group g:
        generate responses {o_i}, compute rewards {r_i}
        compute group advantage A_g (e.g., (r_i - mean)/std)
        compute/update group/sample weights (w_g, w_i) based on loss, entropy, etc.
    compute weighted/clipped surrogate loss
    update θ, weighting parameters by gradient descent

Hyperparameter choices are dictated by the weighting regime: learning rate η in [1e-6, 3e-6], group size G in 8-16, clipping thresholds ε in [0.05, 0.28], entropy or MMR tradeoff tuned or annealed (Wei et al., 14 Jan 2026, Wang et al., 8 Oct 2025, Zhou et al., 10 Oct 2025).

5. Empirical Evaluation and Comparative Performance

Weighted GRPO variants have demonstrated robust improvements on standard mathematical reasoning and code generation benchmarks, as well as broader tasks (control, translation, instruction following):

Faster Convergence: Dynamic/learned weighting (DARO, MMR-GRPO, entropy-based) achieves comparable or superior accuracy in fewer steps; MMR-GRPO reduces wall-clock time by up to 70.2% (Wei et al., 14 Jan 2026, Zhou et al., 10 Oct 2025).
Higher Final Accuracy: DARO improves final pass@1 rates: Qwen-1.5B: 39.6% (GRPO) → 40.6% (DARO); Llama3.1-8B: 18.7% → 21.4% (Zhou et al., 10 Oct 2025). λ-GRPO yields consistent 1–2% gains over GRPO and DAPO (Wang et al., 8 Oct 2025).
Noise Robustness: S-GRPO preserves learning when standard GRPO collapses under 20% label noise (+2.2–2.5% absolute, stable entropy/credit assignment) (Shen et al., 8 Aug 2025).
Multi-Objective Stability: MO-GRPO prevents reward hacking, achieving balanced improvement across objectives (e.g., BLEURT and readability in WMT translation, simultaneous RM and brevity in AlpacaFarm) (Ichihara et al., 26 Sep 2025).
Token-/Sequence-level Reward Shaping: Entropy- and diversity-aware schemes (GTPO, GRPO-S, MMR-GRPO) promote deeper reasoning, exploration, and informative credit assignment, with up to 57% increase in “best@k” mean reward (Tan et al., 6 Aug 2025).

Ablation studies uniformly show dynamic/learned weight schedules outperform static and heuristic analogues (Zhou et al., 10 Oct 2025, Wei et al., 14 Jan 2026, Tan et al., 6 Aug 2025). Static weighting may even elicit negative transfer, as overemphasis on certain groups limits generalizable reasoning ability.

6. Theoretical Analysis, Implications, and Limitations

Weighted GRPO has been subjected to rigorous theoretical investigation. The general surrogate framework reveals subtle gradient biases induced by non-uniform weighting—particularly prefix bias in LLMs when group structure aligns with shared prompt-response tokens (Fontana et al., 8 Jan 2026). Under AdamW optimization, reward scaling is nearly invariant due to internal rescaling (first and second moments), and momentum can cause updates to “overshoot” clipping boundaries of the surrogate loss.

Theoretical results also show that any symmetric weighting of pairwise group-squared losses leads to the same regularized optima in the infinite-sample regime (Yao et al., 29 Sep 2025), but can significantly affect finite-sample update efficiency and stability.

Limitations include:

Sensitivity to weight estimation for small groups or highly non-stationary distributions.
Heuristic dynamic schedules (e.g., entropy or MMR) may overemphasize superficial uncertainty without explicit model of sample informativeness.
Reward normalization approaches (MO-GRPO) do not guard against adversarial, misspecified, or obsolete reward channels.

Extensions involve meta-learned weighting, per-token adaptive weights, and integrating uncertainty-aware credit assignment for improved robustness and theoretical alignment with RL objectives (Zhou et al., 10 Oct 2025, Wang et al., 8 Oct 2025, Fontana et al., 8 Jan 2026).

7. Application Domains and Broader Impact

Weighted GRPO and its descendants are deployed across diverse domains:

Mathematical Reasoning and LLM RLVR Benchmarks: Efficient optimization for GSM8K, MATH500, AIME, AMC23/24/25, Olympiad, Minerva, with strong pass@k improvements (Zhou et al., 10 Oct 2025, Wei et al., 14 Jan 2026, Shen et al., 8 Aug 2025, Wang et al., 8 Oct 2025).
Multi-Objective RL in Control and Translation: MO-GRPO achieves stable, interpretable optimization over multi-dimensional reward spaces, mitigating reward dominance/hacking without bespoke tuning (Ichihara et al., 26 Sep 2025).
Contract Graph Modeling: Weighted GRPO supports entity/edge extraction and analysis in complex language-to-graph parsing (Dechtiar et al., 10 Nov 2025).
Robotics and Flow-Matching Policies: Energy-efficient, reward-weighted flow-matching demonstrates superior sample efficiency and cost reduction in continuous control (Pfrommer et al., 20 Jul 2025).

Weighted GRPO thus represents both a practical and conceptual generalization of GRPO, subsuming numerous contemporary RLVR and RLHF methods for LLMs, control, and beyond, with a rigorous theoretical foundation, efficient implementation path, and empirically validated gains.