GRPO: Group Relative Policy Optimisation

Updated 6 April 2026

GRPO is a reinforcement learning framework that employs group-normalized advantage estimators to mitigate variance and stability issues in policy gradient optimization.
It replaces traditional value critics with empirical group baselines, improving sample efficiency and ensuring robust performance in tasks like language modeling and control.
Its U-statistic based gradient estimator provides theoretical guarantees and supports scalable variants (e.g., MC-GRPO, P-GRPO, MO-GRPO) for diverse applications.

Group Relative Policy Optimisation (GRPO) is a reinforcement learning framework designed to address key variance and stability challenges in policy gradient optimization, particularly in the context of large-scale LLMs and other generative settings. GRPO replaces traditional value function critics with empirical, group-normalized advantage estimators, operating directly on small batches of model outputs (“groups”) per input. This approach provides a statistically robust and theoretically grounded alternative to actor-critic methods, leading to improved sample efficiency, stable updates, and strong empirical performance across a wide array of applications including language modeling, speech recognition, representation learning, multi-agent systems, and continuous control (Zhou et al., 1 Mar 2026, Sane, 30 Jan 2025, Xu et al., 19 Nov 2025, Khanda et al., 25 Jul 2025, Kim, 30 Jan 2026, Shivakumar et al., 2 Sep 2025).

1. Foundational Principles and Formalism

GRPO optimizes the expected return by generating $G$ completions per prompt from the reference or old policy, scoring each with a reward, and standardizing the reward to compute group-relative advantages. For prompt $q$ and policy $\pi_\theta$ , $G$ completions $\{o_i\}$ are sampled, each receiving reward $r_i=R(q, o_i)$ . The mean and (optionally) standard deviation are computed: $\bar r(q) = \frac{1}{G}\sum_{i=1}^G r_i, \qquad \sigma^2(q) = \frac{1}{G}\sum_i (r_i-\bar r(q))^2$ Group-normalized advantage: $A_i = \frac{r_i - \bar r(q)}{\sigma(q) + \epsilon}$ The clipped PPO-style surrogate is used for stability: $\mathcal{J}(\theta) = \mathbb{E}_{q,\{o_i\}}\left[\frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}\min(\rho_{i,t}A_i,\, \hat\rho_{i,t}A_i)\right]$ where $\rho_{i,t} = \frac{\pi_\theta(o_{i,t}\mid q,o_{i,<t})}{\pi_{\text{old}}(o_{i,t}\mid q,o_{i,<t})}$ and $q$ 0 is the clipped ratio.

Crucially, this “critic-free” group-based advantage estimator introduces no parametric bias, reduces dependency on reward scale, and provides per-mini-batch baselines that control variance, making GRPO robust to reward stochasticity and outlier samples (Zhou et al., 1 Mar 2026, Mroueh, 9 Mar 2025, Sane, 30 Jan 2025).

2. Theoretical Properties and Statistical Foundations

GRPO’s policy gradient estimator is precisely a symmetric U-statistic of order two. For each prompt, the gradient can be represented as: $q$ 1 This U-statistic structure enables explicit variance analysis. The mean squared error (MSE) matches that of an “oracle baseline” estimator (using the true value function), up to $q$ 2 excess: $q$ 3 GRPO is thus oracle-efficient and achieves asymptotically optimal variance within the family of baseline-subtracted policy gradient estimators. A universal scaling law for group size selection is established: for total rollout budget $q$ 4, set $q$ 5 where $q$ 6 are data and model-dependent variance factors, and allocate $q$ 7 prompts per batch (Zhou et al., 1 Mar 2026).

Beyond asymptotic variance, finite-sample policy suboptimality gaps are computable, with convergence guarantees under standard smoothness and Polyak–Łojasiewicz conditions.

3. Practical Variants and Algorithmic Extensions

Several GRPO variants have been developed to address practical considerations in diverse domains:

MC-GRPO (Median-Centered GRPO): Replaces the mean baseline with a median (and MAD) to robustly mitigate sign-flip errors that arise with small group sizes ( $q$ 8). The median is resilient to outliers and reduces destructive gradient variance at minimal cost ( $q$ 9 rollouts with the median-pivot excluded from updates) (Kim, 30 Jan 2026).
Personalized GRPO (P-GRPO): Replaces group-batch normalization with preference-specific running means and standard deviations, preserving gradient magnitude for minority user groups and enhancing heterogeneous preference alignment in personalized LLMs (Wang et al., 17 Feb 2026).
MO-GRPO: For multi-objective settings, computes reward-wise z-scores before summing, ensuring equal weighting and immunity to reward scaling artifacts, directly mitigating “reward hacking” and collapse to high-variance objectives (Ichihara et al., 26 Sep 2025).
Hybrid GRPO: Combines empirical group returns and value function baselines via an interpolating coefficient, controlling the bias–variance tradeoff between standard PPO and pure GRPO (Sane, 30 Jan 2025).
DPPO (Dynamic Pruning Policy Optimization): Supports scalable training by pruning prompts and completions during training, while maintaining unbiasedness via two-level importance weighting (Zhu et al., 4 Mar 2026).
Scaf-GRPO: Augments GRPO with hierarchical, in-prompt hints for “hard” tasks where the model otherwise receives zero reward across all samples—thus overcoming the “learning cliff” and restoring gradient flow in domains with verifiable rewards (Zhang et al., 22 Oct 2025).
Consensus GRPO (C-GRPO): Integrates sample-level consensus objectives, distilling MBR (Minimum Bayes Risk) decoding into the training loop for text generation tasks while achieving performance comparable to rerank-based decoding at a fraction of the runtime cost (Ichihara et al., 3 Feb 2026).

4. Empirical Evidence and Benchmark Results

GRPO and its variants have demonstrated strong empirical improvements in a range of settings:

Mathematical Reasoning (Pass@1, GSM8K, MATH, AIME24): MC-GRPO improves absolute accuracy by up to +4.6% at $\pi_\theta$ 0 (narrowing the 2- vs 8-rollout gap from 5.6% to ≤1%). Scaf-GRPO achieves a 44.3% relative Pass@1 gain on AIME24 for Qwen2.5-Math-7B, compared to vanilla GRPO (Kim, 30 Jan 2026, Zhang et al., 22 Oct 2025).
Multi-Agent Topology Learning: In Graph-GRPO, edge-level GRPO normalization improves accuracy over both standard SOTA and graph-level GRPO by up to 2.18%, clarifying communication credit assignment (Cang et al., 3 Mar 2026).
Speech Recognition (ASR): GRPO policies lower word-error rate (WER) by up to 18.4% relative to SFT-only LLMs and reduce hallucinations in out-of-domain generalization and domain adaptation settings (Shivakumar et al., 2 Sep 2025).
Representation Models: GRPO-RM improves softmax regression and kNN accuracy on vision tasks by 3–7% over standard fine-tuning while also accelerating convergence (Xu et al., 19 Nov 2025).
Multi-Objective RL and RLHF: MO-GRPO ensures balanced tradeoff across objectives in bandit, control, machine translation, and instruction following, consistently outperforming vanilla GRPO and clipped Dr.GRPO baselines (Ichihara et al., 26 Sep 2025).
Compute Efficiency: 2-GRPO is empirically equivalent to 16-GRPO at ≈1/8 the cost (Wu et al., 1 Oct 2025). DPPO and Pro-GRPO reduce wall-clock time by up to 2.37× with matching or higher performance, and are Pareto-optimal in compute-vs-accuracy (Zhu et al., 4 Mar 2026, Ge et al., 17 Dec 2025).

5. Application Domains and Specialized Adaptations

GRPO serves as a foundation for robust reinforcement learning and RLHF across modalities:

LLM Post-training: The canonical RLHF setting for large LMs, eliminating the learned value network while retaining PPO-style stability, supporting both verifiable, preference-based, and multi-objective rewards (Mroueh, 9 Mar 2025, Zhou et al., 1 Mar 2026).
Multi-Turn and Multi-Agent Reasoning: Incorporation into tool-calling agents (RC-GRPO), graph-based communication optimization (Graph-GRPO), and multi-agent cooperation with population-level constraints (GRPO-GCC) (Zhong et al., 3 Feb 2026, Cang et al., 3 Mar 2026, Yang et al., 7 Oct 2025).
Vision and Representation Learning: By enumerating all class options for each image, GRPO-RM applies the group-based policy optimization logic to classification and segmentation, improving standard metrics (Xu et al., 19 Nov 2025).
Continuous Control for Robotics: Extension to continuous action spaces is achieved by clustering trajectories and states, using state-aware advantages and group-based normalization, with explicit convergence analysis and regularization for stability in high-dimensional, temporally correlated tasks (Khanda et al., 25 Jul 2025).

6. Limitations, Practical Considerations, and Best Practices

Several challenges, caveats, and best-practice guidelines have emerged:

Reward Misspecification and Heterogeneity: Standard GRPO is sensitive to reward scaling and the homogeneity assumption; MO-GRPO and P-GRPO directly address these, but cannot compensate for poorly specified objectives.
Small Group Sizes: Baseline variance is high; MC-GRPO is recommended to avoid sign-flip-induced instability at G≤4 (Kim, 30 Jan 2026).
Imbalanced Advantages: Large group sizes may dilute the gradient due to “reward clustering,” motivating variance-maximizing selection (OVF, Pro-GRPO) (Ge et al., 17 Dec 2025).
Sample Efficiency: Empirical/theoretical results indicate that moderate group sizes ( $\pi_\theta$ 1– $\pi_\theta$ 2) optimize the variance–compute tradeoff, confirmed by universal scaling law (Zhou et al., 1 Mar 2026).
Bias from Stale Policies: Standard GRPO’s policy gradient is with respect to the old policy; TIC-GRPO (trajectory-level importance correction) provides unbiased gradients with provably fast convergence (Pang et al., 4 Aug 2025).
Multi-Turn and Sparse Reward Settings: Plain GRPO degenerates under low within-group diversity; RC-GRPO and preference-guided weak supervision restore non-degenerate gradients (Zhong et al., 3 Feb 2026, Mundada et al., 19 Feb 2026).

For implementation, it is recommended to perform a pilot search for group size, monitor per-group standard deviation, employ normalization floors ( $\pi_\theta$ 3), and use variance-normalization for multi-objective cases. For heterogeneous or personalized domains, maintain preference- or user-specific running baselines. When compute is limited, leverage MC-GRPO or DPPO/Pro-GRPO to control cost without sacrificing stability (Kim, 30 Jan 2026, Zhu et al., 4 Mar 2026).

7. Impact, Generalization, and Current Research Directions

GRPO has underpinned some of the most successful advances in reasoning- and alignment-focused LLM post-training, notably in DeepSeekMath and DeepSeek-R1 (Zhou et al., 1 Mar 2026, Mroueh, 9 Mar 2025). Its theoretical foundation as a U-statistic-based, oracle-efficient gradient estimator provides a precise and robust alternative to actor-critic methods. Ongoing research explores extensions to new domains (e.g., continuous robotics, multi-modal agents), more effective group selection and variance-enhancement (expand-and-prune, latent pruning), and principled regularization for multi-objective and personalized regimes (Ge et al., 17 Dec 2025, Khanda et al., 25 Jul 2025, Wang et al., 17 Feb 2026). The field continues to push group-based, critic-free policy gradients as a fundamental mechanism for efficient, stable, and generalizable reinforcement learning in complex, high-dimensional environments.