Group Relative Policy Optimisation

Updated 17 November 2025

Group Relative Policy Optimisation is a critic-free, group-based reinforcement learning method that normalizes rewards within candidate groups to yield variance-reduced, ranking-based policy gradients.
It computes advantages by contrasting individual rewards against group statistics, eliminating the need for a traditional value function while offering stable and trust-region-constrained updates.
Extensions like NGRPO, AGPO, and GCPO enhance GRPO by addressing vanishing gradients and multi-objective challenges, making it suitable for diverse domains including LLM post-training and math reasoning.

Group Relative Policy Optimisation (GRPO) is a reinforcement learning (RL) paradigm for critic-free, group-based policy improvement, particularly influential in LLM post-training, mathematical reasoning, and general RL contexts. Unlike critic-based algorithms such as Proximal Policy Optimisation (PPO), GRPO eliminates the explicit value function and instead computes advantages by normalizing verifiable or synthetic rewards within a batch (group) of outputs. This “group-relative” approach provides a variance-reducing, ranking-based policy gradient estimate, making GRPO robust, sample-efficient, and suitable for domains where reward signals are deterministic, sparse, or readily verifiable.

1. Formal Definition and Core Objective

GRPO operates by generating $G$ candidate trajectories (outputs) per context or prompt (e.g., question $q$ for LLMs), then computing the advantage for each candidate by comparing its reward to the intragroup statistics. Given old policy parameters $\theta_{\text{old}}$ , the core objective for a group of $G$ samples is

$\bar{r} = \frac{1}{G}\sum_{j=1}^G r_j, \quad \sigma_r = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_j - \bar{r})^2},$

$A_i = \frac{r_i - \bar{r}}{\sigma_r + \delta}$

for each candidate $i=1,\dots,G$ , where $\delta$ is a numerical stabilizer. The per-token probability ratio is

$r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t}|q,o_{i,<t})}$

and the clipped surrogate objective follows the PPO structure: $J^{\rm GRPO}(\theta) = \mathbb{E}_q \Biggl[ \sum_{i=1}^G \sum_{t=1}^T \min\bigl[ r_{i,t}(\theta)\,A_i,\, \mathrm{clip}(r_{i,t}(\theta),1-\epsilon,1+\epsilon)\,A_i \bigr] \Biggr].$ The expected gradient is estimated with respect to $\theta$ ; a (soft) KL penalty to a reference policy can be added for stabilization.

2. Theoretical Properties and Contrastive Foundations

When rewards are verifiable and binary, the group-normalized advantage renders GRPO an instance of regularized contrastive learning. The objective may be cast as a KL-regularized contrastive loss

$L(\pi; \pi_{\text{old}}, \pi_{\text{ref}}) = \mathbb{E}_{q,o}\left[-\omega^+(p)\,1_{r=1}\log\pi(o|q) + \omega^-(p)\,1_{r=0}\log\pi(o|q)\right] + \beta\,\mathbb{E}_q[\mathrm{KL}(\pi(\cdot|q)\|\pi_{\text{ref}}(\cdot|q))],$

where $\omega^+(p),\omega^-(p)$ are adaptive weights derived from the frequency of correct/incorrect outcomes under $\pi_{\text{old}}$ . GRPO directly amplifies the probability of success beyond the reference model’s fixed point, with explicit closed-form update rules and provable monotone improvement under mild assumptions (Mroueh, 9 Mar 2025, Nan et al., 23 Sep 2025).

Empirically and theoretically, the group-based normalization yields variance reduction without reliance on a separate value function, providing stable, trust-region-constrained updates analogous to PPO, but without value bias.

3. Algorithmic Variants and Extensions

A rich landscape of GRPO-inspired algorithms has emerged, each addressing specific limitations or introducing domain-driven enhancements.

(a) Negative-Enhanced GRPO (NGRPO)

NGRPO introduces a virtual best reward into each group, calibrating the advantage for "all incorrect" or "all correct" groups (previously yielding null gradients), and uses asymmetric clipping (relaxed for positive, tightened for negative advantages). This calibration ensures that even homogeneous error groups produce nonzero, appropriately signed gradients, correcting the vanishing-gradient pathology of vanilla GRPO and improving learning from hard samples. Asymmetric clipping stabilizes the resultant exploration pressure (Nan et al., 23 Sep 2025).

(b) Adaptive/Gain-Modified GRPO (AGPO)

AGPO directly injects $\pm 1$ signals for unanimous groups (all-correct: $+1$ , all-wrong: $-1$ ) and adds a self-adaptive length bonus, encouraging concise but correct reasoning in LLMs. The formal advantage update is: $A_i = \begin{cases} +1, & \mu_G = r_{\max}=1 \ -1, & \mu_G = r_{\min}=0 \ (r_i-\mu_G)/\sigma_G, & \text{otherwise} \end{cases}$ This adaptive corner case robustly eliminates silent batch drops and reduces variance, increasing token efficiency and stability (Li et al., 20 Mar 2025).

(c) Group Causal Policy Optimization (GCPO)

GCPO incorporates structural causal modeling to correct for semantic dependencies between candidate responses within a group (the collider structure). It executes a projection onto a causally informed subspace both in reward normalization and via an additional KL regularization against a causally projected reference. This yields further variance reduction and improved generative coordination in LLMs (Gu et al., 7 Aug 2025).

(d) MO-GRPO (Multi-objective)

For multi-objective problems, GRPO’s sum-and-normalize structure induces disproportionate optimization toward high-variance rewards. MO-GRPO corrects this by normalizing each reward component individually before aggregation, ensuring balanced and scale-invariant credit assignment: $\hat{R}_i(q,o_g) = \frac{R_i(q,o_g) - \mu_i}{\sigma_i + \epsilon}, \quad A_g^{MO} = \sum_{i=1}^K \hat{R}_i(q,o_g)$ This yields stability and predictability across domains with diverse reward structures (Ichihara et al., 26 Sep 2025).

(e) Trajectory, Off-Policy, and Multi-Sample Variants

TIC-GRPO (Trajectory-level Importance-Corrected GRPO): Replaces token-level IS ratios with a single trajectory-level probability ratio, recovering an unbiased estimate of the current policy gradient and yielding improved convergence rates (Pang et al., 4 Aug 2025).
Off-policy GRPO: Extends GRPO to off-policy sample reuse via behavior policies, using appropriate importance weighting and clipping to maintain the policy-improvement guarantees (Mroueh et al., 28 May 2025).
Hybrid GRPO: Merges group-based empirical advantage estimation with a bootstrapped value baseline, balancing sample efficiency and variance control (Sane, 30 Jan 2025).
Token-level GRPO (TEPO): Uniformly distributes group-level advantages across tokens with Markov-likelihood normalization, improving stability and gradient smoothness in long-sequence CoT regimes (Lin et al., 10 Oct 2025).

4. Implementation Details and Training Workflow

A canonical GRPO training pipeline for LLMs or general RL is as follows:

For each prompt or state, generate $G$ candidate outputs under the frozen/old policy $\pi_{\theta_{\text{old}}}$ .
Compute per-candidate rewards $r_i$ using a deterministic verifier or predefined reward model.
Normalize rewards within each group to obtain $A_i$ or an enhanced/causal variant.
For each token within each candidate, calculate the importance sampling ratio $r_{i,t}(\theta)$ and accumulate the PPO-style clipped surrogate loss, possibly using asymmetric clipping (NGRPO) or entropy regularization (Hybrid, TEPO).
Apply a trust-region constraint by penalizing KL divergence to a reference policy.
Take a stochastic gradient step on the total loss, and update $\theta_{\text{old}} \leftarrow \theta$ at the prescribed refresh frequency.

Key hyperparameters include group size ( $G$ ), clipping thresholds ( $\epsilon$ or $\epsilon_{\text{pos}}$ , $\epsilon_{\text{neg}}$ ), learning rate, and any special regularizers (e.g. for length bonus, causality, multi-objective weighting).

Hardware and batch sizing should be selected to ensure effective group statistics (e.g., $G=8$ is common for math LLMs, while up to $G=50$ has been used for wireless control (Zhang et al., 18 Sep 2025)).

5. Empirical Evaluation and Domain-Specific Results

GRPO and its extensions have been evaluated across a wide range of domains:

Domain/Task	Model/Setting	GRPO Variant	Best Reported Metric	Relative Gain Over Baseline
Math Reasoning	Qwen2.5-Math-7B	NGRPO (Nan et al., 23 Sep 2025)	Pass@1: 10.9% (AIME2025)	+0.9–3.0 AUC vs PPO,GRPO,PSR-NSR
Math Reasoning	Qwen2.5-Math-7B	Scaf-GRPO (Zhang et al., 22 Oct 2025)	Pass@1: 43.3% (AIME24)	+44.3% relative vs GRPO
Machine Translation	LLaMA/Qwen	MO-GRPO (Ichihara et al., 26 Sep 2025)	GPT4o-mini win-rate: 89%	89% vs much lower for GRPO
Image Captioning	Transformer-based	GRPO (Liang, 3 Mar 2025)	CIDEr: 100.0	+2.4 over SCST baseline
Hyperparameter Opt.	GRPOformer	GRPO (Guo et al., 21 Sep 2025)	BtR: 94.44%	Outperforms BOformer/SBOA
Speech Recognition	Llama3 ASR	GRPO (Shivakumar et al., 2 Sep 2025)	WER reduction: up to 18.4%	Robust domain-adaptation
Wireless FAS	49% model size cut	GRPO (Zhang et al., 18 Sep 2025)	Sum-rate, 50% compute cut	Outperforms PPO, 50% faster

Learning curves, AUC metrics, ablations, and cross-domain transfers consistently indicate that GRPO variants dominate earlier PPO-style or SCST baselines, yield lower sample/gradient variance, accelerate learning progress, and improve transferability (e.g., Training-Free GRPO (Cai et al., 9 Oct 2025)).

6. Limitations, Practical Recommendations, and Future Directions

GRPO’s reliance on intragroup variance means that "all correct" or "all incorrect" groups initially produced vanishing gradients, a pathology corrected by NGRPO, AGPO, and similar variants. The method assumes that reward variance within a group is meaningful—groups of homogeneous, uninformative candidates will not yield signal unless artificial calibration or scaffolding (Scaf-GRPO) is used. High reward variance across objectives can lead to reward hacking unless proper normalization (MO-GRPO) is applied. Empirical recommendations include:

Group size: $G=8$ (LLMs), $G=5$ –$50$ (control tasks)
Clipping thresholds: $\epsilon_{pos}\approx 0.24$ , $\epsilon_{neg}\approx 0.16$ (NGRPO); $\epsilon=0.1$ –$0.2$ otherwise
Add length regularization and advantage corner-case handling in reasoning LLMs for efficiency (Li et al., 20 Mar 2025)
Use off-policy variant when serving or IO costs dominate (Mroueh et al., 28 May 2025); combine with trajectory-level importance correction for unbiasedness (Pang et al., 4 Aug 2025)
For multi-objective RL, always normalize reward components individually (Ichihara et al., 26 Sep 2025)
Employ progressive scaffolding for hard or sparse-reward domains (Zhang et al., 22 Oct 2025)

Future directions include the extension to subjective, noisy, or multi-modal reward models, scaling to ever more complex multi-agent environments, and the development of adaptive grouping and online calibration methods. The theoretical properties established for modern GRPO-style algorithms—monotonic improvement via clipping, fixed-point and contraction results, convergence bounds—provide a principled foundation for these continued advancements.