Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 69 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 402 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Group-Based Relative Policy Optimization

Updated 7 October 2025
  • GRPO is a reinforcement learning method that employs intra-group normalization and contrastive comparisons to estimate relative advantages and update policies.
  • It replaces baseline and critic-dependent strategies by evaluating groups of outputs per input, using clipping and KL regularization to maintain stability.
  • GRPO has been applied successfully in LLM fine-tuning, image captioning, speech recognition, and robotics to enhance performance and output diversity.

Group-based Relative Policy Optimization (GRPO) is a reinforcement learning (RL) algorithm rooted in group-wise advantage estimation and policy optimization. The method replaces traditional baseline-based or critic-dependent strategies with intra-group normalization, emphasizing contrastive, relative comparison among candidate outputs for a given input. Originally motivated by the deficiencies of single-sample baseline methods in sequence modeling and adopted as an alternative to Self-Critical Sequence Training (SCST), GRPO is now widely used across supervised and reinforcement learning fine-tuning of LLMs, image captioners, reasoning agents, speech models, and beyond. The central premise is to use a group of outputs per input, calculate reward-based advantages by group-centric comparison, and constrain the policy update via clipping and Kullback–Leibler (KL) regularization, ensuring both stability and output diversity.

1. Mathematical Formulation and Core Mechanism

GRPO operates by sampling, for each input qq, a group {o1,,oG}\{o_1, \dotsc, o_G\} of candidate outputs from a reference or current policy πθold\pi_{\theta_{\text{old}}}, evaluating each with a task-specific reward, and then performing the policy update via a relative advantage measure. The fundamental GRPO objective, abstracted for sequence modeling and group-centric RL, is:

JGRPO(θ)=EqP(Q),{oi}πθold(q)[1Gi=1Gmin{πθ(oiq)πθold(oiq)Ai,clip(πθ(oiq)πθold(oiq),1ϵ,1+ϵ)Ai}βDKL(πθπref)]J_{\text{GRPO}}(\theta) = \mathbb{E}_{q \sim \mathbb{P}(Q), \{o_i\} \sim \pi_{\theta_{\text{old}}}(\cdot|q)} \Bigg[ \frac{1}{G} \sum_{i=1}^G \min \Big\{ \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)} A_i, \text{clip}(\frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)}, 1-\epsilon, 1+\epsilon) A_i \Big\} - \beta D_{\text{KL}}(\pi_\theta||\pi_{\text{ref}}) \Bigg]

where the group advantage for each output is

Ai=rimean({rj}j=1G)std({rj}j=1G)A_i = \frac{r_i - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G)}

and the KL term

DKL(πθπref)=Eoπθ[logπθ(oq)πref(oq)]D_{\text{KL}}(\pi_\theta||\pi_{\text{ref}}) = \mathbb{E}_{o\sim \pi_\theta} \left[ \log \frac{\pi_\theta(o|q)}{\pi_{\text{ref}}(o|q)} \right]

enforces trust region regularization.

Key hyperparameters include the update clipping parameter ϵ\epsilon and the KL penalty β\beta, both essential for controlling update magnitude and preventing excessive policy drift.

2. Theoretical Properties and Connections

GRPO fundamentally reframes policy optimization as a contrastive learning problem, prominent when binary or verifiable rewards are used. For binary rewards (e.g., correctness checks):

  • The group-wise "whitening" of reward via mean and standard deviation acts as adaptive weighting, amplifying correct outcomes when the policy is weak and penalizing failures when strong.
  • The closed-form policy update,

πn(oq)=πref(oq)exp[1β(ω+(pn1(q))1r=1ω(pn1(q))1r=0)]Zn1(q)\pi_{n}(o|q) = \frac{\pi_{\text{ref}}(o|q) \exp\left[\frac{1}{\beta} \left( \omega^+(p_{n-1}(q))\,\mathbf{1}_{r=1} - \omega^-(p_{n-1}(q))\,\mathbf{1}_{r=0} \right) \right]} {Z_{n-1}(q)}

where ω+\omega^+ and ω\omega^- depend on base policy accuracy, yields provable amplification of policy success rate over iterations (Mroueh, 9 Mar 2025).

Recent work demonstrates the equivalence of GRPO (especially minimal group-size, i.e., 2-GRPO) with Direct Preference Optimization (DPO) under binary reward settings; the gradient of the GRPO objective directly aligns with that of a contrastive loss familiar from DPO (Wu et al., 1 Oct 2025). This reinterpretation justifies the use of small group sizes without sacrificing statistical efficiency or unbiasedness.

Convergence results establish that under conditions such as Lipschitz-bounded gradients and bounded rewards, both GRPO and variants (such as TIC-GRPO, replacing tokenwise importance sampling with trajectory-level corrections) guarantee convergence to stationary points with rates proportional to step size, number of inner updates, and inverse group size (Pang et al., 4 Aug 2025).

3. Algorithmic Innovations and Extensions

Numerous architectural and algorithmic enhancements have been developed to improve GRPO's flexibility, computational efficiency, and stability:

  • Completion Pruning Policy Optimization (CPPO): Selectively discards completions whose absolute advantage is below a threshold, greatly accelerating training (up to 8.32×) with negligible or positive effect on accuracy (Lin et al., 28 Mar 2025).
  • Kalman Filter Enhanced GRPO (KRPO): Uses a lightweight Kalman filter to adaptively estimate the reward mean and variance, replacing the naive group mean for advantage normalization and thereby improving training robustness in noisy environments (Wang et al., 12 May 2025).
  • Prefix Grouper: Shares encoded representations for long, common prefixes across group samples, reducing FLOPs and memory up to a factor of $1/G$ and enabling larger group size scaling (Liu et al., 5 Jun 2025).
  • Trajectory-Clustering and State-Aware GRPO: For continuous control (robotics), applies group-wise normalization to states and trajectories, with temporal and inter-group diversity regularization, enabling robust RL in infinite action settings (Khanda et al., 25 Jul 2025).
  • Multi-Layer GRPO (MGRPO): Adds an explicit self-correction layer; the first GRPO layer generates initial outputs, while a secondary GRPO models self-correction via error detection and refinement, substantially improving multi-step reasoning accuracy (Ding et al., 5 Jun 2025).
  • Tree-GRPO: Tree-structured rollouts enable process-level preference estimation and finer credit assignment, with sampled branches sharing prefixes and intra-tree normalization mimicking step-wise DPO (Ji et al., 25 Sep 2025).

4. Practical Applications

GRPO has demonstrated strong empirical performance across a variety of domains:

Domain GRPO Role Reported Impact
Image Captioning RL stage, CIDEr reward BLEU-4: +0.9%, CIDEr: +2.4% vs. SCST; stable and diverse captioning
LLM Reasoning (DeepSeek-R1) RLHF, binary reward Success amplification, stable upgrades in mathematical/coding tasks
Safe/Aligned LLMs Multi-label rewards Multi-objective improvements (safety +0.28), robust alignment (Li et al., 26 Mar 2025)
Speech Recognition ASR WER/ED reward Up to 18.4% WER reduction, hallucination suppression, domain robustness (Shivakumar et al., 2 Sep 2025)
TTS CER + ASR-NLL reward Simultaneous gains in intelligibility and naturalness (Liu et al., 23 Sep 2025)
Hyperparameter Opt (GRPOformer) HPO via RL Outperforms baselines, achieves high optimization efficiency (Guo et al., 21 Sep 2025)
Robotics (Continuous) Grouped control policies Stabilizes high-dim, sparse reward RL, ensures temporal smoothness (Khanda et al., 25 Jul 2025)
Visual Generation (DanceGRPO) RL for diffusion/flow Up to 181% improvement in HPS-v2.1, CLIP; unifies text-to-image/video RL (Xue et al., 12 May 2025)
Translation, Control Multi-objective RL, MO-GRPO Prevents reward hacking, stable balance of all objectives (Ichihara et al., 26 Sep 2025)

For many of these applications, GRPO eliminates the need for critic networks, enables direct RL from rule-based or verifiable rewards, and circumvents reward hacking and credit assignment failures typical in traditional actor-critic or proxy-reward approaches.

5. Limitations and Solutions

Key known challenges and corresponding strategies are:

  • Inefficient Computation for Large Groups: Naive GRPO scales linearly in cost with group size and input prefix length, remedied by methods like Prefix Grouper (Liu et al., 5 Jun 2025) and CPPO (Lin et al., 28 Mar 2025).
  • Reward Hacking in Multi-Objective Settings: Aggregating disparate rewards enables objectives with larger variance to dominate, leading to pathological optimization (e.g., maximizing readability but disregarding accuracy). MO-GRPO addresses this via per-objective normalization:

AgMO=i=1KRi(q,og)meano(Ri(q,o))stdo(Ri(q,o))A_g^{\text{MO}} = \sum_{i=1}^K \frac{R_i(q, o_g) - \text{mean}_o(R_i(q,o))}{\text{std}_o(R_i(q,o))}

ensuring balanced optimization without manual scaling (Ichihara et al., 26 Sep 2025).

  • Lack of Intermediate Supervision: In long-chain tasks or complex RL environments, reliance on end-outcome rewards leads to "credit assignment pathology." Multi-layer GRPO (MGRPO) introduces an explicit self-correction layer to address this (Ding et al., 5 Jun 2025), and tree-based strategies provide step-wise preference learning (Ji et al., 25 Sep 2025).
  • Advantage Baseline Sensitivity: Static group mean is vulnerable to noise; KRPO dynamically tracks latent mean and variance with Kalman filtering (Wang et al., 12 May 2025).
  • Potential for Policy Collapse/Instability: Excessively confident predictions can be unduly penalized, leading to flattened distributions. GTPO detects token-level conflicts and applies entropy-based filtering, removing the need for explicit KL regularization (Simoni et al., 5 Aug 2025).
  • Group Size Trade-offs: While larger GG stabilize the advantage, recent analysis shows (in binary reward settings) that 2-GRPO—minimal group size—retains unbiased contrastive gradients and matches large-group performance, drastically reducing computational cost (Wu et al., 1 Oct 2025).

6. Normative Considerations and Future Directions

Recent work highlights that GRPO's core principle—relative, group-based normalization—provides both an efficient statistical mechanism for variance reduction and a natural route to contrastive, preference-based optimization. The contrastive view both unifies GRPO with DPO-style objectives and justifies the use of pairwise or small-group sampling under certain reward regimes. Empirical results consistently show that, when properly normalized and regularized (via KL or entropy controls), GRPO-based algorithms enable efficient and high-quality fine-tuning for LLMs, on policy or generative models in vision and speech, and in multi-objective RL where traditional single-reward approaches struggle with reward imbalance and manipulation.

Current research directions include: extending GRPO architectures to continuous control (via cluster-based estimation (Khanda et al., 25 Jul 2025)), further reducing computational overhead for ultra-long-context and prefix-heavy tasks, advancing group-based RL for multi-modal, multi-turn agentic settings (tree-based rollouts (Ji et al., 25 Sep 2025)), and defining sharper theoretical convergence bounds under function approximation and noisy reward conditions.

Open-source codebases for many variants are available, including, for example, image captioning (Liang, 3 Mar 2025), CPPO (Lin et al., 28 Mar 2025), KRPO (Wang et al., 12 May 2025), Prefix Grouper (Liu et al., 5 Jun 2025), and danceGRPO for vision (Xue et al., 12 May 2025), facilitating broad adoption and reproducibility.

7. Summary Table: Algorithmic Extensions

Variant Main Innovation Primary Domain Reference
GRPO (canonical) Groupwise advantage and KL regularization LLM RLHF, vision, captioning (Liang, 3 Mar 2025, Mroueh, 9 Mar 2025)
CPPO Pruning low-advantage completions LLM reasoning (Lin et al., 28 Mar 2025)
KRPO Adaptive baseline (Kalman filter) RL for LM reasoning (Wang et al., 12 May 2025)
Prefix Grouper Shared-prefix attention computation LLMs, long-context tasks (Liu et al., 5 Jun 2025)
MGRPO Multi-layer, self-correction Multi-step LLM reasoning (Ding et al., 5 Jun 2025)
MO-GRPO Per-objective advantage normalization Multi-objective RL/MT (Ichihara et al., 26 Sep 2025)
Tree-GRPO Tree-structured, process-level grouping LLM agent RL (Ji et al., 25 Sep 2025)
DanceGRPO RL for visual generation (SDE-based) Text/image/video synthesis (Xue et al., 12 May 2025)
GTPO Conflict mask, entropy filtering LLM alignment, reasoning (Simoni et al., 5 Aug 2025)
2-GRPO Minimal group, DPO contrastive link LLM RLHF (Wu et al., 1 Oct 2025)

In all, Group-based Relative Policy Optimization is a unifying methodology for critic-free, high-stability, and sample-efficient RL with groupwise normalization, inheriting connections to both contrastive and preference-based learning, and characterized by rapid algorithmic innovation adapting it to vision, robotics, speech, and multi-objective tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Group-based Relative Policy Optimization (GRPO).