Papers
Topics
Authors
Recent
Search
2000 character limit reached

Grouped Regularized Policy Optimization (GRPO)

Updated 1 July 2026
  • The paper introduces GRPO, a reinforcement learning method that normalizes advantages across grouped trajectories to improve training stability without relying on a learned value network.
  • It uses group-wise advantage normalization to reduce variance and analyze finite-group bias and tail-miss probability, ensuring more robust policy updates.
  • The paper also presents F-GRPO, a focal-adjusted variant that downweights over-sampled prompts to recover diversity and enhance performance in multi-domain tasks.

Grouped Regularized Policy Optimization (GRPO) is a reinforcement learning (RL) algorithm for critic-free, group-based policy optimization, originally introduced to advance reasoning capabilities in LLMs trained with verifiable or binary rewards. The central idea is to normalize advantages within a group of parallel rollouts for each prompt, stabilizing updates without requiring a learned value function. GRPO forms a flexible foundation for subsequent algorithmic innovations—including Focal GRPO (F-GRPO)—and extensions to other domains such as speech recognition, multi-agent control, and neural combinatorial optimization, due to its baseline-free, variance-reducing design (Plyusov et al., 6 Feb 2026).

1. Core GRPO Algorithm: Objective and Advantage Normalization

GRPO operates by generating, for each input xx, a group of NN trajectories {oi}i=1N\{o_i\}_{i=1}^N using the current or previous policy πθ\pi_\theta. Each trajectory is assigned a verifiable (often binary) reward: Ri=Rc⋅1[oi  correct]+Rw⋅1[oi  incorrect]R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}] with Rc>RwR_c > R_w (e.g., Rc=1R_c=1, Rw=0R_w=0 or −1-1). For this group, the mean Rˉ\bar R and standard deviation NN0 of the rewards are computed: NN1 The group-relative advantage for each trajectory is defined as: NN2 where NN3 ensures numerical stability. The token-level policy update uses PPO-style likelihood ratios NN4 (between new and old policy at each token), with a clipped surrogate objective: NN5 Here, all per-prompt statistics are localized within each group, allowing scale-invariance and variance suppression without a value network or explicit baseline (Plyusov et al., 6 Feb 2026).

2. Theoretical Analysis: Finite-Group Bias and Tail-Miss Probability

Although large group sizes NN6 approximate population statistics and minimize bias, they are computationally infeasible in practice. Finite NN7 introduces a characteristic bias in policy learning: rare but correct modes are often unsampled and thus ignored or even downweighted by the normalization.

The "tail-miss" probability NN8 quantifies the chance that, for a prompt NN9, an update occurs (group contains both correct and incorrect trajectories) but none of the correct rollouts is from the rare, desired subspace. This is expressed as: {oi}i=1N\{o_i\}_{i=1}^N0 where {oi}i=1N\{o_i\}_{i=1}^N1 is the current probability of generating a correct trajectory and {oi}i=1N\{o_i\}_{i=1}^N2 the mass on "rare-correct" solutions. {oi}i=1N\{o_i\}_{i=1}^N3 is non-monotonic in group size: it vanishes for very small {oi}i=1N\{o_i\}_{i=1}^N4 (no updates), also for very large {oi}i=1N\{o_i\}_{i=1}^N5 (coverage complete), but peaks at modest sizes where active updates bias learning toward common solutions while missing the rare.

Another consequence is the shrinking of "unsampled-correct mass": the probability mass on correct solutions not appearing in any sampled trajectory. Even as total correct mass can grow, drift induced by the group baseline can systematically reduce unsampled-correct mass, impeding exploration of rare-but-desirable solutions (Plyusov et al., 6 Feb 2026).

3. F-GRPO: Focal Difficulty-Aware Scaling for Diversity Recovery

Motivated by the bias identified above, Focal Group Relative Policy Optimization (F-GRPO) introduces a simple, prompt-specific scaling coefficient inspired by the Focal loss for classification:

  1. Estimate per-prompt empirical success rate:

{oi}i=1N\{o_i\}_{i=1}^N6

where {oi}i=1N\{o_i\}_{i=1}^N7 is the number of correct trajectories in the group.

  1. Apply a Focal scaling with exponent {oi}i=1N\{o_i\}_{i=1}^N8:

{oi}i=1N\{o_i\}_{i=1}^N9

which downweights the gradient update for prompts with many successes ("easy" prompts), thus emphasizing harder cases and rare modes.

  1. The group-relative advantage is rescaled:

πθ\pi_\theta0

and the surrogate loss becomes:

πθ\pi_\theta1

As πθ\pi_\theta2, this recovers vanilla GRPO; as πθ\pi_\theta3, "obvious" (high-success) prompts are suppressed, counteracting group-drift bias (Plyusov et al., 6 Feb 2026).

4. Algorithmic Workflow and Pseudocode

F-GRPO differs from GRPO only in per-prompt computation of the empirical success rate and scaling of group advantages. The main steps per training iteration are as follows:

  1. Draw a batch of prompts πθ\pi_\theta4.
  2. For each prompt πθ\pi_\theta5, sample πθ\pi_\theta6 rollouts πθ\pi_\theta7 under πθ\pi_\theta8.
  3. Compute rewards πθ\pi_\theta9, and the empirical success rate Ri=Rc⋅1[oi  correct]+Rw⋅1[oi  incorrect]R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]0.
  4. Set the focal scaling Ri=Rc⋅1[oi  correct]+Rw⋅1[oi  incorrect]R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]1.
  5. Compute group mean Ri=Rc⋅1[oi  correct]+Rw⋅1[oi  incorrect]R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]2, standard deviation Ri=Rc⋅1[oi  correct]+Rw⋅1[oi  incorrect]R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]3, and (scaled) group-relative advantages for each trajectory.
  6. Compute token-level surrogate losses (as in PPO), using the scaled advantage.
  7. Aggregate gradients and perform parameter update.

Pseudocode excerpt (Plyusov et al., 6 Feb 2026):

Rc>RwR_c > R_w5

5. Empirical Performance and Group Size Trade-Offs

Applied to the Qwen2.5-7B LLM, F-GRPO (with group size Ri=Rc⋅1[oi  correct]+Rw⋅1[oi  incorrect]R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]4 and Ri=Rc⋅1[oi  correct]+Rw⋅1[oi  incorrect]R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]5) achieves substantial accuracy improvements in both in-domain and out-of-domain mathematical reasoning tasks:

  • Baseline GRPO: pass@256 = 64.1, pass@1 ≈ 37.3
  • F-GRPO (same Ri=Rcâ‹…1[oi  correct]+Rwâ‹…1[oi  incorrect]R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]6): pass@256 = 70.3, pass@1 = 38.6

F-GRPO at group size Ri=Rc⋅1[oi  correct]+Rw⋅1[oi  incorrect]R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]7 matches or slightly exceeds GRPO with Ri=Rc⋅1[oi  correct]+Rw⋅1[oi  incorrect]R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]8 (which achieves pass@256 ≈ 70.1), yielding a Ri=Rc⋅1[oi  correct]+Rw⋅1[oi  incorrect]R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]9 reduction in rollout cost for similar diversity and success metrics. On out-of-domain tasks, F-GRPO improves pass@256 from 55.9 to 63.3. The method similarly benefits DAPO and CISPO policy optimization variants.

Group size effects:

  • Rc>RwR_c > R_w0: most groups lack both success and failure, increasing diversity but reducing pass@1.
  • Rc>RwR_c > R_w1: updates boost pass@1 but at the expense of pass@256 (diversity).
  • Rc>RwR_c > R_w2: larger groups recover diversity, but at higher compute cost.
  • F-GRPO with Rc>RwR_c > R_w3 achieves the diversity and success of larger groups at no extra cost (Plyusov et al., 6 Feb 2026).

6. Broader Significance and Applicability

Theoretical and empirical results show that GRPO's group normalization, while robust and scalable, introduces a non-monotonic, group-size-dependent bias toward common solutions, especially when rare-correct modes are under-sampled. F-GRPO supplies a minimal, focal-inspired variant that actively corrects this effect by adaptively weighting updates according to observed group success rates. This technique is agnostic to the underlying group-relative RL algorithm and can be directly applied to DAPO, CISPO, and other group-normalized RLVR methods.

By stabilizing and diversifying policy updates without incurring additional sampling or computational cost, F-GRPO enables practical deployment of group-relative RL schemes in domains where rare modes are critical (reasoning, code generation, safety-sensitive settings), and large group sizes are not computationally viable (Plyusov et al., 6 Feb 2026).

7. Limitations and Directions for Future Research

F-GRPO's focal scaling necessarily depends upon accurate online estimation of group success rates, which may be impacted by reward sparsity or early training dynamics. Although the method alleviates group-drift and loss of rare-correct mass, further advances may require integrating explicit rare-mode tracking, enhanced rollout-generation strategies, or importance correction for under-sampled trajectories.

Possible research directions include analytical study of scaling behavior across RLVR tasks, adaptive scheduling of the focal exponent Rc>RwR_c > R_w4 based on training signals, and application to settings with reward corruption or significant noise.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grouped Regularized Policy Optimization (GRPO).