Grouped Regularized Policy Optimization (GRPO)

Updated 1 July 2026

The paper introduces GRPO, a reinforcement learning method that normalizes advantages across grouped trajectories to improve training stability without relying on a learned value network.
It uses group-wise advantage normalization to reduce variance and analyze finite-group bias and tail-miss probability, ensuring more robust policy updates.
The paper also presents F-GRPO, a focal-adjusted variant that downweights over-sampled prompts to recover diversity and enhance performance in multi-domain tasks.

Grouped Regularized Policy Optimization (GRPO) is a reinforcement learning (RL) algorithm for critic-free, group-based policy optimization, originally introduced to advance reasoning capabilities in LLMs trained with verifiable or binary rewards. The central idea is to normalize advantages within a group of parallel rollouts for each prompt, stabilizing updates without requiring a learned value function. GRPO forms a flexible foundation for subsequent algorithmic innovations—including Focal GRPO (F-GRPO)—and extensions to other domains such as speech recognition, multi-agent control, and neural combinatorial optimization, due to its baseline-free, variance-reducing design (Plyusov et al., 6 Feb 2026).

1. Core GRPO Algorithm: Objective and Advantage Normalization

GRPO operates by generating, for each input $x$ , a group of $N$ trajectories $\{o_i\}_{i=1}^N$ using the current or previous policy $\pi_\theta$ . Each trajectory is assigned a verifiable (often binary) reward: $R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]$ with $R_c > R_w$ (e.g., $R_c=1$ , $R_w=0$ or $-1$ ). For this group, the mean $\bar R$ and standard deviation $N$ 0 of the rewards are computed: $N$ 1 The group-relative advantage for each trajectory is defined as: $N$ 2 where $N$ 3 ensures numerical stability. The token-level policy update uses PPO-style likelihood ratios $N$ 4 (between new and old policy at each token), with a clipped surrogate objective: $N$ 5 Here, all per-prompt statistics are localized within each group, allowing scale-invariance and variance suppression without a value network or explicit baseline (Plyusov et al., 6 Feb 2026).

2. Theoretical Analysis: Finite-Group Bias and Tail-Miss Probability

Although large group sizes $N$ 6 approximate population statistics and minimize bias, they are computationally infeasible in practice. Finite $N$ 7 introduces a characteristic bias in policy learning: rare but correct modes are often unsampled and thus ignored or even downweighted by the normalization.

The "tail-miss" probability $N$ 8 quantifies the chance that, for a prompt $N$ 9, an update occurs (group contains both correct and incorrect trajectories) but none of the correct rollouts is from the rare, desired subspace. This is expressed as: $\{o_i\}_{i=1}^N$ 0 where $\{o_i\}_{i=1}^N$ 1 is the current probability of generating a correct trajectory and $\{o_i\}_{i=1}^N$ 2 the mass on "rare-correct" solutions. $\{o_i\}_{i=1}^N$ 3 is non-monotonic in group size: it vanishes for very small $\{o_i\}_{i=1}^N$ 4 (no updates), also for very large $\{o_i\}_{i=1}^N$ 5 (coverage complete), but peaks at modest sizes where active updates bias learning toward common solutions while missing the rare.

Another consequence is the shrinking of "unsampled-correct mass": the probability mass on correct solutions not appearing in any sampled trajectory. Even as total correct mass can grow, drift induced by the group baseline can systematically reduce unsampled-correct mass, impeding exploration of rare-but-desirable solutions (Plyusov et al., 6 Feb 2026).

3. F-GRPO: Focal Difficulty-Aware Scaling for Diversity Recovery

Motivated by the bias identified above, Focal Group Relative Policy Optimization (F-GRPO) introduces a simple, prompt-specific scaling coefficient inspired by the Focal loss for classification:

Estimate per-prompt empirical success rate:

$\{o_i\}_{i=1}^N$ 6

where $\{o_i\}_{i=1}^N$ 7 is the number of correct trajectories in the group.

Apply a Focal scaling with exponent $\{o_i\}_{i=1}^N$ 8:

$\{o_i\}_{i=1}^N$ 9

which downweights the gradient update for prompts with many successes ("easy" prompts), thus emphasizing harder cases and rare modes.

The group-relative advantage is rescaled:

$\pi_\theta$ 0

and the surrogate loss becomes:

$\pi_\theta$ 1

As $\pi_\theta$ 2, this recovers vanilla GRPO; as $\pi_\theta$ 3, "obvious" (high-success) prompts are suppressed, counteracting group-drift bias (Plyusov et al., 6 Feb 2026).

4. Algorithmic Workflow and Pseudocode

F-GRPO differs from GRPO only in per-prompt computation of the empirical success rate and scaling of group advantages. The main steps per training iteration are as follows:

Draw a batch of prompts $\pi_\theta$ 4.
For each prompt $\pi_\theta$ 5, sample $\pi_\theta$ 6 rollouts $\pi_\theta$ 7 under $\pi_\theta$ 8.
Compute rewards $\pi_\theta$ 9, and the empirical success rate $R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]$ 0.
Set the focal scaling $R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]$ 1.
Compute group mean $R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]$ 2, standard deviation $R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]$ 3, and (scaled) group-relative advantages for each trajectory.
Compute token-level surrogate losses (as in PPO), using the scaled advantage.
Aggregate gradients and perform parameter update.

Pseudocode excerpt (Plyusov et al., 6 Feb 2026):

$R_c > R_w$ 5

5. Empirical Performance and Group Size Trade-Offs

Applied to the Qwen2.5-7B LLM, F-GRPO (with group size $R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]$ 4 and $R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]$ 5) achieves substantial accuracy improvements in both in-domain and out-of-domain mathematical reasoning tasks:

Baseline GRPO: pass@256 = 64.1, pass@1 ≈ 37.3
F-GRPO (same $R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]$ 6): pass@256 = 70.3, pass@1 = 38.6

F-GRPO at group size $R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]$ 7 matches or slightly exceeds GRPO with $R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]$ 8 (which achieves pass@256 ≈ 70.1), yielding a $R_i = R_c \cdot 1[o_i\;\text{correct}] + R_w \cdot 1[o_i\;\text{incorrect}]$ 9 reduction in rollout cost for similar diversity and success metrics. On out-of-domain tasks, F-GRPO improves pass@256 from 55.9 to 63.3. The method similarly benefits DAPO and CISPO policy optimization variants.

Group size effects:

$R_c > R_w$ 0: most groups lack both success and failure, increasing diversity but reducing pass@1.
$R_c > R_w$ 1: updates boost pass@1 but at the expense of pass@256 (diversity).
$R_c > R_w$ 2: larger groups recover diversity, but at higher compute cost.
F-GRPO with $R_c > R_w$ 3 achieves the diversity and success of larger groups at no extra cost (Plyusov et al., 6 Feb 2026).

6. Broader Significance and Applicability

Theoretical and empirical results show that GRPO's group normalization, while robust and scalable, introduces a non-monotonic, group-size-dependent bias toward common solutions, especially when rare-correct modes are under-sampled. F-GRPO supplies a minimal, focal-inspired variant that actively corrects this effect by adaptively weighting updates according to observed group success rates. This technique is agnostic to the underlying group-relative RL algorithm and can be directly applied to DAPO, CISPO, and other group-normalized RLVR methods.

By stabilizing and diversifying policy updates without incurring additional sampling or computational cost, F-GRPO enables practical deployment of group-relative RL schemes in domains where rare modes are critical (reasoning, code generation, safety-sensitive settings), and large group sizes are not computationally viable (Plyusov et al., 6 Feb 2026).

7. Limitations and Directions for Future Research

F-GRPO's focal scaling necessarily depends upon accurate online estimation of group success rates, which may be impacted by reward sparsity or early training dynamics. Although the method alleviates group-drift and loss of rare-correct mass, further advances may require integrating explicit rare-mode tracking, enhanced rollout-generation strategies, or importance correction for under-sampled trajectories.

Possible research directions include analytical study of scaling behavior across RLVR tasks, adaptive scheduling of the focal exponent $R_c > R_w$ 4 based on training signals, and application to settings with reward corruption or significant noise.

References:

F-GRPO and all above results: (Plyusov et al., 6 Feb 2026)

Markdown Report Issue Upgrade to Chat

References (1)

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grouped Regularized Policy Optimization (GRPO).