Grouped Regularized Policy Optimization (GRPO)
- The paper introduces GRPO, a reinforcement learning method that normalizes advantages across grouped trajectories to improve training stability without relying on a learned value network.
- It uses group-wise advantage normalization to reduce variance and analyze finite-group bias and tail-miss probability, ensuring more robust policy updates.
- The paper also presents F-GRPO, a focal-adjusted variant that downweights over-sampled prompts to recover diversity and enhance performance in multi-domain tasks.
Grouped Regularized Policy Optimization (GRPO) is a reinforcement learning (RL) algorithm for critic-free, group-based policy optimization, originally introduced to advance reasoning capabilities in LLMs trained with verifiable or binary rewards. The central idea is to normalize advantages within a group of parallel rollouts for each prompt, stabilizing updates without requiring a learned value function. GRPO forms a flexible foundation for subsequent algorithmic innovations—including Focal GRPO (F-GRPO)—and extensions to other domains such as speech recognition, multi-agent control, and neural combinatorial optimization, due to its baseline-free, variance-reducing design (Plyusov et al., 6 Feb 2026).
1. Core GRPO Algorithm: Objective and Advantage Normalization
GRPO operates by generating, for each input , a group of trajectories using the current or previous policy . Each trajectory is assigned a verifiable (often binary) reward: with (e.g., , or ). For this group, the mean and standard deviation 0 of the rewards are computed: 1 The group-relative advantage for each trajectory is defined as: 2 where 3 ensures numerical stability. The token-level policy update uses PPO-style likelihood ratios 4 (between new and old policy at each token), with a clipped surrogate objective: 5 Here, all per-prompt statistics are localized within each group, allowing scale-invariance and variance suppression without a value network or explicit baseline (Plyusov et al., 6 Feb 2026).
2. Theoretical Analysis: Finite-Group Bias and Tail-Miss Probability
Although large group sizes 6 approximate population statistics and minimize bias, they are computationally infeasible in practice. Finite 7 introduces a characteristic bias in policy learning: rare but correct modes are often unsampled and thus ignored or even downweighted by the normalization.
The "tail-miss" probability 8 quantifies the chance that, for a prompt 9, an update occurs (group contains both correct and incorrect trajectories) but none of the correct rollouts is from the rare, desired subspace. This is expressed as: 0 where 1 is the current probability of generating a correct trajectory and 2 the mass on "rare-correct" solutions. 3 is non-monotonic in group size: it vanishes for very small 4 (no updates), also for very large 5 (coverage complete), but peaks at modest sizes where active updates bias learning toward common solutions while missing the rare.
Another consequence is the shrinking of "unsampled-correct mass": the probability mass on correct solutions not appearing in any sampled trajectory. Even as total correct mass can grow, drift induced by the group baseline can systematically reduce unsampled-correct mass, impeding exploration of rare-but-desirable solutions (Plyusov et al., 6 Feb 2026).
3. F-GRPO: Focal Difficulty-Aware Scaling for Diversity Recovery
Motivated by the bias identified above, Focal Group Relative Policy Optimization (F-GRPO) introduces a simple, prompt-specific scaling coefficient inspired by the Focal loss for classification:
- Estimate per-prompt empirical success rate:
6
where 7 is the number of correct trajectories in the group.
- Apply a Focal scaling with exponent 8:
9
which downweights the gradient update for prompts with many successes ("easy" prompts), thus emphasizing harder cases and rare modes.
- The group-relative advantage is rescaled:
0
and the surrogate loss becomes:
1
As 2, this recovers vanilla GRPO; as 3, "obvious" (high-success) prompts are suppressed, counteracting group-drift bias (Plyusov et al., 6 Feb 2026).
4. Algorithmic Workflow and Pseudocode
F-GRPO differs from GRPO only in per-prompt computation of the empirical success rate and scaling of group advantages. The main steps per training iteration are as follows:
- Draw a batch of prompts 4.
- For each prompt 5, sample 6 rollouts 7 under 8.
- Compute rewards 9, and the empirical success rate 0.
- Set the focal scaling 1.
- Compute group mean 2, standard deviation 3, and (scaled) group-relative advantages for each trajectory.
- Compute token-level surrogate losses (as in PPO), using the scaled advantage.
- Aggregate gradients and perform parameter update.
Pseudocode excerpt (Plyusov et al., 6 Feb 2026):
5
5. Empirical Performance and Group Size Trade-Offs
Applied to the Qwen2.5-7B LLM, F-GRPO (with group size 4 and 5) achieves substantial accuracy improvements in both in-domain and out-of-domain mathematical reasoning tasks:
- Baseline GRPO: pass@256 = 64.1, pass@1 ≈ 37.3
- F-GRPO (same 6): pass@256 = 70.3, pass@1 = 38.6
F-GRPO at group size 7 matches or slightly exceeds GRPO with 8 (which achieves pass@256 ≈ 70.1), yielding a 9 reduction in rollout cost for similar diversity and success metrics. On out-of-domain tasks, F-GRPO improves pass@256 from 55.9 to 63.3. The method similarly benefits DAPO and CISPO policy optimization variants.
Group size effects:
- 0: most groups lack both success and failure, increasing diversity but reducing pass@1.
- 1: updates boost pass@1 but at the expense of pass@256 (diversity).
- 2: larger groups recover diversity, but at higher compute cost.
- F-GRPO with 3 achieves the diversity and success of larger groups at no extra cost (Plyusov et al., 6 Feb 2026).
6. Broader Significance and Applicability
Theoretical and empirical results show that GRPO's group normalization, while robust and scalable, introduces a non-monotonic, group-size-dependent bias toward common solutions, especially when rare-correct modes are under-sampled. F-GRPO supplies a minimal, focal-inspired variant that actively corrects this effect by adaptively weighting updates according to observed group success rates. This technique is agnostic to the underlying group-relative RL algorithm and can be directly applied to DAPO, CISPO, and other group-normalized RLVR methods.
By stabilizing and diversifying policy updates without incurring additional sampling or computational cost, F-GRPO enables practical deployment of group-relative RL schemes in domains where rare modes are critical (reasoning, code generation, safety-sensitive settings), and large group sizes are not computationally viable (Plyusov et al., 6 Feb 2026).
7. Limitations and Directions for Future Research
F-GRPO's focal scaling necessarily depends upon accurate online estimation of group success rates, which may be impacted by reward sparsity or early training dynamics. Although the method alleviates group-drift and loss of rare-correct mass, further advances may require integrating explicit rare-mode tracking, enhanced rollout-generation strategies, or importance correction for under-sampled trajectories.
Possible research directions include analytical study of scaling behavior across RLVR tasks, adaptive scheduling of the focal exponent 4 based on training signals, and application to settings with reward corruption or significant noise.
References:
- F-GRPO and all above results: (Plyusov et al., 6 Feb 2026)