RefGRPO: Enhanced Policy Optimization

Updated 1 July 2026

RefGRPO is a refined policy optimization framework that integrates rigorous normalization, closed-form recursions, calibration bonuses, and diversity-preserving sampling for robust reinforcement learning.
It addresses challenges in multi-objective constrained optimization by reversing scalarization order and enhancing both constraint satisfaction and success probability.
The approach leverages a U-statistic foundation and dynamic oscillatory modeling to ensure theoretical robustness, finite-sample guarantees, and universal scaling across diverse RL domains.

RefGRPO

RefGRPO, or "Reference Group Relative Policy Optimization," refers to a class of methodological refinements and reinterpretations within the Group Relative Policy Optimization (GRPO) framework. Unlike classic baseline-corrected policy gradients, RefGRPO—across its variants—emphasizes rigorous normalization, closed-form policy characterization, and principled handling of constraint and calibration signals. In modern RL regimes for large models (e.g., LLMs, generative diffusion models, robotics), RefGRPO provides critical theoretical and algorithmic advances: preserving the intended semantics of Lagrangian weights, calibrating agent confidence, amplifying binary-verifiable success, and achieving robust, empirical convergence. The following sections dissect the definitions, algorithms, mathematical pathologies, and empirical impacts of RefGRPO, unified across recent literature.

1. Constrained GRPO and Scalarized-Advantage Pathology

RefGRPO was introduced as a solution to the pathology arising from naïve multi-objective constrained policy optimization within GRPO (Girgis et al., 5 Feb 2026). The classical problem formulation considers a CMDP with objective:

$\max_\theta J_r(\theta)$ subject to $J_c(\theta) \le d$ where
$J_r(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t R(s_t,a_t)$ : expected return
$J_c(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t C(s_t,a_t)$ : expected constraint cost (indicator for violation).

A standard Lagrangian relaxation,

$L(\theta,\lambda) = J_r(\theta) - \lambda[J_c(\theta) - d],$

motivates the update: $\begin{aligned} \theta &\leftarrow \theta + \alpha_\theta[\nabla_\theta J_r(\theta) - \lambda\,\nabla_\theta J_c(\theta)] \ \lambda &\leftarrow \max(0,\lambda + \alpha_\lambda[d - J_c(\theta)]) \end{aligned}$

When using group normalization in GRPO, naïve scalarization—taking the scalarized return $R^s := J_r^{(traj)} - \lambda J_c^{(traj)}$ and then z-normalizing—results in data-dependent rescaling of the Lagrange weights: $A = \sum_{j \in \{r,c\}} \frac{\lambda_j \sigma_j}{\sigma_{\lambda^T x}} Z_j$ where $Z_j$ is the standardized component, and $\sigma_j$ denotes std within group. This corrupts the originally intended trade-off, making true constraint enforcement unattainable.

RefGRPO resolves this by reversing the scalarization and standardization sequence:

Compute $J_c(\theta) \le d$ 0, $J_c(\theta) \le d$ 1
Form scalarized advantage: $J_c(\theta) \le d$ 2
Use $J_c(\theta) \le d$ 3 for the policy update.

This ordering preserves the proportional influence of each objective term. Empirically, in gridworld and NAVSIM-v2, RefGRPO achieves both effective constraint satisfaction and superior task success (e.g. EPDMS $J_c(\theta) \le d$ 4 vs. $J_c(\theta) \le d$ 5 for constrained GRPO, $J_c(\theta) \le d$ 6 for naïve baselines) (Girgis et al., 5 Feb 2026).

2. Closed-Form Policy Recursions and Binary Amplification

For binary, verifiable rewards, RefGRPO reduces to a KL-regularized contrastive loss over synthetic rollouts from the previous policy (Mroueh, 9 Mar 2025). The loss for each prompt $J_c(\theta) \le d$ 7 is: $J_c(\theta) \le d$ 8 where $J_c(\theta) \le d$ 9, $J_r(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t R(s_t,a_t)$ 0, $J_r(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t R(s_t,a_t)$ 1 is the empirical success rate under $J_r(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t R(s_t,a_t)$ 2.

The resulting closed-form policy update for $J_r(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t R(s_t,a_t)$ 3 is: $J_r(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t R(s_t,a_t)$ 4

This policy induces a one-dimensional fixed-point recursion for the new success probability $J_r(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t R(s_t,a_t)$ 5, and it is provable that the fixed-point $J_r(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t R(s_t,a_t)$ 6—RefGRPO thus amplifies success probability over the initial model (Mroueh, 9 Mar 2025).

3. Calibration Bonus and the Reflection Gap

RefGRPO for self-assessment calibration addresses persistent misalignment between agent self-confidence (reflection signal $J_r(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t R(s_t,a_t)$ 7) and actual outcomes ( $J_r(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t R(s_t,a_t)$ 8), particularly in LLM RL settings (Zhu, 12 Jun 2026). The core innovation is a "free" calibration bonus: $J_r(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t R(s_t,a_t)$ 9 added to the reward during RL optimization: $J_c(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t C(s_t,a_t)$ 0 where $J_c(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t C(s_t,a_t)$ 1 is a scheduled coefficient. The final advantage is group-wise normalized, and the standard PPO/GRPO objective is optimized.

Empirical results demonstrate dramatic reductions in underconfidence rate ( $J_c(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t C(s_t,a_t)$ 2 at 7B scale), improved task accuracy ( $J_c(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t C(s_t,a_t)$ 3), and higher Chow $J_c(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t C(s_t,a_t)$ 4 (Zhu, 12 Jun 2026). Calibrated reflection further enables self-improvement via pseudo-rewards and superior selective prediction at test time.

4. Expand-and-Prune: Diversity-Preserving Sampling

RefGRPO in generative settings (notably diffusion and flow-based models) addresses "reward clustering," where as group size $J_c(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t C(s_t,a_t)$ 5 grows, most samples collapse toward the mean, providing near-zero policy-gradient signal (Ge et al., 17 Dec 2025). The Optimal Variance Filtering (OVF) heuristic selects the $J_c(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t C(s_t,a_t)$ 6 samples with maximal reward variance: $J_c(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t C(s_t,a_t)$ 7 but static, post-sampling OVF remains computationally expensive. Pro-GRPO (Proactive GRPO)—a direct instantiation of RefGRPO—implements an "Expand-and-Prune" paradigm: expand to $J_c(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t C(s_t,a_t)$ 8 latent samples, iteratively prune low-diversity trajectories via multi-step lookahead and OVF criteria, and only fully denoise the final survivors.

Empirical results show up to 41% compute savings with improved or matched downstream performance on PickScore, ImageReward, HPSv2 (flow/diffusion models), and consistent compositional benefits in GenEval (Ge et al., 17 Dec 2025).

5. Mathematical Structure: U-Statistic Foundation and Universal Scaling

The RefGRPO gradient estimator is formally a second-order U-statistic (Zhou et al., 1 Mar 2026): $J_c(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \sum_t C(s_t,a_t)$ 9 This structure enables precise mean squared error decompositions and a finite-sample suboptimality gap bound: $L(\theta,\lambda) = J_r(\theta) - \lambda[J_c(\theta) - d],$ 0

A universal scaling law for optimal group size emerges: $L(\theta,\lambda) = J_r(\theta) - \lambda[J_c(\theta) - d],$ 1 where $L(\theta,\lambda) = J_r(\theta) - \lambda[J_c(\theta) - d],$ 2 and $L(\theta,\lambda) = J_r(\theta) - \lambda[J_c(\theta) - d],$ 3 are geometric problem constants. This scaling is empirically robust across models/datasets (Zhou et al., 1 Mar 2026). In the $L(\theta,\lambda) = J_r(\theta) - \lambda[J_c(\theta) - d],$ 4 limit, RefGRPO matches the oracle policy-gradient variance and suboptimality asymptotics.

6. Predictable Training Dynamics and Hyperparameter Regimes

RefGRPO training dynamics can be reduced to a stochastically driven, damped oscillator for the expected reward (Ghosh et al., 29 Jun 2026): $L(\theta,\lambda) = J_r(\theta) - \lambda[J_c(\theta) - d],$ 5 where $L(\theta,\lambda) = J_r(\theta) - \lambda[J_c(\theta) - d],$ 6, $L(\theta,\lambda) = J_r(\theta) - \lambda[J_c(\theta) - d],$ 7 is “mass” (momentum/off-policy lag), $L(\theta,\lambda) = J_r(\theta) - \lambda[J_c(\theta) - d],$ 8 is “damping,” $L(\theta,\lambda) = J_r(\theta) - \lambda[J_c(\theta) - d],$ 9 is “stiffness,” and only the noise scales with $\begin{aligned} \theta &\leftarrow \theta + \alpha_\theta[\nabla_\theta J_r(\theta) - \lambda\,\nabla_\theta J_c(\theta)] \ \lambda &\leftarrow \max(0,\lambda + \alpha_\lambda[d - J_c(\theta)]) \end{aligned}$ 0.

Key predictions include:

Deterministic reward trajectory is group-size invariant; only stationary fluctuations shrink as $\begin{aligned} \theta &\leftarrow \theta + \alpha_\theta[\nabla_\theta J_r(\theta) - \lambda\,\nabla_\theta J_c(\theta)] \ \lambda &\leftarrow \max(0,\lambda + \alpha_\lambda[d - J_c(\theta)]) \end{aligned}$ 1.
There exists a refresh interval stability threshold: $\begin{aligned} \theta &\leftarrow \theta + \alpha_\theta[\nabla_\theta J_r(\theta) - \lambda\,\nabla_\theta J_c(\theta)] \ \lambda &\leftarrow \max(0,\lambda + \alpha_\lambda[d - J_c(\theta)]) \end{aligned}$ 2.
Overdamped–underdamped transitions determine the onset of oscillatory reward.
Diagnostics based on reward, advantage std collapse, entropy, and KL divergence distinguish failures: reward hacking, advantage degeneracy, policy concentration, instability.

Empirical fits yield $\begin{aligned} \theta &\leftarrow \theta + \alpha_\theta[\nabla_\theta J_r(\theta) - \lambda\,\nabla_\theta J_c(\theta)] \ \lambda &\leftarrow \max(0,\lambda + \alpha_\lambda[d - J_c(\theta)]) \end{aligned}$ 3 for the critically damped form and confirm group-size invariance and transfer robustness in out-of-distribution evaluation (Ghosh et al., 29 Jun 2026).

7. Summary Table: RefGRPO Key Features and Empirical Claims

Variant	Problem Domain	Core Methodological Step	Empirical/Axiomatic Claim
Constrained GRPO	CMDP / Robotics	Scalarized-advantage normalization	Stable constraint satisfaction, EPDMS↑
KL-Contrastive RefGRPO	Verifiable-binary rewards	Closed-form policy recursion	Guaranteed success amplification
Calibration Bonus	LLM agentic RL	Reward augmentation, schedule	Underconfidence↓, Chow $\begin{aligned} \theta &\leftarrow \theta + \alpha_\theta[\nabla_\theta J_r(\theta) - \lambda\,\nabla_\theta J_c(\theta)] \ \lambda &\leftarrow \max(0,\lambda + \alpha_\lambda[d - J_c(\theta)]) \end{aligned}$ 4↑
Expand-and-Prune	Generative models	OVF, latent lookahead pruning	Faster/more diverse sampling, compute↓
U-Statistic GRPO	General RL	Leave-one-out group mean estimator	Oracle variance, universal $\begin{aligned} \theta &\leftarrow \theta + \alpha_\theta[\nabla_\theta J_r(\theta) - \lambda\,\nabla_\theta J_c(\theta)] \ \lambda &\leftarrow \max(0,\lambda + \alpha_\lambda[d - J_c(\theta)]) \end{aligned}$ 5
Oscillator Dynamics	LLM training dynamics	Potential function reduction	Predictable reward curve, stability cond.

All entries above are direct readings of cited contents. Empirical results confirm that RefGRPO delivers reliable constraint control, amplifies success, sharpens calibration, preserves computational efficiency, and abides by provable finite-sample and asymptotic performance guarantees across domains (Girgis et al., 5 Feb 2026, Mroueh, 9 Mar 2025, Zhu, 12 Jun 2026, Ge et al., 17 Dec 2025, Zhou et al., 1 Mar 2026, Ghosh et al., 29 Jun 2026).