Group-Relative Policy Optimization with RoC

Updated 3 September 2025

The paper introduces GRPO-RoC, which unifies group-based policy optimization with on-chain verifiable rewards for robust and transparent reinforcement learning.
It employs a group-relative preference model with shift-and-scale invariance and nonlinear advantage scaling, ensuring stability and improved performance over traditional RLHF.
The method integrates secure, auditable reward-on-chain logging to mitigate bias and reward hacking, supporting scalable and modular applications in multi-objective AI alignment.

Group Relative Policy Optimization with Reward-on-Chain (GRPO-RoC) is a reinforcement learning (RL) framework designed for preference aggregation and alignment in advanced AI systems, including LLMs and vision–LLMs. The method unifies group-based, reference-regularized policy optimization (as in GRPO) with mechanisms that ensure the verifiability, transparency, or auditability of reward signals (the Reward-on-Chain, RoC, component). This approach introduces a distinct preference aggregation rule, innovative advantage normalization, and flexible integration of verifiable and modular reward signals. It is motivated by the limitations of traditional RLHF (Reinforcement Learning from Human Feedback) and provides novel solutions to policy alignment, efficient reward utilization, and stability of optimization.

1. Group-Relative Preference Aggregation and Objective

GRPO updates the policy by scaling a trusted reference distribution according to a group-based, normalized advantage. For each context $q$ , a group of $G$ outputs $\{o_1, \ldots, o_G\}$ are sampled. Their observed reward values $\{ r_1, \ldots, r_G \}$ produce normalized advantages:

$A_i = \frac{r_i - \mathrm{mean}(r_1, \ldots, r_G)}{\mathrm{std}(r_1, \ldots, r_G)}$

This "whitening" ensures invariance to both affine shifts and rescalings of the reward, focusing policy learning on the relative ranking of outputs.

Aggregate group preference $P_G$ is then defined, and the stationary policy must satisfy a nonlinear fixed-point condition:

$(1 - (P_G(o|\pi(\cdot|q),q) - \mathbb{E}_{o'}[P_G(o'|\pi(\cdot|q),q)] )/\beta ) \cdot \pi(o|q) = \pi_{\mathrm{ref}}(o|q)$

where $\beta$ is a regularization constant. This can be rearranged as

$\pi(o|q) = g\left( \frac{P_G(o|\pi(\cdot|q),q) - \mathbb{E}_{o'}[P_G(o'|\pi(\cdot|q),q)]}{\beta} \right) \cdot \pi_{\mathrm{ref}}(o|q) \qquad \text{with} \quad g(x) = \frac{1}{1 - x}$

Unlike RLHF's standard logarithmic pooling, which exponentially tilts the reference policy by the reward, GRPO uses this nonlinear "scaling" via $g(\cdot)$ , meaning that policy preference is not simply an exponential tilt but corrects for the normalized group-wise reward differences.

This form extends naturally to multi-objective settings, and, for binary tasks ( $G=2$ ), reduces to pairwise comparison aggregation, yielding direct equivalence with preference-based methods (Vojnovic et al., 25 Feb 2025, Li et al., 26 Mar 2025).

2. Reward Preference Model: Shift-and-Scale Invariance and Generalization

The reward preference model in GRPO is constructed to be invariant to affine transformations of the reward, emphasizing ordinal rather than cardinal information. For each sampled group:

Rewards are shifted to zero mean and scaled to unit variance;
The group-relative preference focuses entirely on reward ordering.

For binary groups ( $G=2$ ), preference is equivalent to the sign of the reward difference (pairwise comparison). For larger groups ( $G>2$ ), the aggregation captures higher-order ranking information within the group (Vojnovic et al., 25 Feb 2025, Li et al., 26 Mar 2025). When rewards are verifiable (e.g., programmatically checked correctness), this preference model produces robust and tamper-resistant signals suitable for RL training (Mroueh, 9 Mar 2025).

This shift-and-scale normalization can be explicitly contrasted with RLHF, where absolute reward values (often learned from human comparisons or critics) can be sensitive to calibration, bias, and scale.

3. Reference Policy Regularization: Reverse KL Divergence Penalty

GRPO incorporates a regularization penalty to control divergence from the reference policy, crucial for stability and trust-region updates. The standard penalty function is:

$D_i(\theta) = \frac{\pi_{\mathrm{ref}}(o_i|q)}{\pi_{\theta}(o_i|q)} - \log \frac{\pi_{\mathrm{ref}}(o_i|q)}{\pi_{\theta}(o_i|q)} - 1$

The expected value of this term acts as the regularizer in the optimization objective. Its gradient, evaluated at the stationary point ( $\pi_{\theta} = \pi_{\mathrm{old}}$ ), is:

$- \frac{\pi_{\mathrm{ref}}(o|q)}{\pi_{\theta}(o|q)} + 1$

This is exactly (up to a constant) the gradient of the reverse Kullback–Leibler divergence $KL(\pi_{\mathrm{ref}} \parallel \pi_{\theta})$ . This choice of penalty favors covering the support of the reference policy and leads to more conservative updates than methods that regularize with the forward KL (Vojnovic et al., 25 Feb 2025).

Alternatively, methods that use the direct KL divergence penalty recover standard exponential-tilt aggregation (i.e., RLHF’s logarithmic pooling), which is less robust to outlying or extremal reward values and does not have the same fixed-point uniqueness guarantees (Vojnovic et al., 25 Feb 2025, Mroueh, 9 Mar 2025).

4. Stationary Policy Characterization and Theoretical Guarantees

By applying KKT conditions to the constrained optimization (probability simplex), the stationary distribution for GRPO satisfies the nonlinear fixed-point equation relating group preference advantage to scaling of the reference distribution. The theory provides explicit solutions in special cases:

For binary output settings, aggregate probabilities depend explicitly on the regularization constant $\beta$ and the pairwise confidence margin $\gamma = \mathcal{P}(a \succ b|q) - \mathcal{P}(b \succ a|q)$ .
For large group sizes, closed-form solutions depend on key group preference statistics.

GRPO’s fixed-point characterization ensures the aggregation is stable and effective in reweighting probabilities toward desirable outputs. The proof of monotonic amplification of the "success probability" under suitable regularization is established via recurrence equations (Mroueh, 9 Mar 2025):

$p_n(q) = \frac{p_{\mathrm{ref}}(q) \exp[(1/\beta) \omega^+(p_{n-1}(q))]}{p_{\mathrm{ref}}(q) \exp[(1/\beta) \omega^+(p_{n-1}(q))] + (1-p_{\mathrm{ref}}(q)) \exp [-(1/\beta) \omega^-(p_{n-1}(q))]}$

where at convergence, the stationary success rate $p^*$ of the policy strictly exceeds the initial $p_{\mathrm{ref}}$ , demonstrating systematic performance amplification.

5. Reward-on-Chain Integration: Verifiability, Auditability, and Chain-of-Thought

Reward-on-Chain (RoC) refers to the secure, transparent, and sometimes decentralized logging or verification of reward signals underlying policy updates. The RoC component of GRPO-RoC can encompass:

Storing binary or graded reward signals on-blockchain or an immutable ledger;
Verifiable, externally-auditable signals suitable for minimizing fraudulent or adversarial reward manipulation (Mroueh, 9 Mar 2025);
The possibility of modular or step-wise reward aggregation, as in chain-of-thought (CoT) supervision or self-correcting/step-by-step LLMs (Ding et al., 5 Jun 2025, Yang et al., 5 Jun 2025).

Practical implications include:

Ensuring each policy improvement is traceable to a verifiable reward assignment;
Enabling process-level (not merely trajectory-level) supervision, which is crucial when a single error at any reasoning step may cause overall failure—addressed by self-correction (MGRPO) or dense stepwise reward chaining (TreeRPO);
Providing strong defenses against phenomenon such as reward hacking.

A plausible implication is that, as reward signals become more modular, transparent, and process-linked, GRPO-RoC is increasingly well-matched to advanced alignment tasks—particularly in high-stakes or multi-agent settings.

6. Empirical Results, Extensions, and Modifications

Empirically, GRPO-RoC (and closely related GRPO-based systems) have been validated in multiple domains:

DeepSeek-R1 models for reasoning improvements, using binary (programmatically verifiable) reward signals (Mroueh, 9 Mar 2025);
Safe and multi-objective LLM alignment via multi-label reward regression (Li et al., 26 Mar 2025);
Vision and robotics: crowd counting with fuzzy rewards (FGRPR) (Wang et al., 31 Mar 2025), flow-matching control policy optimization (Pfrommer et al., 20 Jul 2025), and continuous control benchmarking in robotics (Khanda et al., 25 Jul 2025);
Self-correction and dense supervision for mathematical reasoning using step-chaining architectures (Ding et al., 5 Jun 2025, Yang et al., 5 Jun 2025);
Legal reasoning and retrieval-augmented LLMs with citation rewards enforced both on-format and on-evidence via embeddings or automated LLM judges (Akarajaradwong et al., 13 Jul 2025).

GRPO has been adapted for:

Off-policy training with proven reward improvement lower bounds and efficiency benefits (Mroueh et al., 28 May 2025);
Accelerated training via reward variance maximization (GRPOVI) using an $O(n \log n)$ constrained quadratic maximization of per-group reward variance (Yang et al., 29 May 2025);
Adaptive normalization with Kalman filters to mitigate reward noise (Wang et al., 12 May 2025);
Multi-objective alignment, allowing explicit control over reward aspect weights (Li et al., 26 Mar 2025);
Integration with alternative reward and regularization schemes, enabling different robustness/efficiency trade-offs (Vojnovic et al., 25 Feb 2025, Mroueh et al., 28 May 2025).

7. Limitations, Considerations, and Future Directions

Critical considerations for GRPO-RoC include:

Quality and calibration of verifiable or learned reward models: group-based normalization mitigates but does not eliminate potential biases;
Proper tuning of regularization constant ( $\beta$ ): controls the trade-off between reward optimization and reference adherence;
Group size selection: influences stability and the granularity of preference aggregation;
Reward signal structure: modular/process-level rewards confer benefits but may increase system complexity and engineering overhead;
Robustness and computability: off-policy, incremental, and rule-based reward structures enhance computational efficiency and alignment reliability, but may also impact policy exploration.

Future research directions suggested by the literature include development of:

Improved group sampling algorithms for very large-scale LLMs;
General frameworks for integrating on-chain reward logging in production RL systems;
Adaptive and context-sensitive reward aggregation schemes;
Combined usage of multi-layer, tree, or graph-based preference structures;
Proven extensions for simultaneous alignment on multiple, potentially conflicting, axes.

Tables summarizing key comparative dimensions are presented below.

Aspect	GRPO-RoC	RLHF/Logarithmic Pooling
Policy Aggregation	Nonlinear (1/(1 − x)) scaling of $\pi_\text{ref}$	Exponential tilt of $\pi_\text{ref}$
Advantage	Group-normalized, shift and scale invariant	Absolute, often needs value network
Reward	Binary, learned, verifiable, or fuzzy	Learned scalar, potentially biased
Penalty	Reverse KL divergence (default)	Forward KL
Preference Model	Reduces to pairwise comparison at $G=2$	Pairwise or scalar rewards
Reward-on-Chain	Verifiable, auditable, step-chain possible	Generally not modular or auditable

The GRPO-RoC paradigm establishes a robust methodology for scalable, modular, and verifiable policy alignment—enabling high-performance RL optimization with transparent guarantees on reward provenance and aggregation dynamics.