Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Group-Relative Policy Optimization with RoC

Updated 3 September 2025
  • The paper introduces GRPO-RoC, which unifies group-based policy optimization with on-chain verifiable rewards for robust and transparent reinforcement learning.
  • It employs a group-relative preference model with shift-and-scale invariance and nonlinear advantage scaling, ensuring stability and improved performance over traditional RLHF.
  • The method integrates secure, auditable reward-on-chain logging to mitigate bias and reward hacking, supporting scalable and modular applications in multi-objective AI alignment.

Group Relative Policy Optimization with Reward-on-Chain (GRPO-RoC) is a reinforcement learning (RL) framework designed for preference aggregation and alignment in advanced AI systems, including LLMs and vision–LLMs. The method unifies group-based, reference-regularized policy optimization (as in GRPO) with mechanisms that ensure the verifiability, transparency, or auditability of reward signals (the Reward-on-Chain, RoC, component). This approach introduces a distinct preference aggregation rule, innovative advantage normalization, and flexible integration of verifiable and modular reward signals. It is motivated by the limitations of traditional RLHF (Reinforcement Learning from Human Feedback) and provides novel solutions to policy alignment, efficient reward utilization, and stability of optimization.

1. Group-Relative Preference Aggregation and Objective

GRPO updates the policy by scaling a trusted reference distribution according to a group-based, normalized advantage. For each context qq, a group of GG outputs {o1,,oG}\{o_1, \ldots, o_G\} are sampled. Their observed reward values {r1,,rG}\{ r_1, \ldots, r_G \} produce normalized advantages:

Ai=rimean(r1,,rG)std(r1,,rG)A_i = \frac{r_i - \mathrm{mean}(r_1, \ldots, r_G)}{\mathrm{std}(r_1, \ldots, r_G)}

This "whitening" ensures invariance to both affine shifts and rescalings of the reward, focusing policy learning on the relative ranking of outputs.

Aggregate group preference PGP_G is then defined, and the stationary policy must satisfy a nonlinear fixed-point condition:

(1(PG(oπ(q),q)Eo[PG(oπ(q),q)])/β)π(oq)=πref(oq)(1 - (P_G(o|\pi(\cdot|q),q) - \mathbb{E}_{o'}[P_G(o'|\pi(\cdot|q),q)] )/\beta ) \cdot \pi(o|q) = \pi_{\mathrm{ref}}(o|q)

where β\beta is a regularization constant. This can be rearranged as

π(oq)=g(PG(oπ(q),q)Eo[PG(oπ(q),q)]β)πref(oq)withg(x)=11x\pi(o|q) = g\left( \frac{P_G(o|\pi(\cdot|q),q) - \mathbb{E}_{o'}[P_G(o'|\pi(\cdot|q),q)]}{\beta} \right) \cdot \pi_{\mathrm{ref}}(o|q) \qquad \text{with} \quad g(x) = \frac{1}{1 - x}

Unlike RLHF's standard logarithmic pooling, which exponentially tilts the reference policy by the reward, GRPO uses this nonlinear "scaling" via g()g(\cdot), meaning that policy preference is not simply an exponential tilt but corrects for the normalized group-wise reward differences.

This form extends naturally to multi-objective settings, and, for binary tasks (G=2G=2), reduces to pairwise comparison aggregation, yielding direct equivalence with preference-based methods (Vojnovic et al., 25 Feb 2025, Li et al., 26 Mar 2025).

2. Reward Preference Model: Shift-and-Scale Invariance and Generalization

The reward preference model in GRPO is constructed to be invariant to affine transformations of the reward, emphasizing ordinal rather than cardinal information. For each sampled group:

  • Rewards are shifted to zero mean and scaled to unit variance;
  • The group-relative preference focuses entirely on reward ordering.

For binary groups (G=2G=2), preference is equivalent to the sign of the reward difference (pairwise comparison). For larger groups (G>2G>2), the aggregation captures higher-order ranking information within the group (Vojnovic et al., 25 Feb 2025, Li et al., 26 Mar 2025). When rewards are verifiable (e.g., programmatically checked correctness), this preference model produces robust and tamper-resistant signals suitable for RL training (Mroueh, 9 Mar 2025).

This shift-and-scale normalization can be explicitly contrasted with RLHF, where absolute reward values (often learned from human comparisons or critics) can be sensitive to calibration, bias, and scale.

3. Reference Policy Regularization: Reverse KL Divergence Penalty

GRPO incorporates a regularization penalty to control divergence from the reference policy, crucial for stability and trust-region updates. The standard penalty function is:

Di(θ)=πref(oiq)πθ(oiq)logπref(oiq)πθ(oiq)1D_i(\theta) = \frac{\pi_{\mathrm{ref}}(o_i|q)}{\pi_{\theta}(o_i|q)} - \log \frac{\pi_{\mathrm{ref}}(o_i|q)}{\pi_{\theta}(o_i|q)} - 1

The expected value of this term acts as the regularizer in the optimization objective. Its gradient, evaluated at the stationary point (πθ=πold\pi_{\theta} = \pi_{\mathrm{old}}), is:

πref(oq)πθ(oq)+1- \frac{\pi_{\mathrm{ref}}(o|q)}{\pi_{\theta}(o|q)} + 1

This is exactly (up to a constant) the gradient of the reverse Kullback–Leibler divergence KL(πrefπθ)KL(\pi_{\mathrm{ref}} \parallel \pi_{\theta}). This choice of penalty favors covering the support of the reference policy and leads to more conservative updates than methods that regularize with the forward KL (Vojnovic et al., 25 Feb 2025).

Alternatively, methods that use the direct KL divergence penalty recover standard exponential-tilt aggregation (i.e., RLHF’s logarithmic pooling), which is less robust to outlying or extremal reward values and does not have the same fixed-point uniqueness guarantees (Vojnovic et al., 25 Feb 2025, Mroueh, 9 Mar 2025).

4. Stationary Policy Characterization and Theoretical Guarantees

By applying KKT conditions to the constrained optimization (probability simplex), the stationary distribution for GRPO satisfies the nonlinear fixed-point equation relating group preference advantage to scaling of the reference distribution. The theory provides explicit solutions in special cases:

  • For binary output settings, aggregate probabilities depend explicitly on the regularization constant β\beta and the pairwise confidence margin γ=P(abq)P(baq)\gamma = \mathcal{P}(a \succ b|q) - \mathcal{P}(b \succ a|q).
  • For large group sizes, closed-form solutions depend on key group preference statistics.

GRPO’s fixed-point characterization ensures the aggregation is stable and effective in reweighting probabilities toward desirable outputs. The proof of monotonic amplification of the "success probability" under suitable regularization is established via recurrence equations (Mroueh, 9 Mar 2025):

pn(q)=pref(q)exp[(1/β)ω+(pn1(q))]pref(q)exp[(1/β)ω+(pn1(q))]+(1pref(q))exp[(1/β)ω(pn1(q))]p_n(q) = \frac{p_{\mathrm{ref}}(q) \exp[(1/\beta) \omega^+(p_{n-1}(q))]}{p_{\mathrm{ref}}(q) \exp[(1/\beta) \omega^+(p_{n-1}(q))] + (1-p_{\mathrm{ref}}(q)) \exp [-(1/\beta) \omega^-(p_{n-1}(q))]}

where at convergence, the stationary success rate pp^* of the policy strictly exceeds the initial prefp_{\mathrm{ref}}, demonstrating systematic performance amplification.

5. Reward-on-Chain Integration: Verifiability, Auditability, and Chain-of-Thought

Reward-on-Chain (RoC) refers to the secure, transparent, and sometimes decentralized logging or verification of reward signals underlying policy updates. The RoC component of GRPO-RoC can encompass:

  • Storing binary or graded reward signals on-blockchain or an immutable ledger;
  • Verifiable, externally-auditable signals suitable for minimizing fraudulent or adversarial reward manipulation (Mroueh, 9 Mar 2025);
  • The possibility of modular or step-wise reward aggregation, as in chain-of-thought (CoT) supervision or self-correcting/step-by-step LLMs (Ding et al., 5 Jun 2025, Yang et al., 5 Jun 2025).

Practical implications include:

  • Ensuring each policy improvement is traceable to a verifiable reward assignment;
  • Enabling process-level (not merely trajectory-level) supervision, which is crucial when a single error at any reasoning step may cause overall failure—addressed by self-correction (MGRPO) or dense stepwise reward chaining (TreeRPO);
  • Providing strong defenses against phenomenon such as reward hacking.

A plausible implication is that, as reward signals become more modular, transparent, and process-linked, GRPO-RoC is increasingly well-matched to advanced alignment tasks—particularly in high-stakes or multi-agent settings.

6. Empirical Results, Extensions, and Modifications

Empirically, GRPO-RoC (and closely related GRPO-based systems) have been validated in multiple domains:

GRPO has been adapted for:

7. Limitations, Considerations, and Future Directions

Critical considerations for GRPO-RoC include:

  • Quality and calibration of verifiable or learned reward models: group-based normalization mitigates but does not eliminate potential biases;
  • Proper tuning of regularization constant (β\beta): controls the trade-off between reward optimization and reference adherence;
  • Group size selection: influences stability and the granularity of preference aggregation;
  • Reward signal structure: modular/process-level rewards confer benefits but may increase system complexity and engineering overhead;
  • Robustness and computability: off-policy, incremental, and rule-based reward structures enhance computational efficiency and alignment reliability, but may also impact policy exploration.

Future research directions suggested by the literature include development of:

  • Improved group sampling algorithms for very large-scale LLMs;
  • General frameworks for integrating on-chain reward logging in production RL systems;
  • Adaptive and context-sensitive reward aggregation schemes;
  • Combined usage of multi-layer, tree, or graph-based preference structures;
  • Proven extensions for simultaneous alignment on multiple, potentially conflicting, axes.

Tables summarizing key comparative dimensions are presented below.

Aspect GRPO-RoC RLHF/Logarithmic Pooling
Policy Aggregation Nonlinear (1/(1 − x)) scaling of πref\pi_\text{ref} Exponential tilt of πref\pi_\text{ref}
Advantage Group-normalized, shift and scale invariant Absolute, often needs value network
Reward Binary, learned, verifiable, or fuzzy Learned scalar, potentially biased
Penalty Reverse KL divergence (default) Forward KL
Preference Model Reduces to pairwise comparison at G=2G=2 Pairwise or scalar rewards
Reward-on-Chain Verifiable, auditable, step-chain possible Generally not modular or auditable

The GRPO-RoC paradigm establishes a robust methodology for scalable, modular, and verifiable policy alignment—enabling high-performance RL optimization with transparent guarantees on reward provenance and aggregation dynamics.