JustGRPO: Group Relative Policy Optimization
- JustGRPO is a group-based, critic-free reinforcement learning framework that normalizes rewards using group-level statistics.
- It employs a PPO-style clipped surrogate loss with reverse KL regularization, ensuring stable and efficient policy updates.
- Applications in LLM fine-tuning, distributed consensus, and cooperative games demonstrate its improved sample efficiency and scalability.
The JustGRPO framework refers to the application of Group Relative Policy Optimization (GRPO) in its original, unextended form, across a range of modern reinforcement learning and distributed optimization contexts. As formalized in both LLM fine-tuning and distributed convex optimization, JustGRPO leverages a critic-free, group-based normalization of returns—eschewing value-based baselines in favor of relative performance within stochastically generated peer groups. This approach has enabled efficient, stable, and theoretically analyzable policy optimization in large-scale language modeling, complex cooperative games, contract analysis, and distributed consensus problems (Ni et al., 21 Jan 2026, Dechtiar et al., 10 Nov 2025, Pang et al., 4 Aug 2025, Yang et al., 7 Oct 2025, Vojnovic et al., 25 Feb 2025, Lee et al., 2013).
1. Core Principles of JustGRPO
JustGRPO generalizes the single-sample advantage baseline of Proximal Policy Optimization (PPO) to a group-normalized, distribution-relative scheme. For a given prompt, observation, or environment state, a set of candidate actions or output trajectories is sampled from a reference (usually previous) policy. The scalar returns (reward signals) for these candidates are aggregated using a shift-and-scale normalization—typically by centering each reward around the group's mean and scaling by its standard deviation, yielding a group-relative advantage (Yang et al., 7 Oct 2025, Pang et al., 4 Aug 2025, Vojnovic et al., 25 Feb 2025, Ni et al., 21 Jan 2026). The objective function for policy optimization applies these advantages in a PPO-style clipped surrogate, with optional KL penalty to a reference policy.
This structure renders the method critic-free and naturally scale-invariant, avoiding issues of reward magnitude drift and enabling rapid adaptation without the need for separate value function networks (Pang et al., 4 Aug 2025, Yang et al., 7 Oct 2025).
2. Mathematical Formulation and Objective
The central optimization target in JustGRPO is the maximization of a clipped, normalized surrogate with respect to current policy parameters : Here, is the importance sampling ratio between the current and previous policy at candidate , and is the clipping hyperparameter. The group-relative advantage is calculated over returns (Yang et al., 7 Oct 2025, Pang et al., 4 Aug 2025).
For LLM fine-tuning, each complete response (trajectory) is assigned a scalar reward, and group normalization is performed within each prompt batch. In distributed optimization, each agent's update relies on a similar normalization over a group of local candidate projections (Lee et al., 2013).
3. Algorithmic Implementation
The canonical JustGRPO algorithm proceeds as follows:
- Sampling: For each state or input, sample a group of candidate actions/outputs from the reference (old) policy.
- Evaluation: Assign scalar returns to each candidate (reward function or optimization objective).
- Normalization: Compute group mean and standard deviation, then standardize each reward to obtain .
- Policy Ratio: For each candidate, compute the importance sampling ratio .
- Clipped Surrogate: Formulate the loss as the PPO-style minimum of the unclipped and clipped policy-ratio multiplied by group-relative advantage; include KL penalty if used.
- Update: Iterate stochastic gradient ascent (or Adam optimization) steps; periodically refresh reference and old policies (Yang et al., 7 Oct 2025, Pang et al., 4 Aug 2025, Ni et al., 21 Jan 2026).
A representative pseudocode structure is provided in several sources (Pang et al., 4 Aug 2025, Dechtiar et al., 10 Nov 2025, Yang et al., 7 Oct 2025), emphasizing inner loop batch updates and the reuse of samples to increase data efficiency.
4. Theoretical and Empirical Properties
JustGRPO admits first-of-its-kind nonconvex convergence analysis for critic-free RL, establishing that the expected squared policy gradient norm converges at the rate
where is the learning rate, is the number of inner updates between reference policy refreshes, and is the group size (Pang et al., 4 Aug 2025). The algorithm achieves stable convergence when |G| is sufficiently large and θ drift between steps is small.
Empirically, JustGRPO yields robust improvements over PPO and trajectory-level value-based alternatives in sample efficiency and stability across a variety of domains:
- LLMs: outperforms self-consistency and soft policy-gradient methods on math and code reasoning benchmarks (e.g., 89.1% on GSM8K (Ni et al., 21 Jan 2026)).
- Distributed optimization: guarantees almost-sure convergence on convex consensus problems even in fully asynchronous, uncoordinated networks (Lee et al., 2013).
Group normalization avoids the calibration and scaling issues associated with value-based critics and eliminates the requirement for extensive value network training, enabling effective application in small-data or sparse-reward environments (Dechtiar et al., 10 Nov 2025, Yang et al., 7 Oct 2025).
5. Alignment Objective and Stationary Policy Characterization
The GRPO alignment framework induces a nonlinear preference aggregation effect fundamentally distinct from standard RLHF (logarithmic pooling). Group normalization yields a policy update rule: where is a group-relative, shift-and-scale normalized reward preference and is the reverse KL (from reference to policy) (Vojnovic et al., 25 Feb 2025).
The stationary solution satisfies an implicit fixed-point equation depending on group size, scaling, and regularization parameter, and, in special cases (binary/pairwise, large group), has closed-form expressions relating the resulting aggregate to confidence margins and group properties.
6. Practical Applications and Variants
JustGRPO has been deployed in diverse settings:
- LLM fine-tuning: math and code reasoning (Ni et al., 21 Jan 2026), contract graph extraction (Dechtiar et al., 10 Nov 2025), general alignment for LLMs (Pang et al., 4 Aug 2025, Vojnovic et al., 25 Feb 2025).
- Distributed optimization: consensus under random projection and asynchronous gossip (Lee et al., 2013).
- Cooperative games: adaptation to public goods with group-based normalization and extensions for global constraints (Yang et al., 7 Oct 2025).
Variants such as TIC-GRPO (Trajectory-level Importance-Corrected GRPO) replace token-level importance ratios with a single trajectory ratio, providing unbiased gradient estimation at the cost of marginally higher computation per step and sharing the same convergence guarantees (Pang et al., 4 Aug 2025).
7. Theoretical and Empirical Comparison with PPO and RLHF
JustGRPO and its group-relative normalization provide significant departures from PPO and RLHF:
- Critic-free: GRPO eliminates dependence on learned baselines, directly using empirical group returns to compute advantages.
- Reverse KL regularization: Stationary distributions are shaped by a reverse-KL term rather than the forward-KL of RLHF, producing different aggregation and alignment dynamics (Vojnovic et al., 25 Feb 2025).
- Policy stability: Group normalization stabilizes updates in environments with highly variable or sparse reward structures, broadening applicability in language modeling and multi-agent coordination (Ni et al., 21 Jan 2026, Yang et al., 7 Oct 2025).
A plausible implication is that group-relative normalization and scale-invariant update mechanics facilitate robust fine-tuning and consensus even in low-data regimes or under high stochasticity.
Principal References:
(Lee et al., 2013, Vojnovic et al., 25 Feb 2025, Pang et al., 4 Aug 2025, Yang et al., 7 Oct 2025, Dechtiar et al., 10 Nov 2025, Ni et al., 21 Jan 2026)