GRPO: Critic-Free, Group-Normalized Fine-Tuning

Updated 2 December 2025

GRPO-based fine-tuning is a critic-free reinforcement learning procedure that leverages group-normalized advantages to rank candidate outputs efficiently.
It employs modular, task-specific reward functions—such as format, IoU, accuracy, and recall—to provide dense supervision across modalities.
Empirical results demonstrate its superior stability, data efficiency, and generalization compared to traditional PPO and supervised fine-tuning approaches.

Group Relative Policy Optimization (GRPO)-based fine-tuning defines a family of critic-free reinforcement learning procedures for efficiently aligning large language, multimodal, and generative models with complex or verifiable reward functions. In GRPO, multiple candidate outputs are sampled for each input, evaluated under a task-specific reward, and the resulting group is normalized to form a relative advantage signal. Policy updates then maximize the probability of relatively better responses within each group, using a ranking-style, variance-reduced policy gradient. The GRPO framework has proven particularly effective for multimodal LLMs (MLLMs), video understanding, and other RLHF domains, offering superior stability, data efficiency, and generalization versus supervised fine-tuning (SFT) and traditional actor-critic approaches.

1. Core GRPO Formulation and Motivation

GRPO is motivated by the limitations of conventional Proximal Policy Optimization (PPO) in alignment and RLHF settings, particularly the variance and bias introduced by value-function critics. GRPO removes the critic entirely by estimating within-group advantages for a set of $G$ outputs sampled per input under the current—or more typically, slightly outdated—policy $\pi_{\theta_\mathrm{old}}$ .

Given a state $s$ (such as a video–query pair), the policy $\pi_\theta$ generates a set of $G$ candidate responses $\{o_i\}_{i=1}^G$ . Each response receives a reward $r_i = R(s, o_i)$ . The group mean $\mu = \frac{1}{G}\sum_{i=1}^G r_i$ and standard deviation $\sigma$ are computed, and group-relative advantages are defined as

$A_i = \frac{r_i - \mu}{\sigma}.$

The GRPO surrogate objective to maximize is:

$J(\theta) = \mathbb{E}_{s,\{o_i\}\sim\pi_{\theta_\mathrm{old}}} \left[\sum_{i=1}^G \left(\frac{\pi_\theta(o_i|s)}{\pi_{\theta_\mathrm{old}}(o_i|s)} \cdot A_i\right) - \beta D_{KL}(\pi_\theta(\cdot|s) \| \pi_\mathrm{ref}(\cdot|s))\right].$

In practice, the loss is minimized (negative of the above), and the KL penalty with respect to a frozen reference policy $\pi_\mathrm{ref}$ controls distribution shift.

This critic-free, group-normalized update directly increases the probability of outputs that outperform the group's mean reward, reducing reward scaling dependence and enabling stable, sample-efficient RLHF-style fine-tuning (Li et al., 9 Apr 2025).

2. Reward Design and Task Adaptation

GRPO-based fine-tuning is characterized by the use of task-specific, often explicitly verifiable reward functions. For video MLLMs such as VideoChat-R1, multiple rewards are combined:

Format reward (e.g., $R_\mathrm{format}(o) = 0$ if $o$ matches required tags, $1$ otherwise).
IoU reward (temporal/box overlap for grounding or tracking): $R_\text{IoU} = \frac{|I_\text{pred} \cap I_\text{gt}|}{|I_\text{pred} \cup I_\text{gt}|}$ .
Accuracy reward (e.g., $R_\text{acc} = 1$ if answer correct, $0$ otherwise).
Caption/event recall, judged by an LLM, quantifying semantic overlap.

Rewards are typically summed to yield a dense supervision signal:

$R_{st} = R_\mathrm{format} + R_\text{IoU}$ (grounding/tracking)
$R_{qa} = R_\mathrm{format} + R_\text{acc}$ (QA/classification)
$R_{gqa} = R_\mathrm{format} + R_\text{IoU} + R_\text{acc}$ (grounded QA)
$R_{cap} = R_\mathrm{format} + R_\mathrm{recall}$ (captioning)

This modular approach allows the fine-tuning pipeline to be extended across domains by exchanging or stacking reward terms, including semantic alignment for medical VQA (Zhu et al., 20 May 2025), de-biasing with multi-dimensional rewards (Yixuan et al., 8 Nov 2025), and dense structural rewards such as Tree-Edit-Distance Similarity for table perception (Kang et al., 21 Sep 2025).

3. Fine-Tuning Algorithm and Implementation

The typical GRPO-based fine-tuning procedure, as implemented for VideoChat-R1 and related models, proceeds as follows:

Input: Reference policy π_ref = θ_ref
Initialize θ ← θ_ref
for epoch in 1…E:
    for each RL-batch B of N examples:
        for q in B:
            Sample G candidate responses {o_1…o_G} ~ π_{θ_old}(·|q)
            Compute rewards r_i = R(q,o_i) (using task-specific formula)
            μ, σ = mean, std of {r_i}
            Compute group-advantages A_i = (r_i – μ) / σ
        Compute loss:
            L(θ) = - (1/(N·G)) sum_{q} sum_{i} (π_θ(o_i|q)/π_{θ_{old}}(o_i|q)) * A_i
            + β · D_{KL}(π_θ(·|q) || π_ref(·|q))
        Update θ ← θ – lr · ∇_θ L(θ)
        θ_old ← θ

Key implementation elements:

No value function; all variance reduction is intra-group.
Mini-batch processing is used for computational efficiency.
KL penalty ( $\beta$ ) is typically small (e.g., $\beta = 0.02$ ), no entropy bonus.
One or a few epochs over 10k–20k samples are usually sufficient.
All parameters are updated, with no module freezing.

For multi-task or multi-stage settings, tasks may be sampled uniformly per mini-batch to encourage cross-task generalization without overfitting (Li et al., 9 Apr 2025, Kang et al., 21 Sep 2025).

4. Distinctive Features and Modifications versus PPO/SFT

GRPO departs from prior RLHF schemes and conventional PPO in multiple respects:

Advantage estimation: replaces value-based TD or GAE with within-group normalization, yielding ranking-style updates.
Clipping: standard versions omit ratio clipping, opting for a KL penalty to a frozen reference to constrain drift; certain variants (audio QA, robotics) reintroduce PPO-style clipping.
No learned critic: avoids bias and variance due to value function approximation.
Data efficiency: exploits group variance, allowing few epochs and high-variance reward settings (e.g., low data multimodal tasks).

Empirically, this design mitigates overfitting and collapse observed under SFT, providing stable gains even with small sample sizes ( $\sim$ 18k for VideoChat-R1, $\sim$ 8k for audio QA), and superior transfer to held-out domains (Li et al., 9 Apr 2025, Gibier et al., 18 Nov 2025).

Ablations consistently demonstrate that:

SFT rapidly overfits and drops in accuracy on general benchmarks.
GRPO remains robust and supports multi-task RL, with gains sustained across all sub-benchmarks (Li et al., 9 Apr 2025).
Increasing epochs beyond one may lift in-domain metrics with minimal overfitting.

5. Empirical Performance and Task Coverage

In controlled studies, GRPO-based fine-tuning consistently outperforms SFT and alternative RL schemes across multiple modalities:

VideoChat-R1: +31.8 mIoU (temporal grounding), +31.2 overlap (tracking) over base Qwen2.5-VL-7B; improvements on general VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9).
Audio QA: GRPO-tuned Qwen2.5-7B-Instruct with LoRA achieves 62.6% accuracy on DCASE 2025, leveraging group-normalized binary rewards (Gibier et al., 18 Nov 2025).
Table understanding: Table-R1 staged GRPO aligns both structure and reasoning, surpassing SFT and larger models (Kang et al., 21 Sep 2025).
Multi-reward de-biasing: Multi-dimensional GRPO, guided by a fairness classifier and linguistic quality metrics, substantially increases fairness (from 0.74 to 0.93) without loss of fluency (Yixuan et al., 8 Nov 2025).

Optimal results are consistently achieved via:

Freezing the reference model at initialization.
Moderate group sizes (G=4–8, or 16 for large-sample tasks).
Adaptive reward balancing and dynamic scheduling for weighted reward components.

6. Practical Tuning Guidelines and Pitfalls

Recommended settings:

Learning rate: 2–5×10⁻⁶ (AdamW)
Group size: G=8
Batch size: 4–16 (per device or prompt type)
KL penalty: β=0.01–0.04
Epochs: 1–3 (task and data size dependent)
Sampling: temperature ≈ 1.0, top-p or nucleus for candidate diversity

Best practices include:

Early-phase reward emphasis on text form, then gradually shifting to final task reward.
Injecting small stochastic noise into rewards to improve exploration.
Monitoring validation reward, not loss, for checkpoint selection.

Pitfalls to avoid:

Overweighting fairness or form too early can degrade informativeness (Yixuan et al., 8 Nov 2025).
Low candidate diversity causes mode collapse.
Reward miscalibration (e.g., aggressive length, fluency penalties) can induce pathological outputs.

7. Extensions, Domain Adaptation, and Impact

GRPO-based fine-tuning has demonstrated exceptional versatility:

Multimodal MLLMs for video, audio, image, and table reasoning (Li et al., 9 Apr 2025, Gibier et al., 18 Nov 2025, Kang et al., 21 Sep 2025, Gallici et al., 29 May 2025).
Ethical alignment and structured de-biasing via multi-objective reward aggregation (Yixuan et al., 8 Nov 2025).
Domain-specific adaptation, e.g., medical VQA with semantic alignment and specialized reward design (integration of BioGPT/BioMistral as external coherence judges) (Zhu et al., 20 May 2025).
Efficient LoRA adaptation and quantization for deployment on resource-constrained platforms.

The general mechanism of GRPO—reward-driven, group-normalized, critic-free ranking—offers a practical framework for stable RLHF in tasks with verifiable or decomposable objectives, robust to low data regimes and cross-domain generalization requirements.

Summary table: Key features and settings for GRPO-based fine-tuning

Component	Typical Setting	Rationale
Value function	None	Critic-free, group normalization instead
Group size (G)	4–8 (16 for large data/tasks)	Balance variance and computation
KL penalty (β)	0.01–0.04	Controls policy drift
Reference model	θ_ref (frozen initial checkpoint)	Stable alignment
Candidate sampling	Temperature ≈ 1.0, nucleus, top-p	Maintains diversity
Reward normalization	Group-wise (z-score or centered)	Scale-invariance, variance reduction
Optimization steps	1–3 epochs	Rapid convergence, avoids overfitting

This data-efficient, ranking-based RL paradigm is now foundational in state-of-the-art MLLM fine-tuning across diverse domains (Li et al., 9 Apr 2025, Yixuan et al., 8 Nov 2025, Gibier et al., 18 Nov 2025).