GRPO Fine-Tuning in VAR Models

Updated 13 January 2026

The paper introduces a critic-free GRPO approach that employs group-wise reward normalization to fine-tune VAR models efficiently.
It constructs composite reward signals using AES and CLIP scores to guide prompt-aligned, aesthetically refined image generation.
GRPO’s trust-region policy optimization with group normalization accelerates convergence and improves stability, offering faster inference than diffusion methods.

Group Relative Policy Optimization (GRPO)-Based Reinforcement Fine-Tuning refers to a family of reinforcement learning algorithms that utilize group-wise normalization of scalar rewards to fine-tune large-scale autoregressive models—particularly visual autoregressive (VAR) architectures—without requiring explicit value networks. This critic-free approach, rooted in robust policy gradient methods, is designed to efficiently align generative outputs with nuanced human-centric reward signals and to maintain high computational throughput, an especially salient property for visual sequence modeling (Gallici et al., 29 May 2025).

1. Next-Scale Visual Autoregressive Model Architecture and Pre-Training

GRPO-based fine-tuning is most impactful when applied to next-scale VAR models, which structurally decompose an input image $x$ into $K$ discrete “scales” $r_1, r_2, \dots, r_K$ . Each scale $r_k$ is a matrix of $h_k \times w_k$ discrete tokens produced by a Vector-Quantized VAE. The generative distribution factorizes coarse-to-fine:

$p_\theta(x) = p_\theta([r_1,…,r_K]) = \prod_{k=1}^K p_\theta(r_k | r_{<k})$

Pre-training proceeds via cross-entropy minimization over $N$ training images:

$\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \log p_\theta([r_1^{(i)},…,r_K^{(i)}])$

This “next-scale prediction” task preserves spatial locality by predicting entire downsampled feature maps in autoregressive order, instead of token-by-token next-pixel prediction.

2. Construction of Reward Signals for RL Fine-Tuning

GRPO-based fine-tuning relies on verifiable reward functions reflecting task-specific or perceptual criteria. Two principal reward types are used:

Aesthetic Predictor (AES): Returns a scalar in $[1,10]$ by processing the CLIP embedding of the generated image $x$ through an MLP trained on human-provided aesthetics ratings.
CLIP Score: For prompt $c$ , the reward evaluates $\langle E_{visual}(x), E_{text}(c)\rangle$ in CLIP space, measuring semantic alignment.
Combined Reward: Generalized as $r(x) = \lambda_a \cdot \text{AES}(x) + \lambda_c \cdot \text{CLIP\_score}(x, c)$ . For constrained tasks (such as brightness), simpler reward models (thresholded mean RGB) are employed.

3. Formalization of GRPO Workflow

GRPO modifies standard PPO by introducing group-wise advantage normalization and token-level importance weights:

Grouping: $G$ outputs are sampled per condition (e.g., per class label), yielding mini-batches of candidate generations.
Group-Relative Advantage: For group $g$ , with rewards $\{r_1,…,r_G\}$ ,

$A_i^g = \frac{r_i - \mu_g}{\sigma_g}$

where $\mu_g$ and $\sigma_g$ are the group mean and standard deviation.

Importance Weights: For each token $a$ in group $g$ :

$w^g(s,a) = \frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}$

GRPO Objective: The trust-region PPO-style surrogate,

$J(\theta)=\mathbb{E}_{g\sim G}\left[ \mathbb{E}_{(s,a)\sim\pi_{\theta_{old}}^g}\left[ w^g(s,a) A^g_{old}(s,a) \right]\right]$

subject to average KL constraint,

$\mathbb{E}_{g\sim G}\left[D_{KL}(\pi_{\theta_{old}}^g\,\|\,\pi_\theta^g)\right] \leq \delta$

In practice, clipped surrogate loss with fixed $\epsilon$ and KL regularization ( $\beta$ ) is used:

$L(\theta)= \mathbb{E}_g \mathbb{E}_i \left[ \min ( w_i A_i, \mathrm{clip}(w_i,1–\epsilon,1+\epsilon) A_i ) \right] - \beta\,\mathbb{E}_g\,D_{KL}(\pi_{\theta_{old}}^g\,||\,\pi_\theta^g)$

4. Algorithmic Structure and Implementation

The fine-tuning process follows a clear, staged workflow:

Initialization: Set $\theta \leftarrow \theta_{pretrained}$ (VAR trained on ImageNet).
Iterative RL-Driven Training:
- Randomly select $C$ class-labels from the ImageNet class pool.
- For each label, sample $G$ images from $\pi_{\theta_{old}}$ using multicategorical sampling at temperature $\tau$ .
- Compute group rewards using AES/CLIP.
- Calculate intra-group mean and standard deviation to derive normalized advantages for all samples.
- Compute per-token gradients, accumulate surrogate (clipped) loss plus KL penalty, and update weights via Adam ( $\text{lr}=1\times10^{-4}$ , $\epsilon=0.2$ , $\beta=0.2$ ).
- Sync $\theta_{old} \leftarrow \theta$ at intervals. No separate value network is used; GRPO’s group normalization fulfills its purpose.

5. Experimental Paradigm and Evaluation Metrics

Experiments utilize both mid-scale ($310$M, VAR-d16) and large-scale ($2$B, VAR-d30) pretrained models. Fine-tuning samples ImageNet labels or fixed text prompts (for CLIP reward). Primary metrics include:

Aesthetic Score (AES): As output by the MLP/CLIP scheme, measured over $10$K images.
CLIP Score: Alignment with semantic prompts.
ResNet50 Top-5 Accuracy: Detects distributional drift from the ImageNet training regime.
FID: Optionally used to benchmark against diffusion models.

Ablations vary the KL penalty $\beta$ , the group size $G$ , and train on partial label splits to measure generalization.

6. Empirical Results and Analyses

Key empirical outcomes:

Toy Reward (Brightness): Rapid convergence ( $\lesssim 10$ min on H100) to exclusive bright/dark image generation; reward stabilizes at $1$.
Aesthetic (AES/CLIP) Fine-Tuning: $40$K gradient steps ( $\sim16$ $\sim 16$ hr for d16, $\sim40$ $\sim 40$ hr for d30) imply:
- VAR-d30 AES increase from $4.80\rightarrow5.80$ , ResNet accuracy $\sim90$ \%.
- VAR-d16 gains $\sim0.5$ AES even on withheld label classes.
CLIP Alignment (Style Transfer): In $10$ hr, CLIP alignment doubles; model synthesizes unseen prompts and non-ImageNet styles.
Ablation: Excessively low $\beta$ causes reward hacking/label collapse, high $\beta$ blocks improvement. Larger $G$ increases stability and metric gains.
Inference Speed: VAR models offer $\sim10\times$ faster sampling than diffusion approaches, making online RL practical.

7. Broader Implications and Recommendations

GRPO-based fine-tuning provides a robust, value-free approach for aligning high-throughput VAR models with human-centric objectives (Gallici et al., 29 May 2025). Its group-based normalization reduces gradient variance and removes the need for explicit value functions, supporting:

Efficient Online RL: Fast autoregressive models accommodate large-sample RL loops without the prohibitive slowdowns typical of diffusion-based alternatives.
Precise Alignment: The joint use of AES and CLIP rewards enables fine-grained control over both aesthetic quality and prompt-driven style, maintaining classification integrity relative to the base model.
Generalization Beyond Pretraining: RL-driven exploration allows VAR models to synthesize prompt-aligned outputs not represented in the original training data.
Methodological Stability: GRPO’s normalization and trust-region constraints enable large-scale fine-tuning with rapid convergence and robust safety against reward exploitation.

In conclusion, GRPO-based reinforcement fine-tuning for visual autoregressive models represents an efficient paradigm for scaling RL alignment protocols to highly performant, generative architectures while retaining computational and modeling tractability (Gallici et al., 29 May 2025).

PDF Markdown Chat (Pro)

References (1)

Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to GRPO-Based Reinforcement Fine-Tuning.