Papers
Topics
Authors
Recent
2000 character limit reached

GRPO Fine-Tuning in VAR Models

Updated 13 January 2026
  • The paper introduces a critic-free GRPO approach that employs group-wise reward normalization to fine-tune VAR models efficiently.
  • It constructs composite reward signals using AES and CLIP scores to guide prompt-aligned, aesthetically refined image generation.
  • GRPO’s trust-region policy optimization with group normalization accelerates convergence and improves stability, offering faster inference than diffusion methods.

Group Relative Policy Optimization (GRPO)-Based Reinforcement Fine-Tuning refers to a family of reinforcement learning algorithms that utilize group-wise normalization of scalar rewards to fine-tune large-scale autoregressive models—particularly visual autoregressive (VAR) architectures—without requiring explicit value networks. This critic-free approach, rooted in robust policy gradient methods, is designed to efficiently align generative outputs with nuanced human-centric reward signals and to maintain high computational throughput, an especially salient property for visual sequence modeling (Gallici et al., 29 May 2025).

1. Next-Scale Visual Autoregressive Model Architecture and Pre-Training

GRPO-based fine-tuning is most impactful when applied to next-scale VAR models, which structurally decompose an input image xx into KK discrete “scales” r1,r2,,rKr_1, r_2, \dots, r_K. Each scale rkr_k is a matrix of hk×wkh_k \times w_k discrete tokens produced by a Vector-Quantized VAE. The generative distribution factorizes coarse-to-fine:

pθ(x)=pθ([r1,,rK])=k=1Kpθ(rkr<k)p_\theta(x) = p_\theta([r_1,…,r_K]) = \prod_{k=1}^K p_\theta(r_k | r_{<k})

Pre-training proceeds via cross-entropy minimization over NN training images:

L(θ)=1Ni=1Nlogpθ([r1(i),,rK(i)])\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \log p_\theta([r_1^{(i)},…,r_K^{(i)}])

This “next-scale prediction” task preserves spatial locality by predicting entire downsampled feature maps in autoregressive order, instead of token-by-token next-pixel prediction.

2. Construction of Reward Signals for RL Fine-Tuning

GRPO-based fine-tuning relies on verifiable reward functions reflecting task-specific or perceptual criteria. Two principal reward types are used:

  • Aesthetic Predictor (AES): Returns a scalar in [1,10][1,10] by processing the CLIP embedding of the generated image xx through an MLP trained on human-provided aesthetics ratings.
  • CLIP Score: For prompt cc, the reward evaluates Evisual(x),Etext(c)\langle E_{visual}(x), E_{text}(c)\rangle in CLIP space, measuring semantic alignment.
  • Combined Reward: Generalized as r(x)=λaAES(x)+λcCLIP_score(x,c)r(x) = \lambda_a \cdot \text{AES}(x) + \lambda_c \cdot \text{CLIP\_score}(x, c). For constrained tasks (such as brightness), simpler reward models (thresholded mean RGB) are employed.

3. Formalization of GRPO Workflow

GRPO modifies standard PPO by introducing group-wise advantage normalization and token-level importance weights:

  • Grouping: GG outputs are sampled per condition (e.g., per class label), yielding mini-batches of candidate generations.
  • Group-Relative Advantage: For group gg, with rewards {r1,,rG}\{r_1,…,r_G\},

Aig=riμgσgA_i^g = \frac{r_i - \mu_g}{\sigma_g}

where μg\mu_g and σg\sigma_g are the group mean and standard deviation.

  • Importance Weights: For each token aa in group gg:

wg(s,a)=πθ(as)πθold(as)w^g(s,a) = \frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}

  • GRPO Objective: The trust-region PPO-style surrogate,

J(θ)=EgG[E(s,a)πθoldg[wg(s,a)Aoldg(s,a)]]J(\theta)=\mathbb{E}_{g\sim G}\left[ \mathbb{E}_{(s,a)\sim\pi_{\theta_{old}}^g}\left[ w^g(s,a) A^g_{old}(s,a) \right]\right]

subject to average KL constraint,

EgG[DKL(πθoldgπθg)]δ\mathbb{E}_{g\sim G}\left[D_{KL}(\pi_{\theta_{old}}^g\,\|\,\pi_\theta^g)\right] \leq \delta

In practice, clipped surrogate loss with fixed ϵ\epsilon and KL regularization (β\beta) is used:

L(θ)=EgEi[min(wiAi,clip(wi,1ϵ,1+ϵ)Ai)]βEgDKL(πθoldgπθg)L(\theta)= \mathbb{E}_g \mathbb{E}_i \left[ \min ( w_i A_i, \mathrm{clip}(w_i,1–\epsilon,1+\epsilon) A_i ) \right] - \beta\,\mathbb{E}_g\,D_{KL}(\pi_{\theta_{old}}^g\,||\,\pi_\theta^g)

4. Algorithmic Structure and Implementation

The fine-tuning process follows a clear, staged workflow:

  1. Initialization: Set θθpretrained\theta \leftarrow \theta_{pretrained} (VAR trained on ImageNet).
  2. Iterative RL-Driven Training:
    • Randomly select CC class-labels from the ImageNet class pool.
    • For each label, sample GG images from πθold\pi_{\theta_{old}} using multicategorical sampling at temperature τ\tau.
    • Compute group rewards using AES/CLIP.
    • Calculate intra-group mean and standard deviation to derive normalized advantages for all samples.
    • Compute per-token gradients, accumulate surrogate (clipped) loss plus KL penalty, and update weights via Adam (lr=1×104\text{lr}=1\times10^{-4}, ϵ=0.2\epsilon=0.2, β=0.2\beta=0.2).
    • Sync θoldθ\theta_{old} \leftarrow \theta at intervals. No separate value network is used; GRPO’s group normalization fulfills its purpose.

5. Experimental Paradigm and Evaluation Metrics

Experiments utilize both mid-scale ($310$M, VAR-d16) and large-scale ($2$B, VAR-d30) pretrained models. Fine-tuning samples ImageNet labels or fixed text prompts (for CLIP reward). Primary metrics include:

  • Aesthetic Score (AES): As output by the MLP/CLIP scheme, measured over $10$K images.
  • CLIP Score: Alignment with semantic prompts.
  • ResNet50 Top-5 Accuracy: Detects distributional drift from the ImageNet training regime.
  • FID: Optionally used to benchmark against diffusion models.

Ablations vary the KL penalty β\beta, the group size GG, and train on partial label splits to measure generalization.

6. Empirical Results and Analyses

Key empirical outcomes:

  • Toy Reward (Brightness): Rapid convergence (10\lesssim 10 min on H100) to exclusive bright/dark image generation; reward stabilizes at $1$.
  • Aesthetic (AES/CLIP) Fine-Tuning: $40$K gradient steps (16\sim16 hr for d16, 40\sim40 hr for d30) imply:
    • VAR-d30 AES increase from 4.805.804.80\rightarrow5.80, ResNet accuracy 90\sim90\%.
    • VAR-d16 gains 0.5\sim0.5 AES even on withheld label classes.
  • CLIP Alignment (Style Transfer): In $10$ hr, CLIP alignment doubles; model synthesizes unseen prompts and non-ImageNet styles.
  • Ablation: Excessively low β\beta causes reward hacking/label collapse, high β\beta blocks improvement. Larger GG increases stability and metric gains.
  • Inference Speed: VAR models offer 10×\sim10\times faster sampling than diffusion approaches, making online RL practical.

7. Broader Implications and Recommendations

GRPO-based fine-tuning provides a robust, value-free approach for aligning high-throughput VAR models with human-centric objectives (Gallici et al., 29 May 2025). Its group-based normalization reduces gradient variance and removes the need for explicit value functions, supporting:

  • Efficient Online RL: Fast autoregressive models accommodate large-sample RL loops without the prohibitive slowdowns typical of diffusion-based alternatives.
  • Precise Alignment: The joint use of AES and CLIP rewards enables fine-grained control over both aesthetic quality and prompt-driven style, maintaining classification integrity relative to the base model.
  • Generalization Beyond Pretraining: RL-driven exploration allows VAR models to synthesize prompt-aligned outputs not represented in the original training data.
  • Methodological Stability: GRPO’s normalization and trust-region constraints enable large-scale fine-tuning with rapid convergence and robust safety against reward exploitation.

In conclusion, GRPO-based reinforcement fine-tuning for visual autoregressive models represents an efficient paradigm for scaling RL alignment protocols to highly performant, generative architectures while retaining computational and modeling tractability (Gallici et al., 29 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to GRPO-Based Reinforcement Fine-Tuning.