GRPO Fine-Tuning in VAR Models
- The paper introduces a critic-free GRPO approach that employs group-wise reward normalization to fine-tune VAR models efficiently.
- It constructs composite reward signals using AES and CLIP scores to guide prompt-aligned, aesthetically refined image generation.
- GRPO’s trust-region policy optimization with group normalization accelerates convergence and improves stability, offering faster inference than diffusion methods.
Group Relative Policy Optimization (GRPO)-Based Reinforcement Fine-Tuning refers to a family of reinforcement learning algorithms that utilize group-wise normalization of scalar rewards to fine-tune large-scale autoregressive models—particularly visual autoregressive (VAR) architectures—without requiring explicit value networks. This critic-free approach, rooted in robust policy gradient methods, is designed to efficiently align generative outputs with nuanced human-centric reward signals and to maintain high computational throughput, an especially salient property for visual sequence modeling (Gallici et al., 29 May 2025).
1. Next-Scale Visual Autoregressive Model Architecture and Pre-Training
GRPO-based fine-tuning is most impactful when applied to next-scale VAR models, which structurally decompose an input image into discrete “scales” . Each scale is a matrix of discrete tokens produced by a Vector-Quantized VAE. The generative distribution factorizes coarse-to-fine:
Pre-training proceeds via cross-entropy minimization over training images:
This “next-scale prediction” task preserves spatial locality by predicting entire downsampled feature maps in autoregressive order, instead of token-by-token next-pixel prediction.
2. Construction of Reward Signals for RL Fine-Tuning
GRPO-based fine-tuning relies on verifiable reward functions reflecting task-specific or perceptual criteria. Two principal reward types are used:
- Aesthetic Predictor (AES): Returns a scalar in by processing the CLIP embedding of the generated image through an MLP trained on human-provided aesthetics ratings.
- CLIP Score: For prompt , the reward evaluates in CLIP space, measuring semantic alignment.
- Combined Reward: Generalized as . For constrained tasks (such as brightness), simpler reward models (thresholded mean RGB) are employed.
3. Formalization of GRPO Workflow
GRPO modifies standard PPO by introducing group-wise advantage normalization and token-level importance weights:
- Grouping: outputs are sampled per condition (e.g., per class label), yielding mini-batches of candidate generations.
- Group-Relative Advantage: For group , with rewards ,
where and are the group mean and standard deviation.
- Importance Weights: For each token in group :
- GRPO Objective: The trust-region PPO-style surrogate,
subject to average KL constraint,
In practice, clipped surrogate loss with fixed and KL regularization () is used:
4. Algorithmic Structure and Implementation
The fine-tuning process follows a clear, staged workflow:
- Initialization: Set (VAR trained on ImageNet).
- Iterative RL-Driven Training:
- Randomly select class-labels from the ImageNet class pool.
- For each label, sample images from using multicategorical sampling at temperature .
- Compute group rewards using AES/CLIP.
- Calculate intra-group mean and standard deviation to derive normalized advantages for all samples.
- Compute per-token gradients, accumulate surrogate (clipped) loss plus KL penalty, and update weights via Adam (, , ).
- Sync at intervals. No separate value network is used; GRPO’s group normalization fulfills its purpose.
5. Experimental Paradigm and Evaluation Metrics
Experiments utilize both mid-scale ($310$M, VAR-d16) and large-scale ($2$B, VAR-d30) pretrained models. Fine-tuning samples ImageNet labels or fixed text prompts (for CLIP reward). Primary metrics include:
- Aesthetic Score (AES): As output by the MLP/CLIP scheme, measured over $10$K images.
- CLIP Score: Alignment with semantic prompts.
- ResNet50 Top-5 Accuracy: Detects distributional drift from the ImageNet training regime.
- FID: Optionally used to benchmark against diffusion models.
Ablations vary the KL penalty , the group size , and train on partial label splits to measure generalization.
6. Empirical Results and Analyses
Key empirical outcomes:
- Toy Reward (Brightness): Rapid convergence ( min on H100) to exclusive bright/dark image generation; reward stabilizes at $1$.
- Aesthetic (AES/CLIP) Fine-Tuning: $40$K gradient steps ( hr for d16, hr for d30) imply:
- VAR-d30 AES increase from , ResNet accuracy \%.
- VAR-d16 gains AES even on withheld label classes.
- CLIP Alignment (Style Transfer): In $10$ hr, CLIP alignment doubles; model synthesizes unseen prompts and non-ImageNet styles.
- Ablation: Excessively low causes reward hacking/label collapse, high blocks improvement. Larger increases stability and metric gains.
- Inference Speed: VAR models offer faster sampling than diffusion approaches, making online RL practical.
7. Broader Implications and Recommendations
GRPO-based fine-tuning provides a robust, value-free approach for aligning high-throughput VAR models with human-centric objectives (Gallici et al., 29 May 2025). Its group-based normalization reduces gradient variance and removes the need for explicit value functions, supporting:
- Efficient Online RL: Fast autoregressive models accommodate large-sample RL loops without the prohibitive slowdowns typical of diffusion-based alternatives.
- Precise Alignment: The joint use of AES and CLIP rewards enables fine-grained control over both aesthetic quality and prompt-driven style, maintaining classification integrity relative to the base model.
- Generalization Beyond Pretraining: RL-driven exploration allows VAR models to synthesize prompt-aligned outputs not represented in the original training data.
- Methodological Stability: GRPO’s normalization and trust-region constraints enable large-scale fine-tuning with rapid convergence and robust safety against reward exploitation.
In conclusion, GRPO-based reinforcement fine-tuning for visual autoregressive models represents an efficient paradigm for scaling RL alignment protocols to highly performant, generative architectures while retaining computational and modeling tractability (Gallici et al., 29 May 2025).