GARDO: Diversity-Aware RL in Diffusion Models
- The paper introduces GARDO, a novel RL framework integrating gated regularization, adaptive reference resets, and diversity-aware optimization to counter reward hacking and diversity collapse in text-to-image diffusion models.
- GARDO employs targeted KL regularization via uncertainty estimation, applying penalties only to high-risk samples to maintain robust exploration and sample efficiency.
- Adaptive reference resets prevent policy stagnation while multiplicative diversity shaping amplifies reward advantages, leading to enhanced proxy scores and generalization in fine-tuning.
Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO) is a reinforcement learning (RL) framework developed to address the challenge of fine-tuning text-to-image diffusion models using proxy rewards. In scenarios where it is not possible to specify a complete ground-truth objective for generative image tasks, models are optimized against imperfect proxies, leading to reward hacking, sample inefficiency, restricted exploration, and diversity collapse. GARDO integrates three main pillars—gated regularization, adaptive reference resets, and diversity-aware optimization—to achieve sample-efficient fine-tuning, robust exploration, reward hacking mitigation, and diversity preservation (He et al., 30 Dec 2025).
1. Challenges in RL Fine-tuning of Diffusion Models
Fine-tuning diffusion models with RL proxy rewards is hindered by several key issues:
- Reward Hacking: Optimizing against partial proxy objectives (such as ImageReward or OCR accuracy) enables models to obtain high proxy scores by exploiting loopholes, often leading to diminished actual image quality and collapses in sample diversity.
- Sample Efficiency: Universal Kullback–Leibler (KL) regularization, applied to all samples against a static reference, can avoid severe reward hacking but greatly reduces learning speed.
- Exploration Limitation: Overly strong KL penalties cause policies to remain near the initial reference, obstructing the discovery of superior solutions and hindering effective coverage of high-reward regions.
- Diversity Collapse: Standard RL methods such as PPO or GRPO tend to converge to a narrow set of high-reward modes, which concentrates output and reduces generative diversity.
These challenges highlight the need for sophisticated regularization and diversity mechanisms during RL-based fine-tuning.
2. Gated Regularization Mechanism
GARDO introduces a sample-wise regularization strategy based on uncertainty estimation, allowing precise application of KL penalties only to problematic samples:
- Uncertainty Measurement: For each generated sample , the main proxy reward and an ensemble of auxiliary reward models are evaluated. Each reward’s batch win-rate is calculated as
and the uncertainty score is defined by
indicating proxy reward overconfidence relative to the auxiliary ensemble.
- Gating KL Penalty: Samples with uncertainty above the gating threshold (the -percentile, with as the gating rate) are flagged with
KL regularization is applied only to such high-uncertainty samples.
- Surrogate Loss: The training objective for a mini-batch of rollouts is
where is the RL loss (e.g., GRPO) and is the KL weight.
GARDO’s gating mechanism allows efficient learning by penalizing only those outputs at greatest risk of reward hacking or overfitting.
3. Adaptive Regularization Strategy
The use of a static reference policy for KL regularization increasingly penalizes the online policy as it improves, causing optimization stagnation. GARDO addresses this via periodic reference resets:
- Reset Criteria: The rolling average gated KL divergence is monitored:
When (threshold) or a maximum step count is reached, the reference policy is updated:
and the step-counter is reset to zero.
- Effect: This adaptation prevents stale anchoring and ensures KL regularization remains relevant, preserving exploration while maintaining control over divergence from past policies.
4. Diversity-Aware Optimization
GARDO introduces explicit reward amplification for diverse, high-quality outputs:
- Diversity Metric: For each output , a DINOv3 semantic embedding is extracted, and the nearest-neighbor cosine isolation score is computed:
The overall diversity in a group is
- Multiplicative Advantage Shaping: Advantages from GRPO, , are amplified for samples that are both high-reward and highly isolated:
Unlike additive shaping, the multiplicative form strengthens only genuinely diverse, high-reward samples and simplifies tuning.
This dual mechanism is designed to sustain exploration of new modes and prevent convergence to low-diversity policy outputs.
5. GARDO Optimization and Training Workflow
The GARDO training procedure encompasses initialization, sampling, reward computation, diversity evaluation, uncertainty estimation, gating, optimization, adaptive reference resetting, and gating adjustment:
- Initialize online and reference policies .
- For each iteration, a. Sample prompts and generate rollouts. b. Compute proxy rewards and unnormalized advantages via GRPO. c. Extract DINOv3 features, compute isolation , and shape advantages . d. Calculate uncertainty . e. Determine and gating indicators . f. Formulate using the gated loss. g. Update policy parameters with the gradient. h. Conditionally reset according to KL or step schedules. i. Adjust gating parameter based on observed batch uncertainties within a rolling window.
Hyperparameters for stable diffusion adaptation include LoRA adapters (rank 32, α=64), batch size (6 prompts, 24 rollouts per prompt), initial , reference reset threshold , learning rate , and removal of advantage normalization std for stability in low-variance groups.
6. Empirical Evaluation and Ablation Insights
Experiments demonstrate that GARDO delivers sample efficiency comparable to unregularized RL while preventing reward hacking and increasing diversity:
| Method | Proxy Reward Growth | Generalization (Unseen Metrics) | Mean Diversity (%) |
|---|---|---|---|
| GRPO (β=0) | Fast | Poor | 19–20 |
| Uniform KL (β>0) | Slower (10–20% ↓) | High | 19–20 |
| GARDO (Full) | Fast (≈GRPO) | High/exceeds | 24–25 |
| GARDO w/o Div | Fast | High | No diversity gain |
GARDO sustains high proxy scores and simultaneously recovers or exceeds generalization on out-of-distribution (o.o.d.) metrics such as Aesthetic, PickScore, ImageReward, and HPSv3. Removal of multiplicative diversity shaping negates diversity benefits, while diversity shaping without KL regularization eventually leads to over-optimization and collapse. The framework generalizes across models, with performance benefits observed for DiffusionNFT and 12B-parameter Flux.1 models as well (He et al., 30 Dec 2025).
7. Practical Considerations and Implementation Guidelines
For text-to-image diffusion (T2I) fine-tuning, GARDO’s auxiliary uncertainty estimators (Aesthetic, ImageReward) incur negligible overhead. Batch groupings (e.g., batch=6, group ), LoRA adapters, and frozen DINOv3 embeddings are recommended for efficient and scalable implementation. Adaptive hyperparameters, rolling-window gating rate adjustment (±10%), and low-variance advantage normalization further stabilize training. These measures are designed to avoid reward hacking while maintaining rapid policy improvement and output diversity.
GARDO synthesizes uncertainty-aware KL application, adaptive anchoring, and explicit diversity shaping, enabling RL fine-tuning to achieve efficient exploration and robust prevention of both reward exploitation and diversity collapse in diffusion models (He et al., 30 Dec 2025).