GARDO: Diversity-Aware RL in Diffusion Models

Updated 6 January 2026

The paper introduces GARDO, a novel RL framework integrating gated regularization, adaptive reference resets, and diversity-aware optimization to counter reward hacking and diversity collapse in text-to-image diffusion models.
GARDO employs targeted KL regularization via uncertainty estimation, applying penalties only to high-risk samples to maintain robust exploration and sample efficiency.
Adaptive reference resets prevent policy stagnation while multiplicative diversity shaping amplifies reward advantages, leading to enhanced proxy scores and generalization in fine-tuning.

Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO) is a reinforcement learning (RL) framework developed to address the challenge of fine-tuning text-to-image diffusion models using proxy rewards. In scenarios where it is not possible to specify a complete ground-truth objective for generative image tasks, models are optimized against imperfect proxies, leading to reward hacking, sample inefficiency, restricted exploration, and diversity collapse. GARDO integrates three main pillars—gated regularization, adaptive reference resets, and diversity-aware optimization—to achieve sample-efficient fine-tuning, robust exploration, reward hacking mitigation, and diversity preservation (He et al., 30 Dec 2025).

1. Challenges in RL Fine-tuning of Diffusion Models

Fine-tuning diffusion models with RL proxy rewards is hindered by several key issues:

Reward Hacking: Optimizing against partial proxy objectives (such as ImageReward or OCR accuracy) enables models to obtain high proxy scores by exploiting loopholes, often leading to diminished actual image quality and collapses in sample diversity.
Sample Efficiency: Universal Kullback–Leibler (KL) regularization, applied to all samples against a static reference, can avoid severe reward hacking but greatly reduces learning speed.
Exploration Limitation: Overly strong KL penalties cause policies to remain near the initial reference, obstructing the discovery of superior solutions and hindering effective coverage of high-reward regions.
Diversity Collapse: Standard RL methods such as PPO or GRPO tend to converge to a narrow set of high-reward modes, which concentrates output and reduces generative diversity.

These challenges highlight the need for sophisticated regularization and diversity mechanisms during RL-based fine-tuning.

2. Gated Regularization Mechanism

GARDO introduces a sample-wise regularization strategy based on uncertainty estimation, allowing precise application of KL penalties only to problematic samples:

Uncertainty Measurement: For each generated sample $x^i$ , the main proxy reward $\tilde R(x^i)$ and an ensemble of $K$ auxiliary reward models $\{\hat R_n(x^i)\}$ are evaluated. Each reward’s batch win-rate is calculated as

$w(y^i)=\frac1B\sum_{j\neq i}\mathbf{1}(y^i>y^j)$

and the uncertainty score is defined by

$\mathcal U(x^i) = w(\tilde R(x^i)) - \frac1K\sum_{n=1}^K w(\hat R_n(x^i))$

indicating proxy reward overconfidence relative to the auxiliary ensemble.

Gating KL Penalty: Samples with uncertainty above the gating threshold $\epsilon_{\mathcal U}$ (the $(1-k)$ -percentile, with $k$ as the gating rate) are flagged with

$\mathbf{1}_i = \begin{cases} 1 & \text{if }\mathcal U(x^i)>\epsilon_{\mathcal U}, \ 0 & \text{otherwise}. \end{cases}$

KL regularization is applied only to such high-uncertainty samples.

Surrogate Loss: The training objective for a mini-batch of $G$ rollouts is

$L_{\rm total} = \frac1G\sum_{i=1}^G\sum_{t=0}^T \Bigl( L_{\rm RL}^{i,t} + \mathbf{1}_i \beta D_{\rm KL}(\pi_\theta(\cdot|s_t) \| \pi_{\rm ref}(\cdot|s_t)) \Bigr)$

where $L_{\rm RL}^{i,t}$ is the RL loss (e.g., GRPO) and $\beta$ is the KL weight.

GARDO’s gating mechanism allows efficient learning by penalizing only those outputs at greatest risk of reward hacking or overfitting.

3. Adaptive Regularization Strategy

The use of a static reference policy $\pi_{\rm ref}$ for KL regularization increasingly penalizes the online policy as it improves, causing optimization stagnation. GARDO addresses this via periodic reference resets:

Reset Criteria: The rolling average gated KL divergence is monitored:

$\overline{D}_{\rm KL} = \frac1{|\{i:\mathbf{1}_i=1\}|}\sum_{i:\mathbf{1}_i=1} D_{\rm KL}(\pi_\theta\|\pi_{\rm ref})$

When $\overline{D}_{\rm KL} > \epsilon_{\rm KL}$ (threshold) or a maximum step count $m$ is reached, the reference policy is updated:

$\pi_{\rm ref}\leftarrow\pi_\theta$

and the step-counter is reset to zero.

Effect: This adaptation prevents stale anchoring and ensures KL regularization remains relevant, preserving exploration while maintaining control over divergence from past policies.

4. Diversity-Aware Optimization

GARDO introduces explicit reward amplification for diverse, high-quality outputs:

Diversity Metric: For each output $x_0^i$ , a DINOv3 semantic embedding $e^i\in\mathbb{R}^d$ is extracted, and the nearest-neighbor cosine isolation score is computed:

$d_i = \min_{j\neq i} (1 - \cos(e^i,e^j))$

The overall diversity in a group is

$\mathrm{Div} = \frac1{G(G-1)}\sum_{i\neq j}(1-\cos(e^i,e^j))$

Multiplicative Advantage Shaping: Advantages from GRPO, $A_i$ , are amplified for samples that are both high-reward and highly isolated:

$A_i^{\rm shaped} = \begin{cases} A_i\,d_i, & A_i>0, \ A_i, & A_i \le 0 \end{cases}$

Unlike additive shaping, the multiplicative form strengthens only genuinely diverse, high-reward samples and simplifies tuning.

This dual mechanism is designed to sustain exploration of new modes and prevent convergence to low-diversity policy outputs.

5. GARDO Optimization and Training Workflow

The GARDO training procedure encompasses initialization, sampling, reward computation, diversity evaluation, uncertainty estimation, gating, optimization, adaptive reference resetting, and gating adjustment:

Initialize online and reference policies $\pi_\theta$ .
For each iteration, a. Sample prompts and generate $G$ rollouts. b. Compute proxy rewards and unnormalized advantages via GRPO. c. Extract DINOv3 features, compute isolation $d_i$ , and shape advantages $A_i^{\rm shaped}$ . d. Calculate uncertainty $\mathcal U(x^i)$ . e. Determine $\epsilon_{\mathcal U}$ and gating indicators $\mathbf{1}_i$ . f. Formulate $L_{\rm total}$ using the gated loss. g. Update policy parameters with the gradient. h. Conditionally reset $\pi_{\rm ref}$ according to KL or step schedules. i. Adjust gating parameter $k$ based on observed batch uncertainties within a rolling window.

Hyperparameters for stable diffusion adaptation include LoRA adapters (rank 32, α=64), batch size (6 prompts, 24 rollouts per prompt), initial $k=0.1$ , reference reset threshold $\epsilon_{\rm KL}=1\times10^{-4}$ , learning rate $3\times10^{-4}$ , and removal of advantage normalization std for stability in low-variance groups.

6. Empirical Evaluation and Ablation Insights

Experiments demonstrate that GARDO delivers sample efficiency comparable to unregularized RL while preventing reward hacking and increasing diversity:

Method	Proxy Reward Growth	Generalization (Unseen Metrics)	Mean Diversity (%)
GRPO (β=0)	Fast	Poor	19–20
Uniform KL (β>0)	Slower (10–20% ↓)	High	19–20
GARDO (Full)	Fast (≈GRPO)	High/exceeds	24–25
GARDO w/o Div	Fast	High	No diversity gain

GARDO sustains high proxy scores and simultaneously recovers or exceeds generalization on out-of-distribution (o.o.d.) metrics such as Aesthetic, PickScore, ImageReward, and HPSv3. Removal of multiplicative diversity shaping negates diversity benefits, while diversity shaping without KL regularization eventually leads to over-optimization and collapse. The framework generalizes across models, with performance benefits observed for DiffusionNFT and 12B-parameter Flux.1 models as well (He et al., 30 Dec 2025).

7. Practical Considerations and Implementation Guidelines

For text-to-image diffusion (T2I) fine-tuning, GARDO’s auxiliary uncertainty estimators (Aesthetic, ImageReward) incur negligible overhead. Batch groupings (e.g., batch=6, group $G=24$ ), LoRA adapters, and frozen DINOv3 embeddings are recommended for efficient and scalable implementation. Adaptive hyperparameters, rolling-window gating rate adjustment (±10%), and low-variance advantage normalization further stabilize training. These measures are designed to avoid reward hacking while maintaining rapid policy improvement and output diversity.

GARDO synthesizes uncertainty-aware KL application, adaptive anchoring, and explicit diversity shaping, enabling RL fine-tuning to achieve efficient exploration and robust prevention of both reward exploitation and diversity collapse in diffusion models (He et al., 30 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

GARDO: Reinforcing Diffusion Models without Reward Hacking (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO).

GARDO: Diversity-Aware RL in Diffusion Models

1. Challenges in RL Fine-tuning of Diffusion Models

2. Gated Regularization Mechanism

3. Adaptive Regularization Strategy

4. Diversity-Aware Optimization

5. GARDO Optimization and Training Workflow

6. Empirical Evaluation and Ablation Insights

7. Practical Considerations and Implementation Guidelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GARDO: Diversity-Aware RL in Diffusion Models

1. Challenges in RL Fine-tuning of Diffusion Models

2. Gated Regularization Mechanism

3. Adaptive Regularization Strategy

4. Diversity-Aware Optimization

5. GARDO Optimization and Training Workflow

6. Empirical Evaluation and Ablation Insights

7. Practical Considerations and Implementation Guidelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research