Papers
Topics
Authors
Recent
Search
2000 character limit reached

GARDO: Diversity-Aware RL in Diffusion Models

Updated 6 January 2026
  • The paper introduces GARDO, a novel RL framework integrating gated regularization, adaptive reference resets, and diversity-aware optimization to counter reward hacking and diversity collapse in text-to-image diffusion models.
  • GARDO employs targeted KL regularization via uncertainty estimation, applying penalties only to high-risk samples to maintain robust exploration and sample efficiency.
  • Adaptive reference resets prevent policy stagnation while multiplicative diversity shaping amplifies reward advantages, leading to enhanced proxy scores and generalization in fine-tuning.

Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO) is a reinforcement learning (RL) framework developed to address the challenge of fine-tuning text-to-image diffusion models using proxy rewards. In scenarios where it is not possible to specify a complete ground-truth objective for generative image tasks, models are optimized against imperfect proxies, leading to reward hacking, sample inefficiency, restricted exploration, and diversity collapse. GARDO integrates three main pillars—gated regularization, adaptive reference resets, and diversity-aware optimization—to achieve sample-efficient fine-tuning, robust exploration, reward hacking mitigation, and diversity preservation (He et al., 30 Dec 2025).

1. Challenges in RL Fine-tuning of Diffusion Models

Fine-tuning diffusion models with RL proxy rewards is hindered by several key issues:

  • Reward Hacking: Optimizing against partial proxy objectives (such as ImageReward or OCR accuracy) enables models to obtain high proxy scores by exploiting loopholes, often leading to diminished actual image quality and collapses in sample diversity.
  • Sample Efficiency: Universal Kullback–Leibler (KL) regularization, applied to all samples against a static reference, can avoid severe reward hacking but greatly reduces learning speed.
  • Exploration Limitation: Overly strong KL penalties cause policies to remain near the initial reference, obstructing the discovery of superior solutions and hindering effective coverage of high-reward regions.
  • Diversity Collapse: Standard RL methods such as PPO or GRPO tend to converge to a narrow set of high-reward modes, which concentrates output and reduces generative diversity.

These challenges highlight the need for sophisticated regularization and diversity mechanisms during RL-based fine-tuning.

2. Gated Regularization Mechanism

GARDO introduces a sample-wise regularization strategy based on uncertainty estimation, allowing precise application of KL penalties only to problematic samples:

  • Uncertainty Measurement: For each generated sample xix^i, the main proxy reward R~(xi)\tilde R(x^i) and an ensemble of KK auxiliary reward models {R^n(xi)}\{\hat R_n(x^i)\} are evaluated. Each reward’s batch win-rate is calculated as

w(yi)=1Bji1(yi>yj)w(y^i)=\frac1B\sum_{j\neq i}\mathbf{1}(y^i>y^j)

and the uncertainty score is defined by

U(xi)=w(R~(xi))1Kn=1Kw(R^n(xi))\mathcal U(x^i) = w(\tilde R(x^i)) - \frac1K\sum_{n=1}^K w(\hat R_n(x^i))

indicating proxy reward overconfidence relative to the auxiliary ensemble.

  • Gating KL Penalty: Samples with uncertainty above the gating threshold ϵU\epsilon_{\mathcal U} (the (1k)(1-k)-percentile, with kk as the gating rate) are flagged with

1i={1if U(xi)>ϵU, 0otherwise.\mathbf{1}_i = \begin{cases} 1 & \text{if }\mathcal U(x^i)>\epsilon_{\mathcal U}, \ 0 & \text{otherwise}. \end{cases}

KL regularization is applied only to such high-uncertainty samples.

  • Surrogate Loss: The training objective for a mini-batch of GG rollouts is

Ltotal=1Gi=1Gt=0T(LRLi,t+1iβDKL(πθ(st)πref(st)))L_{\rm total} = \frac1G\sum_{i=1}^G\sum_{t=0}^T \Bigl( L_{\rm RL}^{i,t} + \mathbf{1}_i \beta D_{\rm KL}(\pi_\theta(\cdot|s_t) \| \pi_{\rm ref}(\cdot|s_t)) \Bigr)

where LRLi,tL_{\rm RL}^{i,t} is the RL loss (e.g., GRPO) and β\beta is the KL weight.

GARDO’s gating mechanism allows efficient learning by penalizing only those outputs at greatest risk of reward hacking or overfitting.

3. Adaptive Regularization Strategy

The use of a static reference policy πref\pi_{\rm ref} for KL regularization increasingly penalizes the online policy as it improves, causing optimization stagnation. GARDO addresses this via periodic reference resets:

  • Reset Criteria: The rolling average gated KL divergence is monitored:

DKL=1{i:1i=1}i:1i=1DKL(πθπref)\overline{D}_{\rm KL} = \frac1{|\{i:\mathbf{1}_i=1\}|}\sum_{i:\mathbf{1}_i=1} D_{\rm KL}(\pi_\theta\|\pi_{\rm ref})

When DKL>ϵKL\overline{D}_{\rm KL} > \epsilon_{\rm KL} (threshold) or a maximum step count mm is reached, the reference policy is updated:

πrefπθ\pi_{\rm ref}\leftarrow\pi_\theta

and the step-counter is reset to zero.

  • Effect: This adaptation prevents stale anchoring and ensures KL regularization remains relevant, preserving exploration while maintaining control over divergence from past policies.

4. Diversity-Aware Optimization

GARDO introduces explicit reward amplification for diverse, high-quality outputs:

  • Diversity Metric: For each output x0ix_0^i, a DINOv3 semantic embedding eiRde^i\in\mathbb{R}^d is extracted, and the nearest-neighbor cosine isolation score is computed:

di=minji(1cos(ei,ej))d_i = \min_{j\neq i} (1 - \cos(e^i,e^j))

The overall diversity in a group is

Div=1G(G1)ij(1cos(ei,ej))\mathrm{Div} = \frac1{G(G-1)}\sum_{i\neq j}(1-\cos(e^i,e^j))

  • Multiplicative Advantage Shaping: Advantages from GRPO, AiA_i, are amplified for samples that are both high-reward and highly isolated:

Aishaped={Aidi,Ai>0, Ai,Ai0A_i^{\rm shaped} = \begin{cases} A_i\,d_i, & A_i>0, \ A_i, & A_i \le 0 \end{cases}

Unlike additive shaping, the multiplicative form strengthens only genuinely diverse, high-reward samples and simplifies tuning.

This dual mechanism is designed to sustain exploration of new modes and prevent convergence to low-diversity policy outputs.

5. GARDO Optimization and Training Workflow

The GARDO training procedure encompasses initialization, sampling, reward computation, diversity evaluation, uncertainty estimation, gating, optimization, adaptive reference resetting, and gating adjustment:

  1. Initialize online and reference policies πθ\pi_\theta.
  2. For each iteration, a. Sample prompts and generate GG rollouts. b. Compute proxy rewards and unnormalized advantages via GRPO. c. Extract DINOv3 features, compute isolation did_i, and shape advantages AishapedA_i^{\rm shaped}. d. Calculate uncertainty U(xi)\mathcal U(x^i). e. Determine ϵU\epsilon_{\mathcal U} and gating indicators 1i\mathbf{1}_i. f. Formulate LtotalL_{\rm total} using the gated loss. g. Update policy parameters with the gradient. h. Conditionally reset πref\pi_{\rm ref} according to KL or step schedules. i. Adjust gating parameter kk based on observed batch uncertainties within a rolling window.

Hyperparameters for stable diffusion adaptation include LoRA adapters (rank 32, α=64), batch size (6 prompts, 24 rollouts per prompt), initial k=0.1k=0.1, reference reset threshold ϵKL=1×104\epsilon_{\rm KL}=1\times10^{-4}, learning rate 3×1043\times10^{-4}, and removal of advantage normalization std for stability in low-variance groups.

6. Empirical Evaluation and Ablation Insights

Experiments demonstrate that GARDO delivers sample efficiency comparable to unregularized RL while preventing reward hacking and increasing diversity:

Method Proxy Reward Growth Generalization (Unseen Metrics) Mean Diversity (%)
GRPO (β=0) Fast Poor 19–20
Uniform KL (β>0) Slower (10–20% ↓) High 19–20
GARDO (Full) Fast (≈GRPO) High/exceeds 24–25
GARDO w/o Div Fast High No diversity gain

GARDO sustains high proxy scores and simultaneously recovers or exceeds generalization on out-of-distribution (o.o.d.) metrics such as Aesthetic, PickScore, ImageReward, and HPSv3. Removal of multiplicative diversity shaping negates diversity benefits, while diversity shaping without KL regularization eventually leads to over-optimization and collapse. The framework generalizes across models, with performance benefits observed for DiffusionNFT and 12B-parameter Flux.1 models as well (He et al., 30 Dec 2025).

7. Practical Considerations and Implementation Guidelines

For text-to-image diffusion (T2I) fine-tuning, GARDO’s auxiliary uncertainty estimators (Aesthetic, ImageReward) incur negligible overhead. Batch groupings (e.g., batch=6, group G=24G=24), LoRA adapters, and frozen DINOv3 embeddings are recommended for efficient and scalable implementation. Adaptive hyperparameters, rolling-window gating rate adjustment (±10%), and low-variance advantage normalization further stabilize training. These measures are designed to avoid reward hacking while maintaining rapid policy improvement and output diversity.

GARDO synthesizes uncertainty-aware KL application, adaptive anchoring, and explicit diversity shaping, enabling RL fine-tuning to achieve efficient exploration and robust prevention of both reward exploitation and diversity collapse in diffusion models (He et al., 30 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO).