Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large-scale Reinforcement Learning for Diffusion Models (2401.12244v1)

Published 20 Jan 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Text-to-image diffusion models are a class of deep generative models that have demonstrated an impressive capacity for high-quality image generation. However, these models are susceptible to implicit biases that arise from web-scale text-image training pairs and may inaccurately model aspects of images we care about. This can result in suboptimal samples, model bias, and images that do not align with human ethics and preferences. In this paper, we present an effective scalable algorithm to improve diffusion models using Reinforcement Learning (RL) across a diverse set of reward functions, such as human preference, compositionality, and fairness over millions of images. We illustrate how our approach substantially outperforms existing methods for aligning diffusion models with human preferences. We further illustrate how this substantially improves pretrained Stable Diffusion (SD) models, generating samples that are preferred by humans 80.3% of the time over those from the base SD model while simultaneously improving both the composition and diversity of generated samples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. A general language assistant as a laboratory for alignment, 2021.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  3. Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models, 2023.
  4. Easily accessible text-to-image generation amplifies demographic stereotypes at large scale. In 2023 ACM Conference on Fairness, Accountability, and Transparency. ACM, 2023.
  5. Training diffusion models with reinforcement learning, 2023.
  6. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021.
  7. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023.
  8. Training-free layout control with cross-attention guidance, 2023.
  9. Investigating gender and racial biases in dall-e mini images. manuscript.
  10. Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models, 2023.
  11. Fair generative modeling via weak supervision, 2020.
  12. Debiasing vision-language models via biased prompts, 2023.
  13. Directly fine-tuning diffusion models on differentiable rewards, 2023.
  14. Emu: Enhancing image generation models using photogenic needles in a haystack, 2023.
  15. Raft: Reward ranked finetuning for generative foundation model alignment, 2023.
  16. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International Conference on Machine Learning, pages 8489–8510. PMLR, 2023.
  17. Mitigating stereotypical biases in text to image generative systems, 2023.
  18. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models, 2023.
  19. Training-free structured diffusion guidance for compositional text-to-image synthesis, 2023a.
  20. Layoutgpt: Compositional visual planning and generation with large language models, 2023b.
  21. Fair diffusion: Instructing text-to-image generation models on fairness, 2023.
  22. Benchmarking spatial relationships in text-to-image generation, 2023.
  23. Classifier-free diffusion guidance, 2022.
  24. Denoising diffusion probabilistic models, 2020.
  25. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation, 2023.
  26. Aligning text-to-image models using human feedback, 2023.
  27. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022.
  28. Gligen: Open-set grounded text-to-image generation, 2023.
  29. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models, 2023.
  30. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022.
  31. Decoupled weight decay regularization, 2019.
  32. Stable bias: Analyzing societal representations in diffusion models, 2023.
  33. Monte carlo gradient estimation in machine learning, 2020.
  34. Social biases through the text-to-image generation lens, 2023.
  35. Webgpt: Browser-assisted question-answering with human feedback, 2022.
  36. Training language models to follow instructions with human feedback, 2022.
  37. Learning transferable visual models from natural language supervision, 2021.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
  39. High-resolution image synthesis with latent diffusion models, 2022.
  40. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  41. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  42. Balancing the picture: Debiasing vision-language datasets with synthetic contrast sets, 2023.
  43. Deep unsupervised learning using nonequilibrium thermodynamics, 2015.
  44. Denoising diffusion implicit models, 2022.
  45. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems. MIT Press, 1999.
  46. Christopher T. H Teo and Ngai-Man Cheung. Measuring fairness in generative models, 2021.
  47. DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models. arXiv:2210.14896 [cs], 2022.
  48. Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 2004.
  49. Harnessing the spatial-temporal attention of diffusion models for high-fidelity text-to-image synthesis, 2023a.
  50. Human preference score: Better aligning text-to-image models with human preference, 2023b.
  51. Imagereward: Learning and evaluating human preferences for text-to-image generation, 2023.
  52. Scaling autoregressive models for content-rich text-to-image generation, 2022.
  53. ITI-GEN: Inclusive text-to-image generation. In ICCV, 2023a.
  54. Auditing gender presentation differences in text-to-image models, 2023b.
  55. Simple multi-dataset detection, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yinan Zhang (31 papers)
  2. Eric Tzeng (17 papers)
  3. Yilun Du (113 papers)
  4. Dmitry Kislyuk (8 papers)
Citations (19)

Summary

The paper "Large-scale Reinforcement Learning for Diffusion Models" (Zhang et al., 20 Jan 2024 ) introduces a scalable reinforcement learning (RL) framework designed to fine-tune text-to-image diffusion models. The primary objective is to align these models with diverse objectives, including human preferences, compositionality accuracy, and fairness metrics, addressing limitations inherent in models pre-trained on large, uncurated web datasets. These limitations manifest as suboptimal sample quality, propagation of societal biases, and difficulties in accurately rendering complex textual descriptions.

Methodology: RL for Diffusion Model Alignment

The core methodology formulates the diffusion model's iterative denoising process as a finite-horizon Markov Decision Process (MDP). The goal is to optimize the policy, represented by the diffusion model's parameters θ\theta, to maximize the expected terminal reward obtained from the generated image x0x_0.

  • State (sts_t): Defined by the tuple (xt,c,t)(x_t, c, t), where xtx_t is the noisy image at timestep tt, cc is the conditioning text prompt, and tt is the current timestep.
  • Action (ata_t): The predicted denoised image for the previous step, xt1pθ(xt1xt,c)x_{t-1} \sim p_\theta(x_{t-1}|x_t, c).
  • Policy (πθ\pi_\theta): The parameterized reverse diffusion process pθ(xt1xt,c)p_\theta(x_{t-1}|x_t, c).
  • Reward (RR): A terminal reward function r(x0,c)r(x_0, c) evaluated on the final generated image x0x_0 given the prompt cc. Intermediate rewards are zero.

The optimization employs the REINFORCE algorithm (specifically, the likelihood ratio gradient estimator) to update the model parameters θ\theta. The objective function is to maximize the expected reward:

J(θ)=Eτπθ[R(τ)]=Ex0pθ(x0c)[r(x0,c)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)] = \mathbb{E}_{x_0 \sim p_\theta(x_0|c)} [r(x_0, c)]

The policy gradient is given by:

θJ(θ)=Eτπθ[θlogπθ(τ)R(τ)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [ \nabla_\theta \log \pi_\theta(\tau) R(\tau) ]

To enhance training stability and scalability for large datasets (millions of prompts) and large models (like Stable Diffusion v2), several techniques are incorporated:

  1. Importance Sampling and Clipping: Similar to Proximal Policy Optimization (PPO), importance sampling is used to leverage samples generated by older policies. Policy clipping is applied to the likelihood ratio πθ(atst)πθold(atst)\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} to prevent large, destabilizing updates. The clipped objective is used for the RL loss LRLL_{RL}.
  2. Advantage Estimation: An advantage estimate A^t\hat{A}_t is used instead of the raw reward R(τ)R(\tau). For this terminal reward setting, the advantage simplifies. Crucially, rewards r(x0,c)r(x_0, c) are normalized per batch using the mean and variance of rewards within the current minibatch. This contrasts with prior work like DDPO which used per-prompt normalization, and is identified as a key factor for enabling large-scale training.
  3. Pretraining Loss Regularization: To prevent the model from over-optimizing on the reward function ("reward hacking") and maintain generative fidelity, the original diffusion model pretraining loss (denoising score matching loss, LpreL_{pre}) is added to the RL objective, weighted by a hyperparameter β\beta:

    Ltotal=LRL+βLpreL_{total} = L_{RL} + \beta L_{pre}

    Lpre=Et,xt,c,ϵ[ϵθ(xt,c,t)ϵ2]L_{pre} = \mathbb{E}_{t, x_t, c, \epsilon} [ || \epsilon_\theta(x_t, c, t) - \epsilon ||^2 ]

    where ϵ\epsilon is the ground truth noise and ϵθ\epsilon_\theta is the model's noise prediction.

Handling Distribution-based Rewards

A notable contribution is the method for incorporating reward functions that depend on the distribution of generated samples, rather than individual samples. This is essential for objectives like fairness or diversity (e.g., ensuring diverse skintone representation across generations for certain prompts). The paper proposes approximating the distribution-level reward by computing it empirically over the samples generated within each training minibatch. This minibatch statistic serves as the reward signal for the policy gradient update associated with that distribution-dependent objective. For instance, to promote skintone diversity, a statistical parity metric (negative difference from uniform distribution across skintone categories) is calculated over the images generated in a batch, and this single value is used as the reward for all samples in that batch contributing to the fairness objective.

Multi-task Joint Optimization

The framework supports simultaneous optimization for multiple reward functions. The multi-task training procedure (Algorithm 1 in the paper) involves:

  1. Sampling: In each training step, sample a batch of prompts associated with different tasks (e.g., aesthetic preference, fairness, compositionality).
  2. Generation: Generate images x0x_0 for each prompt using the current policy πθ\pi_\theta.
  3. Reward Calculation: Compute the relevant reward r(x0,c)r(x_0, c) for each generated image based on its associated task.
  4. Gradient Updates: Sequentially compute and apply the policy gradient updates for each task, using the corresponding rewards and batch-wise normalization.
  5. Pretraining Loss Update: Compute and apply the gradient update for the pretraining loss LpreL_{pre}.

This allows a single model to be trained to balance and achieve competence across multiple objectives.

Implementation Details and Scale

  • Base Model: Stable Diffusion v2 (SDv2) with a 512x512 resolution UNet backbone.
  • Training Scale: Experiments were conducted using 128 NVIDIA A100 GPUs. Datasets involved millions of prompts: 1.5M from DiffusionDB for preference, 240M BLIP-generated captions from Pinterest images for diversity (with race terms filtered), and over 1M synthetic prompts for compositionality.
  • Reward Functions:
    • Preference: ImageReward (IR), a trained model predicting human aesthetic preference.
    • Fairness: A negative statistical parity score based on a 4-category skintone classifier (calculated per minibatch).
    • Compositionality: Average detection confidence score from a UniDet object detector for objects mentioned in the prompt.
  • Hyperparameters: The weight of the pretraining loss β\beta was tuned; values around 0.1 were often effective. The PPO clipping range was typically set to 0.2. AdamW optimizer was used.

Experimental Results

The proposed RL framework demonstrated significant improvements over the base SDv2 model and existing alignment methods across various tasks.

  • Human Preference: The RL-tuned model was preferred by human evaluators 80.3% of the time against the base SDv2 model. It also achieved higher ImageReward scores and aesthetic ratings (on DiffusionDB and PartiPrompts) compared to baselines like ReFL, RAFT, DRaFT, and Reward-weighted Resampling. The RL approach appeared more robust to reward hacking than direct gradient methods like ReFL, which sometimes introduced high-frequency artifacts.
  • Fairness (Skintone Diversity): Using the minibatch-based distribution reward, the model significantly reduced skintone bias compared to SDv2 on out-of-domain datasets (occupations, HRS-Bench), producing more equitable distributions across Fitzpatrick scale categories.
  • Compositionality: The model fine-tuned with the UniDet reward showed improved ability to generate the correct objects specified in prompts, outperforming SDv2, particularly for prompts involving multiple objects and spatial relationships.
  • Multi-task Learning: A single model jointly trained on preference, fairness, and compositionality rewards substantially outperformed the base SDv2 on all three metrics simultaneously. While specialized single-task models achieved peak performance on their respective metric, they often degraded performance on others (an "alignment tax"). The jointly trained model successfully mitigated this alignment tax, achieving over 80% of the relative performance gain of the specialized models across all objectives.

Practical Significance and Applications

This work provides a scalable and general framework for fine-tuning large diffusion models using RL. Its key practical implications include:

  • Scalability: Demonstrates feasibility of RLHF-style alignment at the scale of millions of prompts, enabled by techniques like batch-wise reward normalization.
  • Generality: Applicable to arbitrary, potentially non-differentiable reward functions (e.g., outputs of object detectors, human feedback simulators, fairness metrics) and distribution-level objectives.
  • Multi-Objective Alignment: Offers a practical method for balancing multiple alignment criteria (e.g., aesthetics, safety, fairness, instruction following) within a single model, crucial for real-world deployment.
  • Improved Robustness: The inclusion of the pretraining loss and the RL optimization process appears less susceptible to reward hacking compared to methods relying solely on direct reward gradient backpropagation.
  • Deployment Strategy: Provides a post-hoc fine-tuning mechanism to adapt pre-trained foundation models to specific downstream requirements and ethical considerations without costly inference-time guidance or complete retraining.

This RL framework can be applied to tailor diffusion models for specific applications requiring high aesthetic quality, adherence to complex compositional instructions, or mitigation of known biases present in the original training data.

In conclusion, the paper presents a robust and scalable RL-based approach for aligning text-to-image diffusion models. By successfully incorporating diverse reward functions, including distribution-level metrics, and enabling effective multi-task optimization at scale, it offers a significant advancement in improving the controllability, quality, and fairness of generative image models.

Youtube Logo Streamline Icon: https://streamlinehq.com