Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training Diffusion Models with Reinforcement Learning (2305.13301v4)

Published 22 May 2023 in cs.LG, cs.AI, and cs.CV

Abstract: Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve prompt-image alignment using feedback from a vision-LLM without the need for additional data collection or human annotation. The project's website can be found at http://rl-diffusion.github.io .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
  2. Data generation as sequential decision making. Advances in Neural Information Processing Systems, 28, 2015.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  4. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022b.
  5. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv preprint arXiv:2303.04137, 2023.
  6. Deep reinforcement learning from human preferences. In Neural Information Processing Systems, 2017.
  7. ImageNet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, 2009.
  8. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, 2021.
  9. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. arXiv preprint arXiv:2302.11552, 2023.
  10. Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pp. 1329–1338. PMLR, 2016.
  11. Optimizing ddpm sampling with shortcut fine-tuning. arXiv preprint arXiv:2301.13362, 2023.
  12. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. arXiv preprint arXiv:2305.16381, 2023.
  13. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  14. Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760, 2022.
  15. Multimodal neurons in artificial neural networks. Distill, 2021. https://distill.pub/2021/multimodal-neurons.
  16. IDQL: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023.
  17. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  18. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
  19. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  20. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, 2022.
  21. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pp.  267–274, 2002.
  22. Variational diffusion models. In Neural Information Processing Systems, 2021.
  23. TAMER: Training an Agent Manually via Evaluative Reinforcement. In International Conference on Development and Learning, 2008.
  24. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  25. Visual instruction tuning. 2023.
  26. Compositional visual generation with composable diffusion models. arXiv preprint arXiv:2206.01714, 2022.
  27. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022.
  28. Monte carlo gradient estimation in machine learning. The Journal of Machine Learning Research, 21(1):5183–5244, 2020.
  29. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
  30. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  31. Reinforcement learning for bandit neural machine translation with simulated human feedback. In Empirical Methods in Natural Language Processing, 2017.
  32. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, 2021.
  33. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  34. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. CoRR, abs/1910.00177, 2019. URL https://arxiv.org/abs/1910.00177.
  35. Reinforcement learning by reward-weighted regression for operational space control. In International Conference on Machine learning, 2007.
  36. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  37. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021.
  38. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  39. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
  40. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  41. Structure-based drug design with equivariant diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  42. Chrisoph Schuhmann. Laion aesthetics, Aug 2022. URL https://laion.ai/blog/laion-aesthetics/.
  43. Trust region policy optimization. In International Conference on Machine Learning, 2015.
  44. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  45. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  46. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015.
  47. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=St1giarCHLP.
  48. Learning to summarize with human feedback. In Neural Information Processing Systems, 2020.
  49. Policy gradient methods for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. Müller (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf.
  50. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
  51. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, pp.  5–32, 1992.
  52. Crystal diffusion variational autoencoder for periodic material generation. In International Conference on Learning Representations, 2021.
  53. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
  54. GeoDiff: A geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations, 2021.
  55. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
  56. Adding conditional control to text-to-image diffusion models, 2023.
  57. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations, 2020.
  58. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5826–5835, 2021.
  59. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kevin Black (29 papers)
  2. Michael Janner (14 papers)
  3. Yilun Du (113 papers)
  4. Ilya Kostrikov (25 papers)
  5. Sergey Levine (531 papers)
Citations (206)

Summary

Review of "Training Diffusion Models with Reinforcement Learning"

The paper presents an innovative approach to training diffusion models by integrating reinforcement learning techniques, thereby enabling direct optimization for downstream objectives such as human-perceived image quality or other task-specific metrics. Traditional generative modeling via diffusion models focuses on maximizing log-likelihoods, but this work proposes a shift toward achieving objectives that are more pragmatically useful and often require qualitative assessment.

Methodological Insights

The central innovation lies in casting the denoising process as a multi-step decision-making problem, enabling the use of reinforcement learning paradigms to optimize diffusion models. Two main algorithmic approaches are introduced and compared: Reward-Weighted Regression (RWR) and Denoising Diffusion Policy Optimization (DDPO). The DDPO framework stands out by employing policy gradient methods to fine-tune diffusion models based on specifically designed rewards. It formalizes denoising as a Markov Decision Process (MDP), utilizing exact likelihood evaluations during each denoising step to accurately integrate the diffusion process within the policy gradient framework.

The paper emphasizes the empirical efficacy of DDPO against RWR, demonstrating superior performance in tasks such as maximizing image aesthetic quality and improving prompt-image alignment. The enhanced performance is attributed to DDPO's capacity to better model the fine-grained step-wise nature of diffusion processes, surpassing the approximate optimization facilitated by RWR.

Experimental Validation

The experimental evaluation is robust, employing Stable Diffusion as the baseline generative model to demonstrate practical applicability. Three distinct reward scenarios are tested: image compressibility, aesthetic quality derived from human feedback, and prompt-image alignment evaluated via vision-LLMs (VLMs). These experiments not only demonstrate the feasibility of DDPO but also reveal its potential to generalize learned modifications to previously unseen prompts, indicating a noteworthy degree of transferability.

Implications and Future Work

This work signifies a step forward in tailoring generative models to meet particular user-defined goals beyond generic distribution matching. With the proposed methodology, diffusion models can be adapted to align more closely with qualitative human judgments, opening pathways for applications in fields requiring high-fidelity, context-specific generative capabilities, such as digital art generation and targeted content creation.

From a theoretical perspective, DDPO enhances the discourse on melding reinforcement learning with generative modeling, suggesting a productive avenue for further exploration of policy gradient methods in capturing complex, non-linear generative processes. Given the framework's adaptability, future research could extend to diverse forms of feedback, such as multi-modal input or comprehensive user interaction assessments, further enriching the generative landscape.

Additionally, this work highlights areas for improvement concerning overoptimization and the full spectrum of VLM capabilities. Future work may delve into refining reward signal specification and mitigating potential trade-offs between optimization and misuse, ensuring that the resulting generative models maintain both quality and ethical standards.

In summary, "Training Diffusion Models with Reinforcement Learning" offers a novel perspective on leveraging RL to directly target application-driven objectives in generative modeling, suggesting substantial implications for both practical applications and theoretical advancements in AI.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com