Training Diffusion Models with Reinforcement Learning (2305.13301v4)
Abstract: Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve prompt-image alignment using feedback from a vision-LLM without the need for additional data collection or human annotation. The project's website can be found at http://rl-diffusion.github.io .
- Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
- Data generation as sequential decision making. Advances in Neural Information Processing Systems, 28, 2015.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022b.
- Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv preprint arXiv:2303.04137, 2023.
- Deep reinforcement learning from human preferences. In Neural Information Processing Systems, 2017.
- ImageNet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, 2009.
- Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, 2021.
- Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. arXiv preprint arXiv:2302.11552, 2023.
- Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pp. 1329–1338. PMLR, 2016.
- Optimizing ddpm sampling with shortcut fine-tuning. arXiv preprint arXiv:2301.13362, 2023.
- Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. arXiv preprint arXiv:2305.16381, 2023.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760, 2022.
- Multimodal neurons in artificial neural networks. Distill, 2021. https://distill.pub/2021/multimodal-neurons.
- IDQL: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, 2022.
- Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pp. 267–274, 2002.
- Variational diffusion models. In Neural Information Processing Systems, 2021.
- TAMER: Training an Agent Manually via Evaluative Reinforcement. In International Conference on Development and Learning, 2008.
- Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
- Visual instruction tuning. 2023.
- Compositional visual generation with composable diffusion models. arXiv preprint arXiv:2206.01714, 2022.
- Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022.
- Monte carlo gradient estimation in machine learning. The Journal of Machine Learning Research, 21(1):5183–5244, 2020.
- Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Reinforcement learning for bandit neural machine translation with simulated human feedback. In Empirical Methods in Natural Language Processing, 2017.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, 2021.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. CoRR, abs/1910.00177, 2019. URL https://arxiv.org/abs/1910.00177.
- Reinforcement learning by reward-weighted regression for operational space control. In International Conference on Machine learning, 2007.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
- Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021.
- High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Structure-based drug design with equivariant diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Chrisoph Schuhmann. Laion aesthetics, Aug 2022. URL https://laion.ai/blog/laion-aesthetics/.
- Trust region policy optimization. In International Conference on Machine Learning, 2015.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015.
- Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=St1giarCHLP.
- Learning to summarize with human feedback. In Neural Information Processing Systems, 2020.
- Policy gradient methods for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. Müller (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf.
- Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
- Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, pp. 5–32, 1992.
- Crystal diffusion variational autoencoder for periodic material generation. In International Conference on Learning Representations, 2021.
- Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
- GeoDiff: A geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations, 2021.
- Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
- Adding conditional control to text-to-image diffusion models, 2023.
- BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations, 2020.
- 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5826–5835, 2021.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- Kevin Black (29 papers)
- Michael Janner (14 papers)
- Yilun Du (113 papers)
- Ilya Kostrikov (25 papers)
- Sergey Levine (531 papers)