DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models (2305.16381v3)

Published 25 May 2023 in cs.LG and cs.CV

Abstract: Learning from human feedback has been shown to improve text-to-image models. These techniques first learn a reward function that captures what humans care about in the task and then improve the models based on the learned reward function. Even though relatively simple approaches (e.g., rejection sampling based on reward scores) have been investigated, fine-tuning text-to-image models with the reward function remains challenging. In this work, we propose using online reinforcement learning (RL) to fine-tune text-to-image models. We focus on diffusion models, defining the fine-tuning task as an RL problem, and updating the pre-trained text-to-image diffusion models using policy gradient to maximize the feedback-trained reward. Our approach, coined DPOK, integrates policy optimization with KL regularization. We conduct an analysis of KL regularization for both RL fine-tuning and supervised fine-tuning. In our experiments, we show that DPOK is generally superior to supervised fine-tuning with respect to both image-text alignment and image quality. Our code is available at https://github.com/google-research/google-research/tree/master/dpok.

PDF Abstract

Introduction

Diffusion models have significantly advanced AI-driven text-to-image generation, materializing textual descriptions into striking visual content. Notwithstanding such progress, these models, including renowned ones like Imagen, Dalle-2, and Stable Diffusion, confront challenges in creating images with intricate specifications such as exact object counts or specific colors. In the field of reinforcement learning (RL), a promising direction has been adopting human feedback to refine models for better alignment with human preferences. This paper presents an innovative approach named DPOK (Diffusion Policy Optimization with KL regularization) that leverages online RL for fine-tuning text-to-image diffusion models.

Methodology

DPOK introduces online reinforcement learning to optimize the expected reward of generated images, aligning them with human evaluations. The method not only maximizes a reward model trained on human feedback but also employs KL divergence as a form of regularization, ensuring the fine-tuned model does not deviate excessively from the pretrained model's capabilities. The authors also delve into theoretical analyses, comparing KL regularization in online RL fine-tuning and supervised fine-tuning settings. DPOK is distinctive because it evaluates the reward model and conditional KL divergence beyond the supervised training dataset, giving it an empirical edge over traditional supervised fine-tuning.

Experimental Results

DPOK is empirically tested by fine-tuning the Stable Diffusion model using ImageReward, focusing on text-to-image alignment and retaining high image fidelity. Results demonstrate that DPOK generally outperforms supervised fine-tuning in both respects. Specifically, online RL fine-tuning leads to stronger text-image alignment, showcased by improved ImageReward scores, and retains high image quality, as evidenced by higher aesthetic scores. Moreover, human evaluations consistently favor the RL model over the supervised one in terms of both image-text alignment and image quality. A significant contribution is that DPOK also manages to mitigate biases inherent in the pretrained models, exemplified by its ability to override web-inculcated associations, such as the "Four roses" prompt being linked to whiskey rather than the flower.

Conclusion

DPOK marks a significant stride in the enhancement of text-to-image diffusion models through online RL fine-tuning. This technique exhibits a substantial improvement over supervised fine-tuning, not only optimizing for image-text alignment but also maintaining or even bettering the aesthetic quality of the generated images. It sets the stage for further exploration into efficient online RL fine-tuning techniques that could enable models to reliably generate highly complex and varied images while staying attuned to human judgment. The paper acknowledges potential limitations and calls for future research to address the efficiency and adaptability of fine-tuning with diverse prompts. It also points to the broader impacts of this work, emphasizing the necessity of thoroughly understanding reward models, as they now have increased sway over the fine-tuning process.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Ying Fan (47 papers)
Olivia Watkins (13 papers)
Yuqing Du (28 papers)
Hao Liu (497 papers)
Craig Boutilier (78 papers)
Pieter Abbeel (372 papers)
Mohammad Ghavamzadeh (97 papers)
Kangwook Lee (70 papers)
Kimin Lee (69 papers)
MoonKyung Ryu (9 papers)

Citations (101)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

YouTube

Show All Videos