Introduction
Diffusion models have significantly advanced AI-driven text-to-image generation, materializing textual descriptions into striking visual content. Notwithstanding such progress, these models, including renowned ones like Imagen, Dalle-2, and Stable Diffusion, confront challenges in creating images with intricate specifications such as exact object counts or specific colors. In the field of reinforcement learning (RL), a promising direction has been adopting human feedback to refine models for better alignment with human preferences. This paper presents an innovative approach named DPOK (Diffusion Policy Optimization with KL regularization) that leverages online RL for fine-tuning text-to-image diffusion models.
Methodology
DPOK introduces online reinforcement learning to optimize the expected reward of generated images, aligning them with human evaluations. The method not only maximizes a reward model trained on human feedback but also employs KL divergence as a form of regularization, ensuring the fine-tuned model does not deviate excessively from the pretrained model's capabilities. The authors also delve into theoretical analyses, comparing KL regularization in online RL fine-tuning and supervised fine-tuning settings. DPOK is distinctive because it evaluates the reward model and conditional KL divergence beyond the supervised training dataset, giving it an empirical edge over traditional supervised fine-tuning.
Experimental Results
DPOK is empirically tested by fine-tuning the Stable Diffusion model using ImageReward, focusing on text-to-image alignment and retaining high image fidelity. Results demonstrate that DPOK generally outperforms supervised fine-tuning in both respects. Specifically, online RL fine-tuning leads to stronger text-image alignment, showcased by improved ImageReward scores, and retains high image quality, as evidenced by higher aesthetic scores. Moreover, human evaluations consistently favor the RL model over the supervised one in terms of both image-text alignment and image quality. A significant contribution is that DPOK also manages to mitigate biases inherent in the pretrained models, exemplified by its ability to override web-inculcated associations, such as the "Four roses" prompt being linked to whiskey rather than the flower.
Conclusion
DPOK marks a significant stride in the enhancement of text-to-image diffusion models through online RL fine-tuning. This technique exhibits a substantial improvement over supervised fine-tuning, not only optimizing for image-text alignment but also maintaining or even bettering the aesthetic quality of the generated images. It sets the stage for further exploration into efficient online RL fine-tuning techniques that could enable models to reliably generate highly complex and varied images while staying attuned to human judgment. The paper acknowledges potential limitations and calls for future research to address the efficiency and adaptability of fine-tuning with diverse prompts. It also points to the broader impacts of this work, emphasizing the necessity of thoroughly understanding reward models, as they now have increased sway over the fine-tuning process.