Diffusion Model Alignment Using Direct Preference Optimization (2311.12908v1)

Published 21 Nov 2023 in cs.CV, cs.AI, cs.GR, and cs.LG

Abstract: LLMs are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO, a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO), a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood, utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation, improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences, opening the door for scaling of diffusion model alignment methods.

PDF Abstract

Diffusion Model Alignment Using Direct Preference Optimization

The paper presents Diffusion-DPO, a novel approach for aligning diffusion models with human preferences by employing a method adapted from Direct Preference Optimization (DPO). This work extends the methodologies commonly used in aligning LLMs to the field of text-to-image diffusion models, which have historically lagged in incorporating human preference learning.

Research Context and Contributions

In contrast to the two-stage training processes used for LLMs, where models are fine-tuned on human preferences using methods such as Reinforcement Learning from Human Feedback (RLHF), text-to-image models typically rely on single-stage training using large-scale web-data. The current best practice involves fine-tuning on curated datasets comprising high-quality images and captions, which lacks the flexibility and power of the alignment techniques applied to LLMs. By introducing Diffusion-DPO, the authors aim to bridge this methodological gap by directly optimizing diffusion models using human comparison data under a classification-based objective—a structure more typical in natural language processing.

The authors leverage a dataset called Pick-a-Pic, consisting of 851,000 pairwise preferences, to fine-tune the Stable Diffusion XL (SDXL)-1.0 model. Notably, this work reformulates the objectives to suit the probabilistic framework of diffusion models, utilizing techniques such as the evidence lower bound (ELBO) to derive a differentiable optimization problem aligning with human preferences.

Key Findings and Results

The Distinctive results from the Diffusion-DPO fine-tuning demonstrate significant improvements over baseline diffusion models. Human evaluations indicate that the DPO-tuned SDXL model produces images that are preferred 69-70% of the time over the baseline SDXL and even surpasses the larger, two-part SDXL system that includes an additional refinement model. This indicates not only enhanced prompt alignment and visual appeal but also implies a superior capacity to generalize across a broad range of contexts without increasing computational costs during inference.

The paper also investigates the potential of using AI-generated feedback as a substitute for human feedback—a process referred to as learning from AI feedback. This variant was found to yield performance comparable to human data, which suggests a scalable pathway for future work in model alignment without the extensive cost and labor associated with large-scale human annotation processes.

Implications and Future Directions

The implications of this work are substantial in several dimensions. Practically, the ability to align diffusion models with human aesthetic preferences through scalable, efficient means can democratize high-fidelity model training, significantly enhancing user experience in creative AI applications. Theoretically, it opens up new questions about the robustness of preference learning across different modalities, suggesting that methodologies similar to those employed in LLMs may be cross-applicable with careful adaptation.

Future work could explore the granularity of preference adaptation—specifically, how models can be fine-tuned on smaller, personalized datasets to cater to individual or group preferences. There is also scope in investigating the online adaptation of diffusion models, leveraging streaming feedback to refine model outputs in a real-time setting, further enhancing the interactive capabilities of AI agents.

Overall, Diffusion-DPO marks a significant step forward in fine-tuning diffusion models to align with human preferences, leveraging the strengths of direct preference optimization to enhance the intersection of AI generative models and user satisfaction. Such advancements are crucial as AI continues to proliferate in domains requiring nuanced user interaction and satisfaction.