Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation (2402.10210v1)

Published 15 Feb 2024 in cs.LG, cs.AI, cs.CL, cs.CV, and stat.ML

Abstract: Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI), especially when compared with the remarkable progress made in fine-tuning LLMs. While cutting-edge diffusion models such as Stable Diffusion (SD) and SDXL rely on supervised fine-tuning, their performance inevitably plateaus after seeing a certain volume of data. Recently, reinforcement learning (RL) has been employed to fine-tune diffusion models with human preference data, but it requires at least two images ("winner" and "loser" images) for each text prompt. In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion), where the diffusion model engages in competition with its earlier versions, facilitating an iterative self-improvement process. Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment. Our experiments on the Pick-a-Pic dataset reveal that SPIN-Diffusion outperforms the existing supervised fine-tuning method in aspects of human preference alignment and visual appeal right from its first iteration. By the second iteration, it exceeds the performance of RLHF-based methods across all metrics, achieving these results with less data.

Citations (10)

View on Semantic Scholar

Summary

The paper introduces SPIN-Diffusion, a self-play fine-tuning method that refines diffusion models without requiring human preference data.
It reformulates fine-tuning as a two-player minimax game, where the model competes against its previous iterations to boost performance.
Empirical results on datasets like Pick-a-Pic show marked improvements in aesthetic quality and alignment with human preferences over conventional methods.

Enhancing Text-to-Image Generation with SPIN-Diffusion

Innovating Fine-Tuning through Self-Play

The paper introduces SPIN-Diffusion, a novel supervised fine-tuning approach for diffusion models, particularly beneficial in text-to-image generation tasks. Unlike traditional reinforcement learning (RL) methods which necessitate extensive human preference data, SPIN-Diffusion leverages a self-play mechanism avoiding such data dependencies. This technique iteratively refines the model by pitting it against prior versions of itself, fostering a continuous improvement loop. Through comprehensive experimentation using the Pick-a-Pic dataset, the algorithm demonstrated notable superiority over both conventional supervised fine-tuning (SFT) and RLHF-based methods across various metrics, including visual quality and alignment with human preferences.

Technical Insights

Central to SPIN-Diffusion is the conceptualization of fine-tuning as a two-player minimax game, where both players are iterations of the diffusion model itself. This configuration eludes the requirement for paired human preference data, which is a significant advancement given the rarity of such datasets. The approach addresses the inherent challenges of applying self-play to diffusion models, especially the difficulty in evaluating performance due to the probabilistic nature of these models and their output. The authors skillfully navigate these issues by designing an objective function that accounts for all potential image trajectories and approximating the complex distribution space of diffusion models. This is further enhanced by employing the Gaussian reparameterization trick from DDIM, enabling the algorithm's efficient execution.

Empirical Validation

The effectiveness of SPIN-Diffusion is validated through rigorous experimental analysis. The model was tested against various baselines using the Pick-a-Pic dataset, with SPIN-Diffusion showing remarkable improvements from its first iteration and exceeding RLHF methods by the second iteration. Specifically, it demonstrated a significant increase in aesthetic scores and alignment with human preferences, evidencing its capability to produce visually appealing and contextually coherent images. Furthermore, the algorithm's robustness was corroborated across additional datasets like PartiPrompts and HPSv2, reiterating its supremacy in aligning with human judgment and enhancing image quality.

Future Directions in AI and Diffusion Models

SPIN-Diffusion opens new avenues for refining diffusion models, catering specifically to scenarios with scarce data resources. This methodology is poised to significantly reduce the barrier to entry for leveraging high-quality text-to-image generation, democratizing access to advanced generative AI capabilities. Looking forward, the integration of self-play into the fine-tuning process of diffusion models could catalyze further innovations in AI, driving advancements in natural language understanding, semantic image synthesis, and beyond. Additionally, exploring the application of SPIN-Diffusion in other domains of generative AI, such as video generation and 3D modeling, presents an exciting prospect for future research.

Conclusion

The SPIN-Diffusion algorithm represents a significant milestone in the development of diffusion models, offering a sophisticated yet data-efficient method for improving text-to-image generation. By leveraging a self-play mechanism, it circumvents the limitations posed by the dependency on human preference data, showcasing an innovative path for enhancing model performance. The algorithm's success in producing high-quality, human-aligned images forecasts its potential impact on both the theoretical understanding and practical applications of generative AI technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/iScienceLuvr/status/1758310526256341266

https://twitter.com/_akhaliq/status/1758329296194601087

https://twitter.com/QuanquanGu/status/1758333048758940109

https://twitter.com/AdeenaY8/status/1762093529705115911

https://twitter.com/_zxchen_/status/1758334280504385844

https://twitter.com/Quebec_AI/status/1758516135375319261