Diff-Instruct*: Towards Human-Preferred One-step Text-to-Image Generative Models
This paper introduces Diff-Instruct* (DI*), a methodology aimed at addressing human preference alignment in one-step text-to-image generative models. The approach adopts online reinforcement learning using human feedback (RLHF) to guide the generative model. The central objective is to optimize a reward function representing human preferences while ensuring that the generated samples remain faithful to a reference diffusion process.
Distinct from conventional approaches that utilize Kullback-Leibler (KL) divergence for regularization, DI* employs a novel score-based divergence regularization. This choice ostensibly results in superior performance outcomes, as demonstrated in the paper's experiments. Although computing the direct score-based divergence is inherently challenging, the authors have derived an alternative loss function facilitating efficient gradient computation.
When evaluated against the current state-of-the-art, DI* demonstrates substantial improvements. Utilizing Stable Diffusion V1.5 as the reference model, DI* achieved an unprecedented performance level in generating text-to-image samples. For instance, with the PixelArt-α model as the reference diffusion process, DI* achieved an Aesthetic Score of 6.30 and an Image Reward of 1.31, notably higher than other models of similar scale. Additionally, DI* achieved an HPSv2 score of 28.70, establishing a leading performance benchmark.
The methodology enhances the visual quality of generated images, illustrated by improved layout, color richness, detail vibrancy, and overall aesthetic appeal. This enhancement aligns generative models more closely with human tastes and preferences, addressing a significant need in the domain of human-centric AI.
The paper delineates two main types of generative models: diffusion models and one-step generators. While diffusion models are known for producing high-quality outputs through progressive denoising, their efficiency is hindered by the necessity for multiple evaluative steps. Conversely, one-step generators, which map latent noise to output in a singular step, promise efficiency gains that make them suitable for real-time applications. Despite their efficiency, these models typically fall short in aligning output with human preferences, a gap this paper seeks to address with DI*.
The model's framework introduces a novel form of regularization to ensure stability and alignment with the reference distribution, involving a tractable pseudo-loss based on the novel score-based divergence. The authors advocate for the model's adaptability across various architectures, pointing to its potential utility in training models for diverse real-time, high-performance applications. Moreover, the paper's approach could stimulate further exploration into integrating explicit and implicit reward mechanisms within the context of large-scale generative model training.
The findings and methodologies presented in this paper have broad implications, not only enhancing the practical alignment of AI-generated content with human taste but also providing foundational techniques that could be applied in other domains, including real-time graphics, interactive media, and virtual environments. The innovations encapsulated within DI* could inform future research on the efficient and effective alignment of generative models, contributing positively to the broader AI community's efforts in advancing human-centric machine learning paradigms.