Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models (2410.20898v1)

Published 28 Oct 2024 in cs.CV, cs.AI, cs.LG, and cs.MM

Abstract: In this paper, we introduce the Diff-Instruct*(DI*), a data-free approach for building one-step text-to-image generative models that align with human preference while maintaining the ability to generate highly realistic images. We frame human preference alignment as online reinforcement learning using human feedback (RLHF), where the goal is to maximize the reward function while regularizing the generator distribution to remain close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the KL divergence for regularization, we introduce a novel score-based divergence regularization, which leads to significantly better performances. Although the direct calculation of this divergence remains intractable, we demonstrate that we can efficiently compute its \emph{gradient} by deriving an equivalent yet tractable loss function. Remarkably, with Stable Diffusion V1.5 as the reference diffusion model, DI* outperforms \emph{all} previously leading models by a large margin. When using the 0.6B PixelArt-$\alpha$ model as the reference diffusion, DI* achieves a new record Aesthetic Score of 6.30 and an Image Reward of 1.31 with only a single generation step, almost doubling the scores of the rest of the models with similar sizes. It also achieves an HPSv2 score of 28.70, establishing a new state-of-the-art benchmark. We also observe that DI* can improve the layout and enrich the colors of generated images.

Authors (4)

Weijian Luo (23 papers)
Colin Zhang (5 papers)
Debing Zhang (29 papers)
Zhengyang Geng (17 papers)

Citations (1)

View on Semantic Scholar

Summary

Diff-Instruct*: Towards Human-Preferred One-step Text-to-Image Generative Models

This paper introduces Diff-Instruct* (DI*), a methodology aimed at addressing human preference alignment in one-step text-to-image generative models. The approach adopts online reinforcement learning using human feedback (RLHF) to guide the generative model. The central objective is to optimize a reward function representing human preferences while ensuring that the generated samples remain faithful to a reference diffusion process.

Distinct from conventional approaches that utilize Kullback-Leibler (KL) divergence for regularization, DI* employs a novel score-based divergence regularization. This choice ostensibly results in superior performance outcomes, as demonstrated in the paper's experiments. Although computing the direct score-based divergence is inherently challenging, the authors have derived an alternative loss function facilitating efficient gradient computation.

When evaluated against the current state-of-the-art, DI* demonstrates substantial improvements. Utilizing Stable Diffusion V1.5 as the reference model, DI* achieved an unprecedented performance level in generating text-to-image samples. For instance, with the PixelArt-α model as the reference diffusion process, DI* achieved an Aesthetic Score of 6.30 and an Image Reward of 1.31, notably higher than other models of similar scale. Additionally, DI* achieved an HPSv2 score of 28.70, establishing a leading performance benchmark.

The methodology enhances the visual quality of generated images, illustrated by improved layout, color richness, detail vibrancy, and overall aesthetic appeal. This enhancement aligns generative models more closely with human tastes and preferences, addressing a significant need in the domain of human-centric AI.

The paper delineates two main types of generative models: diffusion models and one-step generators. While diffusion models are known for producing high-quality outputs through progressive denoising, their efficiency is hindered by the necessity for multiple evaluative steps. Conversely, one-step generators, which map latent noise to output in a singular step, promise efficiency gains that make them suitable for real-time applications. Despite their efficiency, these models typically fall short in aligning output with human preferences, a gap this paper seeks to address with DI*.

The model's framework introduces a novel form of regularization to ensure stability and alignment with the reference distribution, involving a tractable pseudo-loss based on the novel score-based divergence. The authors advocate for the model's adaptability across various architectures, pointing to its potential utility in training models for diverse real-time, high-performance applications. Moreover, the paper's approach could stimulate further exploration into integrating explicit and implicit reward mechanisms within the context of large-scale generative model training.

The findings and methodologies presented in this paper have broad implications, not only enhancing the practical alignment of AI-generated content with human taste but also providing foundational techniques that could be applied in other domains, including real-time graphics, interactive media, and virtual environments. The innovations encapsulated within DI* could inform future research on the efficient and effective alignment of generative models, contributing positively to the broader AI community's efforts in advancing human-centric machine learning paradigms.

PDF Markdown

Related Papers

Tweets

https://twitter.com/William74312006/status/1871919855861137483

https://twitter.com/William74312006/status/1857632191183991119

https://twitter.com/cloneofsimo/status/1880932297618366801

https://twitter.com/William74312006/status/1857467214607294710

https://twitter.com/William74312006/status/1876483879345000544

Reddit

[2410.20898] Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models (1 point, 0 comments)