David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training (2410.20898v3)

Published 28 Oct 2024 in cs.CV, cs.AI, cs.LG, and cs.MM

Abstract: We propose Diff-Instruct* (DI*), a data-efficient post-training approach for one-step text-to-image generative models to improve its human preferences without requiring image data. Our method frames alignment as online reinforcement learning from human feedback (RLHF), which optimizes the one-step model to maximize human reward functions while being regularized to be kept close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the Kullback-Leibler divergence as the regularization, we introduce a novel general score-based divergence regularization that substantially improves performance as well as post-training stability. Although the general score-based RLHF objective is intractable to optimize, we derive a strictly equivalent tractable loss function in theory that can efficiently compute its \emph{gradient} for optimizations. We introduce \emph{DI*-SDXL-1step}, which is a 2.6B one-step text-to-image model at a resolution of $1024\times 1024$, post-trained from DMD2 w.r.t SDXL. \textbf{Our 2.6B \emph{DI*-SDXL-1step} model outperforms the 50-step 12B FLUX-dev model} in ImageReward, PickScore, and CLIP score on the Parti prompts benchmark while using only 1.88\% of the inference time. This result clearly shows that with proper post-training, the small one-step model is capable of beating huge multi-step diffusion models. Our model is open-sourced at this link: https://github.com/pkulwj1994/diff_instruct_star. We hope our findings can contribute to human-centric machine learning techniques.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Diff-Instruct*, a novel online RLHF method using score-based divergence regularization to improve human preference alignment in one-step text-to-image models.
Diff-Instruct* achieves state-of-the-art performance on Stable Diffusion V1.5, demonstrating significantly higher Aesthetic Score, Image Reward, and HPSv2 compared to existing methods.
The method enhances visual quality, aligning generated images more closely with human taste, and offers efficiency benefits for real-time applications and future research directions in generative AI alignment.

Diff-Instruct*: Towards Human-Preferred One-step Text-to-Image Generative Models

This paper introduces Diff-Instruct* (DI*), a methodology aimed at addressing human preference alignment in one-step text-to-image generative models. The approach adopts online reinforcement learning using human feedback (RLHF) to guide the generative model. The central objective is to optimize a reward function representing human preferences while ensuring that the generated samples remain faithful to a reference diffusion process.

Distinct from conventional approaches that utilize Kullback-Leibler (KL) divergence for regularization, DI* employs a novel score-based divergence regularization. This choice ostensibly results in superior performance outcomes, as demonstrated in the paper's experiments. Although computing the direct score-based divergence is inherently challenging, the authors have derived an alternative loss function facilitating efficient gradient computation.

When evaluated against the current state-of-the-art, DI* demonstrates substantial improvements. Utilizing Stable Diffusion V1.5 as the reference model, DI* achieved an unprecedented performance level in generating text-to-image samples. For instance, with the PixelArt-α model as the reference diffusion process, DI* achieved an Aesthetic Score of 6.30 and an Image Reward of 1.31, notably higher than other models of similar scale. Additionally, DI* achieved an HPSv2 score of 28.70, establishing a leading performance benchmark.

The methodology enhances the visual quality of generated images, illustrated by improved layout, color richness, detail vibrancy, and overall aesthetic appeal. This enhancement aligns generative models more closely with human tastes and preferences, addressing a significant need in the domain of human-centric AI.

The paper delineates two main types of generative models: diffusion models and one-step generators. While diffusion models are known for producing high-quality outputs through progressive denoising, their efficiency is hindered by the necessity for multiple evaluative steps. Conversely, one-step generators, which map latent noise to output in a singular step, promise efficiency gains that make them suitable for real-time applications. Despite their efficiency, these models typically fall short in aligning output with human preferences, a gap this paper seeks to address with DI*.

The model's framework introduces a novel form of regularization to ensure stability and alignment with the reference distribution, involving a tractable pseudo-loss based on the novel score-based divergence. The authors advocate for the model's adaptability across various architectures, pointing to its potential utility in training models for diverse real-time, high-performance applications. Moreover, the paper's approach could stimulate further exploration into integrating explicit and implicit reward mechanisms within the context of large-scale generative model training.

The findings and methodologies presented in this paper have broad implications, not only enhancing the practical alignment of AI-generated content with human taste but also providing foundational techniques that could be applied in other domains, including real-time graphics, interactive media, and virtual environments. The innovations encapsulated within DI* could inform future research on the efficient and effective alignment of generative models, contributing positively to the broader AI community's efforts in advancing human-centric machine learning paradigms.

PDF Markdown

Related Papers

Tweets

https://twitter.com/William74312006/status/1871919855861137483

https://twitter.com/William74312006/status/1857632191183991119

https://twitter.com/cloneofsimo/status/1880932297618366801

https://twitter.com/William74312006/status/1857467214607294710

https://twitter.com/William74312006/status/1876483879345000544

Reddit

[2410.20898] Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models (1 point, 0 comments)