SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance (2412.02687v2)

Published 3 Dec 2024 in cs.CV

Abstract: Recent approaches have yielded promising results in distilling multi-step text-to-image diffusion models into one-step ones. The state-of-the-art efficient distillation technique, i.e., SwiftBrushv2 (SBv2), even surpasses the teacher model's performance with limited resources. However, our study reveals its instability when handling different diffusion model backbones due to using a fixed guidance scale within the Variational Score Distillation (VSD) loss. Another weakness of the existing one-step diffusion models is the missing support for negative prompt guidance, which is crucial in practical image generation. This paper presents SNOOPI, a novel framework designed to address these limitations by enhancing the guidance in one-step diffusion models during both training and inference. First, we effectively enhance training stability through Proper Guidance-SwiftBrush (PG-SB), which employs a random-scale classifier-free guidance approach. By varying the guidance scale of both teacher models, we broaden their output distributions, resulting in a more robust VSD loss that enables SB to perform effectively across diverse backbones while maintaining competitive performance. Second, we propose a training-free method called Negative-Away Steer Attention (NASA), which integrates negative prompts into one-step diffusion models via cross-attention to suppress undesired elements in generated images. Our experimental results show that our proposed methods significantly improve baseline models across various metrics. Remarkably, we achieve an HPSv2 score of 31.08, setting a new state-of-the-art benchmark for one-step diffusion models.

Summary

The paper introduces SNOOPI, a framework addressing instability and lack of negative prompt guidance in one-step diffusion models through two novel components.
Proper Guidance - SwiftBrush (PG-SB) improves model stability across diverse backbones by training with a random-scale classifier-free guidance approach.
Negative-Away Steer Attention (NASA) enables effective negative prompt guidance in one-step models by manipulating intermediate feature space to suppress unwanted attributes.

An Expert Overview of the SNOOPI Framework for One-Step Diffusion Models

The paper, titled "SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance," addresses critical challenges in one-step text-to-image diffusion models. Traditional multi-step diffusion models, renowned for their high-quality image synthesis, are computationally demanding due to their iterative nature. Recent advancements have focused on distilling these models into more efficient one-step variants, aiming to reduce computational overhead while preserving output quality. Despite the promise shown by techniques like SwiftBrushv2 (SBv2), the paper identifies and seeks to remedy two main issues affecting current one-step diffusion models: instability across various model backbones and the absence of negative prompt guidance.

The proposed SNOOPI framework introduces two innovative components to tackle these obstacles: Proper Guidance - SwiftBrush (PG-SB) and Negative-Away Steer Attention (NASA).

Proper Guidance - SwiftBrush (PG-SB)

This methodology mitigates the instability in training one-step diffusion models by implementing a random-scale classifier-free guidance (CFG) approach. The paper highlights that using a fixed guidance scale, as seen in SBv2, can lead to inconsistent performance across different model backbones. By varying the guidance scale during training, PG-SB ensures a broader output distribution from the teacher models. This adaptability fosters a more stable variational score distillation (VSD) process, as evidenced by its successful distillation across diverse backbones without additional computational demands. Quantitative results demonstrate enhanced model stability and competitive output quality, particularly evident in the substantial improvements in Human Preference Score v2 benchmarks.

Negative-Away Steer Attention (NASA)

The absence of negative prompt guidance in one-step models restricts their practical application, particularly in scenarios requiring the exclusion of specific features. NASA addresses this by leveraging cross-attention mechanisms within the diffusion model. Unlike multi-step models, where negative prompts are managed through iterative processes, NASA directly manipulates intermediate feature space to suppress unwanted attributes. This technique effectively broadens the control over image synthesis, enabling the generation of high-quality images that adhere closely to both positive and negative prompt constraints.

Implications and Future Directions

Practically, SNOOPI enhances the operational efficiency and flexibility of text-to-image synthesis—facilitating real-time applications by reducing the computational burden associated with diffusion models. Theoretically, it offers a compelling case for the potential of one-step models to match or even surpass multi-step models in certain performance metrics, given the adequate design of distillation and guidance strategies.

Future exploration could extend the scope of SNOOPI by refining its approach to support few-step models, further integrating it with architectures lacking cross-attention layers, and exploring additional scenarios where negative prompt integration becomes crucial. As the landscape of AI and generative models continues to evolve, such advancements promise to push the boundaries of what's achievable within the constraints of current computational paradigms.

In conclusion, SNOOPI represents a significant stride toward stabilizing and enhancing the capacity of one-step diffusion models, ultimately broadening their applicability across diverse practical and theoretical domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arXivGPT/status/1865098130905055675