Few-Step Diffusion via Score identity Distillation (2505.12674v1)

Published 19 May 2025 in cs.CV, cs.LG, and stat.ML

Abstract: Diffusion distillation has emerged as a promising strategy for accelerating text-to-image (T2I) diffusion models by distilling a pretrained score network into a one- or few-step generator. While existing methods have made notable progress, they often rely on real or teacher-synthesized images to perform well when distilling high-resolution T2I diffusion models such as Stable Diffusion XL (SDXL), and their use of classifier-free guidance (CFG) introduces a persistent trade-off between text-image alignment and generation diversity. We address these challenges by optimizing Score identity Distillation (SiD) -- a data-free, one-step distillation framework -- for few-step generation. Backed by theoretical analysis that justifies matching a uniform mixture of outputs from all generation steps to the data distribution, our few-step distillation algorithm avoids step-specific networks and integrates seamlessly into existing pipelines, achieving state-of-the-art performance on SDXL at 1024x1024 resolution. To mitigate the alignment-diversity trade-off when real text-image pairs are available, we introduce a Diffusion GAN-based adversarial loss applied to the uniform mixture and propose two new guidance strategies: Zero-CFG, which disables CFG in the teacher and removes text conditioning in the fake score network, and Anti-CFG, which applies negative CFG in the fake score network. This flexible setup improves diversity without sacrificing alignment. Comprehensive experiments on SD1.5 and SDXL demonstrate state-of-the-art performance in both one-step and few-step generation settings, along with robustness to the absence of real images. Our efficient PyTorch implementation, along with the resulting one- and few-step distilled generators, will be released publicly as a separate branch at https://github.com/mingyuanzhou/SiD-LSG.

Authors (3)

Mingyuan Zhou (161 papers)
Yi Gu (69 papers)
Zhendong Wang (60 papers)

Summary

An Analysis of Few-Step Diffusion via Score Identity Distillation

The paper "Few-Step Diffusion via Score identity Distillation" presents a comprehensive exploration of advanced methodologies to accelerate the text-to-image (T2I) diffusion process by optimizing score identity distillation (SiD). The authors address significant challenges in high-resolution image synthesis models, specifically targeting the distillation of Stable Diffusion XL (SDXL) into efficient generators that require fewer steps for sample generation.

Key Contributions

Score Identity Distillation (SiD): The paper extends SiD, a data-free one-step distillation framework, to enable few-step generation strategies for diffusion models. This innovation is underscored by robust theoretical backing, which validates matching a uniform mixture of outputs from all generation steps to an ideal data distribution. This simplifies the distillation process by eliminating the need for step-specific networks, integrating seamlessly into existing pipelines.
Guidance Strategies: To overcome the trade-offs imposed by classifier-free guidance (CFG), particularly between text-image alignment and generation diversity, two novel guidance strategies—Zero-CFG and Anti-CFG—are introduced. These strategies disable CFG in the teacher network and adjust text conditioning in the fake score network to enhance model flexibility, thereby improving generative diversity without sacrificing content alignment.
Integration of Real Images and Adversarial Losses: The paper incorporates real images and Diffusion GAN–based adversarial losses, leveraging these to compensate for discrepancies between the real data and synthetic outputs. This integration further enhances generation diversity and model robustness, achieving state-of-the-art performance in several cases, notably surpassing existing models in FID and CLIP scores.
Multistep Generator Optimization: Two strategies are proposed for optimizing multistep generators—Final-Step Matching and Uniform-Step Matching. Both strategies aim to maximize fidelity and alignment, either through full backpropagation of the generation chain or through isolated updates to successive steps, providing flexible solutions to distillation scalability and efficiency.

Implications and Future Directions

This paper's methodologies hold promising implications for the field of AI-driven image synthesis by potentially lowering computational costs and enhancing model scalability. Here are several avenues for future exploration:

Scalability to Larger Datasets: Although this research effectively scales some diffusion models like EDM and SD1.5, further work could explore the scalability of these techniques when applied to datasets of a much larger scale or more complex domains.
Joint Distillation and Preference Optimization: Given the balance necessary between fidelity and diversity in generative models, integrating preference learning mechanisms into the distillation process could offer new insights and practical benefits for models requiring refined human-like responses.
Long-term Consistency and Automation: Extending multistep generation frameworks to incorporate automation processes, such as auto-tuning of step configurations, might offer opportunities to enhance consistency in output quality across diverse conditions, especially in real-time applications.

Conclusion

The advancement of SiD for few-step diffusion models marks a critical step forward in the efficiency and scalability of high-resolution image generation. By optimizing score identity distillation for multistep generation, and proposing data-driven enhancements with new guidance strategies, this paper contributes substantial improvements to the existing diffusion model landscape. Furthermore, the robust empirical validations presented suggest that these approaches can reliably outperform predecessor models, offering valuable benchmarks for ongoing research in AI synthesis and applications.

GitHub

GitHub - mingyuanzhou/SiD-LSG: Score identity Distillation with Long and Short Guidance for One-Step Text-to-Image Generation (71 stars)