On Distillation of Guided Diffusion Models (2210.03142v3)

Published 6 Oct 2022 in cs.CV, cs.AI, and cs.LG

Abstract: Classifier-free guided diffusion models have recently been shown to be highly effective at high-resolution image generation, and they have been widely used in large-scale diffusion frameworks including DALLE-2, Stable Diffusion and Imagen. However, a downside of classifier-free guided diffusion models is that they are computationally expensive at inference time since they require evaluating two diffusion models, a class-conditional model and an unconditional model, tens to hundreds of times. To deal with this limitation, we propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from: Given a pre-trained classifier-free guided model, we first learn a single model to match the output of the combined conditional and unconditional models, and then we progressively distill that model to a diffusion model that requires much fewer sampling steps. For standard diffusion models trained on the pixel-space, our approach is able to generate images visually comparable to that of the original model using as few as 4 sampling steps on ImageNet 64x64 and CIFAR-10, achieving FID/IS scores comparable to that of the original model while being up to 256 times faster to sample from. For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps, accelerating inference by at least 10-fold compared to existing methods on ImageNet 256x256 and LAION datasets. We further demonstrate the effectiveness of our approach on text-guided image editing and inpainting, where our distilled model is able to generate high-quality results using as few as 2-4 denoising steps.

PDF Abstract

Analyzing "On Distillation of Guided Diffusion Models"

This paper addresses the inefficiency of classifier-free guided diffusion models, which have shown strong performance in high-resolution image generation but are computationally intensive. The core contribution is a novel two-stage distillation framework that significantly reduces the number of denoising steps required for high-quality image generation, both on pixel-space and latent-space diffusion models.

Technical Contributions

Two-Stage Distillation Approach:
- Stage One involves learning a single student model to approximate the combined outputs of conditional and unconditional models of the classifier-free guidance framework. This stage ensures that the student model can deal efficiently with various guidance strengths, enabling a trade-off between sample quality and diversity.
- Stage Two refines the model further, progressively distilling it into one requiring fewer sampling steps — effectively halving them at each distillation cycle. This approach leverages existing deterministic sampling strategies like DDIM, with extensions to stochastic samplers.
Efficiency Achievements:
- For both pixel-space and latent-space models, the research reveals substantial reductions in sample generation time. Notably, the sampling efficiency gains range from 10 times up to 256 times. For instance, the distilled latent diffusion model requires only 1 to 4 steps, versus the much higher numbers traditionally required, while maintaining FID (Fréchet Inception Distance) and IS (Inception Score) close to those of the original models.
Empirical Validation:
- The paper validates these claims through experiments on standard benchmark datasets like ImageNet and CIFAR-10, showcasing results that match or surpass the teacher models. This drives a significant acceleration in inference, particularly in the field of high-quality, high-resolution text-guided generation tasks on datasets like LAION.

Implications and Future Directions

The implications of this work are multifaceted. Practically, the method reduces the computational demands for deploying large-scale diffusion models in real-world applications, making them more feasible for industries reliant on generative models, such as entertainment, marketing, and virtual reality. Theoretically, this research enriches our understanding of distillation techniques in the context of diffusion models, providing a generalizable framework applicable to various modalities beyond image generation.

Furthermore, the proposed integration of $w$ -conditioning in model architectures to capture diverse guidance levels is noteworthy. It shows potential for extending beyond current text-to-image and class-conditional methodologies, paving the way for more nuanced interactive AI systems.

Considerations for AI Development

Future AI developments could consider leveraging the foundations laid by such distillation approaches for even broader applications, including video, 3D model generation, and cross-modal tasks. Also, refining the stochastic sampling strategies could open up new possibilities for improving the trade-off between quality and diversity in generative tasks.

In summary, this paper contributes crucial advancements to the field of efficient generative modeling by effectively reducing the operational complexity of diffusion models. It sets a stage for future investigations into even more versatile and powerful diffusion-based frameworks.