Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation (2406.02347v3)

Published 4 Jun 2024 in cs.CV, cs.AI, and cs.LG

Abstract: In this paper, we propose an efficient, fast, and versatile distillation method to accelerate the generation of pre-trained diffusion models: Flash Diffusion. The method reaches state-of-the-art performances in terms of FID and CLIP-Score for few steps image generation on the COCO2014 and COCO2017 datasets, while requiring only several GPU hours of training and fewer trainable parameters than existing methods. In addition to its efficiency, the versatility of the method is also exposed across several tasks such as text-to-image, inpainting, face-swapping, super-resolution and using different backbones such as UNet-based denoisers (SD1.5, SDXL) or DiT (Pixart-$\alpha$), as well as adapters. In all cases, the method allowed to reduce drastically the number of sampling steps while maintaining very high-quality image generation. The official implementation is available at https://github.com/gojasper/flash-diffusion.

Citations (5)

View on Semantic Scholar

Summary

The paper introduces an innovative distillation method that trains a student model to mimic a pre-trained teacher, drastically reducing the number of diffusion steps.
It combines distillation, adversarial, and distribution matching losses to attain state-of-the-art performance on text-to-image tasks using only 2 NFEs.
The approach generalizes across tasks like inpainting, super-resolution, and face-swapping, significantly lowering computational costs while maintaining high quality.

Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation

The research paper introduces Flash Diffusion, an innovative, efficient, and versatile distillation method designed to accelerate the generation process of pre-trained diffusion models across various tasks and conditions. The proposed method demonstrates state-of-the-art (SOTA) performance in both Fréchet Inception Distance (FID) and CLIP score for few-step image generation, significantly reducing the computational burden typically associated with these models.

Overview of Flash Diffusion

The primary objective of Flash Diffusion is to address the inherent limitations of diffusion models that necessitate numerous steps for high-quality sample generation, rendering them impractical for real-time applications. Flash Diffusion achieves this through a robust distillation process, training a student model to mimic the predictions of a teacher model in a single or very few steps. The key components of the methodology include:

Distillation Loss: The student model learns to predict a denoised version of noisy input data by matching the output of a pre-trained teacher model.
Adversarial Loss: This enhances the quality of the generated samples by training a discriminator to distinguish between real and generated samples, ensuring the student's outputs are indistinguishable from real data.
Distribution Matching: The student model's distribution is aligned with the teacher's learned distribution using the Kullback–Leibler (KL) divergence.
Efficient Timesteps Sampling: A strategic probability mass function ensures the distillation process focuses on the most relevant timesteps, enhancing the efficiency of the training.

Key Results

The effectiveness of Flash Diffusion is demonstrated through extensive experiments on multiple benchmarks and tasks. Highlights include:

Text-to-Image Generation: Utilizing the SD1.5 model as a teacher, Flash Diffusion achieves a FID of 22.6 and a CLIP score of 0.306 on the COCO2017 dataset using only 2 Network Function Evaluations (NFEs), outperforming several existing methods that require more computational steps.
Versatility Across Tasks: The method's adaptability is showcased through various applications, including inpainting, super-resolution, and face-swapping, with consistent high-quality results and reduced computational costs.

Methodological Contributions

Complementary Objective Functions: The combined use of distillation, adversarial, and distribution matching losses ensures the student model can generate high-quality samples efficiently.
LoRA Compatibility: Applying Low-Rank Adaptation (LoRA) significantly reduces the number of trainable parameters, enabling faster and more efficient training.
Phase-Shifted Timesteps Sampling: Dynamically adjusting the probability mass function for timesteps during training phases ensures the model focuses on the most informative noise levels, optimizing the learning process.

Implications and Future Directions

Flash Diffusion presents a significant advancement in the practical application of diffusion models by drastically reducing the computational requirements for high-quality image generation. The method's ability to generalize across different tasks and conditional inputs without extensive retraining illustrates its potential for broader applications in AI-driven fields such as augmented reality, real-time video synthesis, and interactive content creation.

Future research could explore further reducing the number of NFEs required for generation or integrating additional objectives to enhance sample quality further. Additionally, the integration of emerging techniques such as Direct Preference Optimization could refine the outputs' precision in alignment with user-defined preferences.

Conclusion

Flash Diffusion emerges as a highly efficient and versatile method for accelerating conditional diffusion models. By meticulously combining distillation, adversarial training, and distribution matching with efficient timestep sampling, it sets a new standard in the field, demonstrating the feasibility of real-time applications for complex generative models. The method's broad applicability and impressive results advocate for its adoption in various AI and machine learning contexts.