SDXS: Accelerating Latent Diffusion Models for Real-Time Image Generation with Image Conditions
Introduction to Latent Diffusion Models and Existing Challenges
In recent times, latent diffusion models have emerged as a prominent technology for image generation, showcasing exceptional capabilities in generating high-quality images. These models, particularly when applied to tasks such as text-to-image conversion, have significantly advanced the field. The foundational models such as SD v1.5 and SDXL have set benchmarks in quality; however, they exhibit substantial computational demands and operational latency due to their intricate architecture and iterative sampling mechanisms.
Addressing the Challenges
Recognizing these limitations, our presented work embarks on a dual-strategy approach of model miniaturization along with a reduction in sampling steps. This approach seeks not only to retain the quality of image generation but also to significantly enhance operational efficiency. The paper introduces SDXS-512 and SDXS-1024, two models realizing a leap in inference speed to approximately 100 FPS and 30 FPS on a single GPU for generating and images, respectively. This achievement marks a substantial improvement in computational efficiency, being and faster than their predecessors SD v1.5 and SDXL, respectively.
Methodological Insights
Model Miniaturization
A significant portion of our methodology centers around the distillation of the U-Net and VAE decoder within the latent diffusion framework. By leveraging knowledge distillation, we streamline these segments of the model, maintaining the capacity for high-quality output while markedly reducing computational overhead. In particular, the strategy includes employing a light-weight image decoder that closely mimics the original VAE decoder’s output, utilizing a specially curated training loss constituting both output distillation and GAN loss.
Reduction in Sampling Steps
To circumvent the extensive computational requirements due to iterative sampling, our work innovates a one-step diffusion model (DM) training technique. This approach optimizes the sampling process, substantially reducing the operational latency involved in image generation. By adopting feature matching and score distillation within our training regimen, we establish a pathway to transition from multi-step to efficient one-step operation.
Experimental Validation and Outcomes
The superiority of the SDXS models is underscored through comprehensive experimentation. Benchmarking against existing models such as SD v1.5 and SDXL underlines the remarkable efficiency gains achieved without a compromise in image quality. The models' efficacy is demonstrated across different resolutions, showcasing latency improvements while maintaining competitive FID scores and CLIP scores, indicators of image fidelity and coherence with textual prompts.
Further Application in Image-Conditioned Control
Expanding upon the innovative contributions, this paper also ventures into the application of the optimized model for tasks involving image-conditioned generation. By adapting the distilled model to work with ControlNet for efficient image-to-image translation, we open avenues for employing these advanced capabilities on edge devices, highlighting the model's versatility and practical utility.
Future Perspectives and Conclusion
The paper concludes with a reflection on the promising future directions that emerge from this research. The possibility of deploying such efficient, high-quality image generation models on low-power devices presents an exciting frontier for the development of real-time, interactive applications across various sectors. As this work paves the way for real-time, efficient image generation with latent diffusion models, it sets a foundational stage for further explorations that could extend these advancements to even broader applications in AI-driven image and video generation tasks.