Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions (2403.16627v2)

Published 25 Mar 2024 in cs.CV

Abstract: Recent advancements in diffusion models have positioned them at the forefront of image generation. Despite their superior performance, diffusion models are not without drawbacks; they are characterized by complex architectures and substantial computational demands, resulting in significant latency due to their iterative sampling process. To mitigate these limitations, we introduce a dual approach involving model miniaturization and a reduction in sampling steps, aimed at significantly decreasing model latency. Our methodology leverages knowledge distillation to streamline the U-Net and image decoder architectures, and introduces an innovative one-step DM training technique that utilizes feature matching and score distillation. We present two models, SDXS-512 and SDXS-1024, achieving inference speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a single GPU, respectively. Moreover, our training approach offers promising applications in image-conditioned control, facilitating efficient image-to-image translation.

SDXS: Accelerating Latent Diffusion Models for Real-Time Image Generation with Image Conditions

Introduction to Latent Diffusion Models and Existing Challenges

In recent times, latent diffusion models have emerged as a prominent technology for image generation, showcasing exceptional capabilities in generating high-quality images. These models, particularly when applied to tasks such as text-to-image conversion, have significantly advanced the field. The foundational models such as SD v1.5 and SDXL have set benchmarks in quality; however, they exhibit substantial computational demands and operational latency due to their intricate architecture and iterative sampling mechanisms.

Addressing the Challenges

Recognizing these limitations, our presented work embarks on a dual-strategy approach of model miniaturization along with a reduction in sampling steps. This approach seeks not only to retain the quality of image generation but also to significantly enhance operational efficiency. The paper introduces SDXS-512 and SDXS-1024, two models realizing a leap in inference speed to approximately 100 FPS and 30 FPS on a single GPU for generating 512×512512\times512 and 1024×10241024\times1024 images, respectively. This achievement marks a substantial improvement in computational efficiency, being 30×30\times and 60×60\times faster than their predecessors SD v1.5 and SDXL, respectively.

Methodological Insights

Model Miniaturization

A significant portion of our methodology centers around the distillation of the U-Net and VAE decoder within the latent diffusion framework. By leveraging knowledge distillation, we streamline these segments of the model, maintaining the capacity for high-quality output while markedly reducing computational overhead. In particular, the strategy includes employing a light-weight image decoder that closely mimics the original VAE decoder’s output, utilizing a specially curated training loss constituting both output distillation and GAN loss.

Reduction in Sampling Steps

To circumvent the extensive computational requirements due to iterative sampling, our work innovates a one-step diffusion model (DM) training technique. This approach optimizes the sampling process, substantially reducing the operational latency involved in image generation. By adopting feature matching and score distillation within our training regimen, we establish a pathway to transition from multi-step to efficient one-step operation.

Experimental Validation and Outcomes

The superiority of the SDXS models is underscored through comprehensive experimentation. Benchmarking against existing models such as SD v1.5 and SDXL underlines the remarkable efficiency gains achieved without a compromise in image quality. The models' efficacy is demonstrated across different resolutions, showcasing latency improvements while maintaining competitive FID scores and CLIP scores, indicators of image fidelity and coherence with textual prompts.

Further Application in Image-Conditioned Control

Expanding upon the innovative contributions, this paper also ventures into the application of the optimized model for tasks involving image-conditioned generation. By adapting the distilled model to work with ControlNet for efficient image-to-image translation, we open avenues for employing these advanced capabilities on edge devices, highlighting the model's versatility and practical utility.

Future Perspectives and Conclusion

The paper concludes with a reflection on the promising future directions that emerge from this research. The possibility of deploying such efficient, high-quality image generation models on low-power devices presents an exciting frontier for the development of real-time, interactive applications across various sectors. As this work paves the way for real-time, efficient image generation with latent diffusion models, it sets a foundational stage for further explorations that could extend these advancements to even broader applications in AI-driven image and video generation tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  2. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  3. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  4. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  5. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  6. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
  7. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  8. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
  9. Exploiting diffusion prior for real-world image super-resolution. arXiv preprint arXiv:2305.07015, 2023.
  10. Resshift: Efficient diffusion model for image super-resolution by residual shifting. Advances in Neural Information Processing Systems, 36, 2024.
  11. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
  12. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  13. Dreamfusion: Text-to-3d using 2d diffusion. In International Conference on Learning Representations, 2022.
  14. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems, 36, 2024.
  15. Structural pruning for diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  16. Bk-sdm: Architecturally compressed stable diffusion for efficient text-to-image generation. In Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023.
  17. Progressive knowledge distillation of stable diffusion xl using layer level loss. arXiv preprint arXiv:2401.02677, 2024.
  18. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1972–1981, 2023.
  19. Q-diffusion: Quantizing diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17535–17545, 2023.
  20. Ptqd: Accurate post-training quantization for diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  21. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  22. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
  23. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2021.
  24. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023.
  25. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. Advances in Neural Information Processing Systems, 36, 2024.
  26. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  27. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
  28. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556, 2023.
  29. Flow straight and fast: Learning to generate and transfer data with rectified flow. In International conference on learning representations (ICLR), 2023.
  30. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. In International Conference on Learning Representations, 2024.
  31. Ufogen: You forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257, 2023.
  32. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
  33. Sdxl-lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929, 2024.
  34. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  35. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  36. U-net: Convolutional networks for biomedical image segmentation. In MICAI, 2015.
  37. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  38. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  39. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  40. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286. PMLR, 2014.
  41. Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 2017.
  42. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021.
  43. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020.
  44. Pseudo numerical methods for diffusion models on manifolds. In International Conference on Learning Representations, 2022.
  45. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  46. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022.
  47. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  48. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
  49. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
  50. Improved techniques for training gans. Advances in Neural Information Processing Systems, 29, 2016.
  51. Image quality assessment: Unifying structure and texture similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2567–2581, 2020.
  52. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  53. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2023.
  54. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  55. Score distillation sampling with learned manifold corrective. arXiv preprint arXiv:2401.05293, 2024.
  56. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  57. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  58. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. In International Conference on Machine Learning, pages 30105–30118. PMLR, 2023.
  59. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017.
  60. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  61. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yuda Song (22 papers)
  2. Zehao Sun (1 paper)
  3. Xuanwu Yin (12 papers)
Citations (13)
Youtube Logo Streamline Icon: https://streamlinehq.com