Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SDXL-Lightning: Progressive Adversarial Diffusion Distillation (2402.13929v3)

Published 21 Feb 2024 in cs.CV, cs.AI, and cs.LG

Abstract: We propose a diffusion distillation method that achieves new state-of-the-art in one-step/few-step 1024px text-to-image generation based on SDXL. Our method combines progressive and adversarial distillation to achieve a balance between quality and mode coverage. In this paper, we discuss the theoretical analysis, discriminator design, model formulation, and training techniques. We open-source our distilled SDXL-Lightning models both as LoRA and full UNet weights.

Progressive Adversarial Diffusion Distillation for Efficient Text-to-Image Generation

Introduction

Generative models, particularly diffusion models, have shown remarkable capabilities in various domains like text-to-image and text-to-video generation. However, their slow, iterative generation process poses significant computational challenges. This paper introduces a distillation method combining progressive and adversarial techniques, aimed at striking a balance between image quality and mode coverage for one-step or few-step generative processes. The proposed approach, termed SDXL-Lightning, not only enhances the speed of image generation to new heights but also maintains, and in some cases surpasses, the quality produced by the state-of-the-art models.

Theoretical Foundations and Methodology

At the core of our method lies the fusion of progressive and adversarial distillation strategies, innovatively applied to diffusion models. Traditional approaches to reducing inference steps often led to unacceptable quality loss or required an impractically high number of steps to generate acceptable results. Our method, by contrast, leverages the strengths of both progressive and adversarial distillation to directly predict farther along the flow of generation, notably surpassing previous methods in producing high-quality images in fewer steps.

  • Progressive Distillation: We detail how progressive distillation can ensure the distilled model preserves the original ODE flow and mode coverage but struggles with image sharpness under few inference steps. The inclusion of progressive distillation assists in maintaining the original model behavior, making our distilled models compatible with various LoRA modules and control plugins.
  • Adversarial Distillation: The adoption of an adversarial loss mechanism at each distillation stage plays a crucial role in enhancing image quality. Instead of relying solely on mean squared error (MSE), which tends to produce blurry images, our method utilizes a pre-trained diffusion U-Net encoder as the discriminator backbone, fully operating in latent space. This approach allows for efficient distillation in high-resolutions while providing flexibility in balancing between sample quality and mode coverage.

Model Distillation and Results

Our distilled models, named SDXL-Lightning, exhibit unparalleled efficiency and quality in text-to-image generation, particularly at 1024px resolution. The models, open-sourced for both LoRA adaptation and full UNet weights, show significant improvements over existing distillation methods:

  • Efficiency and Quality: Our distillation procedure effectively reduces the required inference steps to as low as one or two while achieving new state-of-the-art results in quality, as evidenced by numerical scores in established metrics such as FrĂ©chet Inception Distance (FID) and CLIP score.
  • Discriminator Design and Training Techniques: The innovative discriminator design, leveraging the pre-trained diffusion model’s encoder, along with strategic training techniques, ensures stable training and high-quality image generation.
  • Adaptability and Compatibility: The distilled models demonstrate remarkable compatibility with existing LoRA modules and control plugins, showcasing their potential for easy integration into various applications and further research explorations in generative AI.

Future Directions

While SDXL-Lightning sets a new benchmark in efficient text-to-image generation, future work will explore optimizing the architecture for few-step generation processes and extending the method’s applicability across different domains and modalities. The open sourcing of these distilled models is anticipated to catalyze further advancements in the field.

Concluding Remarks

The proposed progressive adversarial diffusion distillation method represents a significant leap forward in the efficiency of high-quality text-to-image generation. By meticulously combining progressive and adversarial distillation techniques and employing innovative training mechanisms, the resulting SDXL-Lightning models practically balance quality, efficiency, and mode coverage, offering vast potential for real-world applications and further scholarly inquiry.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. AAM-XL Anime Mix. https://civitai.com/models/269232.
  2. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023.
  3. Align your latents: High-resolution video synthesis with latent diffusion models, 2023.
  4. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  5. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
  6. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
  7. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  8. Generative adversarial networks, 2014.
  9. Smooth diffusion: Crafting smooth latent spaces in diffusion models, 2023.
  10. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2023.
  11. Gaussian error linear units (gelus), 2023.
  12. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6626–6637, 2017.
  13. Imagen video: High definition video generation with diffusion models, 2022.
  14. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  15. Classifier-free diffusion guidance, 2022.
  16. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  17. Scaling up gans for text-to-image synthesis, 2023.
  18. MSG-GAN: multi-scale gradients for generative adversarial networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 7796–7805. IEEE, 2020.
  19. Elucidating the design space of diffusion-based generative models, 2022.
  20. Training generative adversarial networks with limited data. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  21. Analyzing and improving the image quality of stylegan. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8107–8116. IEEE, 2020.
  22. Consistency trajectory models: Learning probability flow ode trajectory of diffusion, 2023.
  23. The lipschitz constant of self-attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 5562–5571. PMLR, 2021.
  24. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  25. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  26. Common diffusion noise schedules and sample steps are flawed, 2023.
  27. Diffusion model with perceptual loss, 2024.
  28. Microsoft coco: Common objects in context, 2015.
  29. Flow matching for generative modeling, 2023.
  30. Pseudo numerical methods for diffusion models on manifolds. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  31. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022.
  32. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation, 2023.
  33. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  34. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps, 2022.
  35. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models, 2023.
  36. Latent consistency models: Synthesizing high-resolution images with few-step inference, 2023.
  37. Lcm-lora: A universal stable-diffusion acceleration module, 2023.
  38. Sdedit: Guided image synthesis and editing with stochastic differential equations. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  39. Which training methods for gans do actually converge? In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 3478–3487. PMLR, 2018.
  40. Mixed precision training. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
  41. Dinov2: Learning robust visual features without supervision, 2023.
  42. On aliased resizing and surprising subtleties in gan evaluation, 2022.
  43. Wuerstchen: An efficient architecture for large-scale text-to-image diffusion models, 2023.
  44. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  45. Dreamfusion: Text-to-3d using 2d diffusion, 2022.
  46. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
  47. Zero: Memory optimizations toward training trillion parameter models, 2020.
  48. Searching for activation functions, 2017.
  49. Hierarchical text-conditional image generation with clip latents, 2022.
  50. RealVisXL V4.0. https://civitai.com/models/139562.
  51. High-resolution image synthesis with latent diffusion models, 2022.
  52. U-net: Convolutional networks for biomedical image segmentation, 2015.
  53. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
  54. Progressive distillation for fast sampling of diffusion models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  55. Samaritan 3D Cartoon V4. https://civitai.com/models/81270.
  56. Projected gans converge faster. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 17480–17492, 2021.
  57. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis, 2023.
  58. Adversarial diffusion distillation, 2023.
  59. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  60. SDXL-ControlNet Canny. https://huggingface.co/diffusers/controlnet-canny-sdxl-1.0.
  61. SDXL-ControlNet Depth. https://huggingface.co/diffusers/controlnet-depth-sdxl-1.0.
  62. Make-a-video: Text-to-video generation without text-video data, 2022.
  63. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 2256–2265. JMLR.org, 2015.
  64. Denoising diffusion implicit models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  65. Improved techniques for training consistency models, 2023.
  66. Consistency models, 2023.
  67. Score-based generative modeling through stochastic differential equations. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  68. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2818–2826. IEEE Computer Society, 2016.
  69. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  70. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation, 2023.
  71. Group normalization, 2018.
  72. Tackling the generative learning trilemma with denoising diffusion gans. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  73. Ufogen: You forward once large scale text-to-image generation via diffusion gans, 2023.
  74. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models, 2023.
  75. One-step diffusion with distribution matching distillation, 2023.
  76. Adding conditional control to text-to-image diffusion models, 2023.
  77. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 586–595. IEEE Computer Society, 2018.
  78. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models, 2023.
  79. Movq: Modulating quantized vectors for high-fidelity image generation, 2022.
  80. Magicvideo: Efficient video generation with latent diffusion models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shanchuan Lin (17 papers)
  2. Anran Wang (13 papers)
  3. Xiao Yang (158 papers)
Citations (75)
Youtube Logo Streamline Icon: https://streamlinehq.com