Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation (2403.12015v1)

Published 18 Mar 2024 in cs.CV

Abstract: Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach overcoming the limitations of ADD. In contrast to pixel-based ADD, LADD utilizes generative features from pretrained latent diffusion models. This approach simplifies training and enhances performance, enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. Moreover, we systematically investigate its scaling behavior and demonstrate LADD's effectiveness in various applications such as image editing and inpainting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  2. Lumiere: A space-time diffusion model for video generation, 2024.
  3. Tract: Denoising diffusion models with transitive closure time-distillation. arXiv preprint arXiv:2303.04248, 2023.
  4. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
  5. Align your latents: High-resolution video synthesis with latent diffusion models, 2023b.
  6. F. Boesel and R. Rombach. Improving image editing models with generative data refinement, 2024. to appear.
  7. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  8. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  9. Emu: Enhancing image generation models using photogenic needles in a haystack, 2023.
  10. P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis, 2021.
  11. Genie: Higher-order denoising diffusion solvers, 2022.
  12. Structure and content-guided video synthesis with diffusion models, 2023.
  13. Scaling rectified flow transformers for high-resolution image synthesis, 2024.
  14. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. ICLR, 2018.
  15. Multistep consistency models. arXiv preprint arXiv:2403.06807, 2024.
  16. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  17. J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  18. Denoising diffusion probabilistic models, 2020.
  19. Imagen video: High definition video generation with diffusion models, 2022.
  20. Training compute-optimal large language models, 2022.
  21. A. Hyvärinen and P. Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  22. Intriguing properties of generative classifiers. ICLR, 2023.
  23. Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506, 2023.
  24. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
  25. Scaling laws for neural language models, 2020.
  26. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  27. Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023.
  28. Sdxl-lightning: Progressive adversarial diffusion distillation, 2024.
  29. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  30. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t.
  31. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022.
  32. Diffusion hyperfeatures: Searching through time and space for semantic correspondence. Advances in Neural Information Processing Systems, 36, 2024.
  33. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023a.
  34. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556, 2023b.
  35. On distillation of guided diffusion models, 2023.
  36. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  37. W. Peebles and S. Xie. Scalable diffusion models with transformers, 2023.
  38. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  39. Learning transferable visual models from natural language supervision, 2021.
  40. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290, 2023.
  41. Hierarchical text-conditional image generation with clip latents, 2022.
  42. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  43. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 2022.
  44. T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models, 2022.
  45. Improved techniques for training gans, 2016.
  46. Projected gans converge faster. Advances in Neural Information Processing Systems, 34:17480–17492, 2021.
  47. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022.
  48. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. In International conference on machine learning, pages 30105–30118. PMLR, 2023a.
  49. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023b.
  50. J. Schmidhuber. Generative adversarial networks are special cases of artificial curiosity (1990) and also closely related to predictability minimization (1991), 2020.
  51. Bespoke solvers for generative flow models, 2023.
  52. Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089, 2023.
  53. Make-a-video: Text-to-video generation without text-video data, 2022.
  54. Deep unsupervised learning using nonequilibrium thermodynamics. ArXiv, abs/1503.03585, 2015. URL https://api.semanticscholar.org/CorpusID:14888175.
  55. Denoising diffusion implicit models, 2022.
  56. Y. Song and P. Dhariwal. Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189, 2023.
  57. Score-based generative modeling through stochastic differential equations. ArXiv, abs/2011.13456, 2020. URL https://api.semanticscholar.org/CorpusID:227209335.
  58. Consistency models. In International conference on machine learning, 2023.
  59. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2149–2159, 2022.
  60. P. Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  61. Diffusion Model Alignment Using Direct Preference Optimization. arXiv:2311.12908, 2023.
  62. Ufogen: You forward once large scale text-to-image generation via diffusion gans, 2023.
  63. One-step diffusion with distribution matching distillation, 2023.
  64. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  65. Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Processing Systems, 36, 2024.
  66. Q. Zhang and Y. Chen. Fast sampling of diffusion models with exponential integrator, 2023.
  67. Hive: Harnessing human feedback for instructional visual editing. arXiv preprint arXiv:2303.09618, 2023.
  68. Trajectory consistency distillation. arXiv preprint arXiv:2402.19159, 2024.
Citations (62)

Summary

  • The paper demonstrates that LADD simplifies diffusion distillation by operating in latent space, enabling efficient high-resolution image synthesis.
  • It unifies discriminator and teacher roles to control global and local image features while reducing computational complexity.
  • Its application in SD3-Turbo shows that LADD achieves teacher-level image quality in only four unguided sampling steps.

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Introduction

The advent of diffusion models has marked a significant advancement in image and video synthesis, offering an alternative to GANs for generating realistic and diverse samples. However, these models are not without their drawbacks, the most notable being their necessitation of multiple network evaluations during inference, which considerably slows down the process. This limitation obstructs real-time applications and has spurred research into methods for accelerating diffusion models. Among these, adversarial diffusion distillation (ADD) emerged as a promising approach for single-step image synthesis but faced obstacles related to expensive optimization, pixel-based operations, and restrictions in discriminator training resolution.

Advancements in Diffusion Distillation

Enter Latent Adversarial Diffusion Distillation (LADD), a novel methodology that addresses the shortcomings of ADD by employing latent space distillation. Unlike its predecessor, which relied on pixel-based operations, LADD operates within a model's latent space. This adjustment not only simplifies the training setup but also extends the capability of the distillation process to accommodate high-resolution and multi-aspect ratio image synthesis.

LADD employs a two-pronged approach: unifying the discriminator and teacher model roles and utilizing synthetic data for training. This strategy results in several benefits:

  • Efficiency & Simplification: By bypassing the need for pixel space decoding, LADD introduces a more resource-efficient approach that simplifies the overall system architecture.
  • Control Over Discriminator Features: It offers a natural way to adjust the feedback provided by the discriminator, influencing whether more global or local image features are emphasized during training.
  • Improved Performance: LADD demonstrates superior performance to ADD and other single-step approaches across various metrics and applications, from high-resolution image generation to tasks like image editing and inpainting.

Practical Applications and Results

The application of LADD to Stable Diffusion 3 (SD3), dubbed SD3-Turbo, encapsulates the method's potential. SD3-Turbo can match the image quality of its teacher model in merely four unguided sampling steps, showcasing the efficacy of LADD in generating high-resolution, multi-aspect ratio images from text prompts. The paper also explores systematic studies of LADD’s scaling behavior and its adaptability to various practical applications, confirming its versatility and effectiveness.

Future Implications and Research Directions

The development and implementation of LADD signify a substantial step forward in the distillation of diffusion models, enabling the generation of high-quality images in a fraction of the time previously required. This breakthrough could have notable implications in fields requiring rapid image synthesis, such as real-time image editing, video game development, and augmented reality applications.

Moreover, the success of LADD points toward fertile ground for future research, particularly in exploring the scalability of adversarial models within the constraints of current hardware and further refining the synthetic data generation process to enhance text-image alignment in generated outputs.

Conclusion

Latent Adversarial Diffusion Distillation represents a significant advancement in the field of image synthesis. By resolving key limitations associated with predecessor methods, LADD stands as a testament to the potential of leveraging latent spaces for efficient, high-quality image generation. As the community continues to build upon these findings, the horizon looks promising for the future development of faster, more versatile diffusion models capable of meeting the increasing demand for real-time, high-resolution image synthesis across various domains.

Youtube Logo Streamline Icon: https://streamlinehq.com