Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation (2403.12015v1)

Published 18 Mar 2024 in cs.CV

Abstract: Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach overcoming the limitations of ADD. In contrast to pixel-based ADD, LADD utilizes generative features from pretrained latent diffusion models. This approach simplifies training and enhances performance, enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. Moreover, we systematically investigate its scaling behavior and demonstrate LADD's effectiveness in various applications such as image editing and inpainting.

References (68)

Citations (62)

View on Semantic Scholar

Summary

The paper demonstrates that LADD simplifies diffusion distillation by operating in latent space, enabling efficient high-resolution image synthesis.
It unifies discriminator and teacher roles to control global and local image features while reducing computational complexity.
Its application in SD3-Turbo shows that LADD achieves teacher-level image quality in only four unguided sampling steps.

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Introduction

The advent of diffusion models has marked a significant advancement in image and video synthesis, offering an alternative to GANs for generating realistic and diverse samples. However, these models are not without their drawbacks, the most notable being their necessitation of multiple network evaluations during inference, which considerably slows down the process. This limitation obstructs real-time applications and has spurred research into methods for accelerating diffusion models. Among these, adversarial diffusion distillation (ADD) emerged as a promising approach for single-step image synthesis but faced obstacles related to expensive optimization, pixel-based operations, and restrictions in discriminator training resolution.

Advancements in Diffusion Distillation

Enter Latent Adversarial Diffusion Distillation (LADD), a novel methodology that addresses the shortcomings of ADD by employing latent space distillation. Unlike its predecessor, which relied on pixel-based operations, LADD operates within a model's latent space. This adjustment not only simplifies the training setup but also extends the capability of the distillation process to accommodate high-resolution and multi-aspect ratio image synthesis.

LADD employs a two-pronged approach: unifying the discriminator and teacher model roles and utilizing synthetic data for training. This strategy results in several benefits:

Efficiency & Simplification: By bypassing the need for pixel space decoding, LADD introduces a more resource-efficient approach that simplifies the overall system architecture.
Control Over Discriminator Features: It offers a natural way to adjust the feedback provided by the discriminator, influencing whether more global or local image features are emphasized during training.
Improved Performance: LADD demonstrates superior performance to ADD and other single-step approaches across various metrics and applications, from high-resolution image generation to tasks like image editing and inpainting.

Practical Applications and Results

The application of LADD to Stable Diffusion 3 (SD3), dubbed SD3-Turbo, encapsulates the method's potential. SD3-Turbo can match the image quality of its teacher model in merely four unguided sampling steps, showcasing the efficacy of LADD in generating high-resolution, multi-aspect ratio images from text prompts. The paper also explores systematic studies of LADD’s scaling behavior and its adaptability to various practical applications, confirming its versatility and effectiveness.

Future Implications and Research Directions

The development and implementation of LADD signify a substantial step forward in the distillation of diffusion models, enabling the generation of high-quality images in a fraction of the time previously required. This breakthrough could have notable implications in fields requiring rapid image synthesis, such as real-time image editing, video game development, and augmented reality applications.

Moreover, the success of LADD points toward fertile ground for future research, particularly in exploring the scalability of adversarial models within the constraints of current hardware and further refining the synthetic data generation process to enhance text-image alignment in generated outputs.

Conclusion

Latent Adversarial Diffusion Distillation represents a significant advancement in the field of image synthesis. By resolving key limitations associated with predecessor methods, LADD stands as a testament to the potential of leveraging latent spaces for efficient, high-quality image generation. As the community continues to build upon these findings, the horizon looks promising for the future development of faster, more versatile diffusion models capable of meeting the increasing demand for real-time, high-resolution image synthesis across various domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1769926279053103565

https://twitter.com/robrombach/status/1770005063827669186

https://twitter.com/arankomatsuzaki/status/1769924084345835933

https://twitter.com/fly51fly/status/1770210971946226166

https://twitter.com/SD_Tutorial/status/1774346082082009210

https://twitter.com/pbaylies/status/1838773312362557897

YouTube

Show All Videos

HackerNews

SD3-Turbo: Fast High-Res Image Synth W Latent Adversarial Diffusion Distillation (1 point, 1 comment)
Stable Diffusion 3 Turbo (1 point, 0 comments)