SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds (2306.00980v3)

Published 1 Jun 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

References (64)

Authors (9)

Yanyu Li (31 papers)
Huan Wang (211 papers)
Qing Jin (17 papers)
Ju Hu (9 papers)
Pavlo Chemerys (2 papers)
Yun Fu (131 papers)
Yanzhi Wang (197 papers)
Sergey Tulyakov (108 papers)
Jian Ren (97 papers)

Citations (112)

View on Semantic Scholar

Summary

SnapFusion: A Mobile-Optimized Text-to-Image Diffusion Model

The research paper presents SnapFusion, an innovative advancement in text-to-image diffusion models specifically engineered to operate on mobile devices with striking efficiency. Achieving image generation in under two seconds, SnapFusion addresses the computational and privacy challenges inherent in traditional text-to-image diffusion models which typically require high-end GPUs and cloud-based processing.

Contributions and Methodology

The paper introduces significant architectural optimizations and novel strategies for step distillation to facilitate swift on-device inference. The central contributions of the paper are outlined below:

Efficient UNet Architecture: The authors identify and alleviate redundancy in the original UNet architecture—serving as the backbone of their diffusion model—through a robust training and evaluation mechanism. The UNet is optimized to significantly reduce computational latency while maintaining image generation quality.
Network Architecture Evolving Framework: A novel framework is proposed to systematically evolve the network architecture. This involves a robust stochastic training approach coupled with an evolutionary algorithm to prune architecture redundancies effectively, thus improving inference speed.
Compressed VAE Decoder: To further accelerate the image decoding process, a data distillation approach is employed, compressing the VAE decoder with negligible impact on visual quality. This involves a thoughtful design of a distillation pipeline using synthetic latent-image pairs to minimize computational overhead.
CFG-Aware Step Distillation: Enhancing step distillation by integrating classifier-free guidance (CFG), the model reduces the necessary denoising iterations while sustaining image fidelity. This innovation is crucial in minimizing latency by facilitating a model that performs comparably to its 50-step counterpart with only 8 denoising steps.

Numerical Outcomes

Experimental validation on the MS-COCO dataset indicates that SnapFusion's performance exceeds that of Stable Diffusion v1.5, achieving superior FID and CLIP scores despite being executed with reduced computational resources. Notably, with just 8 denoising steps, SnapFusion outperforms the baseline 50-step configuration in terms of image-text alignment as quantified by the CLIP score.

Implications and Future Directions

SnapFusion represents a leap forward in democratizing creative content generation by delivering powerful diffusion models to the user’s palm. The implications for practical applications span various domains, including interactive digital content and real-time artistic rendering on consumer devices. The paper paves the way for additional inquiries into efficient architecture search and distillation methodologies, with potential extensions to other domains such as video synthesis or 3D content creation.

Future research could explore the further miniaturization of these models to fit diverse mobile hardware or enhance model adaptability for varied stylistic attributes. As the demand for efficient, high-quality on-device AI models intensifies, SnapFusion provides a blueprint for effectively overcoming the latency constraints of large-scale machine learning models, ensuring broad accessibility without compromising data privacy.

YouTube

Show All Videos