Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds (2306.00980v3)

Published 1 Jun 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  2. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  3. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  4. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  5. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  6. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  7. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  8. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
  9. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  10. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  11. Sine: Single image editing with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6027–6037, 2023.
  12. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  13. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  14. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  15. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
  16. Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
  17. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  18. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  19. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  20. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  21. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  22. Q-diffusion: Quantizing diffusion models. arXiv preprint arXiv:2302.04304, 2023.
  23. World’s first on-device demonstration of stable diffusion on an android phone, 2023.
  24. Speed is all you need: On-device acceleration of large diffusion models via gpu-aware optimizations. arXiv preprint arXiv:2304.11267, 2023.
  25. Post-training quantization on diffusion models. arXiv preprint arXiv:2211.15736, 2022.
  26. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  27. Teachers do more than teach: Compressing image-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13600–13611, 2021.
  28. Layer freezing & data sieving: Missing pieces of a generic framework for sparse training. arXiv preprint arXiv:2209.11204, 2022.
  29. Efficientformer: Vision transformers at mobilenet speed. arXiv preprint arXiv:2206.01191, 2022.
  30. Rethinking vision transformers for mobilenet size and speed. arXiv preprint arXiv:2212.08059, 2022.
  31. On architectural compression of text-to-image diffusion models. arXiv preprint arXiv:2305.15798, 2023.
  32. On distillation of guided diffusion models. arXiv preprint arXiv:2210.03142, 2022.
  33. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  34. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  35. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  36. U-net: Convolutional networks for biomedical image segmentation. In MICAI, 2015.
  37. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  38. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  39. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278–1286. PMLR, 2014.
  40. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  41. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  42. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
  43. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
  44. Neural pruning via growing regularization. In ICLR, 2021.
  45. Trainability preserving neural pruning. In ICLR, 2023.
  46. Neural architecture search with reinforcement learning. In ICLR, 2017.
  47. Neural architecture search: A survey. JMLR, 20(55):1–21, 2019.
  48. Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 646–661. Springer, 2016.
  49. Slimmable neural networks. arXiv preprint arXiv:1812.08928, 2018.
  50. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  51. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  52. Adam: A method for stochastic optimization. In ICLR, 2015.
  53. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
  54. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022.
  55. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022.
  56. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  57. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080, 2021.
  58. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388, 2021.
  59. Genie: Higher-order denoising diffusion solvers. arXiv preprint arXiv:2210.05475, 2022.
  60. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
  61. Learning fast samplers for diffusion models by differentiating through sample quality. In International Conference on Learning Representations, 2022.
  62. Stable diffusion with core ml on apple silicon, 2022.
  63. Efficient spatially sparse inference for conditional gans and diffusion models. arXiv preprint arXiv:2211.02048, 2022.
  64. Tensorrt. https://developer.nvidia.com/tensorrt.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yanyu Li (31 papers)
  2. Huan Wang (211 papers)
  3. Qing Jin (17 papers)
  4. Ju Hu (9 papers)
  5. Pavlo Chemerys (2 papers)
  6. Yun Fu (131 papers)
  7. Yanzhi Wang (197 papers)
  8. Sergey Tulyakov (108 papers)
  9. Jian Ren (97 papers)
Citations (112)

Summary

SnapFusion: A Mobile-Optimized Text-to-Image Diffusion Model

The research paper presents SnapFusion, an innovative advancement in text-to-image diffusion models specifically engineered to operate on mobile devices with striking efficiency. Achieving image generation in under two seconds, SnapFusion addresses the computational and privacy challenges inherent in traditional text-to-image diffusion models which typically require high-end GPUs and cloud-based processing.

Contributions and Methodology

The paper introduces significant architectural optimizations and novel strategies for step distillation to facilitate swift on-device inference. The central contributions of the paper are outlined below:

  1. Efficient UNet Architecture: The authors identify and alleviate redundancy in the original UNet architecture—serving as the backbone of their diffusion model—through a robust training and evaluation mechanism. The UNet is optimized to significantly reduce computational latency while maintaining image generation quality.
  2. Network Architecture Evolving Framework: A novel framework is proposed to systematically evolve the network architecture. This involves a robust stochastic training approach coupled with an evolutionary algorithm to prune architecture redundancies effectively, thus improving inference speed.
  3. Compressed VAE Decoder: To further accelerate the image decoding process, a data distillation approach is employed, compressing the VAE decoder with negligible impact on visual quality. This involves a thoughtful design of a distillation pipeline using synthetic latent-image pairs to minimize computational overhead.
  4. CFG-Aware Step Distillation: Enhancing step distillation by integrating classifier-free guidance (CFG), the model reduces the necessary denoising iterations while sustaining image fidelity. This innovation is crucial in minimizing latency by facilitating a model that performs comparably to its 50-step counterpart with only 8 denoising steps.

Numerical Outcomes

Experimental validation on the MS-COCO dataset indicates that SnapFusion's performance exceeds that of Stable Diffusion v1.5, achieving superior FID and CLIP scores despite being executed with reduced computational resources. Notably, with just 8 denoising steps, SnapFusion outperforms the baseline 50-step configuration in terms of image-text alignment as quantified by the CLIP score.

Implications and Future Directions

SnapFusion represents a leap forward in democratizing creative content generation by delivering powerful diffusion models to the user’s palm. The implications for practical applications span various domains, including interactive digital content and real-time artistic rendering on consumer devices. The paper paves the way for additional inquiries into efficient architecture search and distillation methodologies, with potential extensions to other domains such as video synthesis or 3D content creation.

Future research could explore the further miniaturization of these models to fit diverse mobile hardware or enhance model adaptability for varied stylistic attributes. As the demand for efficient, high-quality on-device AI models intensifies, SnapFusion provides a blueprint for effectively overcoming the latency constraints of large-scale machine learning models, ensuring broad accessibility without compromising data privacy.

Youtube Logo Streamline Icon: https://streamlinehq.com