SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds (2306.00980v3)
Abstract: Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
- Sine: Single image editing with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6027–6037, 2023.
- Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
- Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
- Q-diffusion: Quantizing diffusion models. arXiv preprint arXiv:2302.04304, 2023.
- World’s first on-device demonstration of stable diffusion on an android phone, 2023.
- Speed is all you need: On-device acceleration of large diffusion models via gpu-aware optimizations. arXiv preprint arXiv:2304.11267, 2023.
- Post-training quantization on diffusion models. arXiv preprint arXiv:2211.15736, 2022.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Teachers do more than teach: Compressing image-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13600–13611, 2021.
- Layer freezing & data sieving: Missing pieces of a generic framework for sparse training. arXiv preprint arXiv:2209.11204, 2022.
- Efficientformer: Vision transformers at mobilenet speed. arXiv preprint arXiv:2206.01191, 2022.
- Rethinking vision transformers for mobilenet size and speed. arXiv preprint arXiv:2212.08059, 2022.
- On architectural compression of text-to-image diffusion models. arXiv preprint arXiv:2305.15798, 2023.
- On distillation of guided diffusion models. arXiv preprint arXiv:2210.03142, 2022.
- Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
- U-net: Convolutional networks for biomedical image segmentation. In MICAI, 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278–1286. PMLR, 2014.
- Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Learning efficient convolutional networks through network slimming. In ICCV, 2017.
- Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
- Neural pruning via growing regularization. In ICLR, 2021.
- Trainability preserving neural pruning. In ICLR, 2023.
- Neural architecture search with reinforcement learning. In ICLR, 2017.
- Neural architecture search: A survey. JMLR, 20(55):1–21, 2019.
- Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 646–661. Springer, 2016.
- Slimmable neural networks. arXiv preprint arXiv:1812.08928, 2018.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
- Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022.
- Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022.
- Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
- Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080, 2021.
- Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388, 2021.
- Genie: Higher-order denoising diffusion solvers. arXiv preprint arXiv:2210.05475, 2022.
- Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
- Learning fast samplers for diffusion models by differentiating through sample quality. In International Conference on Learning Representations, 2022.
- Stable diffusion with core ml on apple silicon, 2022.
- Efficient spatially sparse inference for conditional gans and diffusion models. arXiv preprint arXiv:2211.02048, 2022.
- Tensorrt. https://developer.nvidia.com/tensorrt.
- Yanyu Li (31 papers)
- Huan Wang (211 papers)
- Qing Jin (17 papers)
- Ju Hu (9 papers)
- Pavlo Chemerys (2 papers)
- Yun Fu (131 papers)
- Yanzhi Wang (197 papers)
- Sergey Tulyakov (108 papers)
- Jian Ren (97 papers)