Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations
Abstract: The rapid development and application of foundation models have revolutionized the field of artificial intelligence. Large diffusion models have gained significant attention for their ability to generate photorealistic images and support various tasks. On-device deployment of these models provides benefits such as lower server costs, offline functionality, and improved user privacy. However, common large diffusion models have over 1 billion parameters and pose challenges due to restricted computational and memory resources on devices. We present a series of implementation optimizations for large diffusion models that achieve the fastest reported inference latency to-date (under 12 seconds for Stable Diffusion 1.4 without int8 quantization on Samsung S23 Ultra for a 512x512 image with 20 iterations) on GPU-equipped mobile devices. These enhancements broaden the applicability of generative AI and improve the overall user experience across a wide range of devices.
- Wasserstein gan, 2017.
- Longformer: The long-document transformer. CoRR, abs/2004.05150, 2020.
- Large scale gan training for high fidelity natural image synthesis, 2019.
- A pseudo-softmax function for hardware-based high speed image classification. Sci. Rep., 11(1):15307, July 2021.
- Rethinking attention with performers. CoRR, abs/2009.14794, 2020.
- Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
- Hardware-aware softmax approximation for deep neural networks. In Computer Vision – ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part IV, page 107–122, Berlin, Heidelberg, 2018. Springer-Verlag.
- Generative adversarial networks, 2014.
- Gaussian Error Linear Units (GELUs). 2016.
- Denoising diffusion probabilistic models. CoRR, abs/2006.11239, 2020.
- World’s first on-device demonstration of stable diffusion on an android phone, 2023.
- A style-based generator architecture for generative adversarial networks. CoRR, abs/1812.04948, 2018.
- Auto-encoding variational bayes, 2022.
- Reformer: The efficient transformer. CoRR, abs/2001.04451, 2020.
- Andrew Lavin. Fast algorithms for convolutional neural networks. CoRR, abs/1509.09308, 2015.
- Stable diffusion with core ml on apple silicon, 2022.
- Efficient memory management for deep neural net inference. CoRR, abs/2001.03288, 2020.
- Unsupervised representation learning with deep convolutional generative adversarial networks. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
- Robin Rombach. Github: Compvis/stable-diffusion, 2022.
- High-resolution image synthesis with latent diffusion models. CoRR, abs/2112.10752, 2021.
- U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241. Springer International Publishing, 2015.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023.
- Attention is all you need. CoRR, abs/1706.03762, 2017.
- Group normalization. CoRR, abs/1803.08494, 2018.
- Bo Yuan. Efficient hardware architecture of softmax layer in deep neural network. In 2016 29th IEEE International System-on-Chip Conference (SOCC), pages 323–326, 2016.
- Adding conditional control to text-to-image diffusion models, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.