Papers
Topics
Authors
Recent
Search
2000 character limit reached

Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations

Published 21 Apr 2023 in cs.CV, cs.LG, and eess.IV | (2304.11267v2)

Abstract: The rapid development and application of foundation models have revolutionized the field of artificial intelligence. Large diffusion models have gained significant attention for their ability to generate photorealistic images and support various tasks. On-device deployment of these models provides benefits such as lower server costs, offline functionality, and improved user privacy. However, common large diffusion models have over 1 billion parameters and pose challenges due to restricted computational and memory resources on devices. We present a series of implementation optimizations for large diffusion models that achieve the fastest reported inference latency to-date (under 12 seconds for Stable Diffusion 1.4 without int8 quantization on Samsung S23 Ultra for a 512x512 image with 20 iterations) on GPU-equipped mobile devices. These enhancements broaden the applicability of generative AI and improve the overall user experience across a wide range of devices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Wasserstein gan, 2017.
  2. Longformer: The long-document transformer. CoRR, abs/2004.05150, 2020.
  3. Large scale gan training for high fidelity natural image synthesis, 2019.
  4. A pseudo-softmax function for hardware-based high speed image classification. Sci. Rep., 11(1):15307, July 2021.
  5. Rethinking attention with performers. CoRR, abs/2009.14794, 2020.
  6. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  7. Hardware-aware softmax approximation for deep neural networks. In Computer Vision – ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part IV, page 107–122, Berlin, Heidelberg, 2018. Springer-Verlag.
  8. Generative adversarial networks, 2014.
  9. Gaussian Error Linear Units (GELUs). 2016.
  10. Denoising diffusion probabilistic models. CoRR, abs/2006.11239, 2020.
  11. World’s first on-device demonstration of stable diffusion on an android phone, 2023.
  12. A style-based generator architecture for generative adversarial networks. CoRR, abs/1812.04948, 2018.
  13. Auto-encoding variational bayes, 2022.
  14. Reformer: The efficient transformer. CoRR, abs/2001.04451, 2020.
  15. Andrew Lavin. Fast algorithms for convolutional neural networks. CoRR, abs/1509.09308, 2015.
  16. Stable diffusion with core ml on apple silicon, 2022.
  17. Efficient memory management for deep neural net inference. CoRR, abs/2001.03288, 2020.
  18. Unsupervised representation learning with deep convolutional generative adversarial networks. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
  19. Robin Rombach. Github: Compvis/stable-diffusion, 2022.
  20. High-resolution image synthesis with latent diffusion models. CoRR, abs/2112.10752, 2021.
  21. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241. Springer International Publishing, 2015.
  22. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023.
  23. Attention is all you need. CoRR, abs/1706.03762, 2017.
  24. Group normalization. CoRR, abs/1803.08494, 2018.
  25. Bo Yuan. Efficient hardware architecture of softmax layer in deep neural network. In 2016 29th IEEE International System-on-Chip Conference (SOCC), pages 323–326, 2016.
  26. Adding conditional control to text-to-image diffusion models, 2023.
Citations (28)

Summary

  • The paper achieves a significant reduction in inference latency by using custom GPU-aware techniques that cut processing time to under 12 seconds per image on mobile devices.
  • It introduces specialized GPU kernels for operations such as Group Normalization and GELU, streamlining memory usage and boosting processing speed.
  • Advanced attention mechanisms, including partially fused softmax and the Winograd convolution algorithm, further enhance computational throughput for efficient on-device performance.

On-Device Acceleration of Large Diffusion Models Using GPU-Aware Optimizations

The focus of the research paper "Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations" lies in optimizing the performance of large diffusion models for on-device implementation. Specifically, the paper discusses techniques aimed at reducing inference latency while maintaining model quality, particularly addressing GPU-equipped mobile devices. The primary target is the deployment of Stable Diffusion 1.4, an advanced diffusion-based generative model capable of producing high-quality photorealistic images with more than 1 billion parameters. The paper's authors are affiliated with Google LLC, and their research presents a comprehensive suite of implementation optimizations that target the execution of these substantial models on constrained computational resources.

Key Contributions and Findings

  1. Inference Latency Optimization: The authors achieve a significant reduction in inference latency by introducing several advanced GPU-aware techniques that improve computational efficiency during the denoising process integral to diffusion models. Notably, they report achieving inference times under 12 seconds on devices like the Samsung S23 Ultra for rendering 512x512 images.
  2. Specialized Kernels: Custom GPU kernels for frequently-used operations such as Group Normalization (GN) and Gaussian Error Linear Unit (GELU) are developed. These kernels achieve operations in a single GPU command while minimizing the need for intermediate tensors, thus optimizing memory usage and processing time.
  3. Enhanced Attention Mechanisms: The paper addresses the complex attention modules within the model by introducing partially fused softmax operations and implementing FlashAttention selectively for efficiency. These methods, particularly FlashAttention, focus on reducing memory accesses and better utilizing on-chip resources to enhance computational throughput.
  4. Winograd Convolution Algorithm: To handle the computational load imposed by numerous convolutional operations in the model, the authors apply the Winograd convolution algorithm, which reduces the number of required operations significantly, albeit with careful management of memory overhead and numerical precision.

Experimental Results

Extensive evaluations are carried out on devices such as the Samsung S23 Ultra and the iPhone 14 Pro Max. The results are revealing: the optimizations bring about improvements over baseline latency by 52.2% and 32.9% on the respective devices. Moreover, the experiments underscore the practicality of deploying large models like Stable Diffusion directly on consumer-grade hardware while preserving model integrity and quality.

Implications and Future Work

The ramifications of this study are considerable, primarily concerning widespread accessibility and enhanced user experience with generative AI applications directly on mobile devices. The reduced reliance on server-based computation can lead to enhanced privacy, reduced server load, and improved scalability of applications leveraging these models. The optimization strategies proposed in the study not only elevate current hardware capabilities but also extend the potential for advanced AI implementations in consumer electronics.

Looking ahead, the techniques outlined might be expanded for broader application across other deep learning model architectures that face similar challenges concerning on-device deployment. As both hardware and algorithmic advancements continue to progress, the trajectory set by this work places greater emphasis on integrating AI more intimately with everyday devices, potentially standardizing high-performance model execution in environments previously considered resource-limited.

In conclusion, this research offers valuable insights and tangible methods to enhance the pace and feasibility of deploying complex generative models on mobile platforms, paving the way for a more direct and accessible interaction with advanced AI technologies in daily life. The methods detailed in the research hold promise for further research and development in refining and tailoring AI models to better suit a diverse range of computational environments.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.