Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations

Published 21 Apr 2023 in cs.CV, cs.LG, and eess.IV | (2304.11267v2)

Abstract: The rapid development and application of foundation models have revolutionized the field of artificial intelligence. Large diffusion models have gained significant attention for their ability to generate photorealistic images and support various tasks. On-device deployment of these models provides benefits such as lower server costs, offline functionality, and improved user privacy. However, common large diffusion models have over 1 billion parameters and pose challenges due to restricted computational and memory resources on devices. We present a series of implementation optimizations for large diffusion models that achieve the fastest reported inference latency to-date (under 12 seconds for Stable Diffusion 1.4 without int8 quantization on Samsung S23 Ultra for a 512x512 image with 20 iterations) on GPU-equipped mobile devices. These enhancements broaden the applicability of generative AI and improve the overall user experience across a wide range of devices.

Abstract PDF HTML Upgrade to Chat

References (26)

Citations (28)

View on Semantic Scholar

Summary

The paper achieves a significant reduction in inference latency by using custom GPU-aware techniques that cut processing time to under 12 seconds per image on mobile devices.
It introduces specialized GPU kernels for operations such as Group Normalization and GELU, streamlining memory usage and boosting processing speed.
Advanced attention mechanisms, including partially fused softmax and the Winograd convolution algorithm, further enhance computational throughput for efficient on-device performance.

On-Device Acceleration of Large Diffusion Models Using GPU-Aware Optimizations

The focus of the research paper "Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations" lies in optimizing the performance of large diffusion models for on-device implementation. Specifically, the paper discusses techniques aimed at reducing inference latency while maintaining model quality, particularly addressing GPU-equipped mobile devices. The primary target is the deployment of Stable Diffusion 1.4, an advanced diffusion-based generative model capable of producing high-quality photorealistic images with more than 1 billion parameters. The paper's authors are affiliated with Google LLC, and their research presents a comprehensive suite of implementation optimizations that target the execution of these substantial models on constrained computational resources.

Key Contributions and Findings

Inference Latency Optimization: The authors achieve a significant reduction in inference latency by introducing several advanced GPU-aware techniques that improve computational efficiency during the denoising process integral to diffusion models. Notably, they report achieving inference times under 12 seconds on devices like the Samsung S23 Ultra for rendering 512x512 images.
Specialized Kernels: Custom GPU kernels for frequently-used operations such as Group Normalization (GN) and Gaussian Error Linear Unit (GELU) are developed. These kernels achieve operations in a single GPU command while minimizing the need for intermediate tensors, thus optimizing memory usage and processing time.
Enhanced Attention Mechanisms: The paper addresses the complex attention modules within the model by introducing partially fused softmax operations and implementing FlashAttention selectively for efficiency. These methods, particularly FlashAttention, focus on reducing memory accesses and better utilizing on-chip resources to enhance computational throughput.
Winograd Convolution Algorithm: To handle the computational load imposed by numerous convolutional operations in the model, the authors apply the Winograd convolution algorithm, which reduces the number of required operations significantly, albeit with careful management of memory overhead and numerical precision.

Experimental Results

Extensive evaluations are carried out on devices such as the Samsung S23 Ultra and the iPhone 14 Pro Max. The results are revealing: the optimizations bring about improvements over baseline latency by 52.2% and 32.9% on the respective devices. Moreover, the experiments underscore the practicality of deploying large models like Stable Diffusion directly on consumer-grade hardware while preserving model integrity and quality.

Implications and Future Work

The ramifications of this study are considerable, primarily concerning widespread accessibility and enhanced user experience with generative AI applications directly on mobile devices. The reduced reliance on server-based computation can lead to enhanced privacy, reduced server load, and improved scalability of applications leveraging these models. The optimization strategies proposed in the study not only elevate current hardware capabilities but also extend the potential for advanced AI implementations in consumer electronics.

Looking ahead, the techniques outlined might be expanded for broader application across other deep learning model architectures that face similar challenges concerning on-device deployment. As both hardware and algorithmic advancements continue to progress, the trajectory set by this work places greater emphasis on integrating AI more intimately with everyday devices, potentially standardizing high-performance model execution in environments previously considered resource-limited.

In conclusion, this research offers valuable insights and tangible methods to enhance the pace and feasibility of deploying complex generative models on mobile platforms, paving the way for a more direct and accessible interaction with advanced AI technologies in daily life. The methods detailed in the research hold promise for further research and development in refining and tailoring AI models to better suit a diverse range of computational environments.

Markdown