Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level (2403.04690v3)

Published 7 Mar 2024 in cs.CV, cs.AI, and cs.LG
Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level

Abstract: Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we aim to massively improve upon existing infrastructure by providing two new methods for implementing neighborhood attention. We first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood attention. These kernels on average provide 895% and 272% improvement in full precision runtime compared to existing naive CUDA kernels for 1-D and 2-D neighborhood attention respectively. We find that aside from being heavily bound by memory bandwidth, certain inherent inefficiencies exist in all unfused implementations of neighborhood attention, which in most cases undo their theoretical efficiency gain. Motivated by the progress made into fused dot-product attention kernels, we developed fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes. Known for reducing the quadratic time complexity of self attention to a linear complexity, neighborhood attention can now enjoy a reduced and constant memory footprint, and record-breaking half precision runtime. We observe that our fused implementation successfully circumvents some of the unavoidable inefficiencies in unfused implementations...

Accelerating Neighborhood Attention via GEMM-based and Fused CUDA Kernels

Introduction

Neighborhood Attention has emerged as a pivotal optimization technique for reducing the computational cost of self-attention mechanisms in deep learning models, particularly within the field of Natural Language Processing and Computer Vision. This technique limits the scope of attention to immediate neighboring tokens, thereby transitioning the complexity from quadratic to linear. Despite its efficiency, implementing neighborhood attention, especially in higher-dimensional spaces, has been challenging due to its inherent dependency on custom CUDA kernels. These kernels often lacked in either performance or functionality, leading to a barrier in its widespread adoption. Addressing this gap, the presented paper introduces two novel methodologies for implementing neighborhood attention: GEMM-based and Fused CUDA kernels. These approaches not only offer significant performance enhancements over existing methods but also expand the utility of neighborhood attention across various modalities.

Motivation and Background

Neighborhood attention reduces the computational overhead of traditional self-attention by focusing each token's attentiveness on its closest neighbors. While efficient in theory, its practical application has been hindered by the limitations of existing CUDA kernels, particularly in 2-D and 3-D contexts. The demand for custom, optimized kernels has led to the development of the methodologies outlined in this work. Leveraging the General Matrix-Matrix Multiplication (GEMM) approach and introducing fused CUDA kernels, the paper showcases a massive leap in performance and functionality.

Methodological Innovations

GEMM-based Implementation

The paper first introduces a GEMM-based approach to implement neighborhood attention. Recognizing neighborhood attention as a GEMM problem enables leveraging the underlying efficiency of GEMM kernels, thus addressing the primary shortcomings of earlier implementations. By mapping the GEMV problems intrinsic to neighborhood attention to GEMM operations, the approach benefits from improved hardware acceleration and reduced computational overhead. This method demonstrates a significant improvement in latency for both 1-D and 2-D neighborhood attention problems.

Fused CUDA Kernels

Building on the limitations of unfused (BMM-style) implementations, the paper introduces fused CUDA kernels for neighborhood attention. These kernels eliminate the need to store attention weights in global memory, addressing a major bottleneck in previous methodologies. The fused approach results in reduced memory footprint and enhanced computational efficiency, particularly notable in half-precision calculations. The adaptability of these kernels across different spatial ranks and their ability to incorporate features like causal masking further solidify their utility.

Experimental Validation

The effectiveness of the proposed GEMM-based and fused CUDA kernels is empirically validated through comprehensive experiments. Benchmarks reveal that these new kernels can significantly outperform existing naive CUDA implementations in terms of latency. The fused kernels, in particular, exhibit superior performance across all tested scenarios, enhancing throughput by up to 97% in certain configurations. The applicability of these methodologies is further demonstrated through implementation in existing models like NAT and DiNAT, where notable improvements in throughput are observed without compromising model accuracy.

Implications and Future Directions

The introduction of GEMM-based and fused CUDA kernels for neighborhood attention holds profound implications for the future of attention-based models. By substantially reducing the computational cost and memory footprint, these methodologies pave the way for more efficient and scalable implementations of attention mechanisms. The observed improvements in throughput and latency not only enhance the performance of existing models but also broaden the horizon for the development of more complex and higher-dimensional attention-based architectures. Looking forward, extending these kernels to support backward passes and integrating additional features will be crucial in maximizing their utility across a wider array of deep learning applications.

Conclusion

The paper marks a significant advancement in the implementation of neighborhood attention mechanisms through the introduction of GEMM-based and fused CUDA kernels. These methodologies offer a robust solution to the limitations faced by previous implementations, providing a blend of enhanced performance, reduced memory requirements, and greater adaptability. As attention mechanisms continue to play a central role in deep learning, the innovations presented in this paper are poised to significantly contribute to their evolution and expanded application.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  2. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  3. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  4. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  5. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2020.
  7. Dilated neighborhood attention transformer. arXiv preprint arXiv:2209.15001, 2022.
  8. Neighborhood attention transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  9. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  10. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  11. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2012.
  12. Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867, 2018.
  13. Image transformer. In International Conference on Machine Learning (ICML), 2018.
  14. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  15. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  16. Self-attention does not need O⁢(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682, 2021.
  17. Stand-alone self-attention in vision models. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  18. Cutlass, 2023.
  19. Scaling local self-attention for parameter efficient visual backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  20. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  21. Stylenat: Giving each head a new perspective. arXiv preprint arXiv:2211.05770, 2022.
  22. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ali Hassani (17 papers)
  2. Humphrey Shi (97 papers)
  3. Wen-mei Hwu (62 papers)
Citations (3)
Youtube Logo Streamline Icon: https://streamlinehq.com