Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level (2403.04690v3)

Published 7 Mar 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we aim to massively improve upon existing infrastructure by providing two new methods for implementing neighborhood attention. We first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood attention. These kernels on average provide 895% and 272% improvement in full precision runtime compared to existing naive CUDA kernels for 1-D and 2-D neighborhood attention respectively. We find that aside from being heavily bound by memory bandwidth, certain inherent inefficiencies exist in all unfused implementations of neighborhood attention, which in most cases undo their theoretical efficiency gain. Motivated by the progress made into fused dot-product attention kernels, we developed fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes. Known for reducing the quadratic time complexity of self attention to a linear complexity, neighborhood attention can now enjoy a reduced and constant memory footprint, and record-breaking half precision runtime. We observe that our fused implementation successfully circumvents some of the unavoidable inefficiencies in unfused implementations...

PDF HTML Abstract

Accelerating Neighborhood Attention via GEMM-based and Fused CUDA Kernels

Introduction

Neighborhood Attention has emerged as a pivotal optimization technique for reducing the computational cost of self-attention mechanisms in deep learning models, particularly within the field of Natural Language Processing and Computer Vision. This technique limits the scope of attention to immediate neighboring tokens, thereby transitioning the complexity from quadratic to linear. Despite its efficiency, implementing neighborhood attention, especially in higher-dimensional spaces, has been challenging due to its inherent dependency on custom CUDA kernels. These kernels often lacked in either performance or functionality, leading to a barrier in its widespread adoption. Addressing this gap, the presented paper introduces two novel methodologies for implementing neighborhood attention: GEMM-based and Fused CUDA kernels. These approaches not only offer significant performance enhancements over existing methods but also expand the utility of neighborhood attention across various modalities.

Motivation and Background

Neighborhood attention reduces the computational overhead of traditional self-attention by focusing each token's attentiveness on its closest neighbors. While efficient in theory, its practical application has been hindered by the limitations of existing CUDA kernels, particularly in 2-D and 3-D contexts. The demand for custom, optimized kernels has led to the development of the methodologies outlined in this work. Leveraging the General Matrix-Matrix Multiplication (GEMM) approach and introducing fused CUDA kernels, the paper showcases a massive leap in performance and functionality.

Methodological Innovations

GEMM-based Implementation

The paper first introduces a GEMM-based approach to implement neighborhood attention. Recognizing neighborhood attention as a GEMM problem enables leveraging the underlying efficiency of GEMM kernels, thus addressing the primary shortcomings of earlier implementations. By mapping the GEMV problems intrinsic to neighborhood attention to GEMM operations, the approach benefits from improved hardware acceleration and reduced computational overhead. This method demonstrates a significant improvement in latency for both 1-D and 2-D neighborhood attention problems.

Fused CUDA Kernels

Building on the limitations of unfused (BMM-style) implementations, the paper introduces fused CUDA kernels for neighborhood attention. These kernels eliminate the need to store attention weights in global memory, addressing a major bottleneck in previous methodologies. The fused approach results in reduced memory footprint and enhanced computational efficiency, particularly notable in half-precision calculations. The adaptability of these kernels across different spatial ranks and their ability to incorporate features like causal masking further solidify their utility.

Experimental Validation

The effectiveness of the proposed GEMM-based and fused CUDA kernels is empirically validated through comprehensive experiments. Benchmarks reveal that these new kernels can significantly outperform existing naive CUDA implementations in terms of latency. The fused kernels, in particular, exhibit superior performance across all tested scenarios, enhancing throughput by up to 97% in certain configurations. The applicability of these methodologies is further demonstrated through implementation in existing models like NAT and DiNAT, where notable improvements in throughput are observed without compromising model accuracy.

Implications and Future Directions

The introduction of GEMM-based and fused CUDA kernels for neighborhood attention holds profound implications for the future of attention-based models. By substantially reducing the computational cost and memory footprint, these methodologies pave the way for more efficient and scalable implementations of attention mechanisms. The observed improvements in throughput and latency not only enhance the performance of existing models but also broaden the horizon for the development of more complex and higher-dimensional attention-based architectures. Looking forward, extending these kernels to support backward passes and integrating additional features will be crucial in maximizing their utility across a wider array of deep learning applications.

Conclusion

The paper marks a significant advancement in the implementation of neighborhood attention mechanisms through the introduction of GEMM-based and fused CUDA kernels. These methodologies offer a robust solution to the limitations faced by previous implementations, providing a blend of enhanced performance, reduced memory requirements, and greater adaptability. As attention mechanisms continue to play a central role in deep learning, the innovations presented in this paper are poised to significantly contribute to their evolution and expanded application.