Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light (2504.16922v1)

Published 23 Apr 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.

Summary

  • The paper introduces GNA, a novel approach that extends neighborhood attention with a stride parameter to unify sliding, strided, and blocked patterns.
  • The paper presents NATTENSIM, an analytical tool that estimates realistic speedups by accounting for multi-dimensional tiling and minimizing fine-grained masking.
  • The paper demonstrates up to 46% end-to-end speedups on GPUs in generative models without fine-tuning, while maintaining comparable output quality.

This paper addresses the challenge that many sparse attention mechanisms, particularly locality-based ones like Neighborhood Attention (NA), often fail to deliver significant speedups over standard dense self-attention despite reducing FLOPs. This gap is attributed to implementation complexities and the rapid evolution of AI hardware. The problem is especially pronounced for multi-dimensional data like images and videos.

To tackle this, the authors introduce Generalized Neighborhood Attention (GNA), an extension of NA that adds a "stride" parameter.

  • GNA Definition: GNA controls how the attention window slides across tokens. A stride of 1 replicates standard NA (sliding window). A stride equal to the window size results in non-overlapping blocked attention (like Window Self Attention in Swin Transformers). Intermediate strides create strided sliding window patterns, where groups of adjacent query tokens share the same context window. This grouping increases the density of computation within processed blocks, aiming to improve hardware utilization. GNA unifies sliding window, strided sliding window, and blocked attention patterns.

The paper identifies the "curse of multi-dimensionality" as a key challenge for sparse attention in vision tasks. Standard attention implementations often use 1D tiling, which, when applied to 2D or 3D token layouts common in vision, leads to significant "wasted compute" – FLOPs performed on tokens that are ultimately masked out due to the sparse pattern.

To better analyze and optimize GNA configurations, the authors developed an analytical tool, NATTENSIM.

  • NATTENSIM: This simulator estimates the upper-bound speedup achievable by a GNA configuration. It considers implementation details like:
    • Query (Q) and Key/Value (KV) tile sizes (TQ,TKVT_Q, T_{KV}) used in the underlying fused multi-head attention (FMHA) kernel.
    • Whether tiling is 1D or multi-dimensional.
    • Whether KV tiling is static or dynamic.
    • NATTENSIM calculates the number of KV tiles accessed per Q tile for a given GNA setup (window size, stride, dilation, dimensions), providing a more realistic speedup estimate than raw FLOP reduction. It helps identify "perfectly block-sparse" configurations where speedup can closely match the theoretical FLOP reduction because fine-grained masking within tiles is minimized or eliminated.

The authors implemented GNA, specifically targeting the NVIDIA Blackwell architecture, building upon a high-performance CUTLASS FMHA kernel.

  • Blackwell Implementation:
    • Uses token permutation (re-layout) outside the kernel to handle multi-dimensional token layouts, avoiding complexities of fused multi-dimensional tiling within the kernel itself. This requires static KV tiling.
    • The kernel is designed to minimize overhead, especially for perfectly block-sparse cases identified by NATTENSIM, by skipping the fine-grained masking logic when possible.
    • Achieves high utilization (up to 1.3 PFLOPs/s effective in FP16 reported).
    • Integrated into the N library for PyTorch.

Experiments were conducted on large-scale generative models heavily reliant on self-attention: Cosmos-7B (World Model), HunyuanVideo (Video Generation), and FLUX (Image Generation @ 4K).

  • Results:
    • GNA achieved significant operation-level speedups, often approaching or matching the NATTENSIM analytical bounds, especially for perfectly block-sparse strides.
    • End-to-end speedups of 28% to 46% were demonstrated on B200 GPUs without model fine-tuning, by replacing self-attention with GNA (sometimes retaining self-attention for initial diffusion steps to preserve quality). In some cases (e.g., HunyuanVideo with 91% sparsity), the speedup reached the theoretical maximum based on FLOP reduction (e.g., ~2.23x).
    • Qualitative and quantitative evaluations (VBench, MAN-IQA, QualiCLIP, GenEval) showed that GNA configurations, even with high sparsity and optimized strides, maintained comparable output quality to the original models using dense attention.

In summary, the paper presents GNA as a flexible framework for local sparse attention, introduces NATTENSIM for performance analysis, and provides a highly optimized Blackwell implementation. It demonstrates that by carefully choosing the stride parameter (often guided by NATTENSIM) to maximize block-sparsity, GNA can deliver substantial speedups proportional to FLOP reduction in real-world generative models, overcoming previous limitations of sparse attention methods.

Youtube Logo Streamline Icon: https://streamlinehq.com