- The paper introduces GNA, a novel approach that extends neighborhood attention with a stride parameter to unify sliding, strided, and blocked patterns.
- The paper presents NATTENSIM, an analytical tool that estimates realistic speedups by accounting for multi-dimensional tiling and minimizing fine-grained masking.
- The paper demonstrates up to 46% end-to-end speedups on GPUs in generative models without fine-tuning, while maintaining comparable output quality.
This paper addresses the challenge that many sparse attention mechanisms, particularly locality-based ones like Neighborhood Attention (NA), often fail to deliver significant speedups over standard dense self-attention despite reducing FLOPs. This gap is attributed to implementation complexities and the rapid evolution of AI hardware. The problem is especially pronounced for multi-dimensional data like images and videos.
To tackle this, the authors introduce Generalized Neighborhood Attention (GNA), an extension of NA that adds a "stride" parameter.
- GNA Definition: GNA controls how the attention window slides across tokens. A stride of 1 replicates standard NA (sliding window). A stride equal to the window size results in non-overlapping blocked attention (like Window Self Attention in Swin Transformers). Intermediate strides create strided sliding window patterns, where groups of adjacent query tokens share the same context window. This grouping increases the density of computation within processed blocks, aiming to improve hardware utilization. GNA unifies sliding window, strided sliding window, and blocked attention patterns.
The paper identifies the "curse of multi-dimensionality" as a key challenge for sparse attention in vision tasks. Standard attention implementations often use 1D tiling, which, when applied to 2D or 3D token layouts common in vision, leads to significant "wasted compute" – FLOPs performed on tokens that are ultimately masked out due to the sparse pattern.
To better analyze and optimize GNA configurations, the authors developed an analytical tool, NATTENSIM
.
NATTENSIM
: This simulator estimates the upper-bound speedup achievable by a GNA configuration. It considers implementation details like:
- Query (Q) and Key/Value (KV) tile sizes (TQ,TKV) used in the underlying fused multi-head attention (FMHA) kernel.
- Whether tiling is 1D or multi-dimensional.
- Whether KV tiling is static or dynamic.
NATTENSIM
calculates the number of KV tiles accessed per Q tile for a given GNA setup (window size, stride, dilation, dimensions), providing a more realistic speedup estimate than raw FLOP reduction. It helps identify "perfectly block-sparse" configurations where speedup can closely match the theoretical FLOP reduction because fine-grained masking within tiles is minimized or eliminated.
The authors implemented GNA, specifically targeting the NVIDIA Blackwell architecture, building upon a high-performance CUTLASS FMHA kernel.
- Blackwell Implementation:
- Uses token permutation (re-layout) outside the kernel to handle multi-dimensional token layouts, avoiding complexities of fused multi-dimensional tiling within the kernel itself. This requires static KV tiling.
- The kernel is designed to minimize overhead, especially for perfectly block-sparse cases identified by
NATTENSIM
, by skipping the fine-grained masking logic when possible.
- Achieves high utilization (up to 1.3 PFLOPs/s effective in FP16 reported).
- Integrated into the N library for PyTorch.
Experiments were conducted on large-scale generative models heavily reliant on self-attention: Cosmos-7B (World Model), HunyuanVideo (Video Generation), and FLUX (Image Generation @ 4K).
- Results:
- GNA achieved significant operation-level speedups, often approaching or matching the
NATTENSIM
analytical bounds, especially for perfectly block-sparse strides.
- End-to-end speedups of 28% to 46% were demonstrated on B200 GPUs without model fine-tuning, by replacing self-attention with GNA (sometimes retaining self-attention for initial diffusion steps to preserve quality). In some cases (e.g., HunyuanVideo with 91% sparsity), the speedup reached the theoretical maximum based on FLOP reduction (e.g., ~2.23x).
- Qualitative and quantitative evaluations (VBench, MAN-IQA, QualiCLIP, GenEval) showed that GNA configurations, even with high sparsity and optimized strides, maintained comparable output quality to the original models using dense attention.
In summary, the paper presents GNA as a flexible framework for local sparse attention, introduces NATTENSIM
for performance analysis, and provides a highly optimized Blackwell implementation. It demonstrates that by carefully choosing the stride parameter (often guided by NATTENSIM
) to maximize block-sparsity, GNA can deliver substantial speedups proportional to FLOP reduction in real-world generative models, overcoming previous limitations of sparse attention methods.