Neighborhood Attention Transformers

Updated 5 March 2026

Neighborhood Attention Transformers are transformer architectures that restrict self-attention to local neighborhoods, reducing computational complexity and enhancing scalability for high-resolution data.
Variants such as Dilated, Block-Level, and Graph Neighborhood Attention adapt local receptive fields to balance precision and long-range context across vision, video, and graph domains.
These models achieve significant speedups and efficiency gains—up to 2.7× acceleration and 46% end-to-end speedup—while preserving key inductive biases like translational equivariance.

Neighborhood Attention Transformers (NATs) denote a family of transformer architectures that replace or augment global self-attention with attention mechanisms locally restricted to neighboring tokens, either in spatial, temporal, or graph domains. By constraining each query token to attend only to its immediate or dilated neighborhood, these architectures drastically reduce memory and computational demands, enable strong locality bias, and maintain scalability for high-resolution and long-sequence data. NATs have rapidly gained prominence across computer vision, video generation, graph learning, scientific computing, and medical imaging, and are embodied in designs such as standard Neighborhood Attention (NA), Dilated Neighborhood Attention (DiNA), Block-Level Neighborhood Attention (NABLA), generalized patterns (GNA), and multi-neighborhood or dual-masked graph transformers.

1. Core Mechanisms: From Self-Attention to Neighborhood Attention

Conventional dense self-attention computes attention scores between every pair of tokens, resulting in $\mathcal{O}(N^2d)$ cost for $N$ tokens of dimension $d$ . Neighborhood Attention (NA) restricts, for each query position $i$ , the computation to a window comprising its $k$ nearest spatial (or topological) neighbors. The attention output for token $i$ can be formally written as

$\mathrm{NA}_k(i) = \mathrm{softmax}\left(\frac{\mathbf{A}_i^k}{\sqrt{d}}\right)\mathbf{V}_i^k,$

where $\mathbf{A}_i^k$ contains logits over the restricted neighborhood, including a learned relative positional bias, and $\mathbf{V}_i^k$ gathers value vectors for those neighbors (Hassani et al., 2022).

For two-dimensional data (e.g., images), the local window typically is $K \times K$ centered at $i$ , leading to a per-layer complexity of $\mathcal{O}(Nk^2d)$ in time and $\mathcal{O}(Nk^2)$ in memory. This provides a strict linear scaling in $N$ for fixed $k$ , enabling NATs to surpass the scalability barrier of global attention for high-resolution vision tasks.

Dilated Neighborhood Attention (DiNA) generalizes this locality by introducing a dilation factor $\delta$ , sampling neighborhood points further apart in each spatial direction: $A_i^{(k,\delta)} = [Q_i K_{\rho_1^\delta(i)}^T + B(i, \rho_1^\delta(i)), \ldots, Q_i K_{\rho_k^\delta(i)}^T + B(i, \rho_k^\delta(i))].$ As a consequence, DiNA achieves exponential receptive field growth with depth at constant computational cost (Hassani et al., 2022).

Generalized Neighborhood Attention (GNA) extends NA to accommodate sliding windows, strided windows, and block-wise partitions by introducing a stride parameter $s$ , unifying a range of local attention patterns including those underlying Swin, block attention, and sliding window variants (Hassani et al., 23 Apr 2025).

2. Architectural Variants and Advanced Mechanisms

Neighborhood Attention Transformers have diversified into multiple advanced formulations:

Dilated Neighborhood Attention Transformers (DiNAT/DiNAT-IR): Alternate standard NA and DiNA layers, leveraging dilations of increasing scale per stage to enable both local precision and long-range dependency modeling. Dual-branch blocks combine local (dilation=1) and global (dilation $\,\gg1$ ) DiNA, often paired with channel-aware or global-context modules to compensate for the loss of full-image attention (Hassani et al., 2022, Liu et al., 23 Jul 2025, Manzari et al., 19 Feb 2025).
Block-Level and Adaptive Neighborhood Attention (NABLA): The input sequence is partitioned into non-overlapping blocks; attention is first computed between block-averaged queries/keys, then sparsified via content-adaptive thresholding. Full, fine-grained attention is selectively applied only within block pairs exceeding a specified block-to-block attention mass. This yields block-sparse attention masks, yielding up to $2.7\times$ acceleration in large-scale video transformers (Mikhailov et al., 17 Jul 2025).
Generalized (Sliding, Strided, Blocked) Attention: GNA enables block-sparsity patterns that are optimally aligned with GPU memory tiling, maximizing hardware utilization and permitting end-to-end speedups up to $46\%$ on modern GPUs without quality loss (Hassani et al., 23 Apr 2025).
Graph Neighborhood Attention: In graph domains, Multi-Neighborhood Attention constructs multiple parallel attention kernels using $k$ -hop neighbor-aggregated features, with adaptive mixing weights that allow each node to assign importance to different neighborhood “radii” (Li et al., 2022). In DAM-GT, a dual positional encoding—topological and attribute-aware—augments neighborhood tokens, with masked self-attention enforcing direct information flow between each node and its local hops (2505.17660).
Vicinity and KNN/Locality-Aware Attention: Token proximity is measured by explicit metric distance (e.g., 2D Manhattan), to bias the attention (e.g., via cosine reweighting), facilitating decomposable linear attention with strong locality (Sun et al., 2022, Koh et al., 18 Apr 2025).

3. Algorithmic Details and Implementation Considerations

Neighborhood Attention and its derivatives rely on specialized implementation strategies for efficiency:

Tiled and Fused Kernels: NA and DiNA benefit from tightly tiled GPU kernels (e.g., NATTEN) that maximize shared memory usage and minimize redundant memory access (Hassani et al., 2022, Hassani et al., 2022). Fused dot-product+softmax+gather implementations for DiNA maintain linear time and memory in $N$ for fixed window size $k$ , with negligible constant-factor penalty for dilation (Manzari et al., 19 Feb 2025).
Block Expansion and Masking: In NABLA, block-wise sparsity masks are expanded back to token-level attention masks before passing to efficient sparse or block-sparse attention operators (e.g., PyTorch FlashAttention/FlexAttention). Adaptive thresholds for block retention are content-driven, computed on the fly (Mikhailov et al., 17 Jul 2025).
Token Reordering: For block-based or block-sparse attention patterns, a spatial fractal or Morton-order flattening is often used to align spatially proximate tokens into contiguous memory regions, harmonizing hardware and attention mask structure (Mikhailov et al., 17 Jul 2025, Hassani et al., 23 Apr 2025).
Hardware-Aware Patterns: GNA-SIM simulation is used for hyperparameter tuning (window size, stride) to maximize FMHA kernel utilization, which is crucial for achieving promised speedups on specific hardware such as NVIDIA Blackwell GPUs (Hassani et al., 23 Apr 2025).
KNN Patchifying: In unstructured domains and non-Euclidean spaces, as in the LA2Former, a K-nearest-neighbor procedure is used to define local patches dynamically, enabling effective local attention in point clouds and PDE meshes (Koh et al., 18 Apr 2025).

4. Empirical Results and Performance Benchmarks

Neighborhood Attention mechanisms have given rise to state-of-the-art results across tasks and domains:

Model/Domain	SOTA Metrics/Achievements	Efficiency Gains	Key Papers
NAT-Tiny	83.2% ImageNet top-1; +1.9% over Swin-T	Up to 40% faster than Swin WSA	(Hassani et al., 2022)
DiNAT-L	86.6% ImageNet-22K; 55.3 COCO box-AP	Matches/swaps ConvNeXt/Swin-L	(Hassani et al., 2022)
NABLA (video)	2.7 $\times$ speedup, $>$ 92% sparsity	CLIP/VBench scores unchanged	(Mikhailov et al., 17 Jul 2025)
GNA (FMHA)	28–46% e2e speedup; up to 1.3 PFLOP/s	Nearly 100% theoretical bound	(Hassani et al., 23 Apr 2025)
DiNAT-IR	33.80 dB GoPro deblurring (SOTA restoration)	Competitive at 45.6 GFlops	(Liu et al., 23 Jul 2025)
MedViTV2	+13.4% MedMNIST-C bACC over prior SOTA	44% more efficient than prior	(Manzari et al., 19 Feb 2025)
LA2Former	$>$ 50% rel- $L_2$ error reduction (PDEs)	2–10 $\times$ faster than dense	(Koh et al., 18 Apr 2025)
DAM-GT	Highest node classification on all 12 graphs	Mask: +0.5–1.2pp accuracy	(2505.17660)

In vision, NATs have outperformed Swin- and ConvNeXt-based backbones on ImageNet, COCO, ADE20K, and Cityscapes at similar or lower parameter count and FLOPs. DiNAT delivers further improvements, particularly on dense segmentation and panoptic tasks, via global context recovered from dilated attention (Hassani et al., 2022).

NABLA shows that adaptive block-level sparsity can realize 2.7 $\times$ speedup in video diffusion transformers at negligible degradation in automatically and human-judged quality metrics (Mikhailov et al., 17 Jul 2025). GNA’s cutlass-based implementation enables aggressive sparsity with measured speedups that closely match simulator-predicted bounds; no retraining or fine-tuning is required for model quality retention (Hassani et al., 23 Apr 2025).

In scientific and medical domains, neighborhood and dilated neighborhood patterns have led to large increases in accuracy, robustness (e.g., to corruptions), and convergence speed (Manzari et al., 19 Feb 2025, Koh et al., 18 Apr 2025). In graph transformers, both multi-neighborhood and masked attention strategies adapt local receptive fields and information flow to graph structure, increasing classification accuracy on both homophilic and heterogeneous networks (Li et al., 2022, 2505.17660).

5. Inductive Properties and Theoretical Implications

The localized attention in NATs introduces several key inductive biases:

Translational Equivariance: Pixel-centric, per-position sliding windows in NA preserve translational equivariance, in contrast to block- or window-partitioned schemes (e.g., Swin WSA) which break this property at window boundaries (Hassani et al., 2022).
Receptive Field Growth: Stacking $L$ NA or DiNA layers with window size $k$ and dilation $\delta$ yields receptive fields growing as $L(k-1)/2$ (NA) or exponentially with layer (DiNA, geometric $\delta$ schedule), without increasing per-layer cost (Hassani et al., 2022, Manzari et al., 19 Feb 2025).
Long-Range and Global Context: Alternating NA and DiNA layers or fusing with channel-aware/global modules restores global dependency tracking otherwise lost under strictly localized attention (Hassani et al., 2022, Liu et al., 23 Jul 2025).
Structural Adaptivity: In graph domains, multi-neighborhood and dual-encoded mechanisms enable nodes to access both topological and feature/attribute-based neighborhoods, adaptively controlling the scale of aggregation (Li et al., 2022, 2505.17660).
Block and Stride Patterns: By controlling stride and neighborhood definition, GNA enables trade-offs between strict locality (maximal translational equivariance, high sparsity) and compute density/hardware alignment (blockwise attention) (Hassani et al., 23 Apr 2025).

6. Limitations, Practical Guidelines, and Future Directions

While NATs address the key scalability limitations of dense attention, several important considerations and open questions remain:

Sparsity-Quality Trade-off: Excessively high sparsity (e.g., block masks with threshold $>0.5$ in NABLA) risks omitting semantically critical long-range attends (Mikhailov et al., 17 Jul 2025).
Edge Artifacts: Block-level and windowed patterns may introduce block-edge artifacts, which can be suppressed by combining masks or integrating lightweight global priors (e.g., static STA) (Mikhailov et al., 17 Jul 2025).
Implementation Tuning: Achieving optimal speedups requires aligning block sizes, strides, and window patterns with hardware tile shapes; permutation or fractal reordering is essential for FMHA kernels (Hassani et al., 23 Apr 2025, Mikhailov et al., 17 Jul 2025).
Extensibility: NA/DiNA concepts transfer directly to temporal, audio, and cross-modal attention by redefining neighborhood relations. Further research is ongoing into adaptive per-head thresholding, learned sparsity masks, and dynamic neighborhood selection.
Limitations in Non-Local Dependencies: In graph transformers like DAM-GT, strict masking suppresses higher-order neighborhood–neighborhood interactions, which could in principle be relaxed via soft masking (2505.17660).
Adaptive Schedules: Hyperparameter choices—window size ( $k$ ), dilation ( $\delta$ ), block size, threshold ( $\tau$ )—require empirical tuning. Intermediate scheduling during training (e.g., ramping up $\tau$ ) can enhance convergence and stability (Mikhailov et al., 17 Jul 2025).

Neighborhood Attention has stimulated cross-disciplinary architectures and inspired attention strategies based on spatial, metric, and algebraic neighborhood structure. Proliferating variants—multi-scale, block, dual-branch, adaptive masked—continue to close the performance gap to dense attention on long-range reasoning and generative quality. Techniques bridging local and global context (e.g., DiNA, NABLA, channel-aware fusion, KNN-lifted localities) have proven central for SOTA in vision, video, graph, and scientific domains (Mikhailov et al., 17 Jul 2025, Hassani et al., 2022, Li et al., 2022, 2505.17660).

As hardware and model scale evolve, the research ecosystem—open-source projects (e.g., NATTEN, GNA-SIM), standardized operator support (Flash/Flex Attention), and hardware-aware kernel design—enables deployment of Neighborhood Attention mechanisms in ever-broader contexts. Further development of adaptive, task-specific, and multi-headed neighborhood criteria—augmented by theoretical and empirical analyses of inductive biases—constitutes a frontier for transformer research at scale.