Neighborhood Attention Mechanisms

Updated 7 May 2026

Neighborhood Attention is a localized attention mechanism that restricts computations to fixed neighboring tokens based on spatial, geometric, or topological proximity.
It achieves near-linear complexity and scalable hardware efficiency by reducing global interactions while preserving translational and equivariant properties.
The approach is applied in diverse fields such as computer vision, 3D point cloud processing, graphs, and medical imaging through variants like dilated, spherical, and block-sparse attention.

Neighborhood Attention is a class of localized, efficient attention mechanisms for deep learning, originally introduced to address the computational bottlenecks and inductive bias issues of global self-attention, especially in high-dimensional visual, sequential, and graph-structured data. In Neighborhood Attention, each query token attends only to a fixed, typically small, set of neighboring tokens defined by spatial, geometric, or topological proximity, rather than to all tokens as in conventional self-attention. This locality enables near-linear complexity in both time and memory, allows for sliding-window equivariance, and admits scalable hardware-efficient implementations across multiple domains, including images, video, 3D point clouds, spherical data, and graphs. Various extensions such as dilation, adaptive block structures, and spectral neighborhood definitions further expand the applicability and effectiveness of neighborhood attention models^{[2209.15001](/papers/2209.15001}, Hassani et al., 2022, Hassani et al., 2024, Hassani et al., 23 Apr 2025, Bonev et al., 16 May 2025, Xue et al., 2022, Mehta et al., 2023, Manzari et al., 19 Feb 2025, Lu et al., 2024, Song et al., 2020, Kefato et al., 2020).

1. Mathematical Foundations and Formulations

Neighborhood Attention restricts the set of key-value pairs each query interacts with to a localized neighborhood, commonly parameterized by window size, stride, dilation, and geometric distance.

Standard 2D NA for Vision: For pixel $i$ in a grid, attention weights are computed only over a $K \times K$ window (with $K$ odd), yielding:

$A_i^K = [ Q_i \cdot K_{\rho_1(i)} + B(i,\rho_1(i)), \ldots, Q_i \cdot K_{\rho_K(i)} + B(i,\rho_K(i)) ]$

$\text{Output at } i: \quad \operatorname{NA}_K(i) = \operatorname{softmax}(A_i^K / \sqrt{d}) V_i^K$

where $B(i,j)$ is a learned relative positional bias and $\rho_j(i)$ enumerates the $K$ neighbors of $i$ (Hassani et al., 2022, Hassani et al., 2022).

Dilated Neighborhood Attention (DiNA): Dilation factor $\delta$ introduces sparse sampling of neighbors, with the receptive field at layer $K \times K$ 0 covering up to $K \times K$ 1 tokens, while keeping the per-layer computational cost $K \times K$ 2 unchanged (Hassani et al., 2022, Mehta et al., 2023, Manzari et al., 19 Feb 2025).
Spherical NA: On the sphere $K \times K$ 3, locality is defined by geodesic neighborhoods:

$K \times K$ 4

Attention uses quadrature-weighted softmax within geodesic disks to ensure approximate SO(3) equivariance (Bonev et al., 16 May 2025).

Graph Neighborhood Attention: The neighborhood is defined by a set of topological (usually $K \times K$ 5-hop or $K \times K$ 6-hop) neighbors, or by topology-adaptive or spectral criteria (e.g. via Laplacian eigenvectors, spectral distance, or adaptive pruning/augmentation) (Lu et al., 2024, Kefato et al., 2020).
Generalized NA (GNA): Extends the formalism to any multi-dimensional domain with arbitrary stride, window, and block-alignment parameters, unifying sliding-window, strided, and block attention (Hassani et al., 23 Apr 2025).

2. Algorithmic Structure and Hardware Acceleration

Neighborhood Attention is implemented as sparse or masked attention, often using custom CUDA or FMHA kernels to maximize memory locality and throughput.

Pointwise Neighborhood Attention: Each query index $K \times K$ 7 gathers $K \times K$ 8 spatial or sequential keys $K \times K$ 9 and values $K$ 0 into a local "halo," computes pairwise dot products and softmax, and aggregates $K$ 1 using the attention weights (Hassani et al., 2024).
Fused Kernels: Compute all QK, softmax, and value weighting in a single kernel pass, storing only minimal O(1) auxiliary data per threadblock, realizing practical speedups of 3–12 $K$ 2 for 1D and 2D NA over naive CUDA implementations (Hassani et al., 2024, Hassani et al., 23 Apr 2025).
Blockwise Sparse Attention: In the NABLA method, attention masks are adaptively generated at block level by thresholding proxy (downsampled) attention maps, then expanded to token level for efficient execution with PyTorch’s Flex Attention or Flash/FusedAttention operators (Mikhailov et al., 17 Jul 2025).
Permutation/Tiling: GNA optionally permutes tokens into a tile-major format to optimize for hardware block-level sparsity and minimize kernel-wasted compute, necessary for realizing the speedups on Blackwell-class architectures (Hassani et al., 23 Apr 2025).

3. Complexity, Receptive Field, and Equivariance Properties

Neighborhood Attention mechanisms enable a fundamental trade-off: substantially reduced computational complexity at the expense of strictly local context per layer, which can be mitigated through carefully designed stacking or dilation schemes.

Attention Type	Complexity	Memory	Receptive Field Growth	Equivariance
Global Self-Attention	$K$ 3	$K$ 4	All tokens in one step	No (position biased)
Standard Neighborhood (NA)	$K$ 5	$K$ 6	Linear, grows as $K$ 7	Yes (sliding window)
Dilated NA (DiNA)	$K$ 8	$K$ 9	Exponential, up to $A_i^K = [ Q_i \cdot K_{\rho_1(i)} + B(i,\rho_1(i)), \ldots, Q_i \cdot K_{\rho_K(i)} + B(i,\rho_K(i)) ]$ 0	Yes, width-dependent
Window Self-Attn (Swin)	$A_i^K = [ Q_i \cdot K_{\rho_1(i)} + B(i,\rho_1(i)), \ldots, Q_i \cdot K_{\rho_K(i)} + B(i,\rho_K(i)) ]$ 1	$A_i^K = [ Q_i \cdot K_{\rho_1(i)} + B(i,\rho_1(i)), \ldots, Q_i \cdot K_{\rho_K(i)} + B(i,\rho_K(i)) ]$ 2	Linear, $A_i^K = [ Q_i \cdot K_{\rho_1(i)} + B(i,\rho_1(i)), \ldots, Q_i \cdot K_{\rho_K(i)} + B(i,\rho_K(i)) ]$ 3	No (window/block)
Spherical NA	$A_i^K = [ Q_i \cdot K_{\rho_1(i)} + B(i,\rho_1(i)), \ldots, Q_i \cdot K_{\rho_K(i)} + B(i,\rho_K(i)) ]$ 4 ( $A_i^K = [ Q_i \cdot K_{\rho_1(i)} + B(i,\rho_1(i)), \ldots, Q_i \cdot K_{\rho_K(i)} + B(i,\rho_K(i)) ]$ 5)	$A_i^K = [ Q_i \cdot K_{\rho_1(i)} + B(i,\rho_1(i)), \ldots, Q_i \cdot K_{\rho_K(i)} + B(i,\rho_K(i)) ]$ 6	Local via geodesic disk	Approx. SO(3) invariant
Block-sparse (NABLA, GNA)	$A_i^K = [ Q_i \cdot K_{\rho_1(i)} + B(i,\rho_1(i)), \ldots, Q_i \cdot K_{\rho_K(i)} + B(i,\rho_K(i)) ]$ 7 ( $A_i^K = [ Q_i \cdot K_{\rho_1(i)} + B(i,\rho_1(i)), \ldots, Q_i \cdot K_{\rho_K(i)} + B(i,\rho_K(i)) ]$ 8)	$A_i^K = [ Q_i \cdot K_{\rho_1(i)} + B(i,\rho_1(i)), \ldots, Q_i \cdot K_{\rho_K(i)} + B(i,\rho_K(i)) ]$ 9	Blockwise; adapts to task	Partial, block-aligned

Key principles:

Receptive field grows linearly in local NA, exponentially in DiNA if dilations are increased layerwise, and can be made blockwise-adaptive in GNA/NABLA.
Equivariance is preserved in sliding window NA, in spherical geodesic NA, and in certain block-aligned designs, but is broken in standard window-block schemes.
Hardware scaling is facilitated by block/tile alignment, tile permutation, and kernel fusion.

4. Specialized Forms Across Modalities

Neighborhood Attention adapts to a wide spectrum of data domains and modeling objectives:

Computer Vision: NA and DiNA are used as spatial attention in vision transformers (NAT, DiNAT), enabling translation equivariance and efficient scaling on $\text{Output at } i: \quad \operatorname{NA}_K(i) = \operatorname{softmax}(A_i^K / \sqrt{d}) V_i^K$ 0 images. DiNAT alternates NA and DiNA to combine local and sparse long-range context (Hassani et al., 2022, Hassani et al., 2022).
Medical Images: Enhanced DiNA with fused kernels and hierarchical hybrid strategies achieves SOTA under heavy corruption, outperforming both CNN- and global-attention baselines (Manzari et al., 19 Feb 2025).
LiDAR Point Clouds: Neighborhood Point Attention adapts input-adaptive $\text{Output at } i: \quad \operatorname{NA}_K(i) = \operatorname{softmax}(A_i^K / \sqrt{d}) V_i^K$ 1NN-based neighborhoods for sparse 3D geometry, yielding linear scaling and 640 $\text{Output at } i: \quad \operatorname{NA}_K(i) = \operatorname{softmax}(A_i^K / \sqrt{d}) V_i^K$ 2 runtime reduction for geometry compression (Xue et al., 2022).
Video and Temporal Data: DiNA with causal masking enables long-term temporal modeling with low cost in structures such as NAC-TCN for sequence emotion recognition (Mehta et al., 2023).
Graphs: Multiple paradigms span from mutual-attention pooling in GAP (Kefato et al., 2020), cosine-based neighbor-aware weighting in NGAT4Rec (Song et al., 2020), to topology/spectrum-driven directional neighborhood attention in DGAT (Lu et al., 2024).
Spherical Data: Geodesic disks define neighborhoods on $\text{Output at } i: \quad \operatorname{NA}_K(i) = \operatorname{softmax}(A_i^K / \sqrt{d}) V_i^K$ 3, using quadrature to maintain equivariance for geophysical and 360° image tasks (Bonev et al., 16 May 2025).
Block-level and Adaptive: GNA and NABLA generalize NA to multi-dimensional block-sparse designs, using adaptive masking or stride/window parameters for further complexity reduction (Hassani et al., 23 Apr 2025, Mikhailov et al., 17 Jul 2025).

5. Architectural and Empirical Advances

Neighborhood Attention serves as the structural foundation in multiple SOTA models and practical applications.

Vision Transformers: DiNAT improves box AP by $\text{Output at } i: \quad \operatorname{NA}_K(i) = \operatorname{softmax}(A_i^K / \sqrt{d}) V_i^K$ 4, mask AP by $\text{Output at } i: \quad \operatorname{NA}_K(i) = \operatorname{softmax}(A_i^K / \sqrt{d}) V_i^K$ 5, and achieves new SOTA in panoptic and instance segmentation benchmarks at throughput/memory on par or better than Swin/NAT architectures (e.g., 58.5 PQ on COCO panoptic, 84.5 mIoU on Cityscapes) (Hassani et al., 2022).
Image Restoration: DiNAT-IR achieves PSNR gains of $\text{Output at } i: \quad \operatorname{NA}_K(i) = \operatorname{softmax}(A_i^K / \sqrt{d}) V_i^K$ 6 dB over channel-attention methods on GoPro motion-deblurring, with $\text{Output at } i: \quad \operatorname{NA}_K(i) = \operatorname{softmax}(A_i^K / \sqrt{d}) V_i^K$ 726M parameters and $\text{Output at } i: \quad \operatorname{NA}_K(i) = \operatorname{softmax}(A_i^K / \sqrt{d}) V_i^K$ 845G FLOPs (Liu et al., 23 Jul 2025).
Point Cloud Compression: NPAFormer achieves $\text{Output at } i: \quad \operatorname{NA}_K(i) = \operatorname{softmax}(A_i^K / \sqrt{d}) V_i^K$ 9 BD-rate gain for lossy and $B(i,j)$ 0 bitrate reduction for lossless scenarios, with two orders of magnitude speedup vs. baseline learned octree attention (Xue et al., 2022).
Graph Representation: Directional GAT (DGAT) outperforms prior GAT/GATv2/GT models by 1–7 points on challenging heterophilic graph benchmarks via spectral neighborhood definitions (Lu et al., 2024).
Spherical Transformers: Spherical neighborhood attention achieves 2–3 $B(i,j)$ 1 lower error and improved SO(3) equivariance for weather and 360° vision datasets (Bonev et al., 16 May 2025).
Block-sparse Acceleration: Generalized Neighborhood Attention achieves 28–46% end-to-end speedup on Blackwell B200 GPUs across large vision and generative models (Cosmos-7B, HunyuanVideo, FLUX) without fine-tuning (Hassani et al., 23 Apr 2025, Mikhailov et al., 17 Jul 2025).

6. Limitations, Open Problems, and Future Directions

Long-range Dependencies: While DiNA and block-sparse variants expand receptive fields efficiently, modeling fully global context remains challenging in very shallow or block-partitioned networks (Hassani et al., 2022, Hassani et al., 23 Apr 2025).
Adaptivity: Fixed window/dilation sizes may not optimally capture local versus global context per instance or spatial position. There is active research on adaptive or learned neighborhood selection (Manzari et al., 19 Feb 2025).
Equivariance and Geometry: Spherical NA and other non-Euclidean extensions require specialized kernels and numerics (quadrature) to ensure geometric consistency (Bonev et al., 16 May 2025).
Graph Topologies: On graphs, defining semantically and structurally meaningful neighborhoods, especially under severe heterophily or for dynamic graphs, is an open challenge; spectral or diffusion-based approaches represent one principled direction (Lu et al., 2024).
Hardware Realization: While fused kernels and tiling schemes yield order-of-magnitude speedups, their efficiency strongly depends on the exact alignment of mask/block structure with available hardware primitives; mismatched strides or multi-dimensional permutations can reduce the theoretical benefits (Hassani et al., 2024, Hassani et al., 23 Apr 2025).

7. Summary Table: Major Neighborhood Attention Variants

Variant / Domain	Neighborhood Definition	Key Property / Benefit	Core Reference
NA (Images)	$B(i,j)$ 2 spatial window	Linear scaling, equivariance	(Hassani et al., 2022)
DiNA (Images/Seq)	Dilated windows, $B(i,j)$ 3	Exponential RF growth, no extra cost	(Hassani et al., 2022)
GNA (General/Sparse Block)	Parametric stride/window	Block-sparse, tile-level kernel usage	(Hassani et al., 23 Apr 2025)
NPA (Point Clouds)	$B(i,j)$ 4NN Euclidean	Adaptive, input-density aware	(Xue et al., 2022)
Spherical NA	Geodesic disk on $B(i,j)$ 5	SO(3) equivariance, quadrature correction	(Bonev et al., 16 May 2025)
Block-level Adaptive (NABLA)	Adaptive mask/block threshold	Speedup for video, preserves context	(Mikhailov et al., 17 Jul 2025)
Directional NA (Graphs)	Spectral/diffusion distance	Topology- and edge-flow aware	(Lu et al., 2024)
Causal Dilated NA (Seq/TCN)	Causal, left-only dilation	Long horizon with low cost	(Mehta et al., 2023)

Neighborhood Attention and its variants constitute a broad, rapidly evolving paradigm for efficient, inductive-bias-preserving attention mechanisms, with practical success across diverse modalities and tasks. The flexibility in defining "neighborhood" (spatial, block, spectral) and the ability to exploit hardware and data locality are key to their adoption in modern large-scale architectures. Continued progress increasingly focuses on more adaptive, context-aware, and geometrically principled formulations for even greater efficiency and performance.