Generalized Neighborhood Attention (GNA)

Updated 19 November 2025

Generalized Neighborhood Attention (GNA) is a sparse attention mechanism that restricts each token to a flexible local neighborhood, ensuring linear computational cost.
GNA unifies sliding-window, strided, blocked, Gaussian-sampled, and multi-hop attention variants to effectively process vision, video, and graph data.
Efficient implementation using tiled sparse computation reduces memory and FLOPs, achieving significant speedups and up to 91% sparsity on modern hardware.

Generalized Neighborhood Attention (GNA) is a parametric family of sparse attention mechanisms designed to preserve locality and linear computational cost while providing flexible control over receptive fields. GNA encompasses and unifies sliding-window, strided-window, blocked (tiled), Gaussian-sampled, and multi-hop attention variants for high-dimensional data, including images, videos, and graphs. Its principled mask definition and implementation enable efficient sparse computation in neural network layers with theoretical and empirical speedups on modern hardware. GNA arises in multiple lines of research: kernel-defined sparse attention for vision and diffusion models (Hassani et al., 23 Apr 2025), spatially aware Gaussian attention for crowd localization (Li et al., 2021), and multi-neighborhood attention fusion for graph transformers (Li et al., 2022).

1. Mathematical Formulation of Generalized Neighborhood Attention

GNA restricts each query position to attend only to nearby keys and values within a parameterized “neighborhood,” contrasting dense self-attention which computes global all-pairs interactions. In the most general formulation (Hassani et al., 23 Apr 2025), let $Q,K,V \in \mathbb{R}^{N\times d}$ , where $N=\prod_{k=1}^D D_k$ for $D$ -dimensional data. For each query coordinate $p=(p_1,\ldots,p_D)$ , the neighborhood $N(p)$ is defined by window parameters $L=(\ell_1^-,\ldots,\ell_D^-)$ , $R=(\ell_1^+,\ldots,\ell_D^+)$ , and stride $S=(s_1,\ldots,s_D)$ : $\overline{p}_k = \left\lfloor \frac{p_k - 1}{s_k} \right\rfloor s_k + c_k, \qquad c_k = \lceil s_k/2 \rceil$

$N(p) = \{ q \in \mathbb{Z}^D : \forall k, \overline{p}_k - \ell_k^- \leq q_k \leq \overline{p}_k + \ell_k^+ \}$

The output is: $y_p = \sum_{q \in N(p)} \mathrm{softmax}_{r \in N(p)} (Q_p \cdot K_r / \sqrt{d}) V_q$ This captures sliding windows ( $S=1$ ), strided windows (each stride maps to the same window), and blocked attention (stride equals window size), generalizing traditional local attention masks.

Gaussian Neighborhood Attention defines $N(p)$ by stochastic sampling: for spatial location $x$ , pick $N$ samples $x_i \sim \mathcal{N}^2(x; 0, \gamma^2 I)$ , and aggregate with dot-product attention (Li et al., 2021). Multi-focus extensions average multiple Gaussian kernels with different $\gamma$ .

For graphs, multi-neighborhood attention constructs $H+1$ parallel kernels, one for each $k$ -hop neighborhood, and adaptively fuses node representations weighted by learned per-hop importance scores (Li et al., 2022).

2. Unification and Special Cases

GNA’s parametric mask subsumes classical and recent sparse attention architectures (Hassani et al., 23 Apr 2025):

Sliding-window attention: $S_k=1$ . Every query attends to keys in a fixed local window.
Strided sliding-window: $S_k>1$ . Queries grouped by stride share neighborhood windows.
Blocked (window) attention: $S=W$ . Query blocks attend only inside their block, equivalent to block-sparse kernels.
Gaussian sampling (GNA/MFGNA): Neighborhood sampled probabilistically, density decays with Euclidean distance (Li et al., 2021).
Multi-hop attention for graphs (MNA-GT): Neighborhoods defined by powers of the normalized adjacency, $Ā^k X$ for $k=0,\dots,H$ (Li et al., 2022).

The framework enables direct comparison and blending of locality, scale, and receptive field size.

Variant	Window Definition	Typical Domain
Sliding Window (NA)	Fixed width, stride=1	Vision, Language
Strided Window	Larger stride	Video, Blocked attention
Blocked/Tiled	Non-overlapping blocks	Vision (FMHA), Language
Gaussian Sampling	Stochastic, radius γ	Surveillance, Vision
Multi-Hop Graph	k-hop adjacency	Graphs, Molecules

3. Implementation, Complexity, and Hardware Optimization

Dense self-attention requires $O(n^2 d)$ FLOPs and $O(n^2)$ memory, where $n$ is the number of tokens. GNA restricts computation to $O(n W d)$ , where $W$ is the window size per query. For multi-dimensional ( $D$ -axis) data, $W = \prod_k (\ell_k^- + \ell_k^+ + 1)$ .

Efficient GNA execution is enabled by tiling-based algorithms and token permutation (Hassani et al., 23 Apr 2025). On NVIDIA Blackwell architecture, GNA is implemented atop fused multi-headed attention (FMHA) kernels in CUTLASS. When stride, window, and tile sizes are commensurate (“perfectly block-sparse”), all masked FLOPs are eliminated, achieving theoretical speedup: $R_\text{FLOP} = \frac{n^2 d}{n W d} = \frac{n}{W}$ TileSim analytically computes upper-bound speedup, accounting for tile overlap and masking efficiency.

Empirical FLOP rates (FP16): up to 1.3 petaFLOP/s sustained. End-to-end speedups on diffusion models: Cosmos-7B (26%), HunyuanVideo (63%), FLUX (45%), with visual/metric parity to dense attention and up to 91% sparsity.

Memory use drops from $O(n^2)$ to $O(n W)$ per layer. For Gaussian GNA, per-query computation is $O(F N d)$ , supporting real-time inference at $\sim$ 3 FPS for $F=3$ , $N=32$ , $d=512$ , $H=W=270$ on a V100 (Li et al., 2021).

4. Applications in Vision, Video, and Graph Modeling

GNA’s locality bias and efficient masking enable its adoption in computer vision, video crowd analysis, generative diffusion models, and graph transformers.

In GNANet (Li et al., 2021), multi-focus Gaussian GNA aggregates spatial-temporal context in video clips to localize crowd head centers robustly across varying scales and perspectives. The architecture integrates:

Scene modeling: feature fusion with GNA self-attention.
Context cross-attention: spatial (F-GNA) and temporal (T-GNA) parallel modules.
Localization decoding: dilated and deconvolutional layers for fine-grained probability maps. Empirical validation on VSCrowd and SenseCrowd benchmarks demonstrates state-of-the-art localization and counting.

Plugging GNA into generative models (Cosmos-7B, HunyuanVideo, FLUX) on Blackwell hardware yields substantial inference acceleration without accuracy degradation (Hassani et al., 23 Apr 2025).

In MNA-GT, H+1 multi-hop attention kernels and adaptive fusion capture local-to-global structural information for arbitrary graphs. Results on TU/OGB datasets show consistent improvements over GCN, GAT, GraphSage, and graph Transformer baselines (Li et al., 2022).

5. Adaptive and Multi-Focus Extensions

GNA supports scale-adaptive attention and multi-hop fusion beyond fixed-window mechanisms. In video, multi-focus Gaussian kernels with different $\gamma_f$ sample from mixtures of scales, approximating heavy-tailed receptive fields that link both nearby and distant structures, critical for resolving objects under perspective distortion (Li et al., 2021).

In graph transformers, multi-neighborhood kernels specialize in $k$ -hop propagation, with adaptive per-node fusion weights $\alpha_v^{(k)}$ , learned by attention gating: $z_v = \sum_{k=0}^H \alpha_v^{(k)} \cdot z_v^{(k)}$ Ablation confirms adaptive fusion surpasses uniform merging, preserving task-relevant graph topology (Li et al., 2022).

6. Comparison to Other Sparse Attention Variants

GNA generalizes prior sparse attention schemes:

Fixed window (local attention): rigid mask, inflexible to data geometry.
Uniform random attention: locality is ignored.
Deformable attention: learned offsets, requires supervision; less structure.
Laplacian/spectral positional encodings (GT, SAN): global bias, lacks per-node scale adaptability.
Shortest-path embedding (Graphormer): scalar distances, single-hop bias.
Kernel masks (GraphiT, Gophormer): multi-view static kernels, single attention head.

Key differences: GNA supports multi-dimensional masking and stride, multi-kernel parallelism, stochastic (Gaussian) neighborhoods, and adaptive fusion for spatial, temporal, or structural scale preservation.

7. Practical Guidelines, Limitations, and Future Research

GNA is most effective when window and stride parameters align with hardware tile dimensions, maximizing block-sparse efficiency. For vision/video, token permutation arranges feature tensors to exploit multi-dimensional tiling. For LLMs, 1D or ND-tiled FMHA suffices.

The TileSim simulator enables rapid exploration of (window, stride, tile) parameter space, searching for perfect block-sparse configurations. Incorporating GNA into existing models does not require fine-tuning in diffusion generation, provided dense layers are retained in early steps for quality assurance (Hassani et al., 23 Apr 2025).

Limitations include token permutation bandwidth overhead (~1/8 peak), the need for specialized kernel support for dynamic/learned tiling, and engineering effort for deploying GNA on alternate architectures (Hopper, Ampere) or low-precision modes (FP8).

Open problems include:

Hybrid or adaptive stride patterns tied to data geometry or model state.
Dynamically learned kernel masks and windows under the GNA framework.
Integration with task-specific priors (e.g., semantic segmentation, histology, scale-varying detection).

Access to implementations and reproducible configurations is provided via the NATTEN project (Hassani et al., 23 Apr 2025). This suggests a trend toward hardware/software co-design for scalable, locality-aware attention in large-scale deep learning.

PDF Markdown Chat (Pro)

References (3)

Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light (2025)

Video Crowd Localization with Multi-focus Gaussian Neighborhood Attention and a Large-Scale Benchmark (2021)

Adaptive Multi-Neighborhood Attention based Transformer for Graph Representation Learning (2022)

Follow Topic

Get notified by email when new papers are published related to Generalized Neighborhood Attention (GNA).