Papers
Topics
Authors
Recent
Search
2000 character limit reached

Neighborhood Attention in Transformers

Updated 10 February 2026
  • Neighborhood Attention is a sparse self-attention mechanism that limits attention to local neighborhoods, achieving linear computational complexity.
  • It incorporates variants like dilated, cross-scale, and adaptive attention to balance the trade-off between locality and receptive field in Transformer architectures.
  • The approach delivers state-of-the-art efficiency in tasks across vision, audio, and point clouds while significantly reducing memory and computation costs.

Neighborhood Attention (NA) is a family of sparse self-attention mechanisms that restrict the attention of each query to a local or structured neighborhood of keys, rather than the full set, to achieve linear rather than quadratic complexity with respect to input length or spatial size. NA is implemented in multiple domains—vision, speech, point clouds, signals—where the inductive bias of locality is appropriate, and forms the core of scalable Transformer architectures with state-of-the-art performance and efficiency. NA layers are typically interleaved with or augmented by other mechanisms (global attention, dilated attention, block-adaptive attention) to balance locality, receptive field, and computational trade-offs.

1. Mathematical Formulation and Variants

Neighborhood Attention fundamentally modifies the self-attention operation by restricting the summation over keys and values to a fixed set of indices, typically surrounding each query in spatial or temporal coordinates. Given input XRn×dX \in \mathbb{R}^{n \times d}, projections are computed as Q=XWQQ = XW_Q, K=XWKK = XW_K, V=XWVV = XW_V, with WRd×dW_{\cdot} \in \mathbb{R}^{d \times d}.

For each query token (or position) ii, a neighborhood set N(i)N(i) is defined—e.g., for a window of size kk,

N(i)={j:jik12}.N(i) = \left\{ j: |j - i| \leq \frac{k-1}{2} \right\}.

The attention output for ii is: NAk(i)=jN(i)exp((QiKjT+B(i,j))/d)jN(i)exp((QiKjT+B(i,j))/d)Vj,\mathrm{NA}_k(i) = \sum_{j \in N(i)} \frac{\exp\left((Q_i K_j^T + B(i, j)) / \sqrt{d} \right)}{\sum_{j' \in N(i)} \exp((Q_i K_{j'}^T + B(i, j')) / \sqrt{d})} V_j, where B(i,j)B(i, j) denotes a relative positional bias (learnable in vision or language settings) (Hassani et al., 2022, Hassani et al., 2022, Mehta et al., 2023).

Variants exist:

2. Computational Complexity and Implementation

The principal motivation for NA is reduction of self-attention’s O(n2d)O(n^2 d) compute and memory costs to O(nkd)O(n k d), where knk \ll n is the neighborhood size. For example, in 2D images with window size k×kk \times k and n=HWn = H \cdot W, the cost per layer is O(nk2d)O(n k^2 d) (Hassani et al., 2022, Hassani et al., 2024).

Modern implementations exploit batched GEMM operations and fused attention kernels:

  • Batched GEMM Formulation: Neighbor sets for tiles of queries are gathered to form compact batched matrix multiplications, avoiding explicit per-position loops (Hassani et al., 2024).
  • Fused Kernels: Fused kernels perform all softmax and value-projection operations on chip, reducing global memory footprint and maximizing hardware utilization (see NATTEN library) (Hassani et al., 2022, Hassani et al., 2024).
  • Block- and Tile-aware Designs: By aligning neighborhood windows with memory/compute tiles and introducing block sparsity, kernels exploit hardware threadblock parallelism (Hassani et al., 23 Apr 2025, Mikhailov et al., 17 Jul 2025).

Empirically, fused NA kernels outperform both naive CUDA and unfused batched GEMM approaches, achieving up to 16×16\times speedup and constant global memory usage, independent of the neighborhood size (Hassani et al., 2024).

3. Architectural Integration in Transformers and Other Models

NA is integrated into diverse model architectures across tasks:

  • Vision: Neighborhood Attention Transformer (NAT) replaces windowed or global ViT self-attention with per-pixel NA, preserving translational equivariance and competitive with Swin and ConvNeXt in classification, detection, and segmentation (Hassani et al., 2022). DiNAT alternates between local NA and sparse DiNA layers to combine locality and exponentially expanding context (Hassani et al., 2022).
  • Audio/Speech: In PCF-NAT, NA layers with large temporal windows (e.g., W=27W=27) are alternated with global-attention layers to capture local and global speaker cues, with all projections implemented via progressively fused group convolutions. This design enables state-of-the-art speaker verification performance while retaining low inference cost (Li et al., 2024).
  • Temporal Modeling: NA and its dilated, causal variant (DiNA) are fused with temporal convolutions in NAC-TCN, providing efficient long-range context and causality for video-based emotion analysis (Mehta et al., 2023).
  • Point Cloud Processing: NPA uses dynamic, input-dependent kNN neighborhoods in spatially sparse LiDAR data, aggregating local geometric structure efficiently for occupancy prediction and compression (Xue et al., 2022).
  • Feature Upsampling: NAF (Neighborhood Attention Filtering) applies cross-scale NA for zero-shot upsampling, conditioning adaptive neighborhood weights on high-resolution guidance signals, and explicitly encoding relative position with RoPE (Chambon et al., 23 Nov 2025).
  • Block/Adaptive Attention: NABLA and GNA generalize NA to block-sparse and adaptive sparsity patterns, attaining content-driven global context where block-adaptive masks are determined dynamically from block-level attention scores (Hassani et al., 23 Apr 2025, Mikhailov et al., 17 Jul 2025).

4. Empirical Results and Benchmarks

Neighborhood Attention mechanisms yield competitive or superior results across a wide range of metrics and modalities, while significantly reducing compute:

Model/Task Accuracy / mIoU / Metric Params/FLOPs Notable Efficiency / Speedup Reference
NAT-Tiny, ImageNet 83.2% Top-1 28M/4.3G 40% faster, 25% less mem than Swin (Hassani et al., 2022)
DiNAT-Large, COCO 55.3 box AP (Det) Faster than Swin-L, +1.6 AP (Hassani et al., 2022)
PCF-NAT, VoxCeleb1-O EER < 0.5% >20% lower EER than ECAPA-TDNN (Li et al., 2024)
NAC-TCN, AffWild2 CCC = 0.52 10.12M Up to 8x fewer MACs (Mehta et al., 2023)
NAF, VOC Segm. +5.58 mIoU (vs Nearest) 0.66M 2-3x faster than AnyUp (Chambon et al., 23 Nov 2025)
NPAFormer, Ford PCG -14.3% avg. bits-per-point 640x faster than OctAttention (Xue et al., 2022)
GNA, video Gen. 1.2–1.6x end-to-end speedup (Hassani et al., 23 Apr 2025)
NABLA, video gen. \sim2.7x faster, 92% block-sparsity (Mikhailov et al., 17 Jul 2025)

Performance gains are due to both algorithmic locality and tailored hardware-aware implementation.

5. Design Trade-offs, Extensions, and Limitations

While NA achieves significant efficiency gains, several factors shape its practical deployment:

  • Locality vs. Receptive Field: Stacking NA layers grows receptive field linearly (or exponentially with dilation), but pure local NA can miss long-range context. Alternating with global or block-adaptive attention recovers this at controlled cost (Hassani et al., 2022, Li et al., 2024, Mikhailov et al., 17 Jul 2025).
  • Neighborhood Shape and Adaptivity: Static windows are efficient, but non-uniform or learnable neighborhoods (e.g., kNN in NPA) can improve performance in non-Euclidean or heterogeneous data, albeit with more complex search/gather operations (Xue et al., 2022).
  • Translation Equivariance: Unlike block/window partitioning (e.g., Swin), NA preserves almost perfect equivariance—shifts in the input do not change the per-token neighborhood (Hassani et al., 2022).
  • Sparsity Patterns: Generalized (GNA) and adaptive block-level (NABLA) variants allow arbitrarily flexible neighborhood specification, balancing hardware efficiency, and expressivity (Hassani et al., 23 Apr 2025, Mikhailov et al., 17 Jul 2025).
  • Implementation Limitations: Unfused or naive kernel implementations can negate theoretical gains. Fused implementations achieve constant memory footprint independent of window size, but extensions to backpropagation and all parameter regimes are ongoing (Hassani et al., 2024).
  • Fixed vs. Dynamic Neighborhood Size: Fixed sizes are suboptimal in some regimes; deformable, content-adaptive, or multi-granularity windows are plausible future improvements (Chambon et al., 23 Nov 2025, Mikhailov et al., 17 Jul 2025).

6. Application Domains

Neighborhood Attention has been adapted and validated in diverse modalities:

7. Future Directions and Open Challenges

Several avenues remain active or emerging in Neighborhood Attention research:

  • Content-Adaptive Neighborhoods: Methods for dynamic neighborhood selection (learnable, deformable) are underexplored and may reduce compute or improve fidelity in heterogeneous content (Chambon et al., 23 Nov 2025, Mikhailov et al., 17 Jul 2025).
  • Multi-scale and Multi-granularity Blocks: Hierarchical designs combining NA at various resolutions, or hybrid block/grid structures, could provide additional scalability for ultra-high resolution data (Hassani et al., 23 Apr 2025, Mikhailov et al., 17 Jul 2025).
  • Integration with Hardware Accelerators: Fused GNA and related operators are optimized for emerging architectures (Blackwell, TMA) and are being generalized for broader adoption (Hassani et al., 23 Apr 2025).
  • Gradient-efficient and Backward-pass Kernels: Efficient, fused backward implementations for NA and block-sparse variants are critical for training large models at scale (Hassani et al., 2024).
  • Theoretical Analysis and Modeling: Simulators predicting real speedup (beyond naive FLOP counting) and detailed studies of memory/bandwidth bottlenecks inform practical model design (Hassani et al., 23 Apr 2025, Hassani et al., 2024).

Neighborhood Attention and its descendants continue to advance the scalability, interpretability, and performance of attention-based architectures across modalities, with rapid hardware-software co-evolution enabling new applications at unprecedented scale.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Neighborhood Attention (NA).