Neighborhood-Restricted Transformers

Updated 1 March 2026

Neighborhood-restricted Transformers are a class of models that impose local or graph-defined attention constraints to enhance computational efficiency and focus on local structure.
They employ various mechanisms such as fixed-window, dilated, and adaptive sparse attention to balance reduced computation with effective context aggregation.
These architectures enable aggressive token reduction and scalable performance while preserving spatial and structural consistency in vision, graph, and NLP applications.

Neighborhood-restricted Transformers refer to a family of architectures that impose locality constraints on the attention mechanism or token processing within Transformer models. The core idea is to restrict each token’s attention—explicitly or functionally—to its local or graph-defined neighborhood rather than all other tokens, trading the quadratic complexity and potentially diffuse context of standard self-attention for enhanced computational efficiency, focus on local structure, and improved context aggregation in both vision and graph domains. This class of models encompasses several concrete architectural motifs, such as sliding-window attention, block-level or adaptive sparse attention, graph-based neighborhood attention, Hilbert-reordered neighborhood token processing, and higher-order tuple-based graph transformers.

1. Mathematical Foundations of Neighborhood Restriction

Neighborhood-restricted attention generalizes the standard attention mechanism:

For an input $X \in \mathbb{R}^{n \times d}$ , vanilla multi-head self-attention computes

$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left( \frac{Q K^T}{\sqrt{d_k}} \right)V,$

where $Q, K, V$ are linear projections of $X$ , and the attention matrix is dense ( $n \times n$ ).

Neighborhood-restricted variants introduce a sparsity mask or alteration so that for each query position $i$ , only a subset of key positions $j \in \mathcal{N}(i)$ is considered, where $\mathcal{N}(i)$ denotes the neighborhood of $i$ —spatial, temporal, graph-theoretic, or otherwise. There are two principal approaches:

Fixed/local window restriction: $\mathcal{N}(i)$ is a window of size $w$ around $i$ in 1D/2D.
Graph-defined restriction: $\mathcal{N}(i)$ is the set of $k$ -hop graph neighbors, tuples, or edges adjacent to node $i$ .

The formalism extends to higher-order or blockwise neighborhoods and can incorporate dynamic, adaptive, or even hierarchical attention structures. For instance,

Neighborhood Attention (NA): $\mathcal{N}(i)$ is the nearby grid locations in 2D image space (Hassani et al., 2022).
Dilated Neighborhood Attention (DiNA): the window is subsampled with dilation $\delta$ , expanding the effective receptive field (Hassani et al., 2022).
Block-level Adaptive Neighborhood: Query and key tensors are averaged over blocks, then blockwise attention is sparsified via adaptive masking per block (Mikhailov et al., 17 Jul 2025).

For graph inputs, the restriction is tied to the adjacency or higher-order motif structure of the graph (Li et al., 2022, Zhou et al., 2024, Dwivedi et al., 2023).

2. Locality, Token Reduction, and Receptive Field Control

Neighborhood restriction is also leveraged for aggressive token reduction strategies in ViTs and GNNs. Two notable mechanisms are:

Neighbor-Aware Pruning (NAP) (Li et al., 28 Dec 2025): After Hilbert-curve ordering, importance scores for tokens are formed by convolving attention statistics with a 1D kernel across the Hilbert-sequence, ensuring pruning respects spatial contiguity.
Merging by Adjacent Token Similarity (MAT): Only tokens that are immediately adjacent in the Hilbert sequence (thus locally adjacent in 2D) are eligible for merging, preserving patch-level spatial continuity.

Similarly, in graph transformers, tokens are constructed by aggregating multi-hop neighborhoods ("Hop2Token") and Transformer attention is applied to these per-node token sequences independently, further maintaining neighborhood restriction (Chen et al., 2023).

The effect is a controlled expansion of effective receptive field with stack depth or dilation, local aggregation for context preservation, and a trade-off between computational efficiency and context range.

3. Complexity, Scalability, and Implementation

Restricting attention to neighborhoods directly addresses the quadratic scaling bottleneck of global self-attention.

Model/Approach	Complexity per Layer	Context Range
Full self-attention	$O(n^2 d)$	Global
Windowed/Neighborhood Attention	$O(n k d)$	Local (linear in $n$ )
Dilated Neighborhood Attention	$O(n k d)$	Expanding (dilated window)
Block-level Adaptive (NABLA)	$O(n_B^2 d + k n d)$	Block-adaptive
Sparse order- $k$ graph transformer	$O(k n^{k+1} d)$	$k$ -tuple graph neighborhoods

Neighborhood Attention in Vision (Hassani et al., 2022, Jamali et al., 2023): Linear in number of tokens for fixed window size; implemented via optimized CUDA kernels (e.g., NATTEN), allowing multi-megapixel images.
Dilated Neighborhood Attention (Hassani et al., 2022, Manzari et al., 19 Feb 2025): Same asymptotic cost as simple neighborhood but exponential receptive field expansion via dilation.
NABLA Block-Level Attention: Block grouping and adaptive thresholding control sparsity and enable efficient scaling in video and large-sequence contexts without custom kernel development, leveraging standard frameworks via sparse attention masks (Mikhailov et al., 17 Jul 2025).
Graph Transformer Neighborhood Sampling: Tokenizing multi-hop neighborhoods, or limiting higher-order tuple attention to "neighbor swaps" in tuples, allows $O(n^{k+1})$ scaling versus $O(n^{2k})$ for all-pair attention (Zhou et al., 2024, Dwivedi et al., 2023).

4. Concrete Instantiations Across Domains

Vision Transformers

Neighborhood Attention Transformer (NAT): Employs fixed-size 2D local windows for each pixel, achieving linear complexity, translation equivariance, and improved performance on classification and segmentation tasks compared to fixed-block window attention (e.g. Swin) (Hassani et al., 2022).
Dilated Neighborhood Attention Transformer (DiNAT): Extends NA by introducing a dilation factor $\delta$ in neighbor selection per attention layer, exponentially expanding receptive field without increasing per-layer cost. Demonstrated consistent gains (up to +1.6% AP) over NA- and Swin-based backbones on detection, segmentation, and panoptic tasks (Hassani et al., 2022).
MedViTV2 (DiNA in medical imaging): Further explores DiNA inside a local/global block hierarchy for robust medical visual recognition, showing expanded context is particularly valuable under noisy clinical conditions; w=1 or 2 and d=2 achieves optimal trade-off (Manzari et al., 19 Feb 2025).
Neighbor-aware token reduction with Hilbert ordering (NAP, MAT): Reorders 2D patches into a 1D sequence using a discrete Hilbert curve, executing pruning/merging that preserves 2D spatial adjacency, empirically improving FLOPs-throughput-accuracy tradeoffs relative to prior methods (Li et al., 28 Dec 2025).

Graph and Sequence Modeling

Band-restricted/Local self-attention in NLP: Explicit masking restricts attention heads to fixed radii (“band- $r$ ”) around each token. Systematic evaluation shows that even fully local attention (all heads, all layers) incurs <0.15 BLEU drop or ≤2–3% accuracy loss on GLUE benchmarks, with O( $r T$ ) complexity replacing O( $T^2$ ) (Pande et al., 2020).
Multi-neighborhood attention in graphs (MNA-GT): Constructs separate scaled-dot product kernels for each $k$ -hop adjacency power, then combines kernel outputs adaptively per-node, allowing the model to emphasize the structurally relevant context scale for each node independently (Li et al., 2022).
Order- $k$ and sparse tuple-based Graph Transformers: Neighborhood-based sparse order- $k$ self-attention achieves full $k$ -WL expressivity at $O(k n^{k+1})$ cost per layer, in contrast to full order- $k$ transformers' $O(n^{2k})$ scaling (Zhou et al., 2024).
Sampling-based Graph Transformers for Large Graphs (LargeGT): Efficiently merges local (2-hop sample context tokens, yielding effective 4-hop receptive field) and global (approximate codebook) representations, scaling linearly with graph size and delivering state-of-the-art results on node classification for graphs with billions of edges (Dwivedi et al., 2023).

Block-level and Adaptive Schemes

$\nabla$ NABLA (Neighborhood Adaptive Block-Level Attention): Partitions long token sequences (video, large images) into non-overlapping blocks, summarizes within-block Q/K, computes block-level attention, and adaptively sparsifies according to cumulative attention mass per query block. Achieves up to 2.7x speedup with near-constant generative quality, requires only standard sparse ops (e.g. PyTorch FlexAttention) (Mikhailov et al., 17 Jul 2025).

5. Empirical Results and Trade-offs

Vision: NAT models outperform Swin and ConvNeXt backbones of similar size on ImageNet (e.g., NAT-Tiny: 83.2% top-1 vs Swin-Tiny: 81.3%) and mIoU on ADE20K segmentation (+2–2.6 points), with comparable or lower throughput cost (Hassani et al., 2022).
Token reduction: Hilbert-based neighbor-preserving NAP+MAT achieves 30–50% FLOPs reduction with <0.5–2.5% top-1 drop, surpassing EViT, ToMe, DiffRate (Li et al., 28 Dec 2025).
Dilated Neighborhood Expansion: For fixed window size, increasing dilation grows receptive field and yields +0.3–0.5% accuracy improvements at no extra cost (Manzari et al., 19 Feb 2025, Hassani et al., 2022).
Graph node/classification: LargeGT achieves 3x speedup and up to +16.8% improvement over strong GNN/transformer baselines on large-scale OGB datasets (Dwivedi et al., 2023). NAGphormer achieves comparable or superior accuracy to both full and scalable GNNs, with O( $n K^2$ ) complexity (Chen et al., 2023).
NLP: Full local-only restriction in machine translation yields BLEU drops ≤0.14 and, in some cases, improvements, with nearly halved parameter counts (Pande et al., 2020).
Block-adaptive methods: NABLA enables $>$ 80% attention sparsity and nearly identical CLIP and VBench scores relative to dense baselines in video diffusion transformers (Mikhailov et al., 17 Jul 2025).

6. Architectural and Theoretical Implications

Neighborhood-restricted designs offer several significant theoretical and practical advantages:

Strict local inductive bias: Sliding-window and neighborhood-based attention mimic convolutional localness while allowing flexible receptive field expansion through stacking, dilation, or block adaptation.
Translational equivariance: Pixelwise neighborhood attention (not blockwise) preserves equivariance, which is broken in chunked or globally-addressed models (Hassani et al., 2022).
Context preservation and focus: Neighbor-aware token reduction prevents fragmentation of local context under aggressive pruning/merging regimes (Li et al., 28 Dec 2025).
Expressive power: For graphs, neighborhood-restricted higher-order transformers can match (and, when designed carefully, even surpass) the Weisfeiler-Lehman hierarchy in distinguishing non-isomorphic graphs (Zhou et al., 2024).
Parameter/computation trade-off: Layerwise and headwise parameter sharing among local heads can halve or more the number of learnable parameters with only marginal accuracy reduction (Pande et al., 2020).
Scalability: Algorithms exploiting local or blockwise restriction easily scale to hundreds of millions of nodes or pixels, a regime unreachable for all-pairs attention.

7. Open Questions and Research Directions

Key ongoing and future areas of exploration include:

Dynamic neighborhood selection: Learning or adapting the size, shape, or dilation of attention neighborhoods per input or per-stage, as well as per-token adaptive fusion of local and global kernels (Li et al., 2022, Li et al., 28 Dec 2025).
Space-filling curve reordering: Systematic comparison of locality preservation and context aggregation under different space-filling curves (Hilbert, Peano, Morton) for token reduction (Li et al., 28 Dec 2025).
Theoretical locality guarantees: Characterization of distortion and limit behavior for sequence-based proxies of multidimensional adjacency, such as Hilbert curve reordering.
Block-wise versus pixelwise/locality preservation: Tension between computational efficiency (favoring blockwise summarization) and spatial consistency (preserved under sliding window/pixelwise schemes).
Generalization to additional modalities: Applications in dense prediction for biomedical, spatio-temporal, and geospatial data, as well as video, text, and large-scale relational graphs.
Parameter sharing/factorization strategies: Further reduction in model size via tied weights across heads/layers, particularly in heavily locality-biased models (Pande et al., 2020).
Expressive power for graph transformers: Exploration of the border between locally sparse and globally expressive graph transformers, and how neighborhood restriction can be adapted to graph motif structure beyond simple hops (Zhou et al., 2024).

Neighborhood-restricted Transformers represent a central architectural scientific frontier, allowing efficient, locality-preserving, and expressive modeling in tasks and modalities where local context, scalability, and structure-aware representations are imperative.