DiNA: Dilated Neighborhood Attention
- DiNA is a dilated, locality-controlled attention mechanism that exponentially expands receptive fields without increasing per-layer computational costs.
- It selectively samples spatial or temporal neighborhoods using configurable strides, balancing fine-grained details with long-range contextual aggregation.
- Efficient implementations employ fused kernels and sliding-window strategies to support diverse applications in vision, sequence, and graph learning.
Neighborhood Attention (DiNA) is a scalable, locality-controlled attention mechanism that extends classic windowed self-attention with dilation, providing exponential receptive field growth at fixed computational cost. By selectively sampling spatial or temporal neighborhoods with configurable stride, DiNA interpolates between purely local and quasi-global attention patterns, with implementations spanning vision, sequence modeling, and graph domains. DiNA’s design enables efficient content-dependent aggregation of both fine-grained local and long-range contextual information, forming a core building block in recent state-of-the-art architectures for detection, segmentation, image restoration, medical image analysis, emotion recognition, and heterophilic graph learning.
1. Mathematical Definition and Key Variants
Dilated Neighborhood Attention (DiNA) generalizes Neighborhood Attention (NA) by introducing a dilation factor into the spatial or temporal windowing scheme. For a 1-D input sequence , DiNA applies linear projections to obtain and, at each position , computes attention over a subset of indices: clipped such that (Mehta et al., 2023, Hassani et al., 2022). The full attention output at is
where contains dot products of with at the dilated neighbors, possibly including pairwise biases .
In 2-D vision workloads, let and consider a local attention window of radius (side length $2r+1$), sampled on a -strided subgrid: (Liu et al., 23 Jul 2025, Hassani et al., 2022, Manzari et al., 19 Feb 2025).
On graphs, DiNA acquires a topological form, augmenting Graph Attention Network (GAT) edge scores with directional or diffusion-based features extracted from the spectrum of a parameterized Laplacian, enabling edge-aware long-range aggregation (Lu et al., 2024).
2. Receptive Field Expansion and Complexity Characteristics
The canonical advantage of DiNA is exponential receptive field growth with depth or stage—without increasing per-layer cost—by progressive dilation scheduling:
- NA: receptive field grows linearly: .
- DiNA: with , receptive field grows exponentially: (Hassani et al., 2022, Mehta et al., 2023).
Both NA and DiNA exhibit compute and memory cost (number of tokens, window size, channel/model dimension), in strong contrast to the and complexity of global self-attention (Manzari et al., 19 Feb 2025). In practice, DiNA enables much larger stackable depth or spatial resolution, as in high-resolution image restoration (Liu et al., 23 Jul 2025) and medical imaging (Manzari et al., 19 Feb 2025).
3. Implementation Methodologies and Fused Kernels
Efficient DiNA implementations leverage two primary strategies:
- Sliding-window gathering: For each token, gather neighbors using stride . Efficient vectorized/batched gather operations are employed in both 1-D and 2-D via CUDA kernels (NATTEN (Hassani et al., 2022), fused dot-product kernels (Hassani et al., 2024)).
- Fused attention kernels: The “fused neighborhood attention” paradigm merges , softmax, and in a single GPU threadblock, keeping attention weights in registers/shared memory and reducing global memory bandwidth. Performance gains are dramatic—1D fused kernels accelerate inference by (FP32) and (FP16) versus naive kernels, with similar boosts in 2D/3D (Hassani et al., 2024).
Implementations support both causal and non-causal variants (for sequence modeling: enforce via dynamic masking and padding (Mehta et al., 2023)), multi-head variants (partition feature dimension), and easy integration as PyTorch/torch.nn modules.
4. Architectural Integration in Deep Learning Models
DiNA is a foundational component in multiple state-of-the-art architectures. Canonical patterns include:
- Alternated NA / DiNA Stacking: Hierarchical transformers (e.g., DiNAT and DiNAT-IR) alternate standard NA (dense local) and DiNA (dilated, sparse global) layers, enhancing both local precision and global context (Hassani et al., 2022, Liu et al., 23 Jul 2025).
- Residual and Hybrid Blocks: DiNA modules are inserted after convolutional or feed-forward sub-blocks and combined via residual addition and normalization. In NAC-TCN, DiNA is coupled with dilated 1D convolutions, spatial dropout, and per-block residual connections (Mehta et al., 2023). In MedViTV2, DiNA is used in Local Feature Perception blocks interleaved with Kolmogorov-Arnold Network-based global blocks (Manzari et al., 19 Feb 2025).
For image restoration, DiNA may be hybridized with channel-aware modules for enhanced global context, as in DiNAT-IR’s dual-attention transformer blocks (Liu et al., 23 Jul 2025). In graphs, Directional DiNA introduces spectral edge features and topological rewiring to extend GATs (Lu et al., 2024).
5. Hyperparameterization and Practical Guidelines
The efficacy of DiNA depends on appropriate choices for kernel size , dilation factors , depth, and attention head count. Key observed tradeoffs (Mehta et al., 2023, Hassani et al., 2022, Liu et al., 23 Jul 2025, Manzari et al., 19 Feb 2025):
- Neighborhood size : Larger increases local context and feature extraction capacity, but linearly increases compute/memory per layer.
- Dilation schedule : Exponential growth ( or at layer/stage ) achieves global receptive fields. Too aggressive dilation at high resolution may cause loss of short-term cues; insufficient dilation at low resolution may fail to capture long-range dependencies.
- Multi-head (): More heads enhance pattern diversity, with linearly increasing cost.
- Model dimension : Tuned for task and memory budget.
- Alternation and hybridization: Alternating between NA and DiNA within or across stages/stacks balances fine detail and global awareness (Hassani et al., 2022, Liu et al., 23 Jul 2025).
Example hyperparameters: , typical values per stage for image restoration, per-layer scaling for temporal modeling, and .
6. Application Domains and Empirical Performance
DiNA underlies several high-performance, resource-efficient architectures across domains:
- Vision Transformers: DiNAT achieves state-of-the-art on COCO detection (box AP +1.6% over Swin), ADE20K segmentation (mIoU +1.4% over Swin), and competitive image classification (Hassani et al., 2022).
- Image Restoration: DiNAT-IR surpasses Restormer and other baselines on GoPro deblurring (PSNR 33.80 dB vs. 32.92 dB), HIDE, DPDD, SIDD, and Rain datasets, with negligible FLOPs/memory overhead (Liu et al., 23 Jul 2025).
- Medical Imaging: MedViTV2, using enhanced DiNA with KAN integration, improves clean accuracy by up to 6.2 percentage points and robust accuracy by 5.8 on MedMNIST tasks, also reducing computational demand by 44% vs. earlier models (Manzari et al., 19 Feb 2025).
- Temporal Modeling: NAC-TCN with causal DiNA outperforms LSTM, GRU, and standard TCNs on emotion recognition from sequences, with fewer parameters and lower compute (Mehta et al., 2023).
- Graph Neural Networks: Directional DiNA outperforms vanilla and state-of-the-art GAT/GNN variants on node classification in both homophilic and heterophilic benchmarks, with up to 10–20 points improvement on difficult heterophilic settings (Lu et al., 2024).
7. Theoretical Analysis, Limitations, and Future Directions
DiNA provides a principled mechanism for balancing computational efficiency and receptive field size. The exponential receptive field scaling with layer/stage and the ability to fuse content-adaptivity with sparse sampling distinguish DiNA from both pure convolutions and global attention. In graph regimes, using spectral/geometric features supplies theoretically justified control of message-passing efficacy via Laplacian eigenstructures (Lu et al., 2024).
Known limitations include:
- For tasks requiring fine-grained short-term features, overly large dilation may skip important details.
- Current fused kernels support forward pass only in some versions, limiting efficiency for training in particular frameworks (Hassani et al., 2024).
- Tuning dilation and alternation schedules is task-dependent and may require ablation.
Continued open-source support via NATTEN and integration in major frameworks is accelerating adoption (Hassani et al., 2022, Hassani et al., 2024). There is active research on expanding fused kernel coverage, autograd/bwd support, higher-dimensional and multi-modal extensions, and hybridizations with other locality/globality mechanisms.
References:
- "NAC-TCN: Temporal Convolutional Networks with Causal Dilated Neighborhood Attention for Emotion Understanding" (Mehta et al., 2023)
- "Dilated Neighborhood Attention Transformer" (Hassani et al., 2022)
- "Faster Neighborhood Attention: Reducing the O(n2) Cost of Self Attention at the Threadblock Level" (Hassani et al., 2024)
- "DiNAT-IR: Exploring Dilated Neighborhood Attention for High-Quality Image Restoration" (Liu et al., 23 Jul 2025)
- "Medical Image Classification with KAN-Integrated Transformers and Dilated Neighborhood Attention" (Manzari et al., 19 Feb 2025)
- "Neighborhood Attention Transformer" (Hassani et al., 2022)
- "Representation Learning on Heterophilic Graph with Directional Neighborhood Attention" (Lu et al., 2024)