Dilated Neighborhood Attention (DiNA)

Updated 3 February 2026

DiNA is a self-attention mechanism that introduces a dilation factor to exponentially expand receptive fields while maintaining a constant window size and linear complexity.
It employs efficient batched GEMM kernels and fused GPU optimizations to overcome the quadratic cost of global attention, enhancing performance in vision and language tasks.
DiNA integrates seamlessly into hierarchical and causal transformer architectures, achieving state-of-the-art results in image restoration, segmentation, and multimodal applications.

Dilated Neighborhood Attention (DiNA) is an efficient self-attention variant that generalizes neighborhood or sliding-window attention by introducing a dilation factor, enabling exponential expansion of the receptive field without increasing compute or memory cost per layer. DiNA is fundamentally motivated by the limitations of global self-attention's quadratic complexity and the prohibitively local receptive field of standard neighborhood attention. It has been applied across vision, language, and multimodal domains, often in combination with hierarchical or hybrid architectures where global context and local precision are both essential.

1. Core Formulation and Mathematical Definition

DiNA extends standard neighborhood attention by parameterizing a local window of fixed size with an integer dilation factor. For a multi-dimensional input (e.g., a sequence, image, or volume), consider queries $Q$ , keys $K$ , and values $V$ obtained via linear projections. For a token at index $i$ in a spatial grid, and window size (typically odd) $w$ or kernel radius $r = (w-1)/2$ , and a dilation $d$ , the $i$ th output is:

$y_i = \sum_{j \in N(i)} \mathrm{softmax}_j\left( \frac{q_i k_j^T}{\sqrt{d_k}} \right) v_j$

where the dilated neighborhood $N(i)$ is defined as:

1D: $K$ 0
2D: $K$ 1

With $K$ 2, this reduces to standard neighborhood attention (NA). By choosing $K$ 3, the receptive field grows to $K$ 4 per dimension, but the number of attended elements per query remains constant ( $K$ 5 in 1D, $K$ 6 in 2D), preserving linear complexity. Causal masking is applied as required — for instance, by restricting to $K$ 7 along a temporal axis (Hassani et al., 2024, Hassani et al., 2022, Liu et al., 23 Jul 2025, Manzari et al., 19 Feb 2025).

2. Computational Properties and Implementation

The hallmark of DiNA is the preservation of $K$ 8 time and $K$ 9 space complexity, where $V$ 0 is the window size per query and $V$ 1 is the number of tokens. Unlike global self-attention ( $V$ 2), DiNA enables expanding the effective receptive field exponentially with depth or by stacking layers with increasing dilation, at no additional per-layer cost (Hassani et al., 2022, Manzari et al., 19 Feb 2025). In practice, DiNA is implemented using:

Efficient batched GEMM kernels to leverage hardware-accelerated matrix multiplication for sliding windows.
Fused GPU kernels (adapting FlashAttention and CUTLASS designs), which keep scores and masks in registers/shared memory, directly softmax and sum, eliminating global intermediate state and outperforming naive CUDA implementations by up to $V$ 3 in speed (Hassani et al., 2024, Manzari et al., 19 Feb 2025).

The basic algorithm within one block involves:

Linear projection to $V$ 4.
For each token, gather the windowed keys/values at dilated offsets.
Compute scaled dot-product, softmax, and weighted sum as above.
Optionally, combine with a channel-aware or global aggregation branch for enhanced context (Liu et al., 23 Jul 2025).

The approach generalizes cleanly to multidimensional data with per-axis dilation and is compatible with hierarchical or UNet-style architectures where lower-resolution stages use larger dilations.

3. Architectural Integration and Variants

DiNA is most impactful when integrated within hierarchical transformer backbones or encoder-decoder networks. Prominent schemes include:

DiNAT (“Dilated Neighborhood Attention Transformer”): Stacks layers that alternate between NA ( $V$ 5) and DiNA ( $V$ 6) with a gradually shrinking dilation schedule across spatial resolution stages (for example, $V$ 7 for four stages in vision models). This hybrid approach achieves both strong locality and sparse global coverage, allowing the receptive field to expand exponentially with minimal additional cost (Hassani et al., 2022).
DiNAT-IR: Alternates NA and DiNA blocks at each stage, with deeper/lower spatial resolution stages using larger dilation values (e.g., $V$ 8 at the coarsest). A channel-attention module complements spatial attention for global information integration (Liu et al., 23 Jul 2025).
NAC-TCN: Applies causal DiNA in the time domain for video and time-series, using dilations for exponentially expanding temporal receptive fields while maintaining strict causal order. Typically, each block combines a dilated convolution and a DiNA operator of matching kernel size and dilation, followed by residual addition and normalization (Mehta et al., 2023).
MedViTV2: Blends DiNA with Kolmogorov–Arnold Networks (KANs) for medical image classification, arranging Local Feature Perception (via DiNA) and Global Feature Perception in a hierarchical pattern. Fused DiNA-KAN blocks mitigate feature collapse and scale efficiently on both clean and corrupted datasets (Manzari et al., 19 Feb 2025).

4. Empirical Performance and Receptive Field Analysis

Across computer vision, image restoration, medical imaging, and sequence modeling, DiNA has demonstrated:

Comparable or superior accuracy to global self-attention or undilated neighborhood attention at substantially lower computational burden.
Exponential receptive field growth per layer stack: with window size $V$ 9 and dilation $i$ 0, $i$ 1 stacked DiNA layers cover a field of $i$ 2, versus only linear growth when $i$ 3 (Manzari et al., 19 Feb 2025).
Enhanced robustness to data corruptions and feature collapse (noted in medical imaging), attributed to sparse, widely-spaced competition among attention windows (Manzari et al., 19 Feb 2025).
Competitive or state-of-the-art results on ImageNet-1K classification, COCO object detection and segmentation, ADE20K semantic segmentation, Cityscapes instance/semantic segmentation, low-level image restoration (deblurring, denoising), and medical imaging segmentation/diagnosis (Hassani et al., 2022, Liu et al., 23 Jul 2025, Saadati et al., 2023, Manzari et al., 19 Feb 2025).
Substantial empirical speedups over naive CUDA and even prior optimized attention kernels. For example, in full precision, fused DiNA kernels provide $i$ 4 speedup in 1D and $i$ 5 in 2D (FP32) over naive baselines (Hassani et al., 2024).

Example: Empirical improvements with DiNA

Model/Task	Setting	Baseline (%)	DiNA (%)	Reference
GoPro Deblurring	PSNR (NA only)	31.56	32.03	(Liu et al., 23 Jul 2025)
Synapse Multi-organ	DSC (TransUNet)	77.48	82.43	(Saadati et al., 2023)
MedMNIST-C Corrupted	bACC (MedViT V1)	62.9	75.2	(Manzari et al., 19 Feb 2025)

Ablations confirm that alternating NA and DiNA layers is superior to either strategy alone, and that too large or too small dilation can reduce downstream performance (Hassani et al., 2022, Liu et al., 23 Jul 2025).

5. Variants, Limitations, and Practical Considerations

DiNA admits several variations, notably in the treatment of edges/boundaries (masking or padding), selection of dilation schedule, integration with other global or channel modules, and extension to multi-resolution or causal contexts:

Boundary handling involves clipping, masking with $i$ 6 logits, or explicit padding, as native in CUDA or PyTorch implementations (Hassani et al., 2022, Manzari et al., 19 Feb 2025, Liu et al., 23 Jul 2025).
Dilation schedules must be matched to feature map size and window; typically, larger dilations are used in early/high-resolution stages.
Fused kernels (e.g., FlashAttention-inspired) allow DiNA to scale efficiently to 1D, 2D, and 3D data, with constant memory footprint per token and strong empirical hardware utilization (Hassani et al., 2024).
In causal or autoregressive settings, index construction and masking must ensure that each query only attends to valid past neighbors (Mehta et al., 2023).
As per (Hassani et al., 2024), the backward pass is not yet available in the fused DiNA kernels; GEMM-based variants are limited by scatter/gather overheads in low-precision unless alignment conditions are met.

Limitations include:

Potential for boundary inefficiency or reduced context at edges, unless special handling is employed.
Fixed dilation schedules may be suboptimal for highly multi-scale data — learnable or adaptive dilation remains an open direction (Saadati et al., 2023).
Excessively large dilation can lead to oversmoothing or loss of high-frequency detail in low-level tasks (Liu et al., 23 Jul 2025).

6. Position within the Attention Landscape and Extensions

DiNA occupies a central role in bridging dense local attention (as in convolution or standard NA) and global sparse attention (as in long-range Transformer methods). It inherits translation equivariance, locality, and strong scaling properties, and seamlessly integrates into hierarchical and hybrid neural architectures. Extensions and research frontiers include:

Mixed-scale or multi-dilation DiNA, combining several dilation factors per layer (Moritz et al., 2021).
Integration with global or channel-aware modules for improved global context modeling at modest cost (Liu et al., 23 Jul 2025).
Data-driven or adaptive dilation scheduling to match input scale and content (Saadati et al., 2023).
Applications to temporal or spatiotemporal data with causal masking, as in NAC-TCN (Mehta et al., 2023).
Robustness, calibration, and feature diversity improvements, particularly in medical imaging or in the presence of heavy datasets corruptions, as observed with KAN-integration (Manzari et al., 19 Feb 2025).

DiNA represents a natural and efficient generalization of localized attention, offering exponential receptive field growth and hardware efficiency while maintaining linear complexity with respect to sequence or image size. Its wide adoption in contemporary vision and sequence models, and empirical success across modalities, establishes it as a core operator in the next generation of efficient, scalable attention mechanisms.