Dilated Transformers

Updated 4 December 2025

Dilated Transformers are Transformer architectures that use sparsified attention via dilation patterns to exponentially enlarge receptive fields with sub-quadratic complexity.
They leverage dilation factors to selectively attend to local and distant tokens across spatial, temporal, or spectral domains, enabling versatile applications.
Empirical benchmarks show state-of-the-art performance in vision, medical imaging, video, and language tasks with significantly reduced computational costs.

Dilated Transformers are a class of Transformer architectures in which the attention mechanism is sparsified via dilation patterns—analogous to those in dilated convolutions—to achieve exponential growth in the receptive field with sub-quadratic computational complexity. By selectively attending over local and spatially distant tokens according to a dilation schedule, dilated Transformers efficiently capture both local detail and global context. Variants employ dilated attention in spatial, temporal, or spectral domains and integrate them with hierarchical, multi-scale, or hybrid model designs for diverse modalities including vision, medical imaging, video, and language modeling.

1. Key Concepts and Mathematical Foundations

The core idea in dilated Transformers is to restrict attention to a “dilated” set of tokens: for each query, a sparse but spatially/temporally spread set of key-value pairs is selected based on a dilation factor $\delta$ . This allows the receptive field to expand exponentially over depth, rather than linearly as occurs in canonical sliding-window attention mechanisms (Hassani et al., 2022, Ding et al., 2023, Manzari et al., 19 Feb 2025).

For example, Dilated Neighborhood Attention (DiNA) in vision operates as follows:

For a spatial feature map, each token $i$ attends to the $k\times k$ grid of its neighbors spaced every $\delta$ pixels:

$\rho_\delta(i) = \{ i + \delta \cdot (u, v) : u, v \in [-r, r], \text{within bounds} \}$

The attention for token $i$ is then

$\text{DiNA}_k^\delta(i) = \sum_{j \in \rho_\delta(i)} \text{softmax}_j\left( \frac{Q_i K_j^T + B_{(i,j)}}{\sqrt{d_k}} \right) V_j$

where $B_{(i,j)}$ is a learned positional bias.

In temporal domains such as video, Temporal Dilated Transformer Blocks (TDTB) use a dilation factor $D_t$ to attend only to frames spaced $D_t$ apart, exponentially expanding the temporal receptive field in successive layers (Sun et al., 14 Feb 2024).

For extremely long language sequences, LongNet implements dilated attention by sparsifying attention within segments, using a mixture of dilation patterns $(r_i, w_i)$ ; information can propagate across $O(\log N)$ depth via exponential segment growth (Ding et al., 2023).

2. Architectural Patterns and Variants

Several dilated Transformer variants have emerged:

DiNAT (Dilated Neighborhood Attention Transformer): Hierarchically interleaves local neighborhood attention (NA) and DiNA blocks in a pyramid, achieving exponential receptive field growth while maintaining $O(N k^2)$ complexity (Hassani et al., 2022).
Dilated-UNet/D-Former: U-shaped encoder-decoder architectures for medical segmentation use alternating local and dilated Transformer blocks with dynamic position encoding and skip connections, allowing efficient fusion of local and global cues at every scale (Saadati et al., 2023, Wu et al., 2022).
CTformer: Convolution-free vision Transformer for low-dose CT denoising employs cascaded Token2Token dilation and cyclic shifts; unfolding with dilation achieves local and non-local fusion without any learned convolution kernel (Wang et al., 2022).
TDViT: Temporal dilated attention for video, mixing spatial and temporal information in sequential blocks; hierarchical dilation exponentially increases the frame-wise receptive field while keeping frame-level cost nearly linear (Sun et al., 14 Feb 2024).
LongNet: Scales Transformers to $1$B-token sequences by combining multiple geometric-dilation blocks; achieves $O(N)$ computational complexity with $O(\log N)$ token-token path lengths (Ding et al., 2023).
DTU-Net: Multi-scale Dilated Transformer for hyperspectral unmixing, in which dilation rates vary across attention heads in MSDA (Multi-Scale Dilated Attention) blocks, capturing both local and long-range spatial correlations (Wang et al., 5 Mar 2025).

3. Computational Complexity and Scalability

Dilated attention fundamentally reduces cost compared to global self-attention:

Attention Variant	Time Complexity	Memory
Full Self-Attention	$O(N^2d)$	$O(N^2)$
Local/Neighborhood (NA)	$O(Nk^2d)$	$O(Nk^2)$
Dilated (DiNA/LongNet)	$O(Nk^2d)$	$O(Nk^2)$

With properly chosen dilation schedules, the receptive field grows exponentially ( $R(L) = k^L$ for $L$ layers with per-layer dilation $k$ ) while complexity remains linear in $N$ and quadratic only in window size $k$ . On video data, amortized per-frame cost in TDViT is $O(HW d (\frac{1}{D_t}+K_t))$ ; D-Former for 3D medical volumes reduces quadratic complexity by $g^3$ where $g$ is the stride (Wu et al., 2022). LongNet demonstrates near-constant GPU latency up to $10^9$ tokens using distributed attention blocks (Ding et al., 2023).

4. Empirical Benchmarks and Applications

Dilated Transformer designs have set new benchmarks across multiple domains:

Vision: DiNAT-Large achieves 84.4% top-1 on ImageNet-1k, 55.3 box AP on COCO, 54.9 mIoU on ADE20K, outperforming Swin and NAT baselines (Hassani et al., 2022). MedViTV2 with DiNA blocks achieves state-of-the-art results on 27/29 medical image datasets with 44% lower computation (Manzari et al., 19 Feb 2025).
Medical Segmentation: Dilated-UNet reports 82.43% Dice (Synapse CT) and 0.9147 Dice (ISIC18 skin lesions), surpassing Swin-Unet and MISSFormer (Saadati et al., 2023). D-Former attains 88.8 DSC on Synapse 3D CT and 92.3 DSC on ACDC cardiac MRI at lower FLOPs than nnFormer (Wu et al., 2022).
Low-dose CT Denoising: CTformer yields 0.9121 SSIM and 9.0233 RMSE HU (best among compared baselines), with only 1.45M parameters and 0.86G MACs (Wang et al., 2022).
Video: TDViT raises AP_box from 47.1% (Swin-T) to 49.1%, and can expand temporal receptive field by a factor of 10, saturating accuracy at ~78.5% AP_50 on VID (Sun et al., 14 Feb 2024).
Hyperspectral Unmixing: DTU-Net achieves up to $30\%$ RMSE reduction, SAD_end as low as 0.0277, outperforming PPNMM-AE, Swin-HU, and DeepTrans across synthetic and real datasets (Wang et al., 5 Mar 2025).
Long-sequence Language Modeling: LongNet maintains competitive perplexity across context lengths to 32K and unlocks scaling laws up to 2.7B parameters; latency scales linearly in $N$ up to $1$B tokens (Ding et al., 2023).

5. Limitations, Trade-offs, and Extensions

Dilated Transformers require careful selection of window size $k$ and dilation factors $\delta$ , as large dilation in shallow layers may under-utilize local detail and induce “checkerboard” patterns in aggregation (Wu et al., 2022, Saadati et al., 2023). In temporal and spectral contexts, excessive dilation can skip essential intermediate correlations.

Current limitations include loss of fine-grained precision vs dense attention, necessity of hyperparameter tuning for pattern mixtures (LongNet), and architectural compatibility for dynamic or learned dilation schedules. Proposed extensions include:

Dynamic or adaptive dilation selection per-layer (Saadati et al., 2023)
Multimodal cross-dilated attention for vision+language (Ding et al., 2023)
Temporal and spatial sparsification for scalable video and graph modeling (Sun et al., 14 Feb 2024, Wu et al., 2022)
Integration of KAN blocks with DiNA for enhanced generalization (Manzari et al., 19 Feb 2025)

6. Implementation Practices and Interpretability

Implementation strategies utilize batched “unfold + gather” kernels (for DiNA), cyclic shifts and dilated-unfolding in pure transformer models (CTformer), and hierarchical multi-branch encoder designs (DTU-Net). Empirical interpretability is enhanced by visualizing static attention maps and hierarchical attention-flow graphs; Grad-CAM analyses reveal that DiNA-equipped layers attend to both local and global context, in contrast to non-dilated approaches (Wang et al., 2022, Manzari et al., 19 Feb 2025).

Stage-wise allocation of dilation (e.g. $\delta\in\{8,4,2,1\}$ for DiNAT stages) enables efficient global context fusion at multiple scales (Hassani et al., 2022, Manzari et al., 19 Feb 2025). Overlapped inference mechanisms reduce U-Net boundary artifacts in CT denoising (Wang et al., 2022).

7. Outlook and Ongoing Research

Dilated Transformers offer a principled approach to scaling self-attention for long-range and multi-scale modeling in diverse modalities. Ongoing research is exploring dynamic dilation, distributed training orchestration (LongNet), extensions to 3D, multimodal input, and the fusion of physical models (e.g. PPNMM in hyperspectral unmixing) with Transformer architectures (Ding et al., 2023, Wang et al., 5 Mar 2025). The consensus across domains is that dilated attention mechanisms can bridge the gap between convolutional locality and full global self-attention, achieving high accuracy and efficiency in dense prediction, sequence modeling, and nonlinear inference tasks.