Dilated Convolutions: Methods & Applications

Updated 6 August 2025

Dilated convolutions are generalized operators that increase the receptive field by inserting holes into kernels, enabling efficient aggregation of local and global context.
They use adaptive, learnable dilations to adjust per channel, which improves tasks like semantic segmentation by balancing fine details and broader scene information.
Hardware-aware optimizations and smoothing techniques mitigate gridding artifacts, ensuring effective context integration and parameter efficiency in dense prediction tasks.

Dilated convolutions are generalized convolutional operators that address the need for large receptive fields in neural architectures without a commensurate increase in parameter count or spatial downsampling. By introducing regularly spaced “holes” (the dilation factor) into the convolutional kernel, they enable the simultaneous aggregation of local and global context, which is particularly advantageous in dense prediction tasks, sequential modeling, point cloud analysis, and spatiotemporal segmentation. The development of learnable and adaptive dilation mechanisms, interpolative extensions, and hardware-aware optimizations has positioned dilated convolutions as a critical module in contemporary deep learning frameworks.

1. Mathematical Foundations and Variants

The canonical 2D dilated convolution at output location $(m, n)$ with dilation factor $d$ is given by: $y_{m,n} = \sum_{c,i,j} w_{c,i,j} \cdot x_{c, m + i d, n + j d} + b$ where $x$ is the input feature map, $w$ the kernel weights, $b$ the bias, and $d$ typically integer and hand-tuned. The receptive field is expanded to $k_\text{eff} = k + (k-1)(d-1)$ for kernel size $k$ , providing a parametrically efficient means of enlarging context.

Further generalizations include:

Adaptive and Learnable Dilations: Instead of fixed $d$ , per-channel dilation $d_c \in \mathbb{R}^+$ is learned via backpropagation, requiring bilinear (or higher-order) interpolation to sample the input at non-integral locations, enabling fractional receptive fields and channelwise adaptivity (see Section 2 and (He et al., 2017)).
Non-grid Spacings (DCLS): Positions of nonzero kernel elements are parameterized as learnable, possibly continuous-valued coordinates. Smooth interpolation (bilinear, triangle, Gaussian) replaces fixed-grid sampling, ensuring differentiability (Section 4, (Khalfaoui-Hassani et al., 2021, Khalfaoui-Hassani et al., 2023, Khalfaoui-Hassani, 10 Aug 2024)).

2. Adaptive Dilation Learning and Context-Aware Segmentation

Learnable dilations address the heterogeneity of object scale and context dependence. By parameterizing $d_c$ for each input channel and enabling gradient-based adaptation, architectures achieve:

Fine-grained Adaptivity: Small dilations retain detail; large dilations aggregate context, enabling differential handling of, e.g., small traffic signs and large buildings within the same scene (He et al., 2017).
Channelwise Heterogeneity: Flexible assignment of receptive field per channel in each layer, outperforming fixed-dilation baselines in metrics such as Mean IoU (e.g., 62.5% $\rightarrow$ 63.3% on Cityscapes).
End-to-End Training: Differentiable bilinear interpolation permits gradients to flow to both weights and dilation parameters. The per-channel output is: $y_{m,n} = \sum_{c,i,j} w_{c,i,j} \cdot \mathrm{Interp}[x_{c}, m + i d_c, n + j d_c] + b\ ,$ enabling channel/layer-specific context range selection. This was demonstrated effective in semantic segmentation backbones (Deeplab-LargeFOV, Deeplab-v2, PSPNet), especially on classes with high interclass confusion.

3. Decomposition, Smoothing, and Mitigation of Gridding Artifacts

Dilated convolutions with high dilation factors are prone to “gridding” artifacts—adjacent outputs drawn from disjoint input regions—resulting in spatial inconsistency.

Decomposition View: The operation can be reformulated as periodic subsampling, shared standard convolution, and reinterlacing. Smoothing is achieved by “degridding,” i.e., fusing separated information streams (Wang et al., 2018).
SS (Separable and Shared) Operations: Smooth across groups via group-interaction FC layers (post-convolution mixing), SS convolution (pre-subsampling filtering), or SS output layers (block graph attention), each sharing a small parameter set and implemented efficiently.
Alternative Smoothing: Pre-filtering with local averaging or Gaussian before the dilated operator reduces gridding with minimal cost, and adaptive convex combinations of such filters further improve robustness (Section 5, (Ziegler et al., 2019)).

4. Learnable Spacings: DCLS and Beyond

Dilated Convolution with Learnable Spacings (DCLS) eliminates the constraint of a fixed regular grid, introducing positions $\{ p_x^l, p_y^l \}$ for each kernel element $l$ , which are real-valued and updated by gradient descent.

Interpolation for Differentiability: Each weight’s contribution is distributed over neighboring integer positions via interpolation:
- Bilinear/Triangle: $K_{i,j} = w \cdot \max(0, 1 - |p_x - i|) \cdot \max(0, 1 - |p_y - j|)$ .
- Gaussian: $K_{i,j} = w \cdot \exp\left(-\frac{(p_x - i)^2}{2\sigma_x^2}\right) \exp\left(-\frac{(p_y - j)^2}{2\sigma_y^2}\right)$ .
- Learning the spread parameters ( $\sigma_x, \sigma_y$ ) allows the kernel’s “focus” to be dynamic.
Empirical Results: In vision (ConvNeXt, ConvFormer, FastViT), DCLS with Gaussian interpolation improved accuracy (e.g., +0.6% top-1 ImageNet, same number of parameters; lower loss and better robustness than fixed-grid dilations) (Khalfaoui-Hassani et al., 2023, Khalfaoui-Hassani, 10 Aug 2024).
1D DCLS for Synaptic Delay (SNNs): In SNNs, each weight encodes both synaptic strength and a learnable delay, optimizing coincidence detection and boosting performance in temporal audio classification.

5. Dilated Convolutions in Structured and Spatiotemporal Domains

Manifold-Valued Data:

Adapting dilated convolution to non-Euclidean data spaces (e.g., SPD matrices, ODFs in neuroimaging) involves replacing linear summation with the weighted Fréchet mean:

$(X \ast_{d} w)(s) = \operatorname{argmin}_{M \in \mathcal{M}} \sum_{i=0}^{k-1} w(i) d_{\mathcal{M}}^2(X(s - i d), M)$

This enables DCNNs to respect data geometry, improving sensitivity and equivariance (Zhen et al., 2019).

3D and Spatio-Temporal Extensions:

Dynamic Dilated Convolutions (D $^2$ Conv3D): In 3D CNNs for video, dilation factors for each axis are predicted dynamically (per spatiotemporal location), enabling flexible “grid” adaptation while controlling boundary overflow (Schmidt et al., 2021).
Dilated Point Convolutions: For 3D point clouds, neighbor selection via “dilated KNN” (sampling every $d$ th neighbor in $(k \cdot d)$ -nearest) augments the receptive field without increasing parameter count, improving segmentation and classification accuracy (Engelmann et al., 2019).

6. Hardware-Aware Implementations and Efficiency

EcoFlow Dataflows:

Standard dilated and transposed convolutions involve heavy zero padding, resulting in unnecessary compute and memory access on spatial accelerators.
EcoFlow devises compile-time symbolic mapping and runtime multicast/accumulation policies that bypass zero multiplicands, dramatically improving runtime and energy (up to 4 $\times$ speedup; %%%%21 $d$ 22%%%% energy saving in some regimes) while preserving hardware compatibility (Orosa et al., 2022).

Parameter Efficiency and Throughput:

Dilated convolutions (and their learnable variants) maintain large receptive fields with no increase in parameters, permitting deep, inexpensive feature hierarchies (critical for mobile and embedded devices). In separable architectures (e.g., ConvNeXt), learnable-spacing modules introduce only a marginal throughput reduction (Khalfaoui-Hassani et al., 2021, Khalfaoui-Hassani, 10 Aug 2024).

7. Applications and Impact

Dilated convolutions and their descendants are central to:

Semantic Segmentation: Adaptive dilation, lateral inhibition, and multi-scale merging enable superior delineation of objects across scales, particularly in street scenes, remote sensing, and medical imaging (He et al., 2017, Liu et al., 2019, Vesal et al., 2018).
Genomics and Long-Sequence Modeling: Exponentially growing receptive fields afforded by stacking dilated layers capture long-range dependencies, outperforming standard CNNs and RNNs in genomic marker prediction (Gupta et al., 2017).
Audio and Speech: Deep, gated, and residual dilated stacks model long temporal dependencies in speech enhancement and keyword spotting without RNNs' sequential limitations (Coucke et al., 2018, Gong et al., 2019).
Spiking Neural Networks: DCLS-based learned delays enhance temporal pattern recognition in audio SNNs, establishing new benchmarks in neuromorphic applications (Khalfaoui-Hassani, 10 Aug 2024).
Dense Prediction and Instance Recognition: Channelwise adaptivity, inception convolutions with efficient dilation search, and context aggregation yield state-of-the-art performance across object detection, instance segmentation, and human pose estimation (Liu et al., 2020).
Biomedical Imaging and 3D Vision: Dilated networks in 3D architectures and point cloud domains expand context hierarchically without increasing resource requirements (Vesal et al., 2018, Engelmann et al., 2019).

8. Future Directions and Open Challenges

Continued research focuses on:

Architectures Tailored for Learnable Spacings: Dedicated designs exploiting adaptive spatial kernels, further closing the gap between convolutional and attention-based models in high-level vision tasks (Khalfaoui-Hassani, 10 Aug 2024).
Advanced Interpolation Schemes: Beyond bilinear or Gaussian (e.g., Whittaker–Shannon), seeking improved gradients and better alignment with data geometry (Khalfaoui-Hassani et al., 2023).
Domain-Specific Extensions: Expanding to 3D/4D data, graph-structured input, and hybrid models integrating attention, potentially with multi-modal fusion (Khalfaoui-Hassani, 10 Aug 2024).
Hardware–Algorithm Co-Design: Further optimizing data movement, computation re-use, and fitting to future accelerator architectures.

Open questions include the explainability and interpretability of learned kernel spacings in DCLS, how best to regularize or constrain learned positions for stability and generalization, and optimization of computational efficiency in non-Euclidean and hybrid models.

In conclusion, dilated convolutions and their advanced derivatives, notably those with learnable and adaptive spacings, constitute a versatile toolkit. They enable efficient context capture, flexible receptive field modulation, and high accuracy across a spectrum of domains—segmentation, detection, sequence modeling, structured data analysis, spiking computation, and beyond—representing a foundational component in modern deep learning architectures.