Depthwise Dilated Convolution

Updated 9 January 2026

Depthwise dilated convolution is a neural operation that applies channel-wise filtering with spatial dilation to expand the receptive field efficiently.
It integrates techniques like grouped structures and learnable spacing to extract multi-scale features in applications such as embedded vision, audio analysis, and remote sensing.
Empirical results show that DDC reduces parameters and computational cost while maintaining or improving accuracy, making it ideal for resource-constrained environments.

Depthwise Dilated Convolution (DDC) designates a class of neural network operations that extend standard depthwise convolution by introducing dilation into the spatial sampling of each channel’s kernel, enabling dramatic enlargement of the receptive field without any increase in parameter count or computational cost per filter. DDC and its variants are a key building block in lightweight, high-throughput architectures for embedded vision, audio analysis, and remote sensing, allowing competitive or superior accuracy compared to classical dense convolutional paradigms, with orders-of-magnitude reductions in parameter count and energy footprint.

1. Mathematical Formulation and Definitions

Let $X\in \mathbb{R}^{C\times H \times W}$ denote the input tensor; $W[c,u,v]$ the per-channel kernel for channel $c$ of spatial dimension $K\times K$ ; and $r$ the dilation rate. The output $Y[c,i,j]$ of a depthwise dilated convolution is

$Y[c,i,j] = \sum_{u=0}^{K-1} \sum_{v=0}^{K-1} W[c,u,v] \cdot X[c,\,i + r\cdot u,\,j + r\cdot v]$

as formalized in (Chen et al., 2021, Muhammad et al., 1 May 2025, Nedeljković, 8 Dec 2025, Drossos et al., 2020), and (Zeng et al., 2023). Here, each input channel is filtered independently using its own kernel, while the dilation factor $r$ determines the stride by which kernel values are "spread" spatially, expanding the effective receptive field from $K$ to $r(K-1)+1$ with no increase in weights or MACs.

Depthwise separable dilated convolution extends this by (1) applying the above operation per channel, then (2) merging the result with a $1\times1$ pointwise convolution for cross-channel mixing:

$y_{c'}(i,j) = \sum_{c=1}^{C} W^{(pw)}(1,1,c,c') \cdot \left[ \sum_{m,n} W^{(d)}_{c}(m,n) x_{c}(i + r\,m,\,j + r\,n) \right]$

where $W^{(pw)}$ collects pointwise mixing weights (Muhammad et al., 1 May 2025, Drossos et al., 2020, Zeng et al., 2023).

2. Architectural Implementations and Variants

The architectural integration of depthwise dilated convolution differs across domains and objectives but fundamentally exploits the same core principles.

Unified hardware acceleration: The design in (Chen et al., 2021) proposes a single highly parameterized accelerator CLPU that supports regular, depthwise, and dilated convolutions by exploiting an address generator for stride/dilation emulation with no logic or buffer overhead. Odd-sized filters ( $3\times3,\,5\times5,\,7\times7$ ) and arbitrary dilations are efficiently supported.

Grouped/multi-rate structures: GlimmerNet (Nedeljković, 8 Dec 2025) introduces Grouped Dilated Depthwise Convolution (GDBlock), partitioning the $C$ channels into $m$ groups, each group with its own kernel and dilation factor $d_g$ . This enables parallel multi-scale extraction, augmenting spatial diversity at constant parameter cost. An aggregator module based on grouped 1 $\times$ 1 pointwise convolutions fuses these features efficiently.

Dilated fusion blocks: DSDCN (Muhammad et al., 1 May 2025) implements a parallel fusion module in hyperspectral super-resolution, applying $3\times3$ depthwise separable convolutions with dilation rates $r\in\{1,2,3\}$ in parallel, concatenating, and pointwise fusing the results—a pattern analogously seen in audio (Zeng et al., 2023) and image contexts.

Learnable spatial arrangements: DCLS (Khalfaoui-Hassani et al., 2021) replaces regular-grid fixed dilation with learnable floating-point positions within a large kernel grid, constructing sparse, channel-specific kernels via bilinear interpolation. This technique is especially effective in depthwise settings, where each channel’s spatial context can be adaptively shaped during learning.

3. Computational and Memory Complexity

The parameter and computational savings of depthwise dilated convolution are substantial:

Operator	Parameters	MACs per output location	Receptive Field
Standard convolution	$K^2\cdot M \cdot N$	$K^2\cdot M \cdot N$	$K$
Depthwise separable	$K^2\cdot M + M\cdot N$	$K^2\cdot M + M\cdot N$	$K$
Depthwise dilated	$K^2\cdot M + M\cdot N$	$K^2\cdot M + M\cdot N$	$r(K-1)+1$
Grouped DDC (GlimmerNet)	$k^2C + C^2/m$	$(k^2C + C^2/m)$ (grouped pointwise)	$\max_g k + (k-1)(d_g-1)$

Parameter reduction for depthwise-separable compared to standard convolution is approximately a factor of $1/N + 1/K^2$ (assuming $N \gg K^2$ ). The introduction of dilation does not change the parameter count or number of MACs, only the spatial pattern of memory access and thus the receptive field (Chen et al., 2021, Muhammad et al., 1 May 2025, Nedeljković, 8 Dec 2025, Zeng et al., 2023). Grouped and multi-dilated structures further multiply receptive field diversity at fixed cost.

4. Practical Performance and Empirical Studies

Across vision, hyperspectral imaging, and audio domains, DDC layers consistently achieve strong or superior accuracy at a fraction of the computational burden:

Face detection (RetinaFace, MobileNetV1-0.25): Replacing cascaded $3\times3$ context modules with DDC ( $3\times3, r=1,2$ ) plus pointwise reduces context module size from 138KB to 23KB and compute from 708 to 119 MACs/pixel—31% lower total compute. Validation accuracy remains within 1% of baseline. Hardware accelerator delivers 20% higher throughput (180 vs 150 fps) (Chen et al., 2021).
ImageNet classification: Swapping $3\times3$ with $5\times5$ depthwise increases Top-1 from 68.39% to 69.44%. Hybrid dilation variants ( $r=2,3$ in late layers) yield further small gains (Chen et al., 2021).
DCLS on ConvNeXt: Replacing depthwise convolutions with learnable DCLS kernels improves Top-1 accuracy from 82.1% to 82.5%, outperforms both classic dilation (Top-1 80.8%) and large dense kernels (Top-1 82.0%), with only minor throughput hit ( $\sim$ 7% slowdown) (Khalfaoui-Hassani et al., 2021).
Hyperspectral SR (DSDCN): Parameter overhead $<$ 1M, first in PSNR and SSIM on PaviaC dataset, matching or beating larger models, confirming the efficiency and efficacy of DDC for high-dimensional input (Muhammad et al., 1 May 2025).
Audio event classification: Multi-scale dilated depthwise-separable conv boosts accuracy by 3.4% in domestic activity recognition compared to non-dilated; for polyphonic SED ensembles, depthwise-dilated blocks yield 85% parameter reduction and >78% faster training per epoch, with +4.6% F1 and –3.8% error (Zeng et al., 2023, Drossos et al., 2020).
Ultra-lightweight emergency scene recognition: GlimmerNet achieves F1=0.966 with only 31.2K parameters and 22.3M FLOPs, outperforming much larger transformer and CNN baselines with more than $10\times$ parameters and $100\times$ FLOPs (Nedeljković, 8 Dec 2025).

5. Receptive Field Control and Multi-Scale Contextualization

Dilation in DDC explicitly decouples receptive field growth from parameter count. For a single $3\times3$ kernel with $d=2$ the receptive field expands from 3 to 5, but uses only 9 parameters rather than 25. Layer stacking, or grouped variants (each group with different $d_g$ ) further extend context. In GlimmerNet, grouped DDC and staged downsampling yield effective receptive fields up to $7\times7$ at full resolution, $15\times15$ after one pooling layer.

Multi-scale embedding strategies, as in (Zeng et al., 2023), concatenate pooled outputs from multiple DDC blocks at different depths. Empirical ablation demonstrates both dilation and multi-scale pooling independently boost classification accuracy.

6. Advanced Variants: Learnable Spacings and Grouped Structures

DCLS (Khalfaoui-Hassani et al., 2021) generalizes DDC by making the spatial positions of kernel weights differentiable and learnable via backpropagation. Each DCLS kernel consists of $m$ weights and continuous ( $x$ , $y$ ) positions within an $s_1\times s_2$ grid, using bilinear interpolation to project weights onto the grid:

$K = \sum_{k=1}^m f(w_k, p_k^1, p_k^2)$

followed by standard depthwise convolution. This can expand the receptive field almost arbitrarily without increasing kernel density, and consistently outperforms both fixed-rate dilation and very large fixed kernels.

In grouped constructs like GlimmerNet (Nedeljković, 8 Dec 2025), the total parameter and compute cost is held constant by dividing channels into $m$ groups, applying distinct dilation per group, then recombining feature maps via grouped pointwise convolutions. This strategy enables simultaneous extraction of local and global features and leads to significant accuracy improvements in small-footprint networks.

7. Limitations, Trade-offs, and Design Best Practices

Accuracy vs extreme compression: While DDC and derivatives offer minimal loss in accuracy compared to conventional or even transformer-based architectures, excessive replacement of context modules or aggressive spatial simplification (e.g., reducing all context to DDC) can result in a sub-percent accuracy drop (Chen et al., 2021).
Hardware design: Cost-free support for dilation and large kernels is achievable by integrating the dilation stride logic into the address generation stage. Additional on-chip buffer may be required for multi-channel input in depthwise mode. Implementation of address generation for arbitrary dilations can moderately increase hardware complexity (Chen et al., 2021).
Kernel learnability and dynamic placement: In DCLS, learning the spatial positions of active taps enhances expressivity but slightly reduces throughput compared to fixed-grid dilated or separable kernels (Khalfaoui-Hassani et al., 2021).
Best practices for network design: Universal recommendations encompass ReLU (or ReLU6) activation, batch normalization after convolutional layers, and max-pooling for dimension reduction between DDC modules. Dropout between blocks is favored for regularization, especially in low-parameter regimes. Multi-scale feature extraction through depth stacking and/or explicit concatenation is empirically validated to improve discriminatory capacity (Zeng et al., 2023, Nedeljković, 8 Dec 2025).

Depthwise dilated convolution and its variants constitute foundational operations for highly parameter- and energy-efficient deep neural networks. Through precise control of receptive field, memory usage, and compute via dilation and channel-wise separation, DDC-based modules deliver state-of-the-art accuracy in embedded vision, audio, and remote sensing tasks while supporting real-time inference within severe resource budgets (Chen et al., 2021, Muhammad et al., 1 May 2025, Nedeljković, 8 Dec 2025, Drossos et al., 2020, Zeng et al., 2023, Khalfaoui-Hassani et al., 2021).