Dynamic Convolutions in Deep Learning

Updated 26 November 2025

Dynamic Convolutions are neural network operators that adapt kernel weights based on input features, enabling flexible, task-specific feature extraction.
They use an auxiliary attention mechanism to generate content-adaptive mixtures of candidate kernels, enhancing efficiency and expressivity.
Empirical studies show significant gains in image classification, object detection, and speech processing with minimal computational overhead.

Dynamic convolutions are neural network operators whose kernel weights, or key spatial/spectral parameters, are dynamically generated or modulated as a function of the input data, rather than being fixed after training. These mechanisms offer input-dependent flexibility to classical convolutional neural networks (CNNs), enabling shifting, mixing, or weighting of multiple candidate kernels, spatial locations, or channel groups. Dynamic convolutions have shown marked improvements in efficiency, expressivity, adaptivity, and task-specific accuracy across computer vision, speech, and natural language processing domains.

1. Mathematical Foundations and Core Variants

The canonical dynamic convolution mechanism replaces the fixed kernel in each layer with a content-adaptive mixture: $\overline{W}(X) = \sum_{k=1}^K \pi_k(X)\, W_k$ where $\{W_k\}$ are $K$ learnable candidate kernels, and $\pi(X) = (\pi_1, ... ,\pi_K)$ are input-dependent attention weights generated by an auxiliary network (typically a squeeze-excitation MLP over global pooled statistics). The output becomes: $Y = \mathrm{Conv}(X; \overline{W}(X)) = \sum_{k=1}^K \pi_k(X) \cdot \mathrm{Conv}(X; W_k)$ such that $\pi_k(X)\ge 0$ and $\sum_k \pi_k(X) = 1$ . This aggregation can be evaluated by weighted convolution outputs or by assembling the aggregate kernel, with nearly identical empirical results and computational complexity (Chen et al., 2019, Zhang et al., 2020).

In contexts such as dynamic lightweight convolution or sequence modeling, spatially or temporally localized input features are used to predict kernels per location or timestep, rather than per image, further increasing adaptivity (Wu et al., 2019, Chang et al., 2021).

Several notable dynamic convolution extensions exist, including:

Omni-Dimensional Dynamic Convolution (ODConv): Applies mutually independent attention on the spatial, input channel, output channel, and kernel index axes, allowing context-adaptive modulation in all relevant kernel tensor dimensions (Li et al., 2022).
Dual Complementary Dynamic Convolution (DCDC): Splits processing into a local spatial-adaptive branch and a global, sample-specific shift-invariant branch, summing their outputs (Yan et al., 2022).
Per-pixel Atom Factorizations: Each spatial position adapts a kernel over a learned or basis-decomposed “atom” dictionary, substantially reducing memory versus full per-location kernels, crucial for high-resolution tasks (Wang et al., 2021).
Decoupled Dynamic Filter (DDF): Decomposes dynamic depthwise convolution into spatial and channel-dynamic subfilters, reducing parameter and computational complexity (Zhou et al., 2021).
Dynamic Dilated Convolution (D²Conv3D): Learns input-driven, location-specific dilation rates and modulations for fixed-grid 3D convolutions, enhancing temporal/spatial adaptivity (Schmidt et al., 2021).
Frequency-Dynamic Variants: Generate frequency-dependent or event-class-specific kernels for audio and SED tasks, with possible integration of dilation, partial adaptivity, and temporal attention pooling (Nam, 15 Jun 2025).

2. Architectural Implementations and Attention Mechanisms

Dynamic convolution requires a mechanism to generate input-conditioned mixture weights. In most image models, a global feature vector is extracted with global average pooling: $s = \mathrm{GAP}(X) \in \mathbb{R}^{C_{in}}$ This is passed through a bottleneck MLP and non-linearity (e.g., ReLU): $u = \mathrm{ReLU}(W^{(1)} s)$ and projected to $K$ logits, then normalized via softmax (possibly with temperature annealing): $\pi_k(X) = \exp(z_k / \tau) / \sum_{j=1}^K \exp(z_j / \tau)$ where $z = W^{(2)} u$ and $\tau$ is scheduled to ensure broad early attention and eventual specialization (Chen et al., 2019). In LLMs or SED, local or frequency/channel-wise features may also drive the attention mechanism (Chang et al., 2021, Nam, 15 Jun 2025).

Advances such as ODConv extend this to four parallel attention “heads” (spatial, input, output, kernel-index), all generated in parallel and broadcast to the relevant kernel axes; spatial attention is typically sigmoidal, while channel and kernel-index attentions are softmax-normalized (Li et al., 2022). Dual-component designs (DCDC) integrate two attention branches, one spatially dense, one globally pooled (Yan et al., 2022).

For instance- or location-specific prediction (e.g., CondInst, FCPose, atomized variants), 1×1 or small conv controllers slice out heads’ parameters from the local or per-instance features rather than channel-averaged signals (Mao et al., 2021, Tian et al., 2020, Wang et al., 2021).

3. Efficiency, Parameter Count, and Resource-Constrained Deployment

Dynamic convolution typically trades a moderate increase in parameter count for substantial gains in representational power:

Parameter scaling: Standard convolution has $C_{in}C_{out}k^2$ parameters; dynamic variants scale as $K \cdot C_{in} C_{out} k^2$ for $K$ experts (CondConv, DY-Conv), plus $O(C_{in}^2/r)$ for the gating MLP. Matrix-decomposition approaches reduce parameter amplification by low-rank or group-wise factorizations (Li et al., 2021).
Sparse Dynamic Convolutions (SD-Conv): Introduce learnable binary masks via STE and L₀ penalties to prune kernels/channel groups, halving DY-Conv parameters with no loss in accuracy (He et al., 2022).
Decoupling strategies: Atomized (Wang et al., 2021) and decoupled (Zhou et al., 2021) designs factor the dynamic kernel into small per-location atoms or channel/spatial pieces, maintaining low memory and maximal translation equivariance.
Computational cost: Although naïvely evaluating all $K$ candidate kernels increases FLOPs, the summation can often be collapsed or shared; overhead is typically <5% over vanilla convolution, and resource-aware configurations (reducing expansion width, aggressive mask pruning, partial dynamic branches) yield even lower costs (Zhang et al., 2020, He et al., 2022).
Inference optimizations: For spatial-sparse dynamic conv (mask-based gating), gather–scatter CUDA pipelines skip computation on masked-out positions, yielding theoretical and empirical speedups (up to 60% wall-clock) on GPUs (Verelst et al., 2019). Atomized and DDF modules maintain memory/FLOPs parity with depthwise static convolution (Wang et al., 2021, Zhou et al., 2021).

4. Empirical Performance and Task-Specific Adaptations

Dynamic convolution is empirically justified along several axes:

ImageNet classification: Dynamic conv brings +2–5% top-1 accuracy gains at negligible FLOPs increase across MobileNetV2/V3, ResNet-18/50, and others. ODConv and DCDC reach or surpass prior dynamic variants with reduced or comparable parameter budgets (Chen et al., 2019, Li et al., 2022, Yan et al., 2022).
Object detection/segmentation (COCO, etc.): FCPose and CondInst demonstrate that instance-specific dynamic heads (often < 3k parameters per instance) eliminate the need for RoI operations/grouping, outperform Mask R-CNN in COCO AP, and run nearly constant inference time irrespective of person count (Mao et al., 2021, Tian et al., 2020). DCDC-ResNet-50 achieves +3.2 AP in Faster R-CNN at –28% params versus ResNet-50 (Yan et al., 2022).
Pose estimation: FCPose outperforms classical and dynamic head baselines in both accuracy and speed, supporting the claim that compact, instance-adaptive dynamic heads bypass traditional bottlenecks (Mao et al., 2021).
Video segmentation: D²Conv3D achieves up to +2 J-score over fixed-dilated/deformable convolutions by dynamically adapting dilation and modulation, with minimal computational overhead (Schmidt et al., 2021).
Speech/audio processing: Frequency-dynamic, dilated, and partial-dynamic convolutions (e.g., FDY, DFD, PFD, MDFD, TFD convs) yield 7–11% PSDS1 improvements in SED over standard CRNN, with particular advantage for nonstationary, broad-spectral, or transient event detection (Nam, 15 Jun 2025).
Language modeling: Dynamic/lightweight convolutions for sequence contexts achieve performance on par with or superior to self-attention on machine translation and summarization, for a fraction of the compute (Wu et al., 2019, Chang et al., 2021).

5. Theoretical and Practical Limitations

Key known or observed limitations and trade-offs include:

Parameter explosion: Naïve dynamic designs may scale poorly in memory if full (pixel- or instance-specific) kernels are generated per location. Proposed factorizations (low-rank, atom, DDF, per-branch sparse gating) are essential for scalability (Li et al., 2021, Zhou et al., 2021, Wang et al., 2021).
Optimization complexity: Jointly training kernels and attention weights may be unstable if attention is highly peaked early, starving kernel gradients; techniques such as temperature annealing, sum-to-one constraints, or matrix/fusion reductions improve robustness (Chen et al., 2019, Li et al., 2021).
Task-specific benefit: Dynamic convolution confers the largest relative gain in underparameterized (“thin and shallow”) regimes or spatially/temporally sparse/dense prediction tasks. In dense, stationary signal domains or overparameterized models, static convolution may match or surpass dynamic methods at lower complexity (Chen et al., 2019, Nam, 15 Jun 2025).
Implementation engineering: Custom CUDA kernels or atomic hardware support may be required for dynamic spatial gating, gather–scatter, or advanced atomized architectures, limiting portability (Verelst et al., 2019).
Translational equivariance: Some designs (per-location atomized convs) specifically preserve this property across layers; matrix-decomposition or global-kernel branches may partially relax strict translational invariance (Wang et al., 2021, Yan et al., 2022).

6. Extensions, Connections, and Theoretical Links

Dynamic convolutions reveal a fundamental connection between classical convolution, self-attention, and instance-conditioned modulation:

Self-attention as dynamic convolution: Transformer attention is mathematically a dynamic convolution with $N$ input-dependent kernels and a softmax reweighting, bridging the “attention vs. convolution” dichotomy (Zhou et al., 2023, Chang et al., 2021).
Position and frequency adaptation: Relative position and velocity encodings in self-attention, or frequency-channel-specific gating in SED, are dynamic convolutional designs in disguise (Chang et al., 2021, Nam, 15 Jun 2025).
Hybrid architectures: Mobile-Former, DCDC, ODConv, and various audio models demonstrate how dynamic convolution can be harmonized with residual connections, attention, or cross-modal fusion to enhance local-global, spatial-spectral, or instance-class integration (Li et al., 2022, Yan et al., 2022, Nam, 15 Jun 2025).
Resource-aware combinatorics: Partial adaptivity, frequency-temporal duality, dynamic dilation, and multi-branching enable tailoring of dynamic convolution to specific sparsity, hardware, or domain constraints (Nam, 15 Jun 2025, Wang et al., 2021).

Plausible implications are that dynamic convolution, appropriately modularized and regularized, can serve as a generalized operator unifying soft attention, spatial gating, conditional normalization, and local feature mixing, with strong upside for resource-constrained, variant-rich, and task-adaptive deep learning.

7. Comparative Summary Table

Variant	Major Feature	Typical Complexity/Params	Empirical Gain Example
DynamicConv (Chen et al., 2019)	$K$ -way per-input kernel mixing	$K\times$ std + attention MLP	+2–4% top-1 (ImageNet), +2–3 AP (COCO)
ODConv (Li et al., 2022)	4D attention over kernel axes	$n\times\|W\|$ , 4× small FCs	+3–6% top-1 (MobileNet, ResNet)
DCDC (Yan et al., 2022)	Dual local adaptive + global kernel	Slightly > static (shared pred.)	+2.9–3.8% top-1, –26–38% params (ResNet)
CondInst/FCPose (Tian et al., 2020)/(Mao et al., 2021)	Instance-conditioned heads/filters	169–2700 per instance/head	+0.8 AP (COCO mask/APkp), 4–5× speedup
Atomized (Wang et al., 2021)	Per-pixel atom-decomposed kernels	O(m) atoms, shared coeff.	–85% FLOPs, better accuracy (counting, ImageNet)
DDF (Zhou et al., 2021)	Decoupled spatial/channelization	$O(c\,k^2+\sigma c^2)$	–44–47% FLOPs, +1–2% top-1
SD-Conv (He et al., 2022)	Learnable sparse-mask expert pruning	≈½ DynamicConv	Matches DY-Conv, –50% params
D²Conv3D (Schmidt et al., 2021)	Input-driven spatial/temporal dilation	$+$ for dilation/modulation heads	+2 J-score (DAVIS’16), <0.02 s/frame overhead
FDY/DFD/MDFD/TFD (Nam, 15 Jun 2025)	Frequency-adaptive (dilated/attention-pooled)	5–18M params, multi-branch	+8–11% PSDS1 (DESED), best for nonstationary/transient SED

All concrete numbers, formulas, and architectural points are verbatim or derived directly from the cited papers. This current representation of dynamic convolution includes a diverse array of operator design choices, theoretical underpinnings, regularization/optimization strategies, and application-specific variants that collectively define its state-of-the-art role across modern deep learning.