Attention-Driven Dynamic Convolution

Updated 1 February 2026

Attention-driven dynamic convolution is defined as replacing fixed kernels with dynamic, input-conditioned mixtures that adapt to spatial and spectral features.
It leverages multidimensional attention across kernel axes to enhance performance, with empirical gains shown in architectures like ResNet and MobileNet.
Practical implementations focus on efficient training and integration into various network architectures while managing computational and parameter trade-offs.

Attention-driven dynamic convolution refers to a class of neural network operators in which the conventional static convolution kernels are replaced by linear combinations of multiple basis kernels, with the mixing coefficients (“attentions”) being dynamically generated as a function of the input. This architecture unites the spatial-locality and translation-equivariance properties of convolution with content-adaptivity, in essence bringing “attention” to the convolution kernel space. It generalizes classic static convolution, integrates with energy-based attention architectures, and has been rigorously formalized to encompass a spectrum including dynamic kernel selection, multi-dimensional attention in kernel space, unified convolution-attention frameworks, and various parameter-efficient or domain-specialized extensions.

1. Mathematical Foundations of Attention-Driven Dynamic Convolution

The core mechanism is to express the convolution kernel for each input as an input-conditioned mixture of $N$ static kernels. Formally, given $x\in\mathbb{R}^{C_\mathrm{in}\times H\times W}$ and $(W_1,\ldots,W_N)$ , one computes: $\widetilde{W}(x) = \sum_{k=1}^N \alpha_k(x) W_k, \qquad \sum_{k=1}^N \alpha_k(x) = 1, \quad \alpha_k(x)\geq 0$ where the attention weights $\alpha_k(x)$ are produced by an attention network—typically a lightweight multi-layer perceptron applied to pooled or transformed representations of $x$ (e.g., via global average pooling) (Chen et al., 2019, Li et al., 30 Mar 2025, Wang et al., 20 Feb 2025, Li et al., 2021). The resulting “dynamic” kernel $\widetilde{W}(x)$ is then convolved with $x$ as usual: $y = \widetilde{W}(x) * x + \widetilde{b}(x)$ For 3D or spectral applications, the same paradigm holds, with the attention computed jointly from spatial and spectral global descriptors (Li et al., 30 Mar 2025).

The attention mechanism typically includes:

Squeeze: Global average pooling to summarize the input.
Excitation: MLP with reduction and subsequent activation (e.g., ReLU).
Output projection: Linear layer producing $N$ kernel-wise logits.
Softmax normalization: Ensures convex combination of basis kernels.

Extensions decompose the kernel further, separating it into input-agnostic (“static”) and input-dependent (“dynamic”) parts, or assemble parameter-sharing structures for efficiency (Li et al., 2021, Li et al., 2024).

2. Generalized Attention in the Kernel Tensor Space

Classic methods use attention only over the kernel index axis, but generalized approaches, such as Omni-Dimensional Dynamic Convolution (ODConv), introduce separate attention maps over four inherent axes of the kernel tensor: number (N), spatial location, input channel, and output channel (Li et al., 2022). In this framework, each kernel undergoes multidimensional modulation: $\widetilde{W}_i = (\alpha^N_i) \odot (\alpha^S_i) \odot (\alpha^C_i) \odot (\alpha^O_i) \odot W_i$ for $i=1,...,N$ , where each $\alpha$ is generated by a parallel attention head. This design enhances flexibility and yields performance gains even over classical multi-kernel dynamic methods, often at reduced parameter and computational overhead.

Parameter-efficient variants such as KernelWarehouse exploit intra- and cross-layer parameter dependencies, partitioning each convolution into small kernel “cells” shared across layers and modulated via contrasting-driven attention functions, thus supporting large values of $n$ (kernel-count) with competitive or even reduced model size (Li et al., 2024).

3. Integration of Attention-Driven Dynamic Convolution in Model Architectures

Attention-driven dynamic convolution is commonly embedded as a drop-in replacement for static convolutional layers in canonical architectures such as MobileNetV2, ResNet, DenseNet, and Transformers (Chen et al., 2019, Li et al., 30 Mar 2025, Li et al., 2022, Li et al., 2024, Li et al., 2021). Key deployment points and strategies include:

Replacing every convolution except the very first layer, to maximize content adaptivity (Chen et al., 2019).
Integrating as group/dilated or depthwise convolutions in spectral or 3D contexts (Li et al., 30 Mar 2025).
Selectively deploying multi-attention variants (input, output, kernel) to optimize the speed-accuracy trade-off, as in FMDConv, which omits costly spatial attention for real-world efficiency (Zhang et al., 21 Mar 2025).
Fusion architectures that combine dynamic convolution with other efficient operators (as in MobileFormer/ODConv).
Application to sequential or autoregressive models (e.g., Tacotron2), introducing both static and dynamic content-dependent convolution terms in the recurrent attention energy (Gorodetskii et al., 2022).
Incorporation into advanced networks for knowledge graph completion (relation-driven dynamic kernel with attention) (Guo et al., 2023) or text modeling (span-based attention-driven convolution replacing local Transformer heads) (Jiang et al., 2020).

4. Unified Theoretical Frameworks: Convolution as Adaptive Structure

A formal unification is provided by the tensor factorization view: both static convolution and attention are special instances of the more general structured linear map: $y_{n,q} = \sum_{m=1}^M \sum_{p=1}^P \Phi_{m,n,p,q} x_{m,p}$ where

$\Phi_{m,n,p,q} = \sum_{k=1}^K A_{k,m,n} \Theta_{k,p,q}$

In static convolution, $A$ encodes spatial shifts; in attention, $A$ is generated dynamically per-input (“attention mask”); and models may blend both. With this, attention-driven dynamic convolution encompasses both fixed structure and learned (adaptive) structure (Andreoli, 2019).

Span-based dynamic convolution and content-guided dynamic convolution modules operationalize this unification, enabling both classic and graph or sequence-based contexts (Jiang et al., 2020, Gorodetskii et al., 2022, Guo et al., 2023).

5. Practical Considerations: Computational Complexity, Training, and Efficiency

Dynamic convolution introduces modest increases in both parameter count and compute, proportional to the number of candidate kernels ( $N$ ). However, the dominant cost remains the convolution, with attention networks’ overhead being negligible for small $N$ (typical values $N\leq 4$ ) (Chen et al., 2019, Li et al., 30 Mar 2025).

Parameter-efficient designs—dynamic channel fusion, warehouse-based sharing, and channel-attention methods—have been developed to address the $N$ -fold parameter blowup of naïve dynamic convolution, often reducing it to $\mathcal{O}(C^2+CL)$ with negligible degradation in performance (Li et al., 2021, Li et al., 2024).

Training is stabilized by mechanisms such as softmax temperature annealing, residual-in-kernel connections, and split-static/dynamic kernels to avoid attention collapse and ensure all basis kernels receive sufficient gradient (Chen et al., 2019, Li et al., 2021). Optimizers are standard (SGD, Adam), often combined with learning-rate schedules and regularization (dropout, zoneout for RNNs, etc.) (Gorodetskii et al., 2022, Li et al., 30 Mar 2025).

Empirically, dynamic convolution yields consistent accuracy improvements on ImageNet, COCO, and domain-specific benchmarks. For instance, DyConv applied to MobileNetV3-Small reaches +2.9% absolute top-1 with only +4% FLOPs (Chen et al., 2019), ODConv adds +3.72% to ResNet-18, and kernel-warehouse methods yield +4-5% gains on several backbones at equal or lower cost (Li et al., 2024, Li et al., 2022). Application to speech (e.g., adaptation per-frame in enhancement) and segmentation demonstrates similar return on compute (Wang et al., 20 Feb 2025, Shu et al., 4 Apr 2025).

6. Domain-Specific and Emerging Applications

Attention-driven dynamic convolution has been extended into a wide range of problem domains and network types:

Speech and sequence modeling: Frame-wise or span-based dynamic convolution with temporal attention (Wang et al., 20 Feb 2025, Gorodetskii et al., 2022, Jiang et al., 2020).
Hyperspectral and spatio-spectral processing: Multi-kernel attention with joint spatial and spectral adaptation (Li et al., 30 Mar 2025).
Farmland and small target segmentation: Frequency-domain aware dynamic kernel allocation (Shu et al., 4 Apr 2025); multi-path/branch dynamic convolution fused via content-guided spatial attention (Chen et al., 30 Apr 2025).
Object detection: Fast, spatially aware, and static-guided dynamic modules with minimal parameter overhead (Xing et al., 2024).
Knowledge graphs: Entity-relation adaptive dynamic kernels driven by embedding-level attention with prior knowledge (Guo et al., 2023).

7. Limitations, Extensions, and Directions for Research

Current limitations include parameter overhead (scaling with kernel count $N$ ), additional hyper-parameters for the attention module (kernel count, reduction ratio, temperature schedule), and challenges in optimization due to potential attention collapse when some kernels are not sufficiently attended (Chen et al., 2019, Li et al., 2021).

Research has focused on mitigating these constraints:

Parameter- and compute-efficient kernel partition and sharing schemes (KernelWarehouse) enable exploring regimes with $n\gg 10$ kernels (Li et al., 2024).
Matrix decomposition and channel fusion drastically reduce the dynamic space while enhancing flexibility (Li et al., 2021).
Exclusion of spatial attention and other low-YoY-ROI heads—retaining only those attention axes with favorable efficiency/accuracy tradeoff (Zhang et al., 21 Mar 2025).
Generalization of adaptive convolution to variants such as frequency-aware channel fusion, multi-branch attention fusion, and multi-stage attention for mixed tasks (Shu et al., 4 Apr 2025, Chen et al., 30 Apr 2025).

Future research will likely extend attention-driven dynamic convolution into transformer spaces, generalized structured data, ultra-light settings (quantization, pruning), and explore further integration with neural architecture search, cross-layer parameter sharing, and meta-learned dynamic attention scheduling (Li et al., 2024, Li et al., 2021, Zhang et al., 21 Mar 2025).

Markdown Upgrade to Chat

References (14)

Dynamic Convolution: Attention over Convolution Kernels (2019)

Efficient Dynamic Attention 3D Convolution for Hyperspectral Image Classification (2025)

Adaptive Convolution for CNN-based Speech Enhancement Models (2025)

Revisiting Dynamic Convolution via Matrix Decomposition (2021)

KernelWarehouse: Rethinking the Design of Dynamic Convolution (2024)

Omni-Dimensional Dynamic Convolution (2022)

FMDConv: Fast Multi-Attention Dynamic Convolution via Speed-Accuracy Trade-off (2025)

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention (2022)

ConvD: Attention Enhanced Dynamic Convolutional Embeddings for Knowledge Graph Completion (2023)

10.

ConvBERT: Improving BERT with Span-based Dynamic Convolution (2020)

11.

Convolution, attention and structure embedding (2019)

12.

FADConv: A Frequency-Aware Dynamic Convolution for Farmland Non-agriculturalization Identification and Segmentation (2025)

13.

Selective Variable Convolution Meets Dynamic Content Guided Attention for Infrared Small Target Detection (2025)

14.

SGDM: Static-Guided Dynamic Module Make Stronger Visual Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Attention-Driven Dynamic Convolution.