Dual Complementary Dynamic Convolution

Updated 15 January 2026

DCDC is a dynamic convolution operator that integrates local spatial-adaptive (LSA) and global shift-invariant (GSI) kernels for enhanced feature representation.
It employs a two-branch architecture where the LSA branch predicts location-specific kernels while the GSI branch generates sample-specific kernels, balancing local details and global consistency.
Empirical evaluations demonstrate that DCDC improves classification, detection, and segmentation performance with modest computational and parameter overhead compared to traditional methods.

Dual Complementary Dynamic Convolution (DCDC) is a convolutional neural network (CNN) operator constructed to jointly model local spatial-adaptive (LSA) and global shift-invariant (GSI) features for enhanced representation learning in visual recognition. The DCDC operator mitigates the limitations of vanilla convolution (shift-invariant, sample-agnostic) and prior dynamic convolutional approaches, which typically model only one type of scene characteristic. Through a two-branch architecture, DCDC predicts both spatially-varying kernels tailored per location and a sample-specific, shift-invariant kernel. The summation of the two branches' outputs results in a more expressive convolutional layer, yielding improved performance in classification, detection, and segmentation tasks, with modest computational and parameter overheads relative to traditional or prior dynamic convolutional alternatives (Yan et al., 2022).

1. Motivation and Background

Vanilla convolutions utilize a fixed $k \times k$ kernel per output channel, shared across all spatial locations and samples, thereby conferring translation equivariance and parameter efficiency. However, this mechanism is content and sample agnostic, impeding adaptation to varied data and limiting the diversity of patterns that can be extracted per filter. Convolutional layers in this form are sensitive to resolution shifts, often necessitating complex multi-scale training protocols.

Several dynamic convolution approaches (such as CondConv, DynamicConv, DyNet, WeightNet) attempt to address sample agnosticism by producing a set of expert kernels per layer, modulating their contributions via global attention. These retain shift invariance but lack spatial adaptivity at the level of individual feature map locations and are comparatively expensive in terms of FLOPs and parameters. Other mechanisms, such as Involution and DDF, incorporate location-specific, spatially adaptive kernels but lose the capacity for modeling globally-shared shift-invariant features. DCDC recognizes that scene features often exhibit both local (e.g., region-specific zebra stripe orientation) and global (e.g., overall stripe regularity) patterns. Its two-branch design processes these complementary aspects simultaneously, enriching representational capacity.

2. Architectural Formulation

The DCDC operator receives as input $X \in \mathbb{R}^{B \times C_{in} \times H \times W}$ and outputs $Y \in \mathbb{R}^{B \times C_{out} \times H \times W}$ . The output at position $(b, n, h, w)$ is the sum of two components:

$Y_{b,n,h,w} = Y^{lsa}_{b,n,h,w} + Y^{gsi}_{b,n,h,w}$

LSA Branch: For each location $(h, w)$ and sample $b$ , a kernel $H_{b,h,w} \in \mathbb{R}^{k_{lsa} \times k_{lsa}}$ is generated with a lightweight predictor network $\phi$ . The predictor processes a local patch $X_{b, \cdot, h - \lfloor s/2 \rfloor : h + \lfloor s/2 \rfloor, w - \lfloor s/2 \rfloor : w + \lfloor s/2 \rfloor}$ , involving $N$ blocks of alternating depthwise and pointwise convolutions.

$Y^{lsa}_{b, n, h, w} = \sum_{(i, j) \in \Omega_{k_{lsa} \times k_{lsa}}} X_{b, n, h+i, w+j} \cdot H_{b, h, w}(i, j)$

GSI Branch: For each sample $b$ , a single $k_{gsi} \times k_{gsi} \times C_{in} \times C_{out}$ kernel $P^b$ is produced via a reduced-channel 1×1 convolution, global average pooling, and a projection. The kernel is shared across the spatial domain, introducing sample adaptivity but maintaining shift invariance:

$Y^{gsi}_{b, n, h, w} = \sum_{m=0}^{C_{in}-1} \sum_{(i, j) \in \Omega_{k_{gsi} \times k_{gsi}}} X_{b, m, h+i, w+j} \cdot P^b_{n, m, i, j}$

All steps are fully differentiable, facilitating end-to-end training.

3. Implementation and Integration in CNN Architectures

DCDC is typically deployed within ResNet-style architectures by replacing 3×3 convolutions in both stem and bottleneck modules with the DCDC operator. The standard 1×1 "reduce" and "expand" bottleneck convolutions remain unchanged. In the canonical configuration, $k_{lsa} = 7$ , $s_{lsa} = 7$ , the LSA kernel predictor uses a 3×3 depthwise convolution ( $k_{lsa}^{dw} = 3$ ), and $k_{gsi} = 1$ , with the channel reduction ratio $\lambda = 1$ .

The additional computational overhead is modest. For example, DCDC-ResNet-26 contains 9.36M parameters and 1.71G FLOPs, compared to 13.7M parameters and 2.4G FLOPs in vanilla ResNet-26. This contrasts with RedNet-26 (Involution-based), which has 9.23M parameters and 1.67G FLOPs. The FLOP/parameter increase relative to Involution or RedNet is approximately 5–10%, but DCDC saves 60–70% of parameters and FLOPs compared to vanilla ResNet.

4. Computational Complexity

The parameter and computational resource requirements of DCDC-augmented networks are listed below (single forward pass, ImageNet scale):

Architecture	#Params (M)	FLOPs (G)
ResNet-26	13.7	2.4
RedNet-26 (Involution)	9.23	1.67
DCDC-ResNet-26	9.36	1.71
ResNet-50	25.6	4.1
DDF-ResNet-50	16.8	2.3
RedNet-50	15.5	2.62
DCDC-ResNet-50	15.8	2.68
ResNet-101	44.6	7.9
DDF-ResNet-101	28.1	4.1
RedNet-101	25.7	4.6
DCDC-ResNet-101	26.2	4.71

In DCDC, LSA kernel prediction requires two pointwise convolutions and two 3×3 depthwise convolutions per layer; GSI kernel prediction comprises two 1×1 convolutions and a global average pooling. The overall parameter and FLOP increase relative to Involution/RedNet baselines is minor (Yan et al., 2022).

5. Empirical Evaluation

DCDC-based networks were assessed on ImageNet-1K classification, MS COCO object detection/instance segmentation, and panoptic segmentation tasks using MMClassification and timm frameworks. Trained with standard augmentation, SGD optimization, and established learning schedules, DCDC consistently improved over static ResNets and prior dynamic convolutions.

Image Classification (Top-1 Accuracy, ImageNet-1K):

Model	#Params (M)	FLOPs (G)	Top-1 (%)
ResNet-26	13.7	2.4	73.6
RedNet-26	9.23	1.67	75.9
DCDC-ResNet-26	9.36	1.71	77.4
ResNet-50	25.6	4.1	76.8
RedNet-50	15.5	2.62	78.4
DDF-ResNet-50	16.8	2.3	79.1
DCDC-ResNet-50	15.8	2.68	78.9/80.1*
ResNet-101	44.6	7.9	78.5
RedNet-101	25.7	4.6	79.1
DDF-ResNet-101	28.1	4.1	80.2
DCDC-ResNet-101	26.2	4.71	80.8

(* indicates use of the timm protocol.)

Compared to leading dynamic convolutional approaches, DCDC-ResNet often matches or surpasses their accuracy with substantially fewer parameters (e.g., 15.8M for DCDC-ResNet-50 vs. 104.8M for CondConv-50 at similar or improved performance).

Object Detection and Segmentation:

Across tasks such as MS COCO object detection (Faster-RCNN, RetinaNet), instance segmentation (Mask-RCNN), and panoptic segmentation (Panoptic FPN), replacing the ResNet backbone with DCDC-ResNet yields consistent AP gains (e.g., Mask-RCNN with DCDC-ResNet-50 increases AP_bbox to 41.7 from 38.4 for ResNet-50, at lower parameter and FLOP counts).

6. Ablations and Hyperparameter Analysis

Ablation studies on RedNet-26 show the separate impacts of the LSA and GSI branches: adding only LSA increases Top-1 to 76.7% (+0.8%), only GSI to 77.4% (+1.5%), and full DCDC to 77.4%. Hyperparameter sweeps indicate that increasing the LSA branch kernel to $k_{lsa}^{dw} = 7$ or the GSI kernel to $k_{gsi} = 3$ yields marginal gains (e.g., Top-1 at 77.80% with $k_{lsa}^{dw} = 7, \lambda = 2.0$ ). The computational cost remains moderate under these configurations (Yan et al., 2022).

7. Qualitative Analysis and Visualization

Visualization of predicted kernels and Grad-CAM attention maps demonstrates distinctive behavior for each branch: LSA kernels adapt to fine local structures (e.g., object contours and stripe orientations), while GSI kernels remain coherent and globally consistent within each sample. DCDC-ResNets exhibit more cohesive attention to entire object shapes compared to both vanilla ResNet and Involution, supporting the hypothesis that dual-branch modeling captures richer features.

DCDC exemplifies a two-branch dynamic convolution paradigm, integrating sample-specific shift-invariant and highly spatially adaptive components in a unified, efficient operator. Empirical results indicate state-of-the-art performance on classification and recognition benchmarks and significant resource savings versus both static and earlier dynamic convolutional designs (Yan et al., 2022).

Markdown Upgrade to Chat

References (1)

Dual Complementary Dynamic Convolution for Image Recognition (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual Complementary Dynamic Convolution (DCDC).

Dual Complementary Dynamic Convolution

1. Motivation and Background

2. Architectural Formulation

3. Implementation and Integration in CNN Architectures

4. Computational Complexity

5. Empirical Evaluation

6. Ablations and Hyperparameter Analysis

7. Qualitative Analysis and Visualization

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Dual Complementary Dynamic Convolution

1. Motivation and Background

2. Architectural Formulation

3. Implementation and Integration in CNN Architectures

4. Computational Complexity

5. Empirical Evaluation

6. Ablations and Hyperparameter Analysis

7. Qualitative Analysis and Visualization

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research