Deformable V2 Convolutions in Deep Learning

Updated 14 December 2025

The paper presents a convolutional operation that learns per-sample offsets and modulation scalars, enabling dynamic adaptation of the receptive field.
It employs bilinear interpolation and efficient architectural integration to reduce parameter counts while improving feature alignment in complex geometric scenarios.
Empirical results indicate significant gains, including a +7.45% increase in Dice score, enhanced boundary precision, and robust performance in both medical segmentation and object detection.

Deformable V2 Convolutions are a generalized convolutional operation that enhances spatial modeling in deep neural networks via learnable sampling locations (offsets) and modulation scalars. Unlike standard convolutions, which operate on a fixed and regular grid, Deformable V2 Convolutions dynamically select and weight feature samples, enabling the network to adapt its receptive field to geometric variations in the data. These properties make them especially effective for vision tasks involving objects with complex, non-rigid, or context-dependent structures.

1. Mathematical Formulation and Mechanisms

Let $F\in\mathbb{R}^{C_{\text in}\times H\times W}$ denote the input feature map and $W\in\mathbb{R}^{C_{\text out}\times C_{\text in}\times K\times K}$ the convolutional kernel, with $K=3$ for a $3\times 3$ kernel. The regular sampling grid $\mathcal{R} = \{k=(k_x,k_y)\ |\ k_x,k_y= -\lfloor K/2\rfloor, \dots, \lfloor K/2\rfloor\}$ defines $K^2$ canonical positions.

In Deformable V2 Convolution, for each output position $p$ , the convolution output is:

$Y(p) = \sum_{c=1}^{C_{\text in}} \sum_{k\in \mathcal{R}} \alpha_k(p)\ W_c(k)\ F_c(p + k + \Delta p_k(p))$

where $\Delta p_k(p) \in \mathbb{R}^2$ is the learnable offset and $\alpha_k(p)\in [0,1]$ the learned modulation scalar.

Offsets are predicted by a dedicated convolutional layer with $2K^2$ outputs, while modulation scalars are output by a parallel convolution plus sigmoid activation with $K^2$ outputs. Bilinear interpolation is employed to sample $F$ at (typically non-integer) positions $p + k + \Delta p_k(p)$ .

2. Architectural Integration and Parameterization

Integration of Deformable V2 Convolutions varies by architecture. In DAUNet (Munir et al., 7 Dec 2025), the deformable module is applied exclusively in the bottleneck—where the feature width equals $C/4$ —to maximize spatial adaptivity with low computational cost. The sequence is:

$1\times 1$ conv ( $C \rightarrow C/4$ ),
$3\times 3$ deformable V2 convolution,
$1\times 1$ conv ( $C/4 \rightarrow C$ ),
SimAM parameter-free attention.

The offset head is a $3\times 3$ convolution producing $2K^2$ channels and adds $18\times (C/4)$ parameters for $K=3$ . The modulation head is another $3\times 3$ conv plus sigmoid, contributing $9\times (C/4)$ parameters. The additional parameter count is negligible compared to the reduction stemming from channel compression. In the DAUNet configuration, the model is reduced from 31.03 M (baseline UNet) to 20.47 M parameters. The bilinear interpolation introduces a $10\text{–}20\,\%$ compute overhead for the bottleneck layer but is offset by the smaller 1×1 convs.

3. Modeling Capacity and Receptive Field Adaptation

Standard convolution utilizes a fixed, rigid 3×3 grid with no spatial adaptivity. Deformable V2 Convolution learns per-pixel geometric displacements, “warping” the grid to track relevant contours, scale changes, or orientations in the underlying data (Zhu et al., 2018). The modulation scalars act as gates, selectively increasing or suppressing contributions from spatial samples, allowing the kernel to focus more directly on semantically relevant features while diminishing the influence of background or contextually irrelevant regions.

This adaptivity is crucial for tasks with geometric complexity, such as anatomical boundary segmentation or object detection with arbitrary viewpoint and deformation. For example, in DAUNet the dynamic offsets enable sampling along curved anatomical boundaries, while the modulation weights allow the network to suppress ambiguous or low-contrast background input, improving both spatial precision and contextual robustness (Munir et al., 7 Dec 2025).

4. Training Procedures and Optimization

Offsets $\Delta p_k(p)$ and modulation scalars $\alpha_k(p)$ are jointly optimized with standard convolutional weights via standard backpropagation and task loss (e.g., Dice + weighted BCE for segmentation). In DAUNet, Adam is used with learning rate $10^{-4}$ , batch size 12, over 150 epochs for 256×256 inputs. Data augmentation techniques (random flip, rotate, zoom) are applied to encourage invariant geometric learning (Munir et al., 7 Dec 2025).

In the context of object detection, Deformable ConvNets v2 (DCNv2) employs a feature mimicking scheme whereby a cropped R-CNN branch produces a teacher feature, and the deformable convolution branch is regularized by a mimic loss, encouraging the learned features to focus on pertinent object regions (Zhu et al., 2018).

5. Empirical Impact and Ablation Results

The combination of learnable offsets and modulation brings substantial performance gains relative to rigid architectures. A representative ablation on the FH-PS-AoP dataset (Munir et al., 7 Dec 2025):

Configuration	Dice (%)	HD95 (px)	ASD (px)	Params (M)
Baseline UNet	80.22	15.87	4.88	31.03
+ SimAM only	82.16	14.73	4.92	31.03
+ Deformable V2	87.67	11.87	4.09	20.47
+ Deformable V2 + SimAM (DAUNet)	89.09	10.37	3.70	20.47

Introducing Deformable V2 Convolution in place of conventional convolutions raises the Dice score by +7.45 %, reduces HD95 by 4 pixels, and reduces ASD by 0.79 pixels, while shrinking parameter count by approximately 34 %. Qualitative analysis reveals improved boundary alignment and handling of missing or ambiguous context, with the network’s offset fields remaining spatially coherent even when 25 % of the image content is occluded.

In detection/segmentation, DCNv2 with deeper and denser deployment—along with the feature mimicking loss—yields AP $^{\text{bbox}} =$ 43.1 on COCO (ResNet-50 backbone), compared to 34.7 for standard Faster R-CNN and 38.0 for DCNv1 (Zhu et al., 2018).

6. Application Scenarios and Robustness

Deformable V2 Convolutions are suited for domains characterized by substantial geometric variation, object deformation, and contextual ambiguity. In medical image segmentation (as in DAUNet), they enable models to robustly delineate anatomical structures across patient populations and modalities, addressing challenges of scale, pose, and incomplete information (Munir et al., 7 Dec 2025). For object detection and instance segmentation (as in DCNv2), their capacity for gated, position-dependent receptive field adaptation manifests in improved discrimination of object boundaries and resilience to clutter (Zhu et al., 2018).

The module’s lightweight parameter profile (especially under channel compression) and ability to maintain spatial coherence under missing data make it appropriate for deployment in computationally constrained and real-time settings, including clinical environments.

7. Summary of Architectural and Methodological Advances

Deformable V2 Convolution generalizes standard convolutional neural network operations by introducing per-sample geometric and amplitude adaptivity. Essential characteristics:

Simultaneous learning of position offsets and modulation scalars for each kernel position.
Parameter- and compute-efficient integration via targeted placement (e.g., bottleneck only in encoder-decoder models).
End-to-end differentiable and trainable with standard optimizers and losses.
Empirically validated to yield double-digit percentage improvements in segmentation overlap scores and substantial reductions in boundary error, without significant runtime or parameter overhead when carefully deployed.
Underpinning recent progress in tasks requiring spatial adaptivity and context-aware perception, as reflected in both medical and non-medical vision literature (Munir et al., 7 Dec 2025, Zhu et al., 2018).