Papers
Topics
Authors
Recent
2000 character limit reached

Deformable Transposed Convolution

Updated 1 February 2026
  • DTC is an upsampling operator that dynamically predicts per-position offsets and modulation, enabling adaptive, detail-preserving feature reconstruction.
  • It integrates a transposed convolution to predict offsets combined with grid sampling, allowing the network to focus on semantically important regions.
  • Empirical results show improved segmentation metrics (e.g., DICE, mIoU) with minimal parameter overhead, making it effective in both 2D and 3D applications.

Deformable Transposed Convolution (DTC) is a class of upsampling operators that generalize traditional transposed convolution by introducing dynamic, learnable sampling of feature locations. DTC modules combine a spatially adaptive offset field with optionally learned interpolation (or modulation) kernels to produce high-resolution feature maps that better preserve structural detail, attenuate artifacts, and can be implemented as drop-in replacements for standard transposed convolution in both 2D and 3D settings. Unlike fixed upsampling approaches, DTC explicitly regresses spatial sampling positions conditioned on the local context, allowing the network to attend to semantically salient or structurally challenging regions during upsampling (Sun et al., 25 Jan 2026, Blumberg et al., 2022).

1. Motivation and Limitations of Conventional Upsampling

Conventional upsampling operators such as transposed convolution (deconvolution) and (bi-/tri-)linear interpolation are based on fixed spatial sampling locations. In transposed convolution, zeros are interleaved in the low-resolution map, and a kernel is applied at predetermined spatial locations. Linear interpolation computes output pixels based on fixed weighted averages of input neighbors. These approaches are agnostic to structural cues off the regular grid and are susceptible to blurring, checkerboard artifacts, and detail loss—especially in medical image segmentation and generative imaging contexts (Sun et al., 25 Jan 2026).

The paradigm of deformable convolution (DCN) demonstrated the expressiveness gained by making spatial sampling positions dynamic—adapting offsets per feature location. Deformable Transposed Convolution adopts this principle for upsampling: the network learns per-position offsets and, in some variants, modulation weights or interpolation kernels, thereby enabling the upsampling operator to target informative or structurally critical input regions (Blumberg et al., 2022).

2. Mathematical Formulation and Implementation

Let XRCin×H×WX \in \mathbb{R}^{C_{\text{in}} \times H \times W} (2D) or XRCin×D×H×WX \in \mathbb{R}^{C_{\text{in}} \times D \times H \times W} (3D) denote a low-resolution feature map. The goal is to obtain an upsampled output YRCout×sH×sWY \in \mathbb{R}^{C_{\text{out}} \times sH \times sW} using a scale factor ss.

DTC/DSTC Forward Pass Overview

A typical DTC block decomposes the upsampling process into the following steps:

  1. Offset and Modulation Prediction:

A transposed convolution applied to XX predicts a dense offset field ΔP\Delta P and corresponding modulation weights MM; Z=ConvT(X)R(gdim+g)×sH×sWZ = \text{Conv}^T(X) \in \mathbb{R}^{(g \cdot \text{dim} + g) \times sH \times sW}, where dim=2\text{dim}=2 or $3$ and g=dimg = \text{dim} (number of axes). - Δp=tanh(ΔP)\Delta p=\tanh(\Delta P) clamps offsets to [1,1][-1,1]. - m=sigmoid(M)m=\text{sigmoid}(M) restricts modulation weights to [0,1][0,1].

  1. Receptive Field Control:

Sampling positions are moved from the regular grid PgridP_{\text{grid}} by scaled and modulated offset, Pnew=Pgrid+λ(Δpm)P_{\text{new}} = P_{\text{grid}} + \lambda \cdot (\Delta p \odot m), where λ\lambda is a scalar (e.g., λ=1/H\lambda=1/H).

  1. Feature Extraction and Deformable Sampling: A 1×11 \times 1 (or 1×1×11 \times 1 \times 1) convolution generates Xfeat=Conv1×1(X)X_{\text{feat}} = \text{Conv}_{1 \times 1}(X). Feature values are interpolated at PnewP_{\text{new}} using grid sampling (grid_sample\text{grid\_sample} for 2D, trilinear for 3D).
  2. Residual Fusion: A baseline upsampling result (e.g., transposed convolution or linear interpolation) YbaseY_{\text{base}} is added for stability and global structure preservation:

Y=Ybase+YdefY = Y_{\text{base}} + Y_{\text{def}}

DSTC (Deformably-Scaled Transposed Convolution) Specifics

DSTC additionally introduces a learnable anti-aliasing interpolation kernel GG at each output location, weighted over a Gaussian mixture or similar function, and can use a compact parameterization with global shift and dilation per input location (Blumberg et al., 2022).

3. Integration into Neural Architectures

DTC and DSTC are modular and can be inserted into any upsampling position of an encoder-decoder architecture (e.g., U-Net, UNETR, nnUNet). The input and output channels for DTC are inherited from the layer it replaces, typically adding only the parameters of a 1×11 \times 1 convolution and a small offset-prediction transposed convolution. For a 6-stage 2D U-Net, the parameter increase is approximately +1.3+1.3 million (from $66$M to $67.3$M) and computational overhead is on the order of +0.4+0.4 GFLOPs—constituting about +1+1--$2$\% parameters and +1+1--$5$\% FLOPs overall (Sun et al., 25 Jan 2026).

In DSTC, non-parametrized versions learn separate interpolation kernels and offsets for each location and kernel index; parametrized variants share kernels and use a global shift and spatial dilation per site, greatly reducing parameter count but achieving near-identical empirical performance (Blumberg et al., 2022).

4. Empirical Results and Comparative Analysis

2D Medical Segmentation

On ISIC18 and BUSI, DTC consistently improved segmentation performance across multiple architectures. For U-Net + bilinear upsampling, DICE increased from 78.23%78.23\% to 79.58%79.58\%; adding DTC to ConvT^T upsampling gave a DICE increase from 78.57%78.57\% to 79.41%79.41\%. The method also improved SegMamba and SwinUNETR V2 decoders, producing sharper boundaries and reducing hair artifacts and noise (Sun et al., 25 Jan 2026).

3D Medical Segmentation

On BTCV-15:

  • nnUNet: DICE from 81.52%81.66%81.52\% \rightarrow 81.66\%
  • UNETR: DICE from 69.65%71.98%69.65\% \rightarrow 71.98\%
  • nnMamba: DICE from 75.47%76.77%75.47\% \rightarrow 76.77\%

Notably, performance on small organs and thin structures improved, as shown by visualizations and metric gains (Sun et al., 25 Jan 2026).

General Vision Tasks (DSTC)

DSTC improves instance and semantic segmentation and generative modeling:

  • Mask R-CNN Box AP: $38.3$ (TC) 39.2\rightarrow 39.2 (DSTC)
  • Mask AP: 34.835.834.8 \rightarrow 35.8
  • HRNet-W48 VOC-12 mIoU: $76.17$ (TC) 76.99\rightarrow 76.99 (DSTC)
  • DCGAN FID (CelebAScaled): 29.626.329.6 \rightarrow 26.3

DSTC outperforms standard TC in 2D/3D segmentation and MR image enhancement, with competitive performance achieved by the parametrized version at much lower parameter cost (Blumberg et al., 2022).

5. Implementation, Training, and Ablation Considerations

Key implementation practices include:

  • Use of AdamW with learning rate 1×1041 \times 10^{-4}, weight decay 1×1051 \times 10^{-5} for DTC segmentation.
  • Offsets constrained via tanh\tanh and modulation/weights via sigmoid\text{sigmoid} to ensure stable gradients through grid_samplegrid\_sample (Sun et al., 25 Jan 2026).
  • The receptive field scaling parameter λ\lambda is typically set to 1/feature-map-size1/\text{feature-map-size}, with optimal values depending on task and architecture.
  • For DSTC, the number of Gaussian mixture components for anti-aliasing, kernel size, and offset parameterization are hyperparameters, with most ablations indicating improved accuracy and minimal computational penalty at moderate values (Blumberg et al., 2022).

Table: Hyperparameter Ablations (DSTC)

Parameter Best Setting Empirical Impact
# Gaussians ss s=4s=4 Best box/mask AP; s=1s=1 gives no significant gain
Kernel size KΣK_{\Sigma} KΣ=5K_\Sigma=5 Maximum performance, diminishing returns at >5>5
Offset param. (D+1D+1 ch.) Yes Same AP as full, with $1/10$ parameters

Unbounded or unconstrained offsets/weights lead to divergence or poor segmentation; both branches (offsets and modulation) are necessary for robust performance. Tuning λ\lambda is critical: excessive receptive field can reduce edge precision, while small λ\lambda limits adaptability.

6. Advantages, Limitations, and Applications

Advantages of DTC/DSTC include:

  • Dynamic, data-driven localization for upsampling, enhancing boundary fidelity and reducing artifacts compared to fixed-grid methods.
  • Modularity: Single-line integration into a wide range of decoders and architectures in both 2D and 3D contexts.
  • Minimal computational and parameter overhead compared to fixed transposed convolution.

Limitations:

  • Learned offsets may be unstable in homogeneous regions, potentially introducing noise.
  • Requires careful tuning of the receptive field scaling parameter λ\lambda for optimal delineation of structure.
  • Single deformable head per upsample (no multi-head extension as in some later DCN variants).

Potential applications beyond segmentation include super-resolution, detection and localization heads, and generative decoders such as VAEs and GANs (Sun et al., 25 Jan 2026).

DTC and DSTC are extensions of the deformable convolutional paradigm but apply adaptivity specifically to the upsampling step. The approach contrasts with fixed upsampling and other adaptive upsampling strategies, such as Dysample and FADE, outperforming these baselines in standard benchmarks. DSTC introduces additional flexibility with learned anti-aliasing kernels and compact parameterization, providing a broader framework for deformable upsampling (Blumberg et al., 2022).

Potential future directions include multi-head offset prediction, improved regularization for stability in homogeneous regions, and broader adoption in non-segmentation generative and regression architectures. The modularity and minimal overhead of DTC-type operators suggest further application in any setting requiring learnable, detail-preserving upsampling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deformable Transposed Convolution (DTC).