Papers
Topics
Authors
Recent
Search
2000 character limit reached

Convolution Subsampling Layer

Updated 12 May 2026
  • Convolution subsampling layer is a method that reduces feature map resolution by combining stride-1 convolution with differentiable resizing, enabling both integer and fractional downsampling.
  • It leverages techniques like conv-resize blocks, multisampling, and group-equivariant subsampling to preserve spatial information and maintain transformation equivariance.
  • Empirical results show enhanced metrics (e.g., PSNR, SSIM) and improved segmentation accuracy, while allowing flexible integration into existing deep learning architectures.

A convolution subsampling layer modifies the spatial resolution of feature maps in convolutional neural networks (CNNs), typically reducing their height and width. Traditional downsampling mechanisms—such as pooling or strided convolutions—are limited to integer scaling factors and may discard substantial spatial information. Recent research has produced advanced convolution subsampling paradigms, including the conv-resize block enabling fractional downsampling (Chen et al., 2021), multisampling and checkered subsampling preserving more input data (Sadigh et al., 2018), and group-equivariant variants that maintain transformation equivariance (Xu et al., 2021). These approaches address scaling flexibility, information preservation, and symmetry constraints central to high-performance deep learning models.

1. Mathematical Formulations and Variants

Conv-Resize Block (Fractional Downsampling).

Let xRH×W×Cinx \in \mathbb{R}^{H \times W \times C_{\text{in}}} denote an input feature map and sQ+s \in \mathbb{Q}_+ the desired downsampling factor. The conv-resize block Fs()F_{\downarrow s}(\cdot) comprises two sequential operations:

  1. Stride-1 Convolution:

yconv[i,j,c]=m,n,dw[m,n,d,c]x[i+m,j+n,d]+b[c]y_{\text{conv}}[i,j,c] = \sum_{m,n,d} w[m,n,d,c]\, x[i+m, j+n, d] + b[c]

where wRK×K×Cin×Coutw \in \mathbb{R}^{K \times K \times C_{\text{in}} \times C_{\text{out}}} are learnable filters and bRCoutb \in \mathbb{R}^{C_{\text{out}}}.

  1. Differentiable Resizing: Using bilinear interpolation with kernel ρ(t)=max(1t,0)\rho(t) = \max(1 - |t|, 0), the output is:

yout[u,v,c]=i=0H1j=0W1yconv[i,j,c]ρ(usi)ρ(vsj)y_{\text{out}}[u,v,c] = \sum_{i=0}^{H-1} \sum_{j=0}^{W-1} y_{\text{conv}}[i,j,c]\, \rho\left(\tfrac{u}{s}-i\right)\, \rho\left(\tfrac{v}{s}-j\right)

The overall operation is Fs(x)=Rs(Cs=1(x))F_{\downarrow s}(x) = R_s\left(C_{s=1}(x)\right), with resolution H/s×W/s×Cout\lfloor H/s \rfloor \times \lfloor W/s \rfloor \times C_{\text{out}} (Chen et al., 2021).

Multisampling and Checkered Subsampling.

Standard sQ+s \in \mathbb{Q}_+0-stride subsampling yields sQ+s \in \mathbb{Q}_+1 (sQ+s \in \mathbb{Q}_+2 spatial dims). Multisampling preserves sQ+s \in \mathbb{Q}_+3 elements per sQ+s \in \mathbb{Q}_+4 patch: sQ+s \in \mathbb{Q}_+5 For 2D stride-2, checkered subsampling uses sQ+s \in \mathbb{Q}_+6 carefully selected samples per sQ+s \in \mathbb{Q}_+7 region via specific binary masks, yielding effective stride-2 reduction with sQ+s \in \mathbb{Q}_+8 data retention rather than sQ+s \in \mathbb{Q}_+9 and maintaining output structure by concatenating submaps (Sadigh et al., 2018).

Group Equivariant Subsampling.

Let Fs()F_{\downarrow s}(\cdot)0 act by translation. Given stride Fs()F_{\downarrow s}(\cdot)1, the index map Fs()F_{\downarrow s}(\cdot)2 allocates a coset whose elements are retained: Fs()F_{\downarrow s}(\cdot)3 where Fs()F_{\downarrow s}(\cdot)4 is found by argmax of the normed feature map. This yields exact equivariance under Fs()F_{\downarrow s}(\cdot)5 (translations) and extends to general group actions using coset decomposition (Xu et al., 2021).

2. Architectural Integration and Parameterization

When substituting standard downsampling within a CNN, the conv-resize block replaces stride-Fs()F_{\downarrow s}(\cdot)6 convolutions or pooling with stride-1 convolution followed by parameter-free resize by Fs()F_{\downarrow s}(\cdot)7, hence supporting both integer and arbitrary rational scaling. The only architectural change is in the first downsampling layer, and the parameter count remains constant as the resizer has no learnable components (Chen et al., 2021).

Multisampling and checkered subsampling substitute each stride-2 layer by a submap-expanding operation, effectively doubling the number of feature channels per downsampling step (or growing by Fs()F_{\downarrow s}(\cdot)8 per step to maintain constant FLOPs). This allows retrofitting of ResNet, DenseNet, and similar architectures by converting every stride-2 layer into the custom subsampler and modifying downstream layers for 3D processing (treating submaps as an additional axis) (Sadigh et al., 2018).

Group equivariant variants swap each downsampling layer for a two-part block tracking the coset index alongside the reduced-resolution map. Upsampling is performed by filling samples at coset-representative locations, optionally followed by smoothing for continuity (Xu et al., 2021).

3. Training Procedures and Differentiability

Conv-resize layers exploit fully differentiable bilinear interpolation; the backpropagated gradient with respect to each input pixel is a product of two one-dimensional Fs()F_{\downarrow s}(\cdot)9 functions, enabling end-to-end training without custom regularization on the resize operator. A canonical loss is an yconv[i,j,c]=m,n,dw[m,n,d,c]x[i+m,j+n,d]+b[c]y_{\text{conv}}[i,j,c] = \sum_{m,n,d} w[m,n,d,c]\, x[i+m, j+n, d] + b[c]0 reconstruction cost between original and upsampled reconstructions. Adam optimization with small, fixed learning rates (e.g., yconv[i,j,c]=m,n,dw[m,n,d,c]x[i+m,j+n,d]+b[c]y_{\text{conv}}[i,j,c] = \sum_{m,n,d} w[m,n,d,c]\, x[i+m, j+n, d] + b[c]1) and standard weight decay or dropout are typical (Chen et al., 2021).

Multisampling and checkered subsampling introduce no new learnable parameters. They increase the number of independent gradient paths to deeper layers, enhancing learning signal propagation. Training can proceed as usual, though increased memory and computational overhead may require batch size or channel width adaptation (Sadigh et al., 2018).

For group-equivariant subsampling, equivariant loss functions (such as MSE between reconstructions and symmetrically transformed inputs) are compatible. The design ensures that equivariant information (e.g., translation or rotation indices) is disentangled in the representations (Xu et al., 2021).

4. Empirical Results and Comparative Performance

The conv-resize block outperforms baseline Lanczos downsampling and integer-stride CNN-CR models for both integer and fractional downsampling. Notably, for yconv[i,j,c]=m,n,dw[m,n,d,c]x[i+m,j+n,d]+b[c]y_{\text{conv}}[i,j,c] = \sum_{m,n,d} w[m,n,d,c]\, x[i+m, j+n, d] + b[c]2 with bilinear upsampling, BD-rate reductions of up to yconv[i,j,c]=m,n,dw[m,n,d,c]x[i+m,j+n,d]+b[c]y_{\text{conv}}[i,j,c] = \sum_{m,n,d} w[m,n,d,c]\, x[i+m, j+n, d] + b[c]3 (PSNR), yconv[i,j,c]=m,n,dw[m,n,d,c]x[i+m,j+n,d]+b[c]y_{\text{conv}}[i,j,c] = \sum_{m,n,d} w[m,n,d,c]\, x[i+m, j+n, d] + b[c]4 (SSIM), and yconv[i,j,c]=m,n,dw[m,n,d,c]x[i+m,j+n,d]+b[c]y_{\text{conv}}[i,j,c] = \sum_{m,n,d} w[m,n,d,c]\, x[i+m, j+n, d] + b[c]5 (VMAF) were observed relative to Lanczos on a 45-video set at 1080p. Even at integer yconv[i,j,c]=m,n,dw[m,n,d,c]x[i+m,j+n,d]+b[c]y_{\text{conv}}[i,j,c] = \sum_{m,n,d} w[m,n,d,c]\, x[i+m, j+n, d] + b[c]6, conv-resize gives PSNR and SSIM BD-rate gains around yconv[i,j,c]=m,n,dw[m,n,d,c]x[i+m,j+n,d]+b[c]y_{\text{conv}}[i,j,c] = \sum_{m,n,d} w[m,n,d,c]\, x[i+m, j+n, d] + b[c]7, and remains effective at fractional yconv[i,j,c]=m,n,dw[m,n,d,c]x[i+m,j+n,d]+b[c]y_{\text{conv}}[i,j,c] = \sum_{m,n,d} w[m,n,d,c]\, x[i+m, j+n, d] + b[c]8 up to yconv[i,j,c]=m,n,dw[m,n,d,c]x[i+m,j+n,d]+b[c]y_{\text{conv}}[i,j,c] = \sum_{m,n,d} w[m,n,d,c]\, x[i+m, j+n, d] + b[c]9 (Chen et al., 2021).

Multisampling (particularly checkered subsampling) demonstrates consistent test error reductions across datasets and architectures, such as DenseNet, ResNet, and VGG, without parameter increase. For DenseNet-BC-121 on CIFAR-10 (with augmentation), error falls from wRK×K×Cin×Coutw \in \mathbb{R}^{K \times K \times C_{\text{in}} \times C_{\text{out}}}0 to wRK×K×Cin×Coutw \in \mathbb{R}^{K \times K \times C_{\text{in}} \times C_{\text{out}}}1. On ImageNet, pretrained models benefit from lower top-1 error without fine-tuning. Inference-time and training resource requirements increase, but substantially less than with full dilation (Sadigh et al., 2018).

Group-equivariant subsampling confers robustness and improved generalization. In object-centric learning tasks, group-equivariant autoencoders (GAEs) achieve up to wRK×K×Cin×Coutw \in \mathbb{R}^{K \times K \times C_{\text{in}} \times C_{\text{out}}}2–wRK×K×Cin×Coutw \in \mathbb{R}^{K \times K \times C_{\text{in}} \times C_{\text{out}}}3 lower MSE and significant segmentation ARI improvements in low-data regimes. Equivariance guarantees also enable precise spatial/rotational control of output representations (Xu et al., 2021).

5. Implementation Considerations

PyTorch and TensorFlow implementations of conv-resize leverage stratified class inheritance for a stride-1 convolution block followed by library-resize (bilinear mode). Feature map shapes must be compatible; input spatial dimensions should be divisible by the largest downsampling factor considered during training. For optimal convergence, training patches are pre-cropped, Adam parameters set, and upsamplers in the reconstruction pathway fixed (e.g., bicubic for evaluation) (Chen et al., 2021).

Checkered subsampling in PyTorch-style pseudocode includes convolution at stride 2 for the base and shifted inputs, stacking resulting submaps, and optionally fusing submaps as extra channels processed by 3D convolution. Standard 2D layers are adapted to process the extra submap dimension consistently (Sadigh et al., 2018).

Group-equivariant approaches require explicit tracking of coset indices batch-wise and handling subsampling/upsampling through slicing and ghost channel insertion, with additional smoothing (e.g., average pooling) post-upsampling to avoid artifacts. Offset computation and storage for each forward batch is essential for exact equivariance recovery (Xu et al., 2021).

6. Limitations and Open Issues

Conv-resize blocks do not incur parameter overhead but may demand careful patch sizing and input dimension management for arbitrary fractional scaling (Chen et al., 2021).

Checkered subsampling increases memory (~1.1–2× in training) and computation (2–6× slower training, 3–13× inference slowdowns on CIFAR); certain network backbones reliant on absolute spatial positions (AlexNet, VGG without global pooling) lose accuracy due to spatial offset sensitivity. Clumping bias from repeatedly using the same sampling pattern can arise, which can be mitigated by sampler randomization or low-discrepancy schemes. Generalization to larger wRK×K×Cin×Coutw \in \mathbb{R}^{K \times K \times C_{\text{in}} \times C_{\text{out}}}4 is non-trivial and may require complex sampler design (Sadigh et al., 2018).

Group-equivariant subsampling's main complexity lies in correct implementation of coset selection and record-keeping, especially as the symmetry group extends beyond translations. Precise tuning of the smoothing operation after upsampling may be dataset-dependent (Xu et al., 2021).

7. Applications and Significance

Convolution subsampling layers, especially those employing convolution–resize blocks, multisampling techniques, or group-equivariant constructions, are instrumental in:

  • Video and image processing pipelines requiring fractional resolution changes (e.g., adaptive bitrate streaming, professional video scaling) (Chen et al., 2021).
  • Deep CNN architectures tasked with high-spatial-fidelity recognition or segmentation where traditional downsampling bottlenecks accuracy (Sadigh et al., 2018).
  • Representation learning and autoencoders needing equivariance to spatial symmetries for generalization, interpretability, and robust downstream decompositions (Xu et al., 2021).

These layers expand the class of permissible scaling transformations, encode additional spatial structure, and facilitate model transfer or adaptation between tasks and domains subject to geometric invariances.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Convolution Subsampling Layer.