Papers
Topics
Authors
Recent
2000 character limit reached

SPD-Conv: Lossless Downsampling in CNNs

Updated 17 December 2025
  • SPD-Conv is a CNN technique that replaces strided convolutions by converting spatial patches into expanded channel representations, preserving all image details.
  • It reorganizes r×r spatial regions into a higher channel dimension, then applies a stride-1 convolution to achieve equivalent receptive field without losing spatial fidelity.
  • Empirical results on models like ResNet and YOLOv5 demonstrate notable performance gains on low-resolution images and small object detection with minimal computational overhead.

Space-to-Depth Convolution (SPD-Conv) is a convolutional neural network (CNN) building block designed to replace conventional strided convolution and pooling layers, with the aim of improving performance on low-resolution images and small object detection tasks. SPD-Conv consists of a space-to-depth (SPD) transformation followed by a non-strided convolution, yielding downsampling without loss of spatial information. The method was introduced by S. Li, M. Liu, and Y. Wang in "No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects" (Sunkara et al., 2022).

1. Mathematical Formulation of Space-to-Depth (SPD)

Let XRH×W×CX \in \mathbb{R}^{H \times W \times C} be an input feature map, and rNr \in \mathbb{N} the spatial downsampling scale. The SPD operation rearranges spatial information into the channel dimension, resulting in an output YRHr×Wr×(Cr2)Y \in \mathbb{R}^{\frac{H}{r} \times \frac{W}{r} \times (C \cdot r^2)} without information loss. Formally, for x,y{0,,r1}x, y \in \{0, \ldots, r-1\} and c{0,,C1}c \in \{0, \ldots, C-1\},

Y[i,j,(xr+y)C+c]=X[ir+x,  jr+y,  c]Y[i, j, (x \cdot r + y) \cdot C + c] = X[i \cdot r + x,\; j \cdot r + y,\; c]

for all i=0,,Hr1i = 0, \ldots, \frac{H}{r} - 1, j=0,,Wr1j = 0, \ldots, \frac{W}{r} - 1. In notation reminiscent of MATLAB slicing:

fx,y=X[x:r:H1,  y:r:W1,  :]f_{x,y} = X[x : r : H-1,\; y : r : W-1,\; :]

and

Y=concat(x,y)[0,...,r1]2fx,yY = \text{concat}_{(x, y) \in [0, ..., r-1]^2} f_{x, y}

along the channel dimension. SPD thus compresses each r×rr \times r spatial patch into a single spatial site, expanding the channel count accordingly.

2. Equivalence to Strided Convolution

A conventional k×kk \times k convolution with kernel WsRk×k×C×CW_s \in \mathbb{R}^{k \times k \times C \times C'}, stride rr, and padding pp computes

(Zs)[i,j,p]=u=0k1v=0k1c=0C1Ws[u,v,c,p]X[ir+up,jr+vp,c](Z_s)[i, j, p'] = \sum_{u=0}^{k-1} \sum_{v=0}^{k-1} \sum_{c=0}^{C-1} W_s[u, v, c, p'] \cdot X[i \cdot r + u - p,\, j \cdot r + v - p,\, c]

Replacing this with SPD-Conv involves:

  1. Applying X=SPD(X,r)X' = \operatorname{SPD}(X, r), resulting in shape RH/r×W/r×Cr2\mathbb{R}^{H/r \times W/r \times C r^2}.
  2. Performing a stride-1 convolution Z=Conv1(X;W)Z = \operatorname{Conv}_1(X'; W') of kernel size k×kk \times k, output channels CC', and padding pp.

By constructing WW' so that subkernels of WsW_s are mapped onto the proper channel slices of XX', Zs=ZZ_s = Z exactly. Typically, WW' is learned afresh, ensuring the effective receptive field and stride are preserved. The crucial distinction is that SPD incurs no pixel drop, guaranteeing complete spatial fidelity prior to convolution.

3. Implementation: Pseudocode

Minimal implementations in PyTorch and TensorFlow are as follows. In practical architectures, SPD, convolution, normalization, and activation are combined into a single module.

PyTorch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class SpaceToDepth(nn.Module):
    def __init__(self, r):
        super().__init__()
        self.r = r

    def forward(self, x):
        B, C, H, W = x.size()
        r = self.r
        assert H % r == 0 and W % r == 0
        x = x.view(B, C, H//r, r, W//r, r)
        x = x.permute(0, 3, 5, 1, 2, 4).contiguous()
        x = x.view(B, C * r * r, H//r, W//r)
        return x

class SPDConv(nn.Module):
    def __init__(self, in_ch, out_ch, scale=2, k=3, pad=1, bias=False, bn=True, act=nn.ReLU()):
        super().__init__()
        self.spd = SpaceToDepth(scale)
        self.conv = nn.Conv2d(in_ch * scale*scale, out_ch, k, stride=1, padding=pad, bias=bias)
        self.bn   = nn.BatchNorm2d(out_ch) if bn else nn.Identity()
        self.act  = act

    def forward(self, x):
        x = self.spd(x)
        x = self.conv(x)
        x = self.bn(x)
        return self.act(x)

TensorFlow/Keras:

1
2
3
4
5
6
7
8
9
10
def space_to_depth_tf(x, r):
    return tf.nn.space_to_depth(x, block_size=r)  # built-in

def SPDConvTF(in_ch, out_ch, scale=2, k=3, pad="same"):
    return tf.keras.Sequential([
        Lambda(lambda x: space_to_depth_tf(x, scale)),
        Conv2D(out_ch, k, padding=pad, use_bias=False),
        BatchNormalization(),
        ReLU()
    ])

4. Integration into CNN Architectures

ResNet18/50 \rightarrow ResNet18-SPD/50-SPD:

  • The initial 7×77 \times 7 stride-2 convolution and subsequent 3×33 \times 3 stride-2 max-pool are replaced by a single SPD-Conv with scale 2 and 3×33 \times 3 kernel, transitioning input H×W×3(H/2×W/2×12)(H/2×W/2×64)H \times W \times 3 \rightarrow (H/2 \times W/2 \times 12) \rightarrow (H/2 \times W/2 \times 64).
  • Max-pooling is dropped.
  • In each of the four downsampling blocks (previously stride-2 convolutions), replace with SPD-Conv (scale 2).

Feature-map shapes in ResNet18-SPD (input 224×224224 \times 224):

Layer Output Size Channels
SPD-Conv₁ 112×112 64
BasicBlock×2 112×112 64
SPD-Conv₂ 56×56 128
BasicBlock×2 56×56 128
SPD-Conv₃ 28×28 256
BasicBlock×2 28×28 256
SPD-Conv₄ 14×14 512
BasicBlock×2 14×14 512
GlobalAvgPool 1×1 512

YOLOv5 \rightarrow YOLOv5-SPD:

  • In CSPDarknet53 backbone and PANet neck, each 3×33 \times 3 stride-2 convolution is replaced by SPD-Conv (scale 2), affecting all five backbone downsampling and both neck downsampling stages, for a total of seven SPD-Convs.
  • For example, after the first downsample: 640×640×3SPD-Conv (scale=2, in_ch=3, out_ch=32)320×320×32640 \times 640 \times 3 \xrightarrow{\text{SPD-Conv (scale=2, in\_ch=3, out\_ch=32)}} 320 \times 320 \times 32.
  • The remaining network structure (CSP blocks, PAN blocks, and YOLO head) is unchanged.

5. Empirical Results

Object Detection on COCO-2017

  • Training: Models from scratch on train2017.
  • Evaluation: val2017 (5k images) and test-dev2017 split.
  • Metrics: AP @[.50:.95]@[.50:.95], AP_Small, AP_Medium, AP_Large.
Model Params (M) AP AP_Small
YOLOv5-n 1.9 28.0 14.14
YOLOv5-SPD-n 2.2 31.0 16.00 (+13.2%)
YOLOv5-s 7.2 37.4 21.09
YOLOv5-SPD-s 8.7 40.0 23.50 (+11.4%)
YOLOv5-m 21.2 45.4 27.90
YOLOv5-SPD-m 24.6 46.5 30.30 (+8.6%)
YOLOv5-l 46.5 49.0 31.80
YOLOv5-SPD-l 52.7 48.5 32.40 (+1.8%)

Test-dev results show similar trends. The highest relative improvements are observed in AP_Small, with increases such as SPD-n: +19%, SPD-s: +9.5%, SPD-m: +6%.

Image Classification

  • Tiny ImageNet (64×6464\times64, $200$ classes):
    • ResNet18: 61.68% top-1
    • ResNet18-SPD: 64.52% (+2.84%)
  • CIFAR-10 (32×3232\times32, $10$ classes):
    • ResNet50: 93.94%
    • ResNet50-SPD: 95.03% (+1.09%)

These improvements are consistent on low-resolution tasks and do not involve increases in model depth or use of additional training techniques.

6. Mechanistic Rationale and Trade-offs

Strided convolutions and pooling indiscriminately discard spatial samples, which is especially detrimental for fine-grained detail and small object features. By contrast, SPD is a lossless transformation—reorganizing r×rr \times r patches into the channel dimension—such that all spatial information is retained per subsampled site. The stride-1 convolution that follows can then perform channel mixing over the entirety of the reorganized local receptive field.

The main computational trade-off is an increase in activation memory (by a factor of r2r^2 in the immediate output of SPD), but not in kernel parameter storage. This additional overhead is mitigated as the subsequent convolution reduces channel dimension back to the target number. In practice, overall parameter counts and FLOPs are similar to conventional stride-rr convolution layers.

Potential avenues for mitigating overhead include group or depthwise variants of SPD-Conv. Extending the approach to tasks such as semantic segmentation, super-resolution, or video-based learning is a logical direction for further investigation.

7. Summary and Applicability

SPD-Conv is a drop-in replacement for any stride-rr convolution or pooling: it reduces spatial dimension by rr, preserves all information via channel packing, applies stride-1 convolution for information fusion, and provides consistent accuracy gains on tasks with low-resolution inputs and/or small objects. The methodology is validated across object detection (COCO-2017) and image classification (Tiny ImageNet, CIFAR-10), with open-source implementations provided for both PyTorch and TensorFlow (Sunkara et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Space-to-Depth Convolution (SPD-Conv).