SPD-Conv: Lossless Downsampling in CNNs

Updated 17 December 2025

SPD-Conv is a CNN technique that replaces strided convolutions by converting spatial patches into expanded channel representations, preserving all image details.
It reorganizes r×r spatial regions into a higher channel dimension, then applies a stride-1 convolution to achieve equivalent receptive field without losing spatial fidelity.
Empirical results on models like ResNet and YOLOv5 demonstrate notable performance gains on low-resolution images and small object detection with minimal computational overhead.

Space-to-Depth Convolution (SPD-Conv) is a convolutional neural network (CNN) building block designed to replace conventional strided convolution and pooling layers, with the aim of improving performance on low-resolution images and small object detection tasks. SPD-Conv consists of a space-to-depth (SPD) transformation followed by a non-strided convolution, yielding downsampling without loss of spatial information. The method was introduced by S. Li, M. Liu, and Y. Wang in "No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects" (Sunkara et al., 2022).

1. Mathematical Formulation of Space-to-Depth (SPD)

Let $X \in \mathbb{R}^{H \times W \times C}$ be an input feature map, and $r \in \mathbb{N}$ the spatial downsampling scale. The SPD operation rearranges spatial information into the channel dimension, resulting in an output $Y \in \mathbb{R}^{\frac{H}{r} \times \frac{W}{r} \times (C \cdot r^2)}$ without information loss. Formally, for $x, y \in \{0, \ldots, r-1\}$ and $c \in \{0, \ldots, C-1\}$ ,

$Y[i, j, (x \cdot r + y) \cdot C + c] = X[i \cdot r + x,\; j \cdot r + y,\; c]$

for all $i = 0, \ldots, \frac{H}{r} - 1$ , $j = 0, \ldots, \frac{W}{r} - 1$ . In notation reminiscent of MATLAB slicing:

$f_{x,y} = X[x : r : H-1,\; y : r : W-1,\; :]$

and

$Y = \text{concat}_{(x, y) \in [0, ..., r-1]^2} f_{x, y}$

along the channel dimension. SPD thus compresses each $r \times r$ spatial patch into a single spatial site, expanding the channel count accordingly.

2. Equivalence to Strided Convolution

A conventional $k \times k$ convolution with kernel $W_s \in \mathbb{R}^{k \times k \times C \times C'}$ , stride $r$ , and padding $p$ computes

$(Z_s)[i, j, p'] = \sum_{u=0}^{k-1} \sum_{v=0}^{k-1} \sum_{c=0}^{C-1} W_s[u, v, c, p'] \cdot X[i \cdot r + u - p,\, j \cdot r + v - p,\, c]$

Replacing this with SPD-Conv involves:

Applying $X' = \operatorname{SPD}(X, r)$ , resulting in shape $\mathbb{R}^{H/r \times W/r \times C r^2}$ .
Performing a stride-1 convolution $Z = \operatorname{Conv}_1(X'; W')$ of kernel size $k \times k$ , output channels $C'$ , and padding $p$ .

By constructing $W'$ so that subkernels of $W_s$ are mapped onto the proper channel slices of $X'$ , $Z_s = Z$ exactly. Typically, $W'$ is learned afresh, ensuring the effective receptive field and stride are preserved. The crucial distinction is that SPD incurs no pixel drop, guaranteeing complete spatial fidelity prior to convolution.

3. Implementation: Pseudocode

Minimal implementations in PyTorch and TensorFlow are as follows. In practical architectures, SPD, convolution, normalization, and activation are combined into a single module.

PyTorch:

class SpaceToDepth(nn.Module):
    def __init__(self, r):
        super().__init__()
        self.r = r

    def forward(self, x):
        B, C, H, W = x.size()
        r = self.r
        assert H % r == 0 and W % r == 0
        x = x.view(B, C, H//r, r, W//r, r)
        x = x.permute(0, 3, 5, 1, 2, 4).contiguous()
        x = x.view(B, C * r * r, H//r, W//r)
        return x

class SPDConv(nn.Module):
    def __init__(self, in_ch, out_ch, scale=2, k=3, pad=1, bias=False, bn=True, act=nn.ReLU()):
        super().__init__()
        self.spd = SpaceToDepth(scale)
        self.conv = nn.Conv2d(in_ch * scale*scale, out_ch, k, stride=1, padding=pad, bias=bias)
        self.bn   = nn.BatchNorm2d(out_ch) if bn else nn.Identity()
        self.act  = act

    def forward(self, x):
        x = self.spd(x)
        x = self.conv(x)
        x = self.bn(x)
        return self.act(x)

TensorFlow/Keras:

def space_to_depth_tf(x, r):
    return tf.nn.space_to_depth(x, block_size=r)  # built-in

def SPDConvTF(in_ch, out_ch, scale=2, k=3, pad="same"):
    return tf.keras.Sequential([
        Lambda(lambda x: space_to_depth_tf(x, scale)),
        Conv2D(out_ch, k, padding=pad, use_bias=False),
        BatchNormalization(),
        ReLU()
    ])

4. Integration into CNN Architectures

ResNet18/50 $\rightarrow$ ResNet18-SPD/50-SPD:

The initial $7 \times 7$ stride-2 convolution and subsequent $3 \times 3$ stride-2 max-pool are replaced by a single SPD-Conv with scale 2 and $3 \times 3$ kernel, transitioning input $H \times W \times 3 \rightarrow (H/2 \times W/2 \times 12) \rightarrow (H/2 \times W/2 \times 64)$ .
Max-pooling is dropped.
In each of the four downsampling blocks (previously stride-2 convolutions), replace with SPD-Conv (scale 2).

Feature-map shapes in ResNet18-SPD (input $224 \times 224$ ):

Layer	Output Size	Channels
SPD-Conv₁	112×112	64
BasicBlock×2	112×112	64
SPD-Conv₂	56×56	128
BasicBlock×2	56×56	128
SPD-Conv₃	28×28	256
BasicBlock×2	28×28	256
SPD-Conv₄	14×14	512
BasicBlock×2	14×14	512
GlobalAvgPool	1×1	512

YOLOv5 $\rightarrow$ YOLOv5-SPD:

In CSPDarknet53 backbone and PANet neck, each $3 \times 3$ stride-2 convolution is replaced by SPD-Conv (scale 2), affecting all five backbone downsampling and both neck downsampling stages, for a total of seven SPD-Convs.
For example, after the first downsample: $640 \times 640 \times 3 \xrightarrow{\text{SPD-Conv (scale=2, in\_ch=3, out\_ch=32)}} 320 \times 320 \times 32$ .
The remaining network structure (CSP blocks, PAN blocks, and YOLO head) is unchanged.

5. Empirical Results

Object Detection on COCO-2017

Training: Models from scratch on train2017.
Evaluation: val2017 (5k images) and test-dev2017 split.
Metrics: AP $@[.50:.95]$ , AP_Small, AP_Medium, AP_Large.

Model	Params (M)	AP	AP_Small
YOLOv5-n	1.9	28.0	14.14
YOLOv5-SPD-n	2.2	31.0	16.00 (+13.2%)
YOLOv5-s	7.2	37.4	21.09
YOLOv5-SPD-s	8.7	40.0	23.50 (+11.4%)
YOLOv5-m	21.2	45.4	27.90
YOLOv5-SPD-m	24.6	46.5	30.30 (+8.6%)
YOLOv5-l	46.5	49.0	31.80
YOLOv5-SPD-l	52.7	48.5	32.40 (+1.8%)

Test-dev results show similar trends. The highest relative improvements are observed in AP_Small, with increases such as SPD-n: +19%, SPD-s: +9.5%, SPD-m: +6%.

Image Classification

Tiny ImageNet ( $64\times64$ $64 \times 64$ , $200$ classes):
- ResNet18: 61.68% top-1
- ResNet18-SPD: 64.52% (+2.84%)
CIFAR-10 ( $32\times32$ $32 \times 32$ , $10$ classes):
- ResNet50: 93.94%
- ResNet50-SPD: 95.03% (+1.09%)

These improvements are consistent on low-resolution tasks and do not involve increases in model depth or use of additional training techniques.

6. Mechanistic Rationale and Trade-offs

Strided convolutions and pooling indiscriminately discard spatial samples, which is especially detrimental for fine-grained detail and small object features. By contrast, SPD is a lossless transformation—reorganizing $r \times r$ patches into the channel dimension—such that all spatial information is retained per subsampled site. The stride-1 convolution that follows can then perform channel mixing over the entirety of the reorganized local receptive field.

The main computational trade-off is an increase in activation memory (by a factor of $r^2$ in the immediate output of SPD), but not in kernel parameter storage. This additional overhead is mitigated as the subsequent convolution reduces channel dimension back to the target number. In practice, overall parameter counts and FLOPs are similar to conventional stride- $r$ convolution layers.

Potential avenues for mitigating overhead include group or depthwise variants of SPD-Conv. Extending the approach to tasks such as semantic segmentation, super-resolution, or video-based learning is a logical direction for further investigation.

7. Summary and Applicability

SPD-Conv is a drop-in replacement for any stride- $r$ convolution or pooling: it reduces spatial dimension by $r$ , preserves all information via channel packing, applies stride-1 convolution for information fusion, and provides consistent accuracy gains on tasks with low-resolution inputs and/or small objects. The methodology is validated across object detection (COCO-2017) and image classification (Tiny ImageNet, CIFAR-10), with open-source implementations provided for both PyTorch and TensorFlow (Sunkara et al., 2022).

PDF Markdown Chat (Pro)

References (1)

No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Space-to-Depth Convolution (SPD-Conv).