SPD-Conv: Lossless Downsampling in CNNs
- SPD-Conv is a CNN technique that replaces strided convolutions by converting spatial patches into expanded channel representations, preserving all image details.
- It reorganizes r×r spatial regions into a higher channel dimension, then applies a stride-1 convolution to achieve equivalent receptive field without losing spatial fidelity.
- Empirical results on models like ResNet and YOLOv5 demonstrate notable performance gains on low-resolution images and small object detection with minimal computational overhead.
Space-to-Depth Convolution (SPD-Conv) is a convolutional neural network (CNN) building block designed to replace conventional strided convolution and pooling layers, with the aim of improving performance on low-resolution images and small object detection tasks. SPD-Conv consists of a space-to-depth (SPD) transformation followed by a non-strided convolution, yielding downsampling without loss of spatial information. The method was introduced by S. Li, M. Liu, and Y. Wang in "No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects" (Sunkara et al., 2022).
1. Mathematical Formulation of Space-to-Depth (SPD)
Let be an input feature map, and the spatial downsampling scale. The SPD operation rearranges spatial information into the channel dimension, resulting in an output without information loss. Formally, for and ,
for all , . In notation reminiscent of MATLAB slicing:
and
along the channel dimension. SPD thus compresses each spatial patch into a single spatial site, expanding the channel count accordingly.
2. Equivalence to Strided Convolution
A conventional convolution with kernel , stride , and padding computes
Replacing this with SPD-Conv involves:
- Applying , resulting in shape .
- Performing a stride-1 convolution of kernel size , output channels , and padding .
By constructing so that subkernels of are mapped onto the proper channel slices of , exactly. Typically, is learned afresh, ensuring the effective receptive field and stride are preserved. The crucial distinction is that SPD incurs no pixel drop, guaranteeing complete spatial fidelity prior to convolution.
3. Implementation: Pseudocode
Minimal implementations in PyTorch and TensorFlow are as follows. In practical architectures, SPD, convolution, normalization, and activation are combined into a single module.
PyTorch:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
class SpaceToDepth(nn.Module): def __init__(self, r): super().__init__() self.r = r def forward(self, x): B, C, H, W = x.size() r = self.r assert H % r == 0 and W % r == 0 x = x.view(B, C, H//r, r, W//r, r) x = x.permute(0, 3, 5, 1, 2, 4).contiguous() x = x.view(B, C * r * r, H//r, W//r) return x class SPDConv(nn.Module): def __init__(self, in_ch, out_ch, scale=2, k=3, pad=1, bias=False, bn=True, act=nn.ReLU()): super().__init__() self.spd = SpaceToDepth(scale) self.conv = nn.Conv2d(in_ch * scale*scale, out_ch, k, stride=1, padding=pad, bias=bias) self.bn = nn.BatchNorm2d(out_ch) if bn else nn.Identity() self.act = act def forward(self, x): x = self.spd(x) x = self.conv(x) x = self.bn(x) return self.act(x) |
TensorFlow/Keras:
1 2 3 4 5 6 7 8 9 10 |
def space_to_depth_tf(x, r): return tf.nn.space_to_depth(x, block_size=r) # built-in def SPDConvTF(in_ch, out_ch, scale=2, k=3, pad="same"): return tf.keras.Sequential([ Lambda(lambda x: space_to_depth_tf(x, scale)), Conv2D(out_ch, k, padding=pad, use_bias=False), BatchNormalization(), ReLU() ]) |
4. Integration into CNN Architectures
ResNet18/50 ResNet18-SPD/50-SPD:
- The initial stride-2 convolution and subsequent stride-2 max-pool are replaced by a single SPD-Conv with scale 2 and kernel, transitioning input .
- Max-pooling is dropped.
- In each of the four downsampling blocks (previously stride-2 convolutions), replace with SPD-Conv (scale 2).
Feature-map shapes in ResNet18-SPD (input ):
| Layer | Output Size | Channels |
|---|---|---|
| SPD-Conv₁ | 112×112 | 64 |
| BasicBlock×2 | 112×112 | 64 |
| SPD-Conv₂ | 56×56 | 128 |
| BasicBlock×2 | 56×56 | 128 |
| SPD-Conv₃ | 28×28 | 256 |
| BasicBlock×2 | 28×28 | 256 |
| SPD-Conv₄ | 14×14 | 512 |
| BasicBlock×2 | 14×14 | 512 |
| GlobalAvgPool | 1×1 | 512 |
YOLOv5 YOLOv5-SPD:
- In CSPDarknet53 backbone and PANet neck, each stride-2 convolution is replaced by SPD-Conv (scale 2), affecting all five backbone downsampling and both neck downsampling stages, for a total of seven SPD-Convs.
- For example, after the first downsample: .
- The remaining network structure (CSP blocks, PAN blocks, and YOLO head) is unchanged.
5. Empirical Results
Object Detection on COCO-2017
- Training: Models from scratch on train2017.
- Evaluation: val2017 (5k images) and test-dev2017 split.
- Metrics: AP , AP_Small, AP_Medium, AP_Large.
| Model | Params (M) | AP | AP_Small |
|---|---|---|---|
| YOLOv5-n | 1.9 | 28.0 | 14.14 |
| YOLOv5-SPD-n | 2.2 | 31.0 | 16.00 (+13.2%) |
| YOLOv5-s | 7.2 | 37.4 | 21.09 |
| YOLOv5-SPD-s | 8.7 | 40.0 | 23.50 (+11.4%) |
| YOLOv5-m | 21.2 | 45.4 | 27.90 |
| YOLOv5-SPD-m | 24.6 | 46.5 | 30.30 (+8.6%) |
| YOLOv5-l | 46.5 | 49.0 | 31.80 |
| YOLOv5-SPD-l | 52.7 | 48.5 | 32.40 (+1.8%) |
Test-dev results show similar trends. The highest relative improvements are observed in AP_Small, with increases such as SPD-n: +19%, SPD-s: +9.5%, SPD-m: +6%.
Image Classification
- Tiny ImageNet (, $200$ classes):
- ResNet18: 61.68% top-1
- ResNet18-SPD: 64.52% (+2.84%)
- CIFAR-10 (, $10$ classes):
- ResNet50: 93.94%
- ResNet50-SPD: 95.03% (+1.09%)
These improvements are consistent on low-resolution tasks and do not involve increases in model depth or use of additional training techniques.
6. Mechanistic Rationale and Trade-offs
Strided convolutions and pooling indiscriminately discard spatial samples, which is especially detrimental for fine-grained detail and small object features. By contrast, SPD is a lossless transformation—reorganizing patches into the channel dimension—such that all spatial information is retained per subsampled site. The stride-1 convolution that follows can then perform channel mixing over the entirety of the reorganized local receptive field.
The main computational trade-off is an increase in activation memory (by a factor of in the immediate output of SPD), but not in kernel parameter storage. This additional overhead is mitigated as the subsequent convolution reduces channel dimension back to the target number. In practice, overall parameter counts and FLOPs are similar to conventional stride- convolution layers.
Potential avenues for mitigating overhead include group or depthwise variants of SPD-Conv. Extending the approach to tasks such as semantic segmentation, super-resolution, or video-based learning is a logical direction for further investigation.
7. Summary and Applicability
SPD-Conv is a drop-in replacement for any stride- convolution or pooling: it reduces spatial dimension by , preserves all information via channel packing, applies stride-1 convolution for information fusion, and provides consistent accuracy gains on tasks with low-resolution inputs and/or small objects. The methodology is validated across object detection (COCO-2017) and image classification (Tiny ImageNet, CIFAR-10), with open-source implementations provided for both PyTorch and TensorFlow (Sunkara et al., 2022).