YOLOv5-SPD: Enhanced Object Detection
- YOLOv5-SPD is a modified version of YOLOv5 that replaces stride-2 convolutions with SPD-Conv blocks to preserve detailed spatial information.
- Empirical results on COCO benchmarks show improvements of up to 19% in small object detection metrics, with only modest increases in model size.
- Its implementation uses standard PyTorch modules, enabling easy integration and potential adaptation to other detection architectures.
YOLOv5-SPD refers to a family of modifications to the YOLOv5 object detection architecture, specifically involving integration of the SPD-Conv (Space-to-Depth Convolution) block in place of traditional strided convolution and pooling layers. This approach fundamentally alters the downsampling pathway to preserve more fine-grained spatial information, resulting in improved performance, especially for low-resolution images and small objects. The YOLOv5-SPD design has been empirically validated on benchmark datasets, demonstrating consistent improvements in small object detection metrics with only minor increases in model size and computational cost (Sunkara et al., 2022).
1. SPD-Conv Block and Model Architecture Alterations
YOLOv5-SPD replaces every stride-2 convolution and pooling operation in the standard YOLOv5 backbone (CSPDarknet53-SPP) and PANet neck with an SPD-Conv block. In the YOLOv5 backbone, there are five stride-2 convolutions responsible for progressively halving feature map resolution, and two more downsampling layers in the neck; all seven are replaced. The SPD-Conv block consists of a parameter-free SPD layer, which spatially rearranges the input tensor with scale factor , followed by a standard convolution with stride 1 that mixes the expanded set of channels.
The SPD operation maps an input to , via deterministic reshuffling: for each input channel, the SPD operation extracts interleaved submaps and concatenates them along the channel axis.
The full sequence for downsampling thus becomes:
- An SPD layer rearranging the feature map to coalesce spatial neighborhoods into channels,
- Followed by a convolution operating at stride 1.
| Stage | Original | YOLOv5-SPD Operation |
|---|---|---|
| Backbone conv1–5 | Conv2d (stride=2) | SPD-Conv (scale=2) |
| Neck downsample1–2 | Conv2d (stride=2) | SPD-Conv (scale=2) |
This design leaves the overall downsampling schedule and receptive field unchanged compared to the original stride-2 convolution design (Sunkara et al., 2022).
2. Mathematical Specification and Computational Properties
Let be a feature map and the scale factor. Space-to-depth reshuffling is defined as: for , .
Following the SPD layer, a regular convolution with stride 1 is applied, with in-channels and a chosen out-channels (typically close to input or the original YOLOv5 setting, to control model size).
Key computational effects:
- The receptive field after SPD-Conv matches the original stride-2 conv: the SPD + conv covers a region in the input, corresponding to a patch in the downsampled grid.
- Parameter count can be matched to the original by setting .
- FLOPs are effectively unchanged since the SPD rearrangement is a reshaping and the convolution is sized to maintain equal computation:
3. Empirical Performance on Object Detection Benchmarks
YOLOv5-SPD was evaluated on COCO-2017 object detection benchmarks. For all model scales (nano, small, medium, large), the SPD-Conv replacement yielded consistent gains in AP on small objects (APS), with either matched or improved overall AP@[.5:.95]. Modest increases in parameter count (within 20%) were observed relative to the baseline.
Significant benchmark results:
| Model | Params (M) | AP@[.5:.95] | APS | ΔAPS vs. YOLOv5 |
|---|---|---|---|---|
| YOLOv5n | 1.9 | 28.1 | 12.7 | — |
| YOLOv5-SPD-n | 2.2 | 30.4 | 15.1 | +19.0% |
| YOLOv5s | 7.2 | 37.1 | 20.0 | — |
| YOLOv5-SPD-s | 8.7 | 39.7 | 21.9 | +9.5% |
| YOLOv5m | 21.2 | 45.5 | 26.6 | — |
| YOLOv5-SPD-m | 24.6 | 46.6 | 28.2 | +6.0% |
| YOLOv5l | 46.5 | 49.0 | 29.9 | — |
| YOLOv5-SPD-l | 52.7 | 48.8 | 30.0 | +0.3% |
These results demonstrate that SPD-Conv consistently improves APS by 6–19% relative across all scales (Sunkara et al., 2022).
4. Training Strategy and Implementation
YOLOv5-SPD utilizes standard PyTorch implementations. The COCO-2017 dataset (train2017, val2017, test-dev2017) is used for training and evaluation, with input resolution fixed at to accentuate small object performance. Training employs stochastic gradient descent with momentum (0.937), weight decay (), a warm-up linear schedule followed by cosine decay, and batch sizes adapted to model scale.
Losses: Complete IoU (CIoU) for bounding box regression, binary cross-entropy for objectness and class. Data augmentation includes hue/saturation/value jitter, random translation/scale/shear, horizontal and vertical flips, Mosaic, and CutMix. No additional hyper-parameter tuning is required beyond inherited YOLOv5 defaults.
5. Practical Implementation: PyTorch Pseudocode
A succinct PyTorch-style pseudocode for the SPD-Conv block is given below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
import torch import torch.nn as nn class SpaceToDepth(nn.Module): def __init__(self, p): super().__init__() self.p = p def forward(self, x): B, C, H, W = x.shape p = self.p assert H % p == 0 and W % p == 0 x = x.view(B, C, H//p, p, W//p, p) x = x.permute(0, 1, 3, 5, 2, 4).contiguous() return x.view(B, C * p * p, H//p, W//p) class SPDConv(nn.Module): def __init__(self, in_ch, out_ch, scale=2, k=3): super().__init__() self.spd = SpaceToDepth(scale) self.conv = nn.Conv2d(in_ch * scale * scale, out_ch, kernel_size=k, stride=1, padding=(k//2), bias=False) self.bn = nn.BatchNorm2d(out_ch) self.act = nn.SiLU() def forward(self, x): x = self.spd(x) x = self.conv(x) x = self.bn(x) return self.act(x) |
Conv(c1, c2, k, s=2) with SPDConv(c1, c2, scale=2, k) in the model definition (Sunkara et al., 2022).
6. Significance and Implications
The YOLOv5-SPD architecture eliminates information loss associated with strided convolutions and pooling, preserving fine spatial detail that is beneficial for tasks involving small objects and low-resolution imagery. The design achieves these improvements with minimal or negligible extra implementation complexity, only slight increases in model size, and essentially unchanged inference-time compute (FLOPs). The consistent APS gains across model sizes suggest that the SPD-Conv approach generalizes well. A plausible implication is that SPD-Conv can be adapted to other detection and classification backbones with similar benefits, particularly wherever fine-grained localization fidelity is crucial. The open-source code provided by LabSAINT enables further experimentation and broader adoption (Sunkara et al., 2022).