YOLOv5-SPD: Enhanced Object Detection

Updated 28 December 2025

YOLOv5-SPD is a modified version of YOLOv5 that replaces stride-2 convolutions with SPD-Conv blocks to preserve detailed spatial information.
Empirical results on COCO benchmarks show improvements of up to 19% in small object detection metrics, with only modest increases in model size.
Its implementation uses standard PyTorch modules, enabling easy integration and potential adaptation to other detection architectures.

YOLOv5-SPD refers to a family of modifications to the YOLOv5 object detection architecture, specifically involving integration of the SPD-Conv (Space-to-Depth Convolution) block in place of traditional strided convolution and pooling layers. This approach fundamentally alters the downsampling pathway to preserve more fine-grained spatial information, resulting in improved performance, especially for low-resolution images and small objects. The YOLOv5-SPD design has been empirically validated on benchmark datasets, demonstrating consistent improvements in small object detection metrics with only minor increases in model size and computational cost (Sunkara et al., 2022).

1. SPD-Conv Block and Model Architecture Alterations

YOLOv5-SPD replaces every stride-2 convolution and pooling operation in the standard YOLOv5 backbone (CSPDarknet53-SPP) and PANet neck with an SPD-Conv block. In the YOLOv5 backbone, there are five stride-2 convolutions responsible for progressively halving feature map resolution, and two more downsampling layers in the neck; all seven are replaced. The SPD-Conv block consists of a parameter-free SPD layer, which spatially rearranges the input tensor with scale factor $p=2$ , followed by a standard $3\times3$ convolution with stride 1 that mixes the expanded set of channels.

The SPD operation maps an input $X\in\mathbb{R}^{H\times W\times C}$ to $\mathbb{R}^{(H/p)\times(W/p)\times(C\cdot p^2)}$ , via deterministic reshuffling: for each input channel, the SPD operation extracts $p^2$ interleaved submaps and concatenates them along the channel axis.

The full sequence for downsampling thus becomes:

An SPD layer rearranging the feature map to coalesce spatial neighborhoods into channels,
Followed by a $3\times3$ convolution operating at stride 1.

Stage	Original	YOLOv5-SPD Operation
Backbone conv1–5	Conv2d (stride=2)	SPD-Conv (scale=2)
Neck downsample1–2	Conv2d (stride=2)	SPD-Conv (scale=2)

This design leaves the overall downsampling schedule and receptive field unchanged compared to the original stride-2 convolution design (Sunkara et al., 2022).

2. Mathematical Specification and Computational Properties

Let $X\in\mathbb{R}^{H\times W\times C}$ be a feature map and $p$ the scale factor. Space-to-depth reshuffling is defined as: $\mathrm{SPD}(X)_{i,j,c\cdot p^2 + u\cdot p + v} = X_{i\cdot p + u,\,j\cdot p + v,\,c}$ for $c=0\ldots C-1$ , $u,v=0\ldots p-1$ .

Following the SPD layer, a regular $3\times3$ convolution with stride 1 is applied, with in-channels $C\cdot p^2$ and a chosen out-channels $C_2$ (typically close to input $C$ or the original YOLOv5 setting, to control model size).

Key computational effects:

The receptive field after SPD-Conv matches the original stride-2 conv: the SPD + $3\times3$ conv covers a $6\times6$ region in the input, corresponding to a $3\times3$ patch in the downsampled grid.
Parameter count can be matched to the original by setting $C_2 \approx C_{out}/p^2$ .
FLOPs are effectively unchanged since the SPD rearrangement is a reshaping and the convolution is sized to maintain equal computation: $\mathrm{FLOPs}_{\mathrm{SPD}} = H\,W\,k^2\,C_{in}\,C_2$

$\text{Set } C_2\approx C_{out}/p^2 \implies \text{FLOPs}_{\mathrm{SPD}}\approx \text{FLOPs}_{\mathrm{orig}}$

(Sunkara et al., 2022).

3. Empirical Performance on Object Detection Benchmarks

YOLOv5-SPD was evaluated on COCO-2017 object detection benchmarks. For all model scales (nano, small, medium, large), the SPD-Conv replacement yielded consistent gains in AP on small objects (APS), with either matched or improved overall AP@[.5:.95]. Modest increases in parameter count (within 20%) were observed relative to the baseline.

Significant benchmark results:

Model	Params (M)	AP@[.5:.95]	APS	ΔAPS vs. YOLOv5
YOLOv5n	1.9	28.1	12.7	—
YOLOv5-SPD-n	2.2	30.4	15.1	+19.0%
YOLOv5s	7.2	37.1	20.0	—
YOLOv5-SPD-s	8.7	39.7	21.9	+9.5%
YOLOv5m	21.2	45.5	26.6	—
YOLOv5-SPD-m	24.6	46.6	28.2	+6.0%
YOLOv5l	46.5	49.0	29.9	—
YOLOv5-SPD-l	52.7	48.8	30.0	+0.3%

These results demonstrate that SPD-Conv consistently improves APS by 6–19% relative across all scales (Sunkara et al., 2022).

4. Training Strategy and Implementation

YOLOv5-SPD utilizes standard PyTorch implementations. The COCO-2017 dataset (train2017, val2017, test-dev2017) is used for training and evaluation, with input resolution fixed at $640\times640$ to accentuate small object performance. Training employs stochastic gradient descent with momentum (0.937), weight decay ( $5\times10^{-4}$ ), a warm-up linear schedule followed by cosine decay, and batch sizes adapted to model scale.

Losses: Complete IoU (CIoU) for bounding box regression, binary cross-entropy for objectness and class. Data augmentation includes hue/saturation/value jitter, random translation/scale/shear, horizontal and vertical flips, Mosaic, and CutMix. No additional hyper-parameter tuning is required beyond inherited YOLOv5 defaults.

5. Practical Implementation: PyTorch Pseudocode

A succinct PyTorch-style pseudocode for the SPD-Conv block is given below:

import torch
import torch.nn as nn

class SpaceToDepth(nn.Module):
    def __init__(self, p):
        super().__init__()
        self.p = p

    def forward(self, x):
        B, C, H, W = x.shape
        p = self.p
        assert H % p == 0 and W % p == 0
        x = x.view(B, C, H//p, p, W//p, p)
        x = x.permute(0, 1, 3, 5, 2, 4).contiguous()
        return x.view(B, C * p * p, H//p, W//p)

class SPDConv(nn.Module):
    def __init__(self, in_ch, out_ch, scale=2, k=3):
        super().__init__()
        self.spd = SpaceToDepth(scale)
        self.conv = nn.Conv2d(in_ch * scale * scale, out_ch, kernel_size=k, stride=1, padding=(k//2), bias=False)
        self.bn   = nn.BatchNorm2d(out_ch)
        self.act  = nn.SiLU()
    def forward(self, x):
        x = self.spd(x)
        x = self.conv(x)
        x = self.bn(x)
        return self.act(x)

To convert YOLOv5 to YOLOv5-SPD, replace each Conv(c1, c2, k, s=2) with SPDConv(c1, c2, scale=2, k) in the model definition (Sunkara et al., 2022).

6. Significance and Implications

The YOLOv5-SPD architecture eliminates information loss associated with strided convolutions and pooling, preserving fine spatial detail that is beneficial for tasks involving small objects and low-resolution imagery. The design achieves these improvements with minimal or negligible extra implementation complexity, only slight increases in model size, and essentially unchanged inference-time compute (FLOPs). The consistent APS gains across model sizes suggest that the SPD-Conv approach generalizes well. A plausible implication is that SPD-Conv can be adapted to other detection and classification backbones with similar benefits, particularly wherever fine-grained localization fidelity is crucial. The open-source code provided by LabSAINT enables further experimentation and broader adoption (Sunkara et al., 2022).

PDF Markdown Chat (Pro)

References (1)

No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to YOLOv5-SPD.