YOLOv8n-SPTS: Advanced Small-Target Detector

Updated 17 December 2025

The paper presents a novel YOLOv8n-SPTS model that integrates SPD-Conv, SPPFCSPC, and TSFP modules to significantly improve small-target detection under occlusion and scale imbalance.
The model processes 3×640×640 RGB images and achieves a +10.8 percentage point [email protected] improvement on VisDrone2019-DET while enhancing recall and precision over standard YOLOv8n.
YOLOv8n-SPTS is highly applicable in autonomous driving scenarios, reducing missed detections of pedestrians and bicycles and maintaining competitive inference speed on modern GPUs.

The YOLOv8n-SPTS model is an advanced small-target object detector developed to address challenges in autonomous driving scenarios, with a focus on enhancing the recognition and localization of small traffic participants under occlusion and scale imbalance. This architecture extends the baseline YOLOv8n framework with novel feature extraction, fusion, and detection mechanisms, and incorporates Pyramid Sparse Transformer (PST) modules for multi-scale attention-driven fusion. Empirical results on benchmarks such as VisDrone2019-DET and MS COCO datasets demonstrate significant improvements in detection metrics, parameter efficiency, and inference speed compared to prior works (Wu, 10 Dec 2025, Hu et al., 19 May 2025).

1. Architectural Innovations and Model Overview

YOLOv8n-SPTS maintains the canonical YOLOv8n pipeline (Backbone → Neck → Detect Head) but includes three principal modifications: SPD-Conv modules for backbone feature preservation, SPPFCSPC modules for contextual feature aggregation, and a Triple-Stage Feature Pyramid for refined detection scale allocation. Additionally, the model supports integration of PST modules in the neck to further boost multi-scale feature fusion.

The architecture accepts 3×640×640 RGB images as input and processes them through successive convolutions before entering specialized modules. The backbone replaces four standard convolutional blocks with SPD-Convs—each performing a space-to-depth transformation followed by 3×3 convolution without spatial striding, thus quadrupling channel depth and halving spatial size per block while preserving fine spatial information. The final backbone features are aggregated with SPPFCSPC. The neck comprises FPN/PAN-style upsampling and concatenation operations, possibly interleaved with PST modules to enable attention-based cross-scale fusion. Detection occurs at three spatial scales: 160×160, 80×80, and 40×40, corresponding to stride-4, stride-8, and stride-16 heads.

2. SPD-Conv, SPPFCSPC, and TSFP Modules

SPD-Conv (Space-to-Depth Convolution)

SPD-Conv replaces sequential stride-2 convolutions with a two-step process: a deterministic space-to-depth rearrangement ( $\mathrm{SPD}$ ), which reformulates a $C \times H \times W$ tensor into $4C \times (H/2) \times (W/2)$ , and a 3×3 convolution without reduction in spatial resolution. This operation enables the preservation of $2\times2$ pixel relationships—vital for small-object representation—without redundancy loss typical of strided convolutions.

SPPFCSPC (Spatial Pyramid Pooling – Fast Cross Stage Partial Connection)

The SPPFCSPC module enhances the feature map’s contextual awareness before entry into the neck. It concatenates multiple max-pooled versions of the input at varying kernel sizes ( $5\times5$ , $9\times9$ , $13\times13$ ), maintaining spatial alignment with stride-1 pooling. A CSP-style split-merge then processes each half through dedicated convolution branches before fusing via a final 1×1 convolution. This design fuses global context with local spatial precision and preserves channel-wise feature diversity.

TSFP (Triple-Stage Feature Pyramid)

TSFP reconfigures the detector head arrangement for improved small-target coverage. Unlike standard YOLOv8n, which predicts at downsampling ratios of ×8, ×16, ×32, TSFP introduces a stride-4 (160×160) detection head—exploiting high-resolution, shallow-layer features—while pruning the lowest-resolution (stride-32, 20×20) head to minimize computation focused on rarely occurring large-scale objects in drone imagery. Detection heads output bounding box position/size, classification, and objectness confidences using a decoded anchor-based representation.

3. Pyramid Sparse Transformer Integration

PST modules are plug-and-play attention-based fusion units targeted at computationally efficient multi-scale feature fusion within neck architectures (Hu et al., 19 May 2025). They follow a two-stage operation:

Coarse Attention: Input feature maps at adjacent resolutions are projected into queries, keys, and values via 1×1 convolutions. Global cross-attention selects k most salient tokens via per-key scores, reducing computational cost to $O(N^2/4 + N k)$ from $O(N^2)$ (for $N$ tokens).
Fine Attention: Selected coarse tokens are mapped back to corresponding fine-grained spatial neighbors and refined by targeted attention. Outputs are combined with a convolutional positional encoding and projected back to the feature map domain.

Within YOLOv8n-SPTS, two PAN fusion points are replaced with PST modules, with adjustments to embedding dimensions and attention head counts according to spatial scale. During training, only coarse attention is active; the fine path is enabled during inference for maximal spatial selectivity without additional retraining.

4. Training Regime and Implementation Details

YOLOv8n-SPTS is trained on the VisDrone2019-DET dataset (10,209 images, $2000 \times 1500$ native resolution) using 640×640 crops. Training uses SGD (momentum 0.937, weight decay 0.0005), a batch size matching hardware constraints (e.g., 16 on RTX 3090), and a learning rate schedule annealing from 0.01 to 0.0001 (cosine decay). Data augmentations include mosaic, mixup, random HSV shifts, and horizontal flips. The network is trained for up to 300 epochs with early stopping at 50 epochs of stagnant validation loss.

For PST integration, training is single-stage: only coarse attention and positional encoding are enabled, simplifying the learning process and avoiding instability. At inference, the fine-branch is activated seamlessly.

5. Empirical Performance and Ablation Results

Performance on VisDrone2019-DET

Method	Precision (%)	Recall (%)	[email protected] (%)	[email protected]:0.95 (%)
YOLOv8n	54.1	39.8	41.8	24.9
YOLOv8n-SPTS	61.9	48.3	52.6	32.6

YOLOv8n-SPTS achieves +10.8 percentage points in [email protected] and +7.7 in [email protected]:0.95 over vanilla YOLOv8n. Miss rates for small, occluded targets such as pedestrians and bicycles are reduced by approximately a factor of two in dense scenes (Wu, 10 Dec 2025).

Performance on COCO with PST

On COCO val2017 (640×640), YOLOv8n-SPTS with PST modules yields a mAP $_{50:95}$ of 38.3%, an increase of +0.9% over baseline, with parameter increase of +0.7M and a negligible inference latency rise (+0.16ms). Peak activation memory rises by ≈10MB per image (Hu et al., 19 May 2025).

6. Computational Efficiency and Trade-Offs

Parameter count rises moderately, with YOLOv8n-SPTS total ≈5.2M (+0.5M over YOLOv8n), and FLOPs increase by ∼20% (from ~12B to ~14B). Inference throughput remains competitive at ~40 FPS for 640×640 images on RTX 3090 (vanilla: ~50 FPS). Addition of PST modules yields further savings on multi-scale fusion complexity without substantial latency overhead. This balanced cost-benefit profile is particularly significant in autonomous driving, where missing a small obstacle due to information loss is unacceptable.

7. Implementation Considerations and Limitations

SPD-Conv and SPPFCSPC modules can be implemented as drop-in replacements within Ultralytics’ YOLOv8 codebase. For PST modules, using fused Conv1×1+BN kernels and CUDA-optimized attention-gather implementations is recommended for hardware efficiency. Key limitations include diminishing returns of PST on small (320×320) inputs and unstable training if fine attention is enabled during parameter optimization.

A plausible implication is that further multi-scale attention enhancements and backbone efficiency improvements may yield diminishing returns relative to the architectural simplicity and compute cost profile currently achieved with YOLOv8n-SPTS.

References:

"Traffic Scene Small Target Detection Method Based on YOLOv8n-SPTS Model for Autonomous Driving" (Wu, 10 Dec 2025)
"Pyramid Sparse Transformer: Enhancing Multi-Scale Feature Fusion with Dynamic Token Selection" (Hu et al., 19 May 2025)

PDF Markdown Chat (Pro)

References (2)

Traffic Scene Small Target Detection Method Based on YOLOv8n-SPTS Model for Autonomous Driving (2025)

Pyramid Sparse Transformer: Enhancing Multi-Scale Feature Fusion with Dynamic Token Selection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to YOLOv8n-SPTS Model.