HierLight-YOLO: Lightweight UAV Detection
- The paper introduces HierLight-YOLO, a framework that integrates a novel Hierarchical Extended Path Aggregation Network and lightweight modules to enhance small-object detection.
- It applies inverted residual depthwise convolution blocks and lightweight downsampling to reduce parameters by up to 29.7% while maintaining competitive accuracy.
- A dedicated high-resolution detection head improves sensitivity to objects as small as 4×4 pixels, enabling efficient real-time processing in UAV imagery.
HierLight-YOLO is a hierarchical and lightweight object detection framework, specifically formulated to address real-time detection of small objects in unmanned aerial vehicle (UAV) photography. Extending the YOLOv8 architecture, HierLight-YOLO incorporates specialized mechanisms for multi-scale fusion, significant parameter reduction, and enhanced response to tiny target objects—substantially improving detection accuracy while retaining high inference speed and low computational footprint on resource-constrained platforms (Chen et al., 26 Sep 2025).
1. Architectural Overview and Network Design
HierLight-YOLO maintains the canonical three-stage structure—Backbone, Neck, Head—as in YOLOv8, but introduces novel modules for hierarchical multi-scale feature processing and lightweight parameterization.
- Backbone: Begins with Conv 3×3 + BatchNorm + SiLU, followed by stacked Inverted Residual Depthwise Convolution Blocks (IRDCB) which act as both feature extractors and channel modulators. Downsampling is conducted with Lightweight Downsample (LDown) modules, providing a 50% reduction in spatial resolution.
- Feature Map Sequence: The backbone outputs four multi-scale feature maps:
- Neck (HEPAN): The Hierarchical Extended Path Aggregation Network accepts – and applies 1×1 conv-based Hierarchical Feature Channel Compression (HFCC), followed by dense hierarchical cross-level fusion via top-down and bottom-up passes.
- Heads: Four detection heads are used. A dedicated small-object head operates at (from ) for high spatial fidelity (targeting objects as small as pixels); standard heads remain at , , and . All heads use anchor-free regression.
Block Diagram (simplified):
1 2 3 4 5 6 7 8 |
Input 640×640 │ ├─ Conv3×3 → IRDCB → LDown ──┐ │ ├─ IRDCB → LDown ──┐ │ │ ├─ IRDCB → LDown → P5 │ │ └─ IRDCB → → P4 │ └─ IRDCB → → P3 └─ IRDCB ──────────────────────────────────────→ P2 |
2. Key Modules: Mathematical and Structural Specification
2.1 Hierarchical Extended Path Aggregation Network (HEPAN)
HEPAN fuses via two critical processes:
- HFCC:
- Cross-Level Dense Skip Fusion:
- Top-down: ,
- Bottom-up: ,
- Output:
This concatenated hierarchy increases gradient flow, with explicit formula:
2.2 Inverted Residual Depthwise Convolution Block (IRDCB)
For :
Residual addition if .
2.3 Lightweight Downsample (LDown)
Given :
This uses group (depthwise) convolution for spatial downsampling, followed by channel mixing.
2.4 Detection Loss
HierLight-YOLO applies YOLOv8’s multi-component loss:
with CIoU loss and binary cross-entropy for objectness and class probabilities.
3. Dedicated Small-Object Detection Mechanism
A fourth head at resolution provides increased spatial sensitivity. Nearest-neighbor upsampling enhances to match , and the two are concatenated:
This multi-layer fusion synergizes edge detail (from ) with enriched semantics (from upsampled ), allowing detection of targets as small as pixels.
4. Empirical Evaluation and Ablation Analysis
4.1 Experimental Setup
- Dataset: VisDrone2019 ($6,471$ train, $548$ val, $1,610$ test)
- Input Resolution:
- Training Regime: $600$ epochs, Adam optimizer ( lr, $0.9$ momentum), batch size 16, standard YOLOv8 augmentations
4.2 Quantitative Results
| Model | AP@0.5 | [email protected]:0.95 | Params | FLOPs | FPS |
|---|---|---|---|---|---|
| YOLOv8-S | 40.9% | 24.3% | 11.1M | 28.5G | 143 |
| YOLOv8-S-P2 | 44.1% | 26.5% | 10.6M | 36.7G | 125 |
| HierLight-S | 44.9% | 27.3% | 7.8M | 33.7G | 133 |
Nano-scale (N) and Medium-scale (M) results:
- HierLight-N: 35.8 [email protected] (vs. YOLOv8-N 33.4), 2.2M params (vs. 3.0M)
- HierLight-M: 50.2 [email protected] (vs. YOLOv8-M 44.6), 17.9M params
This suggests marked improvement in parameter efficiency without sacrifice of detection efficacy or real-time throughput.
4.3 Ablation Study Summary
| Experiment | P2 | HEPAN | IRDCB | LDown | [email protected] | [email protected]:0.95 |
|---|---|---|---|---|---|---|
| Baseline (YOLOv8-S) | – | – | – | – | 40.9% | 24.3% |
| + P2 | ✓ | – | – | – | 44.1% | 26.5% |
| + HEPAN | ✓ | ✓ | – | – | 44.9% | 27.2% |
| + IRDCB | ✓ | ✓ | ✓ | – | 44.8% | 27.0% |
| + LDown (full HL-YOLO-S) | ✓ | ✓ | ✓ | ✓ | 44.9% | 27.3% |
- P2 head alone: +3.2% [email protected]
- HEPAN: +0.8% AP (@ + 0.6M params)
- IRDCB: −22.1% parameters (no AP loss)
- LDown: further parameter reduction with slight AP gain
4.4 Component-Level Comparison
- HEPAN outperforms PANet and BiFPN in [email protected].
- IRDCB achieves nearly identical AP as C2f but with fewer parameters.
- Best IRDCB expansion factor found at .
5. Implications, Significance, and Context
HierLight-YOLO directly addresses the dual challenge of small-object detection accuracy and inference efficiency in UAV-aerial imagery scenarios where the prevalent object size is pixels. The introduction of hierarchical multi-scale fusion (HEPAN) and a targeted high-resolution detection head substantially mitigates the high false negative rates observed in prior YOLO models under these conditions. The adoption of IRDCB and LDown modules brings about parameter and FLOP reduction (up to 29.7%) for competitive accuracy, supporting deployment on embedded UAV edge hardware.
A plausible implication is that the architectural principles underlying HEPAN and lightweight modularization as implemented in HierLight-YOLO may generalize to related small-object detection tasks beyond UAV domains—particularly wherever real-time edge inference with constrained resources is required. However, the actual performance across diverse datasets remains to be rigorously benchmarked.
6. Conclusion
HierLight-YOLO introduces three principal innovations: Hierarchical Extended Path Aggregation Network (HEPAN) for cross-scale feature fusion, Inverted Residual Depthwise Convolution Block (IRDCB) for lightweight feature extraction, and Lightweight Downsample (LDown) for parameter-efficient spatial reduction, with the addition of a dedicated small-object head for pixel scale detection. Across nano, small, and medium scales, HierLight-YOLO sets new state-of-the-art performance on VisDrone2019, reducing parameters by up to 29.7% and sustaining real-time speeds above 130 FPS (Chen et al., 26 Sep 2025).