Papers
Topics
Authors
Recent
2000 character limit reached

HierLight-YOLO: Lightweight UAV Detection

Updated 16 December 2025
  • The paper introduces HierLight-YOLO, a framework that integrates a novel Hierarchical Extended Path Aggregation Network and lightweight modules to enhance small-object detection.
  • It applies inverted residual depthwise convolution blocks and lightweight downsampling to reduce parameters by up to 29.7% while maintaining competitive accuracy.
  • A dedicated high-resolution detection head improves sensitivity to objects as small as 4×4 pixels, enabling efficient real-time processing in UAV imagery.

HierLight-YOLO is a hierarchical and lightweight object detection framework, specifically formulated to address real-time detection of small objects in unmanned aerial vehicle (UAV) photography. Extending the YOLOv8 architecture, HierLight-YOLO incorporates specialized mechanisms for multi-scale fusion, significant parameter reduction, and enhanced response to tiny target objects—substantially improving detection accuracy while retaining high inference speed and low computational footprint on resource-constrained platforms (Chen et al., 26 Sep 2025).

1. Architectural Overview and Network Design

HierLight-YOLO maintains the canonical three-stage structure—Backbone, Neck, Head—as in YOLOv8, but introduces novel modules for hierarchical multi-scale feature processing and lightweight parameterization.

  • Backbone: Begins with Conv 3×3 + BatchNorm + SiLU, followed by stacked Inverted Residual Depthwise Convolution Blocks (IRDCB) which act as both feature extractors and channel modulators. Downsampling is conducted with Lightweight Downsample (LDown) modules, providing a 50% reduction in spatial resolution.
  • Feature Map Sequence: The backbone outputs four multi-scale feature maps:
    • P2RC2×160×160P_2\in\mathbb{R}^{C_2\times160\times160}
    • P3RC3×80×80P_3\in\mathbb{R}^{C_3\times80\times80}
    • P4RC4×40×40P_4\in\mathbb{R}^{C_4\times40\times40}
    • P5RC5×20×20P_5\in\mathbb{R}^{C_5\times20\times20}
  • Neck (HEPAN): The Hierarchical Extended Path Aggregation Network accepts P2P_2P5P_5 and applies 1×1 conv-based Hierarchical Feature Channel Compression (HFCC), followed by dense hierarchical cross-level fusion via top-down and bottom-up passes.
  • Heads: Four detection heads are used. A dedicated small-object head operates at 160×160160\times160 (from F2F_2) for high spatial fidelity (targeting objects as small as 4×44\times4 pixels); standard heads remain at 80×8080\times80, 40×4040\times40, and 20×2020\times20. All heads use anchor-free regression.

Block Diagram (simplified):

1
2
3
4
5
6
7
8
Input 640×640
  │
  ├─ Conv3×3 → IRDCB → LDown ──┐
  │                            ├─ IRDCB → LDown ──┐
  │                            │                  ├─ IRDCB → LDown → P5
  │                            │                  └─ IRDCB →         → P4
  │                            └─ IRDCB →         → P3
  └─ IRDCB ──────────────────────────────────────→ P2
(P2P_2P5P_5 flow into HEPAN, then four detection heads.)

2. Key Modules: Mathematical and Structural Specification

2.1 Hierarchical Extended Path Aggregation Network (HEPAN)

HEPAN fuses Fin(l)F_\mathrm{in}^{(l)} (l=25)(l=2\ldots5) via two critical processes:

  • HFCC:

F~(l)=Conv1×1(Fin(l))RCl×Hl×Wl\tilde F^{(l)} = \mathrm{Conv}_{1\times1}(F_\mathrm{in}^{(l)}) \in \mathbb{R}^{C'_l\times H_l\times W_l}

  • Cross-Level Dense Skip Fusion:
    • Top-down: T(5)=F~(5)T^{(5)}=\tilde F^{(5)}, T(l)=Conv3×3(T(l+1))+F~(l)T^{(l)}=\mathrm{Conv}_{3\times3}(T^{(l+1)})+\tilde F^{(l)}
    • Bottom-up: B(2)=T(2)B^{(2)}=T^{(2)}, B(l)=Conv3×3(B(l1))+T(l)B^{(l)}=\mathrm{Conv}_{3\times3}(B^{(l-1)})+T^{(l)}
    • Output: F(l)=B(l)F^{(l)}=B^{(l)}

This concatenated hierarchy increases gradient flow, with explicit formula:

LF~(l)=mlLT(m)T(m)F~(l)\frac{\partial \mathcal{L}}{\partial \tilde{F}^{(l)}} = \sum_{m\geq l} \frac{\partial \mathcal{L}}{\partial T^{(m)}} \frac{\partial T^{(m)}}{\partial \tilde{F}^{(l)}}

2.2 Inverted Residual Depthwise Convolution Block (IRDCB)

For xinRc1×H×Wx_\mathrm{in}\in\mathbb{R}^{c_1\times H\times W}:

F(xin)=Conv1×1expandDWConv3×3DWConv3×3Conv1×1compress(xin)\mathcal{F}(x_\mathrm{in})= \mathrm{Conv}_{1\times1}^{\mathrm{expand}} \circ \mathrm{DWConv}_{3\times3} \circ \mathrm{DWConv}_{3\times3} \circ \mathrm{Conv}_{1\times1}^{\mathrm{compress}}(x_\mathrm{in})

Residual addition if c1=c2c_1=c_2.

2.3 Lightweight Downsample (LDown)

Given FinRc1×H×WF_\mathrm{in}\in\mathbb{R}^{c_1\times H\times W}:

Fout=Conv1×1(2)(Convk×k,s(1)(Fin))F_\mathrm{out} = \mathrm{Conv}_{1\times1}^{(2)}(\mathrm{Conv}_{k\times k, s}^{(1)}(F_\mathrm{in}))

This uses group (depthwise) convolution for spatial downsampling, followed by channel mixing.

2.4 Detection Loss

HierLight-YOLO applies YOLOv8’s multi-component loss:

L=λboxLCIoU+λobjLBCE(o^,o)+λclsLBCE(p^,p)\mathcal{L} = \lambda_\mathrm{box}L_\mathrm{CIoU} + \lambda_\mathrm{obj}L_\mathrm{BCE}(\hat o,o) + \lambda_\mathrm{cls}L_\mathrm{BCE}(\hat p,p)

with CIoU loss and binary cross-entropy for objectness and class probabilities.

3. Dedicated Small-Object Detection Mechanism

A fourth head at 160×160160\times160 resolution provides increased spatial sensitivity. Nearest-neighbor upsampling φ2φ_2 enhances F(3)F^{(3)} to match F(2)F^{(2)}, and the two are concatenated:

Fmerge=Concat(ϕ2(F(3)),F(2))F_\mathrm{merge} = \mathrm{Concat}(\phi_2(F^{(3)}),\,F^{(2)})

This multi-layer fusion synergizes edge detail (from P2P_2) with enriched semantics (from upsampled P3P_3), allowing detection of targets as small as 4×44\times4 pixels.

4. Empirical Evaluation and Ablation Analysis

4.1 Experimental Setup

  • Dataset: VisDrone2019 ($6,471$ train, $548$ val, $1,610$ test)
  • Input Resolution: 640×640640\times640
  • Training Regime: $600$ epochs, Adam optimizer (1e31\mathrm{e}{-3} lr, $0.9$ momentum), batch size 16, standard YOLOv8 augmentations

4.2 Quantitative Results

Model AP@0.5 [email protected]:0.95 Params FLOPs FPS
YOLOv8-S 40.9% 24.3% 11.1M 28.5G 143
YOLOv8-S-P2 44.1% 26.5% 10.6M 36.7G 125
HierLight-S 44.9% 27.3% 7.8M 33.7G 133

Nano-scale (N) and Medium-scale (M) results:

This suggests marked improvement in parameter efficiency without sacrifice of detection efficacy or real-time throughput.

4.3 Ablation Study Summary

Experiment P2 HEPAN IRDCB LDown [email protected] [email protected]:0.95
Baseline (YOLOv8-S) 40.9% 24.3%
+ P2 44.1% 26.5%
+ HEPAN 44.9% 27.2%
+ IRDCB 44.8% 27.0%
+ LDown (full HL-YOLO-S) 44.9% 27.3%
  • P2 head alone: +3.2% [email protected]
  • HEPAN: +0.8% AP (@ + 0.6M params)
  • IRDCB: −22.1% parameters (no AP loss)
  • LDown: further parameter reduction with slight AP gain

4.4 Component-Level Comparison

  • HEPAN outperforms PANet and BiFPN in [email protected].
  • IRDCB achieves nearly identical AP as C2f but with fewer parameters.
  • Best IRDCB expansion factor found at (n=2,t=2)(n=2, t=2).

5. Implications, Significance, and Context

HierLight-YOLO directly addresses the dual challenge of small-object detection accuracy and inference efficiency in UAV-aerial imagery scenarios where the prevalent object size is <32<32 pixels. The introduction of hierarchical multi-scale fusion (HEPAN) and a targeted high-resolution detection head substantially mitigates the high false negative rates observed in prior YOLO models under these conditions. The adoption of IRDCB and LDown modules brings about parameter and FLOP reduction (up to 29.7%) for competitive accuracy, supporting deployment on embedded UAV edge hardware.

A plausible implication is that the architectural principles underlying HEPAN and lightweight modularization as implemented in HierLight-YOLO may generalize to related small-object detection tasks beyond UAV domains—particularly wherever real-time edge inference with constrained resources is required. However, the actual performance across diverse datasets remains to be rigorously benchmarked.

6. Conclusion

HierLight-YOLO introduces three principal innovations: Hierarchical Extended Path Aggregation Network (HEPAN) for cross-scale feature fusion, Inverted Residual Depthwise Convolution Block (IRDCB) for lightweight feature extraction, and Lightweight Downsample (LDown) for parameter-efficient spatial reduction, with the addition of a dedicated small-object head for 4×44\times4 pixel scale detection. Across nano, small, and medium scales, HierLight-YOLO sets new state-of-the-art performance on VisDrone2019, reducing parameters by up to 29.7% and sustaining real-time speeds above 130 FPS (Chen et al., 26 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to HierLight-YOLO.