HGNetv2: Efficient CNN Backbone for Detection

Updated 26 September 2025

HGNetv2 is a CNN backbone that employs hierarchical layering and recursive HGBlocks to capture diverse, multi-scale features while preserving spatial details.
It integrates residual connections and lightweight GhostConv modules, reducing computational cost by up to 50% GFLOPs without compromising accuracy.
Deployed in both pure CNN and hybrid CNN-transformer frameworks, HGNetv2 demonstrates robust performance in domain shift and resource-constrained detection tasks.

HGNetv2 is a convolutional neural network (CNN) backbone architecture developed for efficient multi-scale feature extraction in real-time object detection tasks, with a focus on maximizing feature diversity while minimizing computational cost and model size. It is distinguished by its hierarchical layering, residual multi-scale fusion, and extensibility for lightweight modifications (such as the integration of GhostConv). HGNetv2 has been employed as both a pure CNN backbone and as a component in hybrid CNN-transformer architectures, demonstrating notable advantages in resource-constrained and domain-shift scenarios.

1. Architectural Principles of HGNetv2

HGNetv2 is structured around recursive blocks—designated as HGBlocks—that are tailored to support multi-scale feature extraction and robust propagation of spatial detail. Each HGBlock processes its input feature map $X$ through a series of convolutional operators (each followed by batch normalization and non-linear activation), producing intermediate feature representations $Y_1, Y_2, \ldots, Y_{n-1}$ . These are concatenated along the channel dimension:

$Z = \text{Concat}(Y_1, Y_2, \ldots, Y_{n-1})$

A subsequent convolution compresses the channel dimension, followed by a residual addition with the original input:

$Z' = \text{Conv}(Z), \quad \text{Output} = \text{BN}(\text{Activation}(Z' + X))$

This design preserves both the original and learned multi-scale semantics, facilitating better localization and class discrimination across diverse object sizes. When generalized into lightweight variants (e.g., GhostHGNetv2), the standard convolution operators are replaced with GhostConv modules: a two-stage operation generating intrinsic features via standard convolution and additional "ghost" features using cost-effective linear projections.

2. Mechanisms for Lightweight Efficiency: GhostHGNetv2

A prominent extension of HGNetv2 is GhostHGNetv2, which introduces GhostConv to reduce redundancy. The GhostConv mechanism decomposes feature computation as follows:

$\text{GhostConv}(X) = U\left( \sum_k^{1:K} g_k(X) \right)$

where $g_k(\cdot)$ indicates different convolutional operations (with varying kernel sizes or dilation rates) and $U(\cdot)$ denotes concatenation. This two-stage design allows a drastic reduction in FLOPs and parameter count without degrading representational capacity.

Integration of GhostHGNetv2 into object detectors both simplifies convolution layers (e.g., replacing all standard convolutions within HGBlocks) and enables multi-scale residual fusion with a lower computational footprint. Empirical results indicate a reduction of redundant computation by up to 50%, as measured by GFLOPs, and a concurrent model size decrease while enriching the effective receptive field (Zheng et al., 10 Mar 2025, Pingzhen et al., 23 Jul 2025).

3. Application in Detection Frameworks

HGNetv2 has been deployed in a variety of detection frameworks, either as a stand-alone backbone (CNN-only) or within composite detector architectures that combine CNN and transformer components. The general composite detector can be expressed as $D(\text{head}, \text{backbone})$ , with, for instance, $D(\text{RT-DETR}, \text{HGNetv2})$ denoting an RT-DETR detection head paired with the HGNetv2 backbone (Cani et al., 1 May 2025).

Integration Steps

Replacement of the default backbone (e.g., YOLOv8's CSP-DarkNet53) with HGNetv2, sometimes requiring modification of spatial skip connections ( $C(x,y)$ ) to maintain feature compatibilities.
Use of configuration variants (e.g., $C(7,17)$ , $C(9,19)$ ) to harmonize feature map sizes as input to the detection head.
For lightweight deployments, pairing HGNetv2 with custom detection heads (OptiConvDetect, GCDetect) using grouped or shared-weight convolutions.
Optional interleaving with transformer-based heads (RT-DETR), allowing the combination of strong local feature learning (from HGNetv2) and global context modeling (from transformers).

4. Performance Metrics and Deployment Impact

Experimental Results

The architecture’s efficiency and versatility are reflected in multiple empirical scenarios:

Detector Variant	Dataset	mAP / AP Variant	Params (M)	GFLOPs	FPS	Notes
HGO-YOLO (GhostHGNetv2 OptiConvDetect)	Anomaly	87.4% [email protected]	4.6	4.3	56	+3.0% mAP, -51.7% GFLOPs vs YOLOv8n (Zheng et al., 10 Mar 2025)
HGO-YOLO (Jetson Orin Nano)	Anomaly	—	—	—	42	42 FPS real-world Jetson throughput
Ghost-HGNetv2+GCDetect+C2f-Faster+pruning	PCB Defect	99.32% mAP0.5	0.67–2.31	2.4–3	129.3	+10.13 mAP0.5:0.9 vs YOLOv8n (Pingzhen et al., 23 Jul 2025)
D(RT-DETR, HGNetv2)	X-ray EDS	0.573/0.410 mAP	—	—	—	Outperforms YOLOv8 in domain shift (Cani et al., 1 May 2025)
DEIMv2-Pico (HGNetv2 pruned)	COCO	38.5 AP	1.5	—	—	Matches YOLOv10-Nano with 50% fewer params (Huang et al., 25 Sep 2025)
DEIMv2-Atto (HGNetv2 pruned)	COCO	23.8 AP	0.49	—	—	Ultra-light variant, competitive AP

On datasets without domain shift, conventional CNN backbones ( $D(\text{YOLOv8}, \text{CSP-DarkNet53})$ ) may still surpass those built with HGNetv2, but in conditions with significant domain variation (multiple X-ray scanners in EDS), $D(\text{RT-DETR}, \text{HGNetv2})$ exhibits superior robustness (Cani et al., 1 May 2025). For highly resource-constrained applications (edge devices, Jetson, mobile), HGNetv2’s prunability ensures competitive detection rates and AP at dramatically lower resource budgets (Huang et al., 25 Sep 2025).

5. Specialized Modules and Loss Functions

HGNetv2-based systems often incorporate auxiliary design features to maximize detection quality for small or occluded targets and improve inference efficiency:

GroupConv-Based Heads: GCDetect and OptiConvDetect replace standard decoupled detection heads with parameter-sharing architectures, leveraging Group or Partial Convolutions to serve both classification and regression branches, reducing head computation by up to 41% (Zheng et al., 10 Mar 2025, Pingzhen et al., 23 Jul 2025).
Inner-MPDIoU Loss: Designed for extremely fine-grained localization (e.g., PCB defects), this loss combines auxiliary adaptive bounding regions with a distance-based objective, ensuring precise bounding box regression for tiny or irregular objects (Pingzhen et al., 23 Jul 2025).
Adaptive Pruning (LAMP): After initial training, channels are pruned using magnitude-based LAMP scoring, further compressing the model to as little as 0.67M parameters with no significant degradation in mAP (Pingzhen et al., 23 Jul 2025).
Dynamic Anchor/Stride Computation: Networks dynamically determine anchor scales and strides during inference, optimizing for input shape and maintaining detection consistency (Zheng et al., 10 Mar 2025).

6. Pruning and Scalability

HGNetv2 is explicitly designed for aggressive depth and width pruning to create ultra-lightweight models:

Depth Pruning: Complete removal of one or more backbone stages (e.g., dropping the fourth stage in the B0 variant for Pico).
Width Pruning: Reduction of final block channels, formalized as $C_{\text{pruned}} = \beta \cdot C_{\text{orig}}$ (with $\beta = 0.5$ representing channel halving for Atto).
Block Count Reduction: For the Femto variant, reducing the final-stage block count to $N_{\text{last}}^{(\text{Femto})}=1$ .
Effect on Resource Budgets: These strategies enable creation of sub-1M parameter models with competitive AP—e.g., DEIMv2-Atto at 0.49M parameters and 23.8 AP on COCO (Huang et al., 25 Sep 2025).

The table below summarizes HGNetv2 scaling strategies within DEIMv2:

Variant	Depth Reduction	Width Reduction ( $\beta$ )	Output Scale	Params (M)	AP (COCO)
Pico	Remove 4th stage	—	1/16	1.5	38.5
Femto	Set $N_\text{last}=1$	—	1/16	—	—
Atto	—	0.5	1/16	0.49	23.8

This scaling enables deployment across a wide spectrum of hardware, from mobile and edge devices to high-throughput computing environments.

7. Application Domains and Limitations

HGNetv2 and its derivatives are widely applicable to scenarios requiring efficient, real-time, and robust object detection:

Anomaly Surveillance: Integration into HGO-YOLO for fall/fighting/smoke detection yields improved accuracy and latency on CPUs and embedded devices (Zheng et al., 10 Mar 2025).
Industrial Inspection: In PCB defect detection, Ghost-HGNetv2, coupled with tailored loss functions and pruning, sets new mAP records with rapid inference and a tiny memory footprint (Pingzhen et al., 23 Jul 2025).
Security Screening: In X-ray security contexts, pairing HGNetv2 with transformer-inspired detection heads enhances robustness to domain distribution shifts, though careful skip connection design is required (Cani et al., 1 May 2025).
General Object Detection: As the backbone for ultra-lightweight DEIMv2 variants, HGNetv2 offers a compelling performance-cost ratio, outperforming state-of-the-art YOLO models at comparable or lower parameter counts (Huang et al., 25 Sep 2025).

A plausible implication is that, while HGNetv2 excels in extreme efficiency and robustness under domain variation, hybrid architectures may require careful engineering (particularly in skip connection design and neck/head compatibility) to reliably outperform established CNN-only backbones; additionally, the full benefits of transformer integration appear scenario-dependent.

HGNetv2 thus represents a modular, hierarchical, and prunable CNN backbone supporting an array of lightweight object detection systems, and its adoption may increase in settings where computational and memory efficiency are as critical as detection accuracy.