PACGNet Architecture for Multimodal Detection
- The PACGNet architecture introduces a dual-stream backbone with symmetrical cross-gating (SCG) to refine features from RGB and IR modalities.
- It employs a pyramidal feature-aware multimodal gating (PFMG) module to progressively fuse hierarchical features and enhance small-object detection.
- Experiments on DroneVehicle and VEDAI datasets demonstrate PACGNet’s state-of-the-art performance, with significant mAP50 improvements.
The Pyramidal Adaptive Cross-Gating Network (PACGNet) is an architecture for multimodal object detection, explicitly designed to address the limitations of conventional feature fusion methods in aerial imagery-based detection tasks. PACGNet preserves hierarchical feature structures and reduces cross-modal noise through a combination of bidirectional “horizontal” gating (Symmetrical Cross-Gating, SCG) and progressive “vertical” fusion (Pyramidal Feature-aware Multimodal Gating, PFMG) within a dual-stream deep backbone. This approach produces state-of-the-art results on challenging benchmarks, especially in the detection of small or fine-grained objects in registered RGB and infrared (IR) imagery (Gu et al., 20 Dec 2025).
1. Network Structure and Dataflow
PACGNet processes registered RGB and IR image pairs—each resized to —via a dual-stream YOLOv8 backbone. The backbone consists of two identical but independently parameterized branches (for RGB and IR), using standard C2f blocks. Each stream generates hierarchical feature maps at four scales: , , , and , with , , , and for the base model.
After each backbone stage except the final (), a Symmetrical Cross-Gating (SCG) module fuses and refines features between the two modalities at corresponding resolutions. The SCG modules operate at , , and .
Following the backbone, a series of PFMG modules perform progressive, top-down fusion of features at decreasing resolutions (, , ), with each level’s fusion guided by the fused features from the previous, higher-resolution stage. The final fused pyramid is processed by a standard PANet neck and a YOLOv8-oriented box detection head.
The main dataflow is as follows:
- dual-stream backbone
- (no SCG at )
- (guided by )
- (guided by )
- (guided by )
- oriented-box detection head, outputs
2. Symmetrical Cross-Gating (SCG) Module
At each of , , and , the SCG module implements a symmetrical, bidirectional “horizontal” cross-gating function between modality feature maps and . Both streams are refined through intra-modal bottleneck layers before entering modality cross-gating.
SCG includes two gating routes per direction:
- Spatial Gating: Computes a spatial attention map from one modality and applies it to the other, e.g., , then .
- Channel Gating: Projects the guidance modality through a bottleneck and 1x1 conv to yield a channel map, e.g., , then .
The final output is obtained by residual addition and normalization:
The operation is symmetric for , and each direction employs independent parameter sets. This ensures that cross-modal enhancement selectively absorbs complementary information while suppressing noise. The residual path prevents destructive gating and preserves modality-specific semantics.
3. Pyramidal Feature-aware Multimodal Gating (PFMG) Module
The PFMG module implements a progressive vertical fusion mechanism at each level , guided by the higher-resolution fused feature . The detailed steps include:
- Hierarchical Spatial Gate: Computes .
- Modality Interaction & Weighting: Concatenates and projects and , then computes a softmax score map to yield a spatially adaptive blend.
- Gated Fusion:
This process enforces the propagation of fine object detail down the feature pyramid, leveraging high-resolution spatial guidance to improve the preservation of object boundaries and small-object sensitivity.
4. Integration with Backbone and Detection Head
The dual-stream backbone is based on the YOLOv8 architecture, with C2f blocks in both RGB and IR branches (independent weights, identical topology). The SCG modules augment intra-scale feature interaction within the backbone, while PFMG modules replace traditional lateral connections typically seen in FPNs. After feature fusion, the resulting fused pyramids are fed into a standard PANet neck and then to an oriented-box detection head. Outputs consist of .
5. Implementation Strategies and Training Configuration
- Convolutional Kernels:
- SCG spatial gate: 1x1 conv + sigmoid
- SCG channel gate: bottleneck 1x1 conv () + sigmoid
- PFMG spatial gate: 3x3 conv (stride=2) + sigmoid
- PFMG interaction: 1x1 conv
- Gate Initialization: All gating conv weights set to zero, biases to zero, yielding initial gates post-sigmoid.
- Optimization: SGD, initial learning rate , final , momentum $0.937$, weight decay .
- Batch Size & Augmentation: Batch size $128$ distributed over RTX3090 GPUs; data augmentation via Mosaic, random flip, translate.
- Loss Functions: WIoU-v3 for bounding box regression, BCE for classification.
- Epochs: 3 epoch warmup, total 300 epochs.
6. Experimental Results and Comparative Evaluation
PACGNet was evaluated on DroneVehicle (5 classes, oriented boxes) and VEDAI (9 classes, small-object-heavy, oriented boxes) datasets. The principal metric was mAP50. Results are summarized:
| Model Variant | DroneVehicle mAP50 | VEDAI mAP50 |
|---|---|---|
| Baseline (no SCG, PFMG) | 80.1% | 74.1% |
| + PFMG only | 80.7% | 76.7% |
| + SCG only | 80.8% | 76.6% |
| SCG + PFMG (PACGNet) | 81.7% | 82.1% |
| Prior SOTA (RGFNet) | 81.4% | 81.2% |
PACGNet outperforms prior best methods, particularly on VEDAI, exhibiting an absolute gain in mAP50 over the baseline, indicating the synergistic effect of SCG and PFMG modules in enhancing cross-modal reinforcement and detail-aware gating for small object detection. The effect on DroneVehicle is positive but more modest, with non-additive gains confirming complementary roles for the two gating strategies.
7. Architectural Significance and Practical Implications
PACGNet’s integration of both cross-modal horizontal fusion (SCG) and vertical, detail-aware refinement (PFMG) addresses two prevalent challenges: cross-modal noise and the disruption of multi-scale feature hierarchies observed in conventional fusion schemes. The architecture maintains semantic integrity for each modality while promoting selective absorption of complementary features, and enforces high-fidelity propagation of fine-grained details across pyramid levels. The empirical outcomes on diverse benchmarks support the efficacy of these principles, especially in domains characterized by small object density and challenging multimodal registration (Gu et al., 20 Dec 2025).
This suggests that progressive, attention-based gating at both intra- and inter-hierarchical levels may be an effective paradigm for multimodal feature fusion, motivating its application to other vision modalities and tasks characterized by scale sensitivity and cross-modal heterogeneity.