PACGNet Architecture for Multimodal Detection

Updated 20 February 2026

The PACGNet architecture introduces a dual-stream backbone with symmetrical cross-gating (SCG) to refine features from RGB and IR modalities.
It employs a pyramidal feature-aware multimodal gating (PFMG) module to progressively fuse hierarchical features and enhance small-object detection.
Experiments on DroneVehicle and VEDAI datasets demonstrate PACGNet’s state-of-the-art performance, with significant mAP50 improvements.

The Pyramidal Adaptive Cross-Gating Network (PACGNet) is an architecture for multimodal object detection, explicitly designed to address the limitations of conventional feature fusion methods in aerial imagery-based detection tasks. PACGNet preserves hierarchical feature structures and reduces cross-modal noise through a combination of bidirectional “horizontal” gating (Symmetrical Cross-Gating, SCG) and progressive “vertical” fusion (Pyramidal Feature-aware Multimodal Gating, PFMG) within a dual-stream deep backbone. This approach produces state-of-the-art results on challenging benchmarks, especially in the detection of small or fine-grained objects in registered RGB and infrared (IR) imagery (Gu et al., 20 Dec 2025).

1. Network Structure and Dataflow

PACGNet processes registered RGB and IR image pairs—each resized to $640 \times 640$ —via a dual-stream YOLOv8 backbone. The backbone consists of two identical but independently parameterized branches (for RGB and IR), using standard C2f blocks. Each stream generates hierarchical feature maps at four scales: $P_2 \in \mathbb{R}^{C_2 \times \frac{H}{4} \times \frac{W}{4}}$ , $P_3$ , $P_4$ , and $P_5$ , with $C_2=128$ , $C_3=256$ , $C_4=512$ , and $C_5=1024$ for the base model.

After each backbone stage except the final ( $P_5$ ), a Symmetrical Cross-Gating (SCG) module fuses and refines features between the two modalities at corresponding resolutions. The SCG modules operate at $P_2$ , $P_3$ , and $P_4$ .

Following the backbone, a series of PFMG modules perform progressive, top-down fusion of features at decreasing resolutions ( $P_3$ , $P_4$ , $P_5$ ), with each level’s fusion guided by the fused features from the previous, higher-resolution stage. The final fused pyramid $\{P_{\text{fused}3}, P_{\text{fused}4}, P_{\text{fused}5}\}$ is processed by a standard PANet neck and a YOLOv8-oriented box detection head.

The main dataflow is as follows:

$[RGB_{in}, IR_{in}] \rightarrow$ dual-stream backbone
$\{P_2^{RGB}, P_2^{IR}\} \rightarrow SCG@P_2 \rightarrow \{P_2'^{RGB}, P_2'^{IR}\}$
$... \rightarrow P_3^{RGB}, P_3^{IR} \rightarrow SCG@P_3 \rightarrow ...$
$... \rightarrow P_4^{RGB}, P_4^{IR} \rightarrow SCG@P_4$
$... \rightarrow P_5^{RGB}, P_5^{IR}$ (no SCG at $P_5$ )
$PFMG@P_3$ (guided by $P_2'$ ) $\rightarrow P_{\text{fused}3}$
$PFMG@P_4$ (guided by $P_{\text{fused}3}$ ) $\rightarrow P_{\text{fused}4}$
$PFMG@P_5$ (guided by $P_{\text{fused}4}$ ) $\rightarrow P_{\text{fused}5}$
$PANet \rightarrow$ oriented-box detection head, outputs $[x, y, w, h, \theta, \text{class scores}]$

2. Symmetrical Cross-Gating (SCG) Module

At each of $P_2$ , $P_3$ , and $P_4$ , the SCG module implements a symmetrical, bidirectional “horizontal” cross-gating function between modality feature maps $F_k^{RGB}$ and $F_k^{IR}$ . Both streams are refined through intra-modal bottleneck layers before entering modality cross-gating.

SCG includes two gating routes per direction:

Spatial Gating: Computes a spatial attention map from one modality and applies it to the other, e.g., $M_s^{IR \rightarrow RGB} = \sigma(W_s^{IR \rightarrow RGB} * Frgb)$ , then $Frgb^{(s)} = Frgb \odot (1 + M_s^{IR \rightarrow RGB})$ .
Channel Gating: Projects the guidance modality through a bottleneck and 1x1 conv to yield a channel map, e.g., $g^{IR \rightarrow RGB} = \sigma(W_c^{IR \rightarrow RGB} * G^{IR \rightarrow RGB} + b)$ , then $Frgb^{(c)} = g^{IR \rightarrow RGB} \odot G^{IR \rightarrow RGB}$ .

The final output is obtained by residual addition and normalization:

$\hat F^{RGB} = \text{BatchNorm}\left(F^{RGB} + Frgb^{(s)} + Frgb^{(c)}\right)$

The operation is symmetric for $F^{IR}$ , and each direction employs independent parameter sets. This ensures that cross-modal enhancement selectively absorbs complementary information while suppressing noise. The residual path prevents destructive gating and preserves modality-specific semantics.

3. Pyramidal Feature-aware Multimodal Gating (PFMG) Module

The PFMG module implements a progressive vertical fusion mechanism at each level $i \in \{3,4,5\}$ , guided by the higher-resolution fused feature $P_{i-1}^{fused}$ . The detailed steps include:

Hierarchical Spatial Gate: Computes $M^{(i)} = \sigma(\text{Conv}_{3 \times 3, s=2}([P_{i-1}^{fused}; F_{i-1}^{RGB/IR}])) \in \mathbb{R}^{1 \times H_i \times W_i}$ .
Modality Interaction & Weighting: Concatenates and projects $F_i^{RGB}$ and $F_i^{IR}$ , then computes a softmax score map $[\alpha, 1-\alpha]$ to yield a spatially adaptive blend.
Gated Fusion:

$F_i^{base} = \alpha \odot U^{RGB} + (1-\alpha) \odot U^{IR}$

$P_i^{fused} = F_i^{base} \odot M^{(i)} + F_i^{base}$

This process enforces the propagation of fine object detail down the feature pyramid, leveraging high-resolution spatial guidance to improve the preservation of object boundaries and small-object sensitivity.

4. Integration with Backbone and Detection Head

The dual-stream backbone is based on the YOLOv8 architecture, with C2f blocks in both RGB and IR branches (independent weights, identical topology). The SCG modules augment intra-scale feature interaction within the backbone, while PFMG modules replace traditional lateral connections typically seen in FPNs. After feature fusion, the resulting fused pyramids $P_{\text{fused}3}, P_{\text{fused}4}, P_{\text{fused}5}$ are fed into a standard PANet neck and then to an oriented-box detection head. Outputs consist of $[x, y, w, h, \theta, \text{class scores}]$ .

5. Implementation Strategies and Training Configuration

Convolutional Kernels:
- SCG spatial gate: 1x1 conv + sigmoid
- SCG channel gate: bottleneck 1x1 conv ( $C_k \rightarrow C_k/2 \rightarrow C_k$ ) + sigmoid
- PFMG spatial gate: 3x3 conv (stride=2) + sigmoid
- PFMG interaction: 1x1 conv
Gate Initialization: All gating conv weights set to zero, biases to zero, yielding initial gates $\approx 1$ post-sigmoid.
Optimization: SGD, initial learning rate $lr_0 = 0.01$ , final $lrf = 0.01$ , momentum $0.937$, weight decay $5\times 10^{-4}$ .
Batch Size & Augmentation: Batch size $128$ distributed over $8\times$ RTX3090 GPUs; data augmentation via Mosaic, random flip, translate.
Loss Functions: WIoU-v3 for bounding box regression, BCE for classification.
Epochs: 3 epoch warmup, total 300 epochs.

6. Experimental Results and Comparative Evaluation

PACGNet was evaluated on DroneVehicle (5 classes, oriented boxes) and VEDAI (9 classes, small-object-heavy, oriented boxes) datasets. The principal metric was mAP50. Results are summarized:

Model Variant	DroneVehicle mAP50	VEDAI mAP50
Baseline (no SCG, PFMG)	80.1%	74.1%
+ PFMG only	80.7%	76.7%
+ SCG only	80.8%	76.6%
SCG + PFMG (PACGNet)	81.7%	82.1%
Prior SOTA (RGFNet)	81.4%	81.2%

PACGNet outperforms prior best methods, particularly on VEDAI, exhibiting an $8.0\%$ absolute gain in mAP50 over the baseline, indicating the synergistic effect of SCG and PFMG modules in enhancing cross-modal reinforcement and detail-aware gating for small object detection. The effect on DroneVehicle is positive but more modest, with non-additive gains confirming complementary roles for the two gating strategies.

7. Architectural Significance and Practical Implications

PACGNet’s integration of both cross-modal horizontal fusion (SCG) and vertical, detail-aware refinement (PFMG) addresses two prevalent challenges: cross-modal noise and the disruption of multi-scale feature hierarchies observed in conventional fusion schemes. The architecture maintains semantic integrity for each modality while promoting selective absorption of complementary features, and enforces high-fidelity propagation of fine-grained details across pyramid levels. The empirical outcomes on diverse benchmarks support the efficacy of these principles, especially in domains characterized by small object density and challenging multimodal registration (Gu et al., 20 Dec 2025).

This suggests that progressive, attention-based gating at both intra- and inter-hierarchical levels may be an effective paradigm for multimodal feature fusion, motivating its application to other vision modalities and tasks characterized by scale sensitivity and cross-modal heterogeneity.

Markdown Report Issue Upgrade to Chat

References (1)

Pyramidal Adaptive Cross-Gating for Multimodal Detection (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PACGNet Architecture.