PACGNet: Adaptive Cross-Gating for Aerial Detection

Updated 27 December 2025

The paper introduces a dual-stream YOLOv8-based architecture that uses SCG and PFMG modules to fuse RGB and IR data for enhanced aerial object detection.
It achieves super-additive gains on DroneVehicle and VEDAI benchmarks, notably improving small-object detection performance.
Experimental results and ablation studies confirm that early cross-modal fusing effectively reduces noise while preserving pyramidal feature hierarchies.

The Pyramidal Adaptive Cross-Gating Network (PACGNet) is an architecture specifically designed for multimodal object detection in aerial imagery, notably UAV-based RGB and infrared (IR) data. It seeks to address key limitations of prevailing multimodal fusion approaches—namely, their proneness to cross-modal noise and their disruption of pyramidal feature hierarchies. PACGNet's design centers around two innovations: a Symmetrical Cross-Gating (SCG) module for bidirectional, modality-specific feature filtering and a Pyramidal Feature-aware Multimodal Gating (PFMG) module for progressive, hierarchy-preserving fusion. Together, these mechanisms deliver state-of-the-art performance on benchmarks such as DroneVehicle and VEDAI, particularly enhancing fine-grained, small-object detection (Gu et al., 20 Dec 2025).

1. Network Architecture and Backbone Integration

PACGNet utilizes a dual-stream variant of the YOLOv8 architecture as its backbone. One stream processes RGB data, and the other IR; neither stream is pre-trained. Each stream produces a four-level feature pyramid, denoted $P_2 \to P_3 \to P_4 \to P_5$ , using convolutional and C2f blocks. Crucially, PACGNet departs from simple post-hoc or naive summation by performing early “horizontal” cross-gating fusion through the SCG module after each of $P_2$ , $P_3$ , and $P_4$ . The output at each level is a pairwise-fused feature:

$P_2^f = \mathrm{SCG}(P_2^{\mathrm{RGB}}, P_2^{\mathrm{IR}})$
$P_3^f = \mathrm{SCG}(P_3^{\mathrm{RGB}}, P_3^{\mathrm{IR}})$
$P_4^f = \mathrm{SCG}(P_4^{\mathrm{RGB}}, P_4^{\mathrm{IR}})$

Following SCG, a top-down PFMG mechanism recursively fuses features along the pyramid:

$\hat{P}_3 = \mathrm{PFMG}(P_3^f, \text{guidance}=P_2^f)$
$\hat{P}_4 = \mathrm{PFMG}(P_4^f, \text{guidance}=P_3^f)$
$\hat{P}_5 = \mathrm{PFMG}(P_5^f, \text{guidance}=P_4^f)$

The output fused features $\{\hat{P}_3, \hat{P}_4, \hat{P}_5\}$ are delivered to a standard YOLOv8 neck (PAN) and detection head, which produce oriented bounding box (OBB) predictions.

2. Symmetrical Cross-Gating (SCG) Module

The SCG module introduces a bidirectional, “horizontal” gating mechanism to selectively update features in one modality with guidance from the other, while preventing noise propagation and maintaining semantic integrity via residual links. Each direction (e.g., IR $\to$ RGB and RGB $\to$ IR) operates as follows, with $F^{\mathrm{rgb}}_{\mathrm{in}}, F^{\mathrm{ir}}_{\mathrm{in}} \in \mathbb{R}^{C \times H \times W}$ as input feature maps:

Processing Flow for IR $\to$ RGB:

Intra-modal refinement: $F_{\mathrm{rgb}} = R(F^{\mathrm{rgb}}_{\mathrm{in}})$ and $F_{\mathrm{ir}} = R(F^{\mathrm{ir}}_{\mathrm{in}})$ , using depthwise-separable bottleneck $R$ .
Spatial gating: $M_{\mathrm{ir} \to \mathrm{rgb}} = \sigma(\mathrm{Conv}_{1\times1}(F_{\mathrm{ir}})) \in [0, 1]^{1 \times H \times W}$ , $F_{\text{spat}} = F_{\mathrm{rgb}} \odot (1 + M_{\mathrm{ir} \to \mathrm{rgb}})$ .
Channel guidance: $G_{\mathrm{ir} \to \mathrm{rgb}} = P(F_{\mathrm{ir}})$ , a 1 $\times$ 1 conv bottleneck to $\mathbb{R}^{C/r \times H \times W}$ , and $g_{\mathrm{ir} \to \mathrm{rgb}} = \sigma(\mathrm{Conv}_{1\times1}(G_{\mathrm{ir} \to \mathrm{rgb}})) \in [0, 1]^{C/r \times 1 \times 1}$ .
Fusion and residual: $\mathrm{BN}(F^{\mathrm{rgb}}_{\mathrm{in}}) + \Big(F_{\text{spat}} + g_{\mathrm{ir} \to \mathrm{rgb}} \odot G_{\mathrm{ir} \to \mathrm{rgb}}\Big)$ .

This process is executed in parallel for the opposite direction. The formalism for general modalities $A$ and $B$ is:

$\begin{align*} A' &= R(A), \quad B' = R(B) \ M_{B\to A} &= \sigma(W_s \ast B') \ F_{\mathrm{spatial}} &= A' \odot (1 + M_{B\to A}) \ G_{B \to A} &= P(B') \ g_{B \to A} &= \sigma(W_c \ast G_{B \to A}) \ A_{\text{out}} &= \mathrm{BN}(A) + [F_{\mathrm{spatial}} + g_{B \to A} \odot G_{B \to A}] \end{align*}$

Pseudocode summarizing SCG:

def SCG(A_in, B_in):
    # A_in, B_in ∈ ℝ^{C×H×W}
    A_ref = R(A_in)           # Depthwise-Separable refinement
    B_ref = R(B_in)
    # 1) Spatial gate from B to A
    M = sigmoid(conv1x1(B_ref))          # shape (1,H,W)
    F_sp = A_ref * (1 + M)               # spatial modulation
    # 2) Channel guidance from B to A
    G = P_projection(B_ref)              # shape (C/r, H, W)
    g = sigmoid(conv1x1(G))              # shape (C/r,1,1)
    F_ch = g * G                         # channel gating
    # 3) Residual fusion
    output_A = BatchNorm(A_in) + (F_sp + F_ch)
    return output_A

3. Pyramidal Feature-aware Multimodal Gating (PFMG) Module

The PFMG module is responsible for hierarchy-preserving, top-down fusion of multimodal features. At each pyramid level $i$ , the module fuses the pairwise-fused features $F^{(i)}_{\mathrm{RGB}}$ , $F^{(i)}_{\mathrm{IR}}$ guided by the finer, higher-resolution feature $F^{(i-1)}$ from the previous level.

Stepwise fusion:

Hierarchical spatial gate: $M^{(i)} = H(\mathrm{concat}(F^{(i-1)}, F^{(i-1)})) = \sigma_s(\mathrm{strided\,conv}_2(...)) \in [0,1]^{1\times H_i \times W_i}$
Modality interaction: $I = \mathrm{Conv}_{1\times1}(\mathrm{concat}(F^{(i)}_{\mathrm{RGB}}, F^{(i)}_{\mathrm{IR}}))$ , split into $[F^{\mathrm{int}}_{\mathrm{RGB}}, F^{\mathrm{int}}_{\mathrm{IR}}]$
Adaptive weighting: $[\alpha^{(i)}, \beta^{(i)}] = \mathrm{Softmax}(\mathrm{Conv}_{1\times1}(\mathrm{concat}(F^{\mathrm{int}}_{\mathrm{RGB}}, F^{\mathrm{int}}_{\mathrm{IR}})))$ , with $\alpha + \beta = 1$ over the two-channel dimension for each pixel
Hierarchically-gated fusion:

$F^{\mathrm{base}} = \alpha^{(i)} \odot F^{\mathrm{int}}_{\mathrm{RGB}} + \beta^{(i)} \odot F^{\mathrm{int}}_{\mathrm{IR}}$

$F^{\mathrm{fused}, (i)} = F^{\mathrm{base}} + M^{(i)} \odot F^{\mathrm{base}}$

All nonlinearities and softmax activations are applied as specified in the formulation.

4. Training Protocols, Implementation, and Model Variants

PACGNet is trained on two UAV-focused datasets—DroneVehicle and VEDAI—both containing paired RGB-IR imagery with oriented bounding box annotations for vehicles. Preprocessing includes cropping, resizing, and standard augmentations such as mosaic composition, random flips, and translation.

Optimization specifics:

Framework: Ultralytics YOLOv8 v8.2.50
Hardware: 8 NVIDIA RTX 3090 GPUs
Batch size: 128 (16 per GPU)
Optimizer: SGD with momentum 0.937 and weight decay $5\times 10^{-4}$
Learning rate: initial 0.01, scheduled to a final factor of 0.01
Warmup: 3 epochs (starting momentum 0.8, bias_lr 0.1)
Total epochs: 300
Losses: WIoU v3 for regression, standard cross-entropy for classification

Model Growth (Ablation Table):

Variant	Params (M)	GFLOPS	mAP50 (VEDAI)	mAP50 (DroneVehicle)
Baseline Dual YOLOv8	4.3	11.6	74.1	80.1
+PFMG	4.7	12.3	76.7	80.7
+SCG	4.8	12.5	76.6	80.8
PACGNet (PFMG+SCG)	5.2	13.2	82.1	81.7

Both modules individually yield clear improvement, with the combination delivering super-additive gains.

5. Experimental Results and Analysis

On DroneVehicle, PACGNet attains an mAP50 of 81.7% (IoU=0.50 for oriented boxes), compared to 77.4% for the best single-modality YOLOv8 and 81.4% for the top prior multimodal approach (RGFNet). On VEDAI, PACGNet reaches 82.1% mAP50, surpassing the highest previously reported result (S-MSTD: 81.2%) (Gu et al., 20 Dec 2025).

Ablation studies indicate that both SCG and PFMG contribute materially to detection accuracy; gains are especially pronounced on datasets characterized by many small objects (such as VEDAI, where joint use of SCG+PFMG yields +8.0 percentage points in mAP50). Qualitative analyses further demonstrate PACGNet's ability to correct low-light misses and suppress false positives compared to earlier baselines. Activation heatmaps confirm that PACGNet's attendances are concentrated on vehicle bodies—contrasted with baseline models, whose activations scatter into irrelevant background regions.

6. Limitations and Prospective Directions

PACGNet exhibits some underperformance on visually ambiguous classes (e.g., discriminating vans from cars), which is plausibly attributed to the one-stage detection architecture employed. Augmenting the architecture with two-stage heads may further enhance fine-grained discrimination.

Potential extensions include:

Transfer of the cross-gating and pyramidal fusion paradigm to other multimodal remote-sensing tasks, such as semantic segmentation or change detection.
Pre-training of SCG and PFMG modules on large-scale multimodal video datasets.
Exploration of more advanced hierarchical gating mechanisms (e.g., transformer-based self- and cross-attention).

PACGNet establishes the principle that deep fusion within the backbone (leveraging cross-gating and pyramidal hierarchy) is superior to conventional late- or naive fusion, particularly for small-object detection in multimodal aerial imagery (Gu et al., 20 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Pyramidal Adaptive Cross-Gating for Multimodal Detection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Pyramidal Adaptive Cross-Gating Network (PACGNet).