PACGNet: Adaptive Cross-Gating for Aerial Detection
- The paper introduces a dual-stream YOLOv8-based architecture that uses SCG and PFMG modules to fuse RGB and IR data for enhanced aerial object detection.
- It achieves super-additive gains on DroneVehicle and VEDAI benchmarks, notably improving small-object detection performance.
- Experimental results and ablation studies confirm that early cross-modal fusing effectively reduces noise while preserving pyramidal feature hierarchies.
The Pyramidal Adaptive Cross-Gating Network (PACGNet) is an architecture specifically designed for multimodal object detection in aerial imagery, notably UAV-based RGB and infrared (IR) data. It seeks to address key limitations of prevailing multimodal fusion approaches—namely, their proneness to cross-modal noise and their disruption of pyramidal feature hierarchies. PACGNet's design centers around two innovations: a Symmetrical Cross-Gating (SCG) module for bidirectional, modality-specific feature filtering and a Pyramidal Feature-aware Multimodal Gating (PFMG) module for progressive, hierarchy-preserving fusion. Together, these mechanisms deliver state-of-the-art performance on benchmarks such as DroneVehicle and VEDAI, particularly enhancing fine-grained, small-object detection (Gu et al., 20 Dec 2025).
1. Network Architecture and Backbone Integration
PACGNet utilizes a dual-stream variant of the YOLOv8 architecture as its backbone. One stream processes RGB data, and the other IR; neither stream is pre-trained. Each stream produces a four-level feature pyramid, denoted , using convolutional and C2f blocks. Crucially, PACGNet departs from simple post-hoc or naive summation by performing early “horizontal” cross-gating fusion through the SCG module after each of , , and . The output at each level is a pairwise-fused feature:
Following SCG, a top-down PFMG mechanism recursively fuses features along the pyramid:
The output fused features are delivered to a standard YOLOv8 neck (PAN) and detection head, which produce oriented bounding box (OBB) predictions.
2. Symmetrical Cross-Gating (SCG) Module
The SCG module introduces a bidirectional, “horizontal” gating mechanism to selectively update features in one modality with guidance from the other, while preventing noise propagation and maintaining semantic integrity via residual links. Each direction (e.g., IRRGB and RGBIR) operates as follows, with as input feature maps:
Processing Flow for IRRGB:
- Intra-modal refinement: and , using depthwise-separable bottleneck .
- Spatial gating: , .
- Channel guidance: , a 11 conv bottleneck to , and .
- Fusion and residual: .
This process is executed in parallel for the opposite direction. The formalism for general modalities and is:
Pseudocode summarizing SCG:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
def SCG(A_in, B_in): # A_in, B_in ∈ ℝ^{C×H×W} A_ref = R(A_in) # Depthwise-Separable refinement B_ref = R(B_in) # 1) Spatial gate from B to A M = sigmoid(conv1x1(B_ref)) # shape (1,H,W) F_sp = A_ref * (1 + M) # spatial modulation # 2) Channel guidance from B to A G = P_projection(B_ref) # shape (C/r, H, W) g = sigmoid(conv1x1(G)) # shape (C/r,1,1) F_ch = g * G # channel gating # 3) Residual fusion output_A = BatchNorm(A_in) + (F_sp + F_ch) return output_A |
3. Pyramidal Feature-aware Multimodal Gating (PFMG) Module
The PFMG module is responsible for hierarchy-preserving, top-down fusion of multimodal features. At each pyramid level , the module fuses the pairwise-fused features , guided by the finer, higher-resolution feature from the previous level.
Stepwise fusion:
- Hierarchical spatial gate:
- Modality interaction: , split into
- Adaptive weighting: , with over the two-channel dimension for each pixel
- Hierarchically-gated fusion:
All nonlinearities and softmax activations are applied as specified in the formulation.
4. Training Protocols, Implementation, and Model Variants
PACGNet is trained on two UAV-focused datasets—DroneVehicle and VEDAI—both containing paired RGB-IR imagery with oriented bounding box annotations for vehicles. Preprocessing includes cropping, resizing, and standard augmentations such as mosaic composition, random flips, and translation.
Optimization specifics:
- Framework: Ultralytics YOLOv8 v8.2.50
- Hardware: 8 NVIDIA RTX 3090 GPUs
- Batch size: 128 (16 per GPU)
- Optimizer: SGD with momentum 0.937 and weight decay
- Learning rate: initial 0.01, scheduled to a final factor of 0.01
- Warmup: 3 epochs (starting momentum 0.8, bias_lr 0.1)
- Total epochs: 300
- Losses: WIoU v3 for regression, standard cross-entropy for classification
Model Growth (Ablation Table):
| Variant | Params (M) | GFLOPS | mAP50 (VEDAI) | mAP50 (DroneVehicle) |
|---|---|---|---|---|
| Baseline Dual YOLOv8 | 4.3 | 11.6 | 74.1 | 80.1 |
| +PFMG | 4.7 | 12.3 | 76.7 | 80.7 |
| +SCG | 4.8 | 12.5 | 76.6 | 80.8 |
| PACGNet (PFMG+SCG) | 5.2 | 13.2 | 82.1 | 81.7 |
Both modules individually yield clear improvement, with the combination delivering super-additive gains.
5. Experimental Results and Analysis
On DroneVehicle, PACGNet attains an mAP50 of 81.7% (IoU=0.50 for oriented boxes), compared to 77.4% for the best single-modality YOLOv8 and 81.4% for the top prior multimodal approach (RGFNet). On VEDAI, PACGNet reaches 82.1% mAP50, surpassing the highest previously reported result (S-MSTD: 81.2%) (Gu et al., 20 Dec 2025).
Ablation studies indicate that both SCG and PFMG contribute materially to detection accuracy; gains are especially pronounced on datasets characterized by many small objects (such as VEDAI, where joint use of SCG+PFMG yields +8.0 percentage points in mAP50). Qualitative analyses further demonstrate PACGNet's ability to correct low-light misses and suppress false positives compared to earlier baselines. Activation heatmaps confirm that PACGNet's attendances are concentrated on vehicle bodies—contrasted with baseline models, whose activations scatter into irrelevant background regions.
6. Limitations and Prospective Directions
PACGNet exhibits some underperformance on visually ambiguous classes (e.g., discriminating vans from cars), which is plausibly attributed to the one-stage detection architecture employed. Augmenting the architecture with two-stage heads may further enhance fine-grained discrimination.
Potential extensions include:
- Transfer of the cross-gating and pyramidal fusion paradigm to other multimodal remote-sensing tasks, such as semantic segmentation or change detection.
- Pre-training of SCG and PFMG modules on large-scale multimodal video datasets.
- Exploration of more advanced hierarchical gating mechanisms (e.g., transformer-based self- and cross-attention).
PACGNet establishes the principle that deep fusion within the backbone (leveraging cross-gating and pyramidal hierarchy) is superior to conventional late- or naive fusion, particularly for small-object detection in multimodal aerial imagery (Gu et al., 20 Dec 2025).