Symmetrical Cross-Gating Module for Robust Fusion
- SCG is defined as a deep feature-level multimodal fusion module that selectively integrates RGB and IR data using bidirectional spatial and channel gating.
- It utilizes a refined residual-gated architecture to suppress cross-modal noise while preserving modality-specific semantics in complex scenes.
- Empirical studies on VEDAI and DroneVehicle benchmarks show that SCG improves detection mAP by up to 2.5%, enhancing overall reliability.
The Symmetrical Cross-Gating (SCG) module is a deep feature-level multimodal fusion mechanism introduced for robust object detection in settings such as aerial imagery, where both visible (RGB) and infrared (IR) modalities are exploited. SCG is specifically engineered to address the challenges of cross-modal noise and semantic degradation commonly observed with naïve fusion strategies (e.g., addition or concatenation), ensuring both selective cross-modal interaction and preservation of modality-specific semantics via a bidirectional, residual-gated architecture (Gu et al., 20 Dec 2025).
1. Motivation for Symmetrical Cross-Gating
Conventional two-stream fusion, such as direct addition or concatenation at the feature level, exhibits two principal deficiencies: high susceptibility to cross-modal noise (e.g., overexposed RGB corrupting IR features) and loss of inherent semantics (irreversibly mixing streams and reducing feature interpretability). SCG is motivated by the need for controlled, "horizontal" interaction—enabling each modality to obtain complementary information from the other stream while actively suppressing detrimental or redundant signals and retaining its own semantic clarity. This is achieved through learnable gating and residual connections, ensuring the features of each branch remain interpretable and robust throughout the detection backbone (Gu et al., 20 Dec 2025).
2. Architectural Structure and Bidirectionality
SCG operates at every level of a feature pyramid (usually at pyramidal stages such as , , ). Each module instance processes a pair of feature maps: from the RGB stream, and from the IR stream. These are refined into and using a four-stage process in each direction:
- Intra-modal refinement: Each input is processed by a depthwise-separable bottleneck to produce .
- Cross-modal spatial gating: One modality (e.g., IR) produces a spatial attention mask that modulates the other modality.
- Cross-modal channel gating: The same guiding modality projects a guidance feature, further controlling the contribution via a learnable, channel-wise gate.
- Residual fusion: All modulated signals and the original feature are combined via direct addition, with batch normalization applied to the input to ensure stable learning.
This flow is perfectly symmetrical: IR modulates RGB and vice versa, each with its own gating parameters and residual pathways.
3. Detailed Mathematical Formulation
Let denote the "Refined Feature Extractor" bottleneck. The IRRGB flow is detailed as follows; the RGBIR path is structurally analogous.
- Intra-Modal Refinement
- Cross-Modal Spatial Gating
is the sigmoid function; the "1+M" design ensures that a null mask does not fully suppress the feature.
- Cross-Modal Channel Gating
- Residual Fusion
The entire process is repeated symmetrically in the RGBIR direction, ensuring that both branches benefit from complementary modulation and feature isolation (Gu et al., 20 Dec 2025).
4. Learnable Components and Training Criteria
The SCG module is composed of convolutional kernels and bottleneck architectures whose weights are updated end-to-end. Specifically:
- : Depthwise and pointwise convolutions for initial feature refinement.
- : Parametrized layer generating spatial and channel gates.
- : A lightweight bottleneck, typically Conv1×1→ReLU→Conv1×1.
All SCG parameters are optimized jointly under the full detection loss, which comprises classification cross-entropy and Wise-IoU for box regression. Stochastic Gradient Descent (SGD) is the optimizer of choice during end-to-end training (Gu et al., 20 Dec 2025).
5. Algorithmic Forward Pass
The following pseudocode encapsulates the bilateral flow of the SCG module, where operations are performed for both modality directions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
function SCG(F_rgb_in, F_ir_in):
# 1) Intra-modal refinement
F_rgb_r = RefineBlock(F_rgb_in)
F_ir_r = RefineBlock(F_ir_in)
# 2) IR→RGB spatial gating
M_ir2rgb = sigmoid(Conv1×1_spatial(F_ir_r))
F_spat = F_rgb_r * (1 + M_ir2rgb)
# 3) IR→RGB channel gating
G_ir2rgb = ProjectBottleneck(F_ir_r)
s_ir2rgb = sigmoid(Conv1×1_channel(G_ir2rgb))
F_chan = s_ir2rgb * G_ir2rgb
# 4) Residual fusion + normalization
Out_rgb = BatchNorm(F_rgb_in) + F_spat + F_chan
# Repeat steps 2–4 with roles swapped to get Out_ir
(Out_ir) = symmetric_flow(F_ir_in, F_rgb_in)
return Out_rgb, Out_ir |
6. Quantitative Performance and Effect of Ablation
Extensive ablation illustrates the impact of incorporating SCG into two-stream detectors:
| Configuration | VEDAI mAP50 | DroneVehicle mAP50 |
|---|---|---|
| Baseline YOLOv8 (no SCG) | 74.1% | 80.1% |
| Baseline + SCG only | 76.6% | 80.8% |
SCG confers a +2.5% improvement on VEDAI and +0.7% on DroneVehicle over a two-stream YOLOv8, signifying effective cross-modal noise suppression and improved maintenance of semantic integrity. The results suggest that SCG's dual gating and explicit residual path are instrumental in reducing false positives stemming from clutter and enhancing receptive feature discriminability (Gu et al., 20 Dec 2025).
7. Comparison with Prior Fusion Strategies
Conventional feature fusion via additive or concatenation operations lacks the ability to modulate or suppress noise and leads to irreversible semantic blending between modalities. Likewise, prior attention-based mechanisms (e.g., GAFF, CMAFF) typically generate single-stream attention maps but do not ensure retention of modality-pure features through residual gating, nor implement strictly bidirectional symmetry. SCG is distinctive in its explicit modeling of spatial and channel gating in both directions; the mechanism features learned gates per modality and a residual skip, maintaining single-modality semantics until gated fusion occurs.
This architectural choice not only ensures improved interpretability and control over cross-modal influence but is empirically connected to state-of-the-art results on DroneVehicle and VEDAI datasets. The design thus addresses both the preservation of semantic clarity and the suppression of destructive interference beyond what is achieved by earlier fusion schemas (Gu et al., 20 Dec 2025).