PFMG: Pyramidal Feature-aware Multimodal Gating

Updated 27 December 2025

The paper demonstrates that PFMG effectively integrates spatial priors to preserve fine details while mitigating cross-modal noise in object detection networks.
PFMG leverages concatenation, convolution, and gating mechanisms to adaptively fuse multi-scale features from RGB and IR modalities.
Ablation studies reveal that incorporating PFMG yields substantial improvements in small-object detection, achieving up to +8% mAP over baseline.

The Pyramidal Feature-aware Multimodal Gating (PFMG) module is a detail-preserving hierarchical fusion mechanism for multimodal object detection networks, designed to mitigate cross-modal noise and restore the integrity of feature pyramids in dual-stream backbones. Originally proposed as a core component of the PACGNet architecture for aerial detection tasks, PFMG leverages spatial guidance from higher-resolution feature maps and adaptively fuses current-level RGB and IR features to address limitations of naïve fusion strategies (Gu et al., 20 Dec 2025).

1. Architectural Role and Integration

PFMG operates within a dual-stream backbone—specifically, a YOLOv8-based network receiving RGB and IR inputs. The backbone produces multi-scale feature pyramid levels $P_2, P_3, P_4, P_5$ ; after each except the highest, features are refined by a Symmetrical Cross-Gating (SCG) module. At levels $i=3,4,5$ , PFMG modules execute hierarchical multimodal fusion before passing outputs to the neck (PAN). Key PFMG inputs at level $i$ are:

SCG-refined features at the current level: $F^{(i)}_{rgb}, F^{(i)}_{ir} \in \mathbb{R}^{C\times H_i \times W_i}$
SCG-refined features at the previous (higher-resolution) level: $F^{(i-1)}_{rgb}, F^{(i-1)}_{ir} \in \mathbb{R}^{C\times 2H_i \times 2W_i}$

PFMG fuses these modalities into a single output $F^{PFMG,(i)} \in \mathbb{R}^{C \times H_i \times W_i}$ , reconstructing a feature pyramid that retains small-object cues while suppressing background and cross-modal artifacts.

2. Hierarchical Gating Mechanism: Mathematical Formulation

The PFMG fusion process is a four-stage operation:

Hierarchical Spatial Gate Construction At level $i$ , concatenate the previous level's outputs,

$\text{Concat}^{(i-1)} = [F^{(i-1)}_{rgb}; F^{(i-1)}_{ir}] \in \mathbb{R}^{2C\times 2H_i\times 2W_i}$

Apply

$M^{(i)} = \sigma \left( \mathrm{Conv}_{3\times3,s=2}\big(\mathrm{BN}(\mathrm{ReLU}(\mathrm{Conv}_{3\times3}(\text{Concat}^{(i-1)})))\big)\right) \in \mathbb{R}^{C \times H_i \times W_i}$

with sigmoid $\sigma$ , producing a fine-detail spatial gate.

Modality Interaction and Mixing Concatenate current SCG outputs and mix:

$I^{(i)} = \mathrm{ReLU}\left(\mathrm{Conv}_{1\times 1}([F^{(i)}_{rgb}; F^{(i)}_{ir}])\right) \in \mathbb{R}^{2C\times H_i \times W_i}$

Split channel-wise to $F'^{(i)}_{rgb}, F'^{(i)}_{ir} \in \mathbb{R}^{C\times H_i\times W_i}$ .

Adaptive Fusion Weight Generation Concatenate the interacted features and pass through a $1\times 1$ conv with softmax:

$W^{(i)} = \mathrm{softmax}\left(\mathrm{Conv}_{1\times 1}([F'^{(i)}_{rgb}; F'^{(i)}_{ir}])\right) \in \mathbb{R}^{2 \times H_i \times W_i}$

Produce weight maps $w^{(i)}_{rgb}, w^{(i)}_{ir} \in \mathbb{R}^{1\times H_i\times W_i}$ , constrained such that $w^{(i)}_{rgb}(x,y) + w^{(i)}_{ir}(x,y) = 1$ .

Gated Fusion and Hierarchy Reconstruction Compute weighted fusion:

$F^{(i)}_{fuse} = w^{(i)}_{rgb} \odot F'^{(i)}_{rgb} + w^{(i)}_{ir} \odot F'^{(i)}_{ir}$

Final output:

$F^{PFMG,(i)} = \mathrm{BN}(F^{(i)}_{fuse}) + M^{(i)}$

This additive injection of $M^{(i)}$ prior ensures detail preservation at each pyramid level.

3. Layer Dimensions and Workflow

The following table summarizes layerwise dimensions and operations for PFMG (for level $i$ ):

Block	Input Shape(s)	Output Shape(s)
Hierarchical Spatial Gate	$2C\times2H_i\times2W_i$	$C\times H_i\times W_i$
Modality Interaction	$2C\times H_i\times W_i$	$2C\times H_i\times W_i$ (split to $C$ each)
Fusion Weights	$2C\times H_i\times W_i$	$2\times H_i\times W_i$
Gated Fusion + BN	$C\times H_i\times W_i$	$C\times H_i\times W_i$
Additive Injection	$C\times H_i\times W_i$ (+ $M^{(i)}$ )	$C\times H_i\times W_i$

For each pyramid level $i = 3,4,5$ , the corresponding outputs $F^{PFMG,(i)}$ are provided to the neck’s inputs for downstream detection.

4. Implementation Pseudocode

A concise pseudocode representation encapsulates the four stages described above:

def PFMG(Frgb_i, Fir_i, Frgb_prev, Fir_prev):
    # Hierarchical Spatial Gate
    guidance = concat(Frgb_prev, Fir_prev, dim=channel)      # [2C,2H,2W]
    x = Conv3x3(guidance, out_channels=C, stride=1)
    x = ReLU(BN(x))
    x = Conv3x3(x, out_channels=C, stride=2)                # to [C,H,W]
    M = Sigmoid(x)                                          # [C,H,W]
    # Modality Interaction
    mix = conv1x1(concat(Frgb_i, Fir_i, dim=ch), out_channels=2C)
    mix = ReLU(BN(mix))
    Frgb_prime, Fir_prime = split_channels(mix, C)           # [C,H,W] each
    # Fusion Weights
    w_logits = conv1x1(concat(Frgb_prime, Fir_prime), out_channels=2)  # [2,H,W]
    w_soft = Softmax(dim=0)(w_logits)
    w_rgb, w_ir = w_soft[0:1,:,:], w_soft[1:2,:,:]  # [1,H,W] each
    # Gated Fusion and Hierarchical Prior Injection
    fused = w_rgb * Frgb_prime + w_ir * Fir_prime  # [C,H,W]
    fused = BN(fused)
    out = fused + M                               # [C,H,W]
    return out

5. Preserving Fine-Grained Structure and Noise Attenuation

PFMG’s hierarchical gating leverages spatial priors from immediately finer pyramid levels, promoting the propagation of high-frequency, small-object details such as those encountered in aerial vehicle detection. Adaptive per-pixel fusion weights enable dynamic suppression of cross-modal noise, effectively gating out uninformative regions for each stream. The additive injection of spatial gate $M^{(i)}$ reinforces fine detail without overwriting semantic context, as demonstrated in ablation heatmaps showing focused target activation and diminished background.

6. Quantitative Ablation and Performance Impact

Ablation studies on the VEDAI and DroneVehicle datasets (see Table 4 in (Gu et al., 20 Dec 2025)) report the following mAP50 scores:

Model Variant	VEDAI mAP50	DroneVehicle mAP50
Baseline	74.1%	80.1%
+PFMG only	76.7%	80.7%
+SCG only	76.6%	80.8%
SCG+PFMG (PACGNet)	82.1%	81.7%

PFMG alone yields significant improvements in small-object benchmarks, notably a +2.6% gain on VEDAI, attributed to enhanced detail preservation. In conjunction with SCG, PFMG drives synergistic gains, with PACGNet achieving state-of-the-art results: +8.0% over baseline for VEDAI and +1.6% for DroneVehicle. A plausible implication is that progressive hierarchical gating, when coupled to modality-aware cross-gating, sets a robust paradigm for fine-grained multimodal fusion in pyramidal detection backbones.

7. Context and Significance

The PFMG module addresses two primary flaws inherent in previous multimodal fusion schemes: susceptibility to cross-modal noise and disruption of feature pyramid fidelity. By reconstructing fused representations at multiple scales via additive hierarchical spatial priors and adaptive cross-modal weights, PFMG is central to the performance of PACGNet on small-object detection in aerial imagery. This approach sets a precedent for leveraging hierarchical guidance in multimodal vision networks, with empirical mAP improvements substantiating its role in state-of-the-art detection pipelines (Gu et al., 20 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Pyramidal Adaptive Cross-Gating for Multimodal Detection (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pyramidal Feature-aware Multimodal Gating (PFMG) Module.