Papers
Topics
Authors
Recent
2000 character limit reached

Global & Local Aggregate FPN

Updated 28 November 2025
  • The paper introduces GLAFPN, a network that integrates global and local feature extraction across pyramid levels to boost small object detection in remote sensing imagery.
  • It employs bidirectional top-down and bottom-up pathways with a dedicated GLFF module to refine multi-scale features and improve cross-scale information flow.
  • Empirical results on VisDrone2019 demonstrate a +3.8% mAPâ‚…â‚€ gain with reduced parameters, affirming GLAFPN's efficacy and efficiency.

The Global and Local Aggregate Feature Pyramid Network (GLAFPN) is a multi-scale feature aggregation architecture developed to enhance detection accuracy, especially for small objects in remote sensing imagery. Originally introduced as the neck of the DMG-YOLO detector, GLAFPN systematically fuses both global context and local spatial detail across hierarchical pyramid levels, while preserving fine features critical for small-object localization. The core innovation is the integration of bidirectional (top-down and bottom-up) pathways and a dedicated global-local fusion module at each pyramid stage, resulting in improved representational capacity and efficient cross-scale information flow (Wang et al., 21 Nov 2025).

1. Architectural Role and High-Level Overview

GLAFPN is inserted between the backbone and detection head of object detection frameworks. It ingests a set of hierarchical backbone feature maps—specifically, P5P_5 (coarsest), P4P_4 (intermediate), and P3P_3 (finest)—and outputs refined feature maps (P3′P'_3, P4′P'_4, P5′P'_5) aligned for small, medium, and large object detection respectively. Its principal roles are:

  • Preservation of fine spatial details by adding a shallow high-resolution pathway ("small-object layer") in the top-down branch.
  • Enrichment of cross-scale feature propagation through both top-down and bottom-up paths, advancing information flow beyond conventional unidirectional FPNs.
  • Explicit fusion of global context and local detail via the Global-Local Feature Fusion (GLFF) module at each pyramid node.

The architecture routes feature maps through a refinement sequence:

  • Top-down stream: Upsamples and aggregates P5→P4→P3P_5 \rightarrow P_4 \rightarrow P_3.
  • Bottom-up stream: Downsamples and fuses P3′→P4′→P5′P'_3 \rightarrow P'_4 \rightarrow P'_5. At each fusion, the GLFF module performs global-local attention-based refinement, enhancing both contextual reasoning and precise localization.

2. Component Breakdown

Data Flow

The simplified dataflow is:

  • Each backbone output (P5P_5, P4P_4, P3P_3) is first refined by a GLFF module.
  • In the top-down path: P5P_5 is upsampled and fused with P4P_4, then further upsampled and fused with P3P_3 to yield a high-resolution output for small object detection.
  • In the bottom-up path: Starting from the refined high-resolution output, features are sequentially downsampled and fused with previous stages, generating medium and large-scale outputs (P4′P'_4, P5′P'_5).

GLFF Module

GLFF operates on each single-scale map as follows:

  1. 1×1 Convolution: Projects input FiF_i of shape Hi×Wi×CiH_i\times W_i\times C_i to X∈RH×W×CoutX\in\mathbb{R}^{H\times W\times C_\text{out}}.
  2. Channel Split: Divides XX into XshortX_\text{short} and XprocessX_\text{process} of shape H×W×(Cout/2)H\times W\times (C_\text{out}/2).
  3. Global-Local Spatial Attention (GLSA) Blocks: Two serial GLSA blocks applied to XprocessX_\text{process}.
  4. Channel Concatenation: Merges XshortX_\text{short} with GLSA output, restoring the channel dimension.
  5. Final 1×1 Convolution: Fuses and outputs the refined feature.

GLSA Module

GLSA splits its input into two branches:

  • Global Spatial Attention (GSA): Computes global context via spatial self-attention with softmax-weighted aggregation and MLP-based projection.
  • Local Spatial Attention (LSA): Applies channel-mixed, depthwise 3×3 convolutions followed by a sigmoid activation, emphasizing local detail.

3. Mathematical Formulation

The operations within GLAFPN are formalized as:

  • Channel Split:

Fi,1,Fi,2=Split(Fi),Fi,1,Fi,2∈RHi×Wi×Ci2F_{i,1}, F_{i,2} = \mathrm{Split}(F_i),\quad F_{i,1}, F_{i,2} \in \mathbb{R}^{H_i\times W_i\times \frac{C_i}{2}}

  • GLFF Fusion:

Fi′=Conv1×1(Concat(GSA(Fi,1), LSA(Fi,2)))∈RHi×Wi×CoutF_i' = \mathrm{Conv}_{1\times1}\Big(\mathrm{Concat}(\mathrm{GSA}(F_{i,1}),\,\mathrm{LSA}(F_{i,2}))\Big)\in\mathbb{R}^{H_i\times W_i\times C_\text{out}}

  • Global Spatial Attention (GSA):

Q=Conv1×1(F1), K=Conv1×1(F1)Q = \mathrm{Conv}_{1\times1}(F_1),\ K = \mathrm{Conv}_{1\times1}(F_1) AttG=Softmax(Transpose(Q)K)∈RHW×HW\mathrm{Att}_G = \mathrm{Softmax}\big(\mathrm{Transpose}(Q)K\big)\in\mathbb{R}^{HW \times HW} Z=AttG reshape(F1)∈RHW×(C′/2)Z = \mathrm{Att}_G\,\mathrm{reshape}(F_1)\in\mathbb{R}^{HW \times (C'/2)}

GSA(F1)=MLP(Z)+F1\mathrm{GSA}(F_1) = \mathrm{MLP}(Z) + F_1

  • Local Spatial Attention (LSA):

U=DWConv3×3(Conv1×1(F2)+F2)U = \mathrm{DWConv}_{3\times3}\big(\mathrm{Conv}_{1\times1}(F_2) + F_2\big) AttL=σ(U)\mathrm{Att}_L = \sigma(U)

LSA(F2)=AttL⊙F2+F2\mathrm{LSA}(F_2) = \mathrm{Att}_L \odot F_2 + F_2

  • Cross-Scale Fusion:

Pfused=Conv1×1(Concat(Ptop,Pside))P_\text{fused} = \mathrm{Conv}_{1\times1}\big( \mathrm{Concat}(P_\text{top}, P_\text{side}) \big )

4. Forward Pass Pseudocode Outline

A succinct pseudocode capturing the GLAFPN computation is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
P5 = backbone_output(level=5)
P4 = backbone_output(level=4)
P3 = backbone_output(level=3)

R5 = GLFF(P5)
R4 = GLFF(P4)
R3 = GLFF(P3)

U54 = Upsample(R5)
TD4 = Conv1x1(Concat(U54, R4))
TD4 = GLFF(TD4)

U43 = Upsample(TD4)
TD3 = Conv1x1(Concat(U43, R3))
P3_out = GLFF(TD3)

D35 = Downsample(P3_out)
BU4 = Conv1x1(Concat(D35, TD4))
BU4 = GLFF(BU4)

D46 = Downsample(BU4)
P5_out = Conv1x1(Concat(D46, R5))
P5_out = GLFF(P5_out)

P4_out = BU4

return P3_out, P4_out, P5_out
(Wang et al., 21 Nov 2025)

5. Design Rationale and Innovations

The motivations for GLAFPN's design include:

  • Small-object sensitivity: By preserving and processing the highest-resolution feature map through a dedicated path, GLAFPN improves the retention and usage of spatial signals essential for detecting objects occupying few pixels, a common trait in remote sensing imagery.
  • Balanced context-localization blend: The GLFF modules systematically combine transformer-style global spatial self-attention with convolution-based local spatial attention, ensuring each output map encodes both scene-wide context and fine structure.
  • Bidirectional cross-scale fusion: Unlike conventional FPNs, the bidirectional path design in GLAFPN strengthens both upwards and downwards information flow, facilitating robust gradient propagation and improving multi-scale feature coherence across object sizes and aspect ratios.
  • Efficiency: Despite incorporating attention mechanisms, GLAFPN employs lightweight formulations and convolutional operations to avoid excessive parameter growth and maintains practical FLOPS.

6. Empirical Impact and Ablation Analysis

The performance influence of GLAFPN is especially pronounced in small object detection tasks within remote sensing. On the VisDrone2019 dataset, supplementing the YOLOv8-n baseline (enhanced with DFE and MFF) with GLAFPN leads to the following outcomes (Wang et al., 21 Nov 2025):

mAPâ‚…â‚€ (%) Params (M) GFLOPS
Baseline 35.0 3.01 7.8
+ GLAFPN 38.8 2.15 12.4

The introduction of GLAFPN yields a +3.8% gain in mAPâ‚…â‚€, a reduction in parameter count due to the lightweight nature of the modules, and a modest increase in computational cost attributed to the attention processes. These results demonstrate the network's capacity to enhance detection accuracy and efficiency, especially in domains where fine spatial discrimination is paramount.

7. Relation to Broader Multi-Scale Fusion Approaches

The conceptual underpinnings of GLAFPN reflect a more general trend in multi-scale feature pyramids toward fusing both global and local signals. For instance, Content-Augmented Feature Pyramid Network (CA-FPN) (Gu et al., 2021) also introduces global context extraction and spatial transformers to address locality and misalignment limitations in standard FPNs. CA-FPN achieves this using a global content module (GCEM) and lightweight approximated self-attention. A plausible implication is that GLAFPN's architecture could be generalized to blend these global transformer-like aggregators with local or deformable attention branches, constituting a family of dual-scale aggregators that further optimize the trade-off between contextual reasoning and geometric precision across scales.

By explicitly embedding both global attention and local convolution paths within the feature pyramid, GLAFPN advances the representational richness and scale-robustness of detection frameworks—particularly in scenarios dominated by small, context-sensitive targets.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Global and Local Aggregate Feature Pyramid Network (GLAFPN).