Two-Way Guidance Fusion Module (TGFM)
- The paper introduces TGFM as a role-aware, bidirectional fusion block that guides low-level features with spatial cues and high-level features with channel cues.
- It integrates into GGL-Net’s multi-scale fusion stage to combine gradient-enhanced shallow and deep representations, achieving improved IoU performance.
- Mathematically, TGFM uses CBAM-style attention operations to modulate features, effectively enhancing localization and suppressing background clutter.
Searching arXiv for the cited papers and closely related fusion modules to ground the article. The Two-Way Guidance Fusion Module (TGFM) is the feature-fusion mechanism introduced in GGL-Net for infrared small target detection, where the central difficulty is to preserve extremely weak, tiny, edge-sensitive target responses from shallow layers while still exploiting the stronger semantic discrimination of deeper layers. In this design, higher-level features are treated as semantically stronger but spatially coarser, whereas lower-level features are treated as detail-rich but semantically weak. TGFM addresses that asymmetry through role-aware, bidirectional guidance: low-level features guide high-level features spatially, and high-level features guide low-level features channel-wise, so that multi-scale fusion preserves localization fidelity and suppresses clutter rather than merely mixing features indiscriminately (Zhao et al., 10 Dec 2025).
1. Architectural placement in GGL-Net
Within GGL-Net, TGFM belongs to the feature fusion stage. The full network is divided into feature extraction, local contrast learning, and feature fusion. The feature extractor is dual-branch: the main branch processes the original infrared image through five Stage modules, while the supplementary branch processes gradient magnitude images. The two branches are connected by the Gradient Supplementary Module (GSM), whose purpose is to encode raw gradient information into deeper network layers so that edge details receive stronger emphasis during feature extraction. Afterward, the network applies a local contrast learning module inherited from previous work, and TGFM then fuses the resulting multi-scale feature maps (Zhao et al., 10 Dec 2025).
The paper’s network figure places TGFM as a repeated multi-level mechanism rather than a single terminal block. The fusion stage contains TGFM1, TGFM2, TGFM3, TGFM4, indicating repeated use across the hierarchy. Conceptually, each TGFM receives a pair of features from different depths, denoted for a low-level feature map and for a high-level feature map. These features already carry the effect of the gradient branch because GSM has already injected gradient information upstream. Accordingly, TGFM does not compute gradients itself; it fuses representations that have already been enhanced by gradient-guided extraction. The paper also states that no additional supervision is attached specifically to TGFM; optimization is end-to-end under the network’s output loss, namely soft-IoU loss (Zhao et al., 10 Dec 2025).
2. Task-specific motivation and design principle
TGFM is motivated by a failure mode that is especially severe in infrared small target detection: the target may occupy only a few pixels, may have scarce intrinsic texture or shape cues, and may be easily submerged by cluttered backgrounds. Under those conditions, conventional multi-scale fusion based on simple top-down fusion, bottom-up fusion, element-wise addition, or concatenation can be inadequate because it mixes features without respecting their distinct roles. High-level features contribute stronger semantics but poorer detail perception, while low-level features contribute richer detail and localization but insufficient scene semantic understanding. The paper treats this mismatch as a primary cause of inaccurate edge positioning, missed detections, and false alarms in clutter (Zhao et al., 10 Dec 2025).
The defining principle of TGFM is therefore role-aware fusion. The module assigns spatial/detail guidance to shallow features and semantic/channel guidance to deep features. Operationally, the two directions are complementary rather than symmetric for symmetry’s own sake. The low-level high-level path uses spatial attention, because shallow features preserve precise target position, edge contour, and local structure. The high-level low-level path uses channel attention, because deep features better encode which semantic patterns are target-like and which resemble clutter. In the paper’s own functional interpretation, one direction answers where the target-supporting evidence is, while the other answers what is target-like in channel space (Zhao et al., 10 Dec 2025).
This division is particularly well matched to infrared small targets. Losing shallow edge/detail information harms localization and target completeness, while lacking deep semantic filtering causes bright noise, strong background edges, or clutter structures to be mistaken for targets. TGFM is thus not described as a generic pyramid block, but as a task-adapted fusion rule for a regime where semantically coarse abstraction and detail preservation must be balanced explicitly rather than implicitly (Zhao et al., 10 Dec 2025).
3. Mathematical formulation and internal operators
The module is defined by three equations: $Z=C(Y)\otimes X + S(X)\otimes Y \tag{1}$
In this notation, is the low-level feature map, is the high-level feature map, and is the fused output. The attention term 0 is the channel attention map generated from the high-level feature, while 1 is the spatial attention map generated from the low-level feature. The operators are average pooling, max pooling, a shared multilayer perceptron in the channel branch, a 2 convolution in the spatial branch, concatenation, and sigmoid activation. The text states that the element-wise operator is multiplication, represented here by 3 (Zhao et al., 10 Dec 2025).
Equation (1) makes the cross-guidance structure explicit. The term 4 means that the low-level feature is reweighted by a channel attention map extracted from the high-level feature; this is the deep-to-shallow semantic guidance path. The term 5 means that the high-level feature is reweighted by a spatial attention map extracted from the low-level feature; this is the shallow-to-deep localization/detail guidance path. The two guided branches are then fused additively. The module is therefore not plain concatenation followed by convolution, and it is not simple addition after upsampling. Each branch is first modulated by guidance extracted from the other branch, then summed (Zhao et al., 10 Dec 2025).
The internal operators also reveal a familiar attention decomposition. The channel branch follows a CBAM-style pattern: global average pooling and global max pooling on 6, a shared MLP, summation, and sigmoid. The spatial branch similarly follows a CBAM-like pattern but with the roles inverted across levels: average pooling and max pooling across channels on 7, concatenation of the resulting spatial descriptors, 8 convolution, and sigmoid. This inversion is central to TGFM’s asymmetry. High-level features are not used to provide spatial masks, and low-level features are not used to provide channel selection; the module preserves the mapping between feature level and information type (Zhao et al., 10 Dec 2025).
Because Eq. (1) requires compatible tensor sizes for element-wise multiplication and addition, some resolution alignment is implicitly necessary. This suggests that the high-level feature is resized to match the low-level feature resolution, or that both are transformed to a common resolution before fusion. The paper, however, does not explicitly formulate the exact upsampling, downsampling, or channel-alignment operator, so any exact resizing statement remains an inference rather than an explicit architectural specification (Zhao et al., 10 Dec 2025).
4. Repeated multi-scale fusion and empirical behavior
TGFM is applied repeatedly across the feature hierarchy rather than once at a single resolution. The repeated blocks labeled TGFM1–TGFM4 indicate that GGL-Net uses the same two-way guidance principle at multiple scales. This repeated deployment is consistent with the paper’s broader claim that infrared small target detection benefits from multi-level fusion of gradient-enhanced shallow and deep representations, rather than reliance on a single fusion point (Zhao et al., 10 Dec 2025).
The clearest quantitative evidence for TGFM comes from the ablation study comparing it with unguided fusion and one-way variants. The paper defines ADD as direct element-wise addition, CAM as only channel attention from high-level to low-level, SAM as only spatial attention from low-level to high-level, and TGFM as both directions together.
| Variant | IoU | nIoU |
|---|---|---|
| ADD | 0.8062 | 0.7798 |
| CAM | 0.8114 | 0.7810 |
| SAM | 0.8119 | 0.7812 |
| TGFM | 0.8142 | 0.7858 |
These results establish two points. First, either one-way guidance alone improves over direct addition, which supports the claim that guided fusion is better than unguided fusion in this setting. Second, using both directions together yields the best result, which supports the interpretation that the two guidance flows are complementary rather than redundant. The paper summarizes the gain over direct addition as 0.99% in IoU and 0.77% in nIoU, corresponding to the increase from 0.8062 to 0.8142 in IoU and from 0.7798 to 0.7858 in nIoU. The authors also note qualitatively that better use of the respective advantages of high-level and low-level features improves the fusion effect of different scale features and contributes to better target completeness and background suppression (Zhao et al., 10 Dec 2025).
The paper additionally studies the reduction ratio 9 in the channel attention branch. Although Eq. (2) does not explicitly write 0, the authors report that multiple values were tested and that final detection performance changes little, while larger 1 reduces parameter count. They therefore set 2. Beyond that, the spatial branch is specified only as a 3 convolution, and the channel branch is specified only by pooling, shared MLP, and sigmoid. The text does not state additional details such as normalization layers, extra activations, or interpolation type inside TGFM (Zhao et al., 10 Dec 2025).
5. Distinction from conventional fusion and neighboring module families
Relative to standard multi-scale fusion strategies, TGFM’s novelty lies in its cross-level, bidirectional, asymmetric attention design. In a typical FPN-like top-down architecture, semantic information mainly flows downward. In PAN-style structures, there is an additional bottom-up path, but the fusion often remains generic, such as addition or concatenation followed by convolution. In U-Net skip fusion, shallow and deep features are combined, but usually without explicit mutual guidance that distinguishes what kind of information each level should contribute. TGFM differs by encoding the principle that low-level features should provide spatial/detail guidance and high-level features should provide semantic/channel guidance (Zhao et al., 10 Dec 2025).
This distinction becomes clearer when TGFM is placed alongside nearby guidance-fusion designs in other domains. TGIF in multimodal LLMs performs prompt-conditioned visual layer fusion and is explicitly described as a one-way query-guided visual depth router rather than a true two-way mechanism (Lin et al., 6 Jan 2026). FORMULA combines foreground guidance with self-iterative refinement and multi-layer additive feature fusion for unsupervised object discovery, but does not define an explicit reciprocal guidance block (Lin et al., 2022). Depth Guidance Fusion Module in reliable image outpainting uses depth-to-RGB guidance through dynamic kernel generation and progressive multimodal fusion, yet remains asymmetric because depth is the guiding modality (Zhang et al., 2022). GAFusion uses sparse depth guidance, LiDAR occupancy guidance, and LiDAR-guided adaptive fusion for BEV detection, but the guidance remains mainly LiDAR-to-camera rather than reciprocal (Li et al., 2024). Across these cases, the common pattern is guided fusion, but TGFM’s specific contribution is to make the two directions explicit and role-specific within a single shallow/deep feature pair.
Other recent work distributes dual guidance across multiple components rather than embedding it in one cross-level block. DSPFusion separates Dual Prior Guidance Module (DPGM) and Prior-Guided Fusion Module (PGFM), combining degradation priors and semantic priors in a split restoration-and-fusion pipeline (Tang et al., 30 Mar 2025). TeSG uses Mask-Guided Cross-Attention (MGCA) followed by Text-Driven Attentional Fusion (TDAF), so spatial and semantic guidance act at different stages of infrared-visible image fusion (Zhu et al., 20 Jun 2025). GD4Fusion uses GFMSE for modality-specific frequency guidance and GSMAF for guidance-conditioned spatial aggregation, again distributing guidance across a broader architecture (Zhang et al., 5 Sep 2025). These comparisons suggest that TGFM occupies a specific point in the design space: a compact, repeated cross-level fusion block whose two directions are defined by feature-level asymmetry rather than by modality asymmetry or external prompt routing.
6. Interpretation, limits, and broader significance
Several misconceptions can be excluded directly from the paper’s description. TGFM is not the component that introduces gradient magnitude images; that role belongs to GSM in the feature extraction stage. TGFM does not directly compute gradients, and it is not supervised by a dedicated auxiliary loss. It is also not an indiscriminate bidirectional exchange block: the two directions are different in function and operator, because one is spatial attention generated from low-level features and the other is channel attention generated from high-level features. Finally, the paper does not provide exact per-level tensor shapes or a fully explicit resizing pipeline inside the module, so low-level implementation details such as interpolation type or channel-alignment layers are not specified as explicit facts (Zhao et al., 10 Dec 2025).
From the standpoint of infrared small target detection, the module is tightly matched to the appearance regime of the task. Tiny targets often have weak contrast, blurred boundaries, limited texture, and only a few supporting pixels. Under those conditions, ordinary deep fusion can wash out weak signals as feature maps become deeper and coarser. TGFM addresses that by preserving location and edge integrity through low-level spatial guidance, while using deep semantic channel guidance to reject clutter such as bright noise or strong background edges. A plausible implication is that TGFM exemplifies a broader design rule for small-object perception: when shallow and deep features fail in different ways, fusion benefits from assigning each level a distinct guidance role rather than forcing both into a single generic merge operator.
In that sense, TGFM is best understood not merely as a module inside GGL-Net, but as a concise formalization of role-specific bidirectional fusion. Its defining idea is not bidirectionality alone, but the coupling of direction with information type: shallow features contribute spatial evidence, deep features contribute semantic filtering, and the fused representation is formed only after each branch has been modulated by the other. Within the paper’s own empirical scope, that formulation improves over both unguided addition and either one-way variant alone on NUAA-SIRST, which is the most direct evidence that the two-way design is functionally meaningful rather than notationally decorative (Zhao et al., 10 Dec 2025).