Discrepancy Info Injection Module (DIIM)
- DIIM is a fusion module that explicitly extracts and injects modality-specific discrepancy features to enhance image fusion within the ATFusion framework.
- The module modifies standard cross-attention by isolating unique modality signals, subtracting common features, and re-injecting them via a residual architecture.
- Empirical studies show DIIM improves performance metrics and preserves critical thermal and structural details compared to traditional fusion approaches.
The Discrepancy Information Injection Module (DIIM) is a fusion component introduced in the context of infrared and visible image fusion, specifically within the ATFusion framework. DIIM provides a mechanism for explicitly extracting and injecting modality-unique features—termed "discrepancy" information—into the fusion process, in contrast to vanilla cross-attention architectures that primarily focus on commonalities between input modalities. DIIM modifies cross-attention to promote the preservation and utilization of features unique to each input, yielding performance improvements in tasks that require the integration of modality-specific cues for downstream applications such as environmental monitoring and target detection (Yan et al., 2024).
1. Motivation and Conceptual Foundation
The canonical challenge in fusing infrared and visible images is to synergistically merge "common" scene elements (e.g., boundaries identified in both modalities) with "discrepancy" features, such as thermal highlights unique to infrared or fine-grained textures available only in visible images. Traditional cross-attention mechanisms, parameterized as
are fundamentally designed to emphasize mutual information across modalities A and B, thus maximizing cross-correlations while largely ignoring non-overlapping, modality-specific signal. This oversight yields fused representations that inadequately preserve salient, unique information crucial for robust scene interpretation in remote sensing domains.
To address these limitations, DIIM incorporates three operations:
- Computing standard cross-attention to extract common information (CM),
- Subtracting the resultant commonality from one modality's features to isolate the discrepancy,
- Re-injecting this discrepancy stream into the query path via a residual architecture and projection.
This protocol guarantees that subsequently fused representations are informed by both shared and unique patterns, thereby enhancing overall discriminability (Yan et al., 2024).
2. Module Architecture and Computational Graph
DIIM operates on tokenized input features, and , sourced from infrared and visible images respectively, and outputs , which encodes isolated discrepancy information.
The information flow, abstracted by the following steps, is detailed as:
- Partition: Both input tensors are divided into tokens, each of shape .
- Projection: Each segment is projected through modality-specific linear layers to obtain queries (), keys (), and values ():
- ,
- 0,
- 1
- Cross-Attention: Features are stacked to form 2. The standard cross-attention is evaluated,
3
- Discrepancy Extraction: The discrepancy stream is isolated,
4
- Injection: The discrepancy is re-injected into the query stream,
5
- Residual MLP Pipeline: The resulting tensor is normalized, passed through an MLP, and combined with the input via a residual connection,
6
A compact tabular overview of the major operations is presented below:
| Step | Operation | Output Shape |
|---|---|---|
| Partition | 7, 8 | 9 |
| Linear Projections | 0, 1, 2 | 3 |
| Cross-Attention | 4 | 5 |
| Discrepancy | 6 | 7 |
| Combine and Residual | 8, 9 | 0 |
3. Mathematical Description and Pseudocode
The mathematical formulation directly mirrors the computational pipeline:
- Linear Projections: 1
- 2,
- 3,
- 4
- Cross-Attention:
5
- Discrepancy Isolation:
6
- Information Injection and Residual:
7
8
Corresponding pseudocode as implemented:
2
4. Integration within ATFusion Framework
DIIM is invoked as the initial component of the ATFusion fusion block. Its output, 9, encodes the discrepancy information primarily bridging the two modalities. ATFusion proceeds by sequentially applying two alternate common information injection modules (ACIIM) in the following manner:
- 0
- 1
The final fused feature for reconstruction is then
2
This operational sequence ensures that the fused output is both initialized with unique, modality-specific cues and subsequently refined with common features for comprehensive scene coverage (Yan et al., 2024).
5. Training Objectives and Optimization Interaction
DIIM is optimized as part of the end-to-end ATFusion network rather than via a bespoke loss function. The overall loss is expressed as
3
where 4 is a segmented pixel loss. It partitions pixels based on importance—defining the top 5 by intensity gradient product 6—with a max-selection 7 penalty applied to the high-importance segment and an average 8 penalty to the remainder. This strategy emphasizes fidelity for modality-relevant (often discrepancy-associated) structures while encouraging global consistency. The auxiliary texture loss 9 regularizes high-frequency gradient content.
A plausible implication is that DIIM's effectiveness relies heavily on its interaction with the overall network's loss structure, since its unique stream is indirectly shaped through these global optimization objectives.
6. Empirical Outcomes, Ablation, and Functional Significance
Empirical validation indicates that DIIM enables more robust preservation of modality-unique cues. Qualitative feature map analyses reveal that DIIM's outputs preserve thermal targets and modality-specific edge structures more effectively than those produced by vanilla cross-attention, which highlight only shared spatial features (see Fig. 4 in (Yan et al., 2024)).
Ablation studies (Section 4.3, Fig. 14, Table VI) demonstrate quantifiable impact: removing DIIM from ATFusion causes a pronounced reduction in metrics that capture infrared-unique saliency (Average Gradient (AG), Spatial Frequency (SF), Qabf), with drops on the order of 0–1. This outcome affirms that explicit discrepancy injection is critical for advanced fusion performance, particularly in scenarios demanding sensitivity to modality-specific signals.
7. Summary and Comparative Perspective
DIIM constitutes a lightweight, explicit strategy for enhancing modality-unique information uptake in transformer-based fusion architectures. Its characteristic approach—extracting, isolating, and injecting discrepancy signals—positions it as a central enabler in ATFusion's ability to synthesize both shared and unique content from heterogeneous inputs. Ablation-based improvements substantiate its role as essential for achieving state-of-the-art outcomes in infrared and visible image fusion.
Further investigations may assess the generalizability of the discrepancy injection protocol for other multimodal fusion tasks and transformer-based models (Yan et al., 2024).