Saliency Enhanced Feature Fusion

Updated 29 December 2025

SEFF is a paradigm that leverages saliency cues to guide fusion across modalities, scales, and architectures for robust feature integration.
It employs methods like attention modulation, learned saliency maps, and dynamic weighting to improve spatial and semantic alignment in tasks such as image fusion and segmentation.
Empirical studies demonstrate that SEFF enhances accuracy and efficiency in applications like remote sensing detection and EEG emotion estimation.

Saliency Enhanced Feature Fusion (SEFF) is a unifying paradigm in which feature fusion across modalities, scales, or architectures is guided, modulated, or initiated by saliency information—statistical, learned, or analytically derived—to maximize information transfer, robust discrimination, and modality complementarity. SEFF approaches have been systematically developed in domains such as image fusion, saliency detection, multimodal segmentation, remote-sensing detection, large multimodal LLMs, and bio-signal processing. The central motif is leveraging saliency at intermediate stages of feature extraction or fusion, dynamically weighting, selecting, or transforming the outputs for optimal downstream task performance.

1. Core Principles and General Architectures

The defining principle of SEFF is the injection of saliency priors or learned saliency-induced weights into the feature fusion process. Saliency can be spatial, channel-wise, cross-scale, or semantic. Generic SEFF pipelines comprise:

Saliency Extraction: Saliency maps or vectors are derived from low-level cues (e.g., Zhai-Shah contrast (Lahoud et al., 2019)), deep activations (e.g., CNN feature activity, LayerCAM (Tao et al., 4 Sep 2025)), or explicit attention (e.g., Multi-Dimensional Collaborative Attention blocks (Jiang et al., 17 Jul 2025)).
Feature Decomposition and Preprocessing: Inputs are typically split (e.g., base/detail components, modality channels, or encoder/decoder paths) to expose complementary structures or semantics.
Saliency-Weighted Fusion: Saliency maps modulate fusion via multiplicative gating, attention reweighting, or as inputs to learnable fusion blocks, thus improving spatial and semantic alignment.
Multi-scale or Multi-modal Integration: Features from different modalities (RGB/depth, IR/visible, EEG-sequential/CNN-image) or scales are fused using saliency to preserve critical object cues or context.

The following pseudocode (adapted from (Lahoud et al., 2019)) illustrates a typical SEFF approach:

for k in range(K):
    base_k = LowPass(inputs[k])
    detail_k = inputs[k] - base_k
    saliency_k = SaliencyMap(inputs[k])
    ...
base_weights = softmax(saliency)
fused_base = sum(base_weights * base_layers)
...
detail_features = CNN(inputs)
detail_weights = softmax(norm(detail_features))
fused_detail = sum(detail_weights * detail_layers)
output = fused_base + fused_detail

2. Saliency Extraction: Modalities and Algorithms

The most widely-used saliency cues in SEFF are:

Histogram-based Visual Contrast: The Zhai-Shah per-pixel histogram contrast (as implemented in (Lahoud et al., 2019)) is computationally efficient and domain-agnostic:

$S_k(p) = \sum_{i=0}^{255} M_k(i) \, |I_k(p) - i|,\ \ M_k(i) = \#\{q : I_k(q) = i\}$

Deep Feature Activity/LayerCAM: Activation maps or gradients of class scores with respect to feature maps, post-ReLU and often collapsed by summation across channels, provide class/discriminant saliency (Tao et al., 4 Sep 2025):

$w^k_{ij} = \max \Bigl( \frac{\partial y_c}{\partial A^k_{ij}}, 0 \Bigr),\;\; \widetilde{A}^k_{ij} = w^k_{ij} \, A^k_{ij}$

Self- or Cross-attention Blocks: Learnable multi-dimensional (spatial/channel/scale) attention maps (e.g., MDCA (Jiang et al., 17 Jul 2025), joint attention (Jiang et al., 5 Feb 2024)) focus fusion on informative structure.
Task-specific Cues: In some domains, e.g., EEG analysis (Delvigne et al., 2022), RNN-based class saliency gradients are spatially projected to image-like inputs to guide CNN attention.

These mechanisms may operate in tandem, with analytic saliency guiding initial filtering and learned attention modules conducting deeper adaptive weighting.

3. Fusion Strategies and Mathematical Formulations

The fusion of features under SEFF follows well-characterized protocols:

Weighted Summation (Image Fusion): Fused output is a sum of input features weighted by normalized or saliency-guided weights:

$\overline{B}(p) = \sum_{k=1}^K \overline{w}_k^B(p) B_k(p)$

$\overline{D}(p) = \sum_{k=1}^K \overline{w}_k^D(p) D_k(p)$

(Lahoud et al., 2019)

Attention-Modulated Fusion (Deep Models): Fusion modules may implement channel, spatial, or joint attention, as in JAFF:

$A = M_c \odot M_s$

$F_l' = \alpha (F_l \odot A) + F_l$

$F_{fuse} = [F_l';\,\mathrm{Upsample}(F_h)]$

(Jiang et al., 5 Feb 2024)

Linear Combination of Semantic and Appearance Features: SEFF in unsupervised segmentation (Li et al., 2020) fuses unary and context features linearly with learned weights:

$S_c(i) = w^T \varphi(i) + b,\quad \varphi(i) = [S_s(i), S_s^{ctx}(i), S_a(i), S_a^{ctx}(i)]^T$

Cross-Modality and Multi-Scale Fusion: Cross-attention mechanisms (e.g., IAS-ViT (Tao et al., 4 Sep 2025), FGSE (Wang et al., 2023)) use saliency to modulate, query, or transform input features, followed by learnable fusion layers.
Guided Filters and Smoothing: Post-fusion spatial weights are often regularized for smoothness without sacrificing locality or discriminability.

4. Application Domains

SEFF has been systematically established in the following domains:

Domain	Saliency cue	Fusion Mechanism	Representative papers
Multi-modal image fusion	Visual contrast, CNN act.	Saliency-guided weighted sum	(Lahoud et al., 2019, Wang et al., 2023)
Saliency detection / SOD	Global context, attention	ASPP-like fusion, NEWLoss	(Park et al., 2021, Huang et al., 22 Jan 2024)
Tiny object detection RS	MDCA, ARB, PFDH	Stage-wise, attention, reversible	(Jiang et al., 17 Jul 2025)
Aesthetic captioning	LayerCAM, cross-attention	Cross-attend in ViT, token fusion	(Tao et al., 4 Sep 2025)
EEG emotion estimation	RNN saliency, spatial-map	Saliency back-projection, dual-stream	(Delvigne et al., 2022)

These approaches demonstrate that SEFF mechanisms are architecture-agnostic and apply to both supervised and unsupervised scenarios, as well as to spatial, temporal, and semantic feature fusion.

5. Quantitative Impact and Ablation Evidence

Comprehensive ablation studies support the efficacy of SEFF modules:

Image fusion (Lahoud et al., 2019): Saliency-based base and CNN-based detail fusion outperformed simple max/average schemes across energy, mutual information, and visual quality metrics; runtime is real-time (~0.16s).
RGB-D saliency detection (Huang et al., 22 Jan 2024): Incorporation of SEFF reduced mean absolute error from 0.059 to 0.035 and increased $F_\beta$ from 0.885 to 0.917; ablation confirmed criticality of both local and global context attention.
Surface defect SOD (Jiang et al., 5 Feb 2024): Removal of JAFF degraded $F_w$ by ~1% and of DRF by ~0.9%, establishing necessity for joint attention-guided fusion and dense context.
Tiny-object remote sensing (Jiang et al., 17 Jul 2025): MDCA+ARB+PFDH yields +4.0% AP (AI-TOD benchmark) and +6.5% AP75, outperforming strong YOLOv11m baselines.
EEG emotion (Delvigne et al., 2022): Saliency fusion achieves 74.42% accuracy (SEED-IV), compared to 71.48% for vanilla feature fusion and 69.34% for post-classification output fusion, while reducing variance.
Aesthetic image captioning (Tao et al., 4 Sep 2025): Saliency-fused MLLMs outperform generic MLLMs and classical approaches in all major AIC metrics, showing the universality of SEFF principles in multimodal generative tasks.

6. Design Variants and Implementation Considerations

Implementation varies by domain, but core design considerations emerge:

Attention Type and Granularity: Spatial vs. channel vs. joint attention; local vs. global pooling; saliency-cue origin (external/analytic vs. learned).
Integration Stage: Encoder fusion, cross-scale decoder fusion, or cross-modality projection; some pipelines embed SEFF at several levels (e.g., both encoder and decoder).
Supervision and Losses: Deep supervision (multiple side outputs (Jiang et al., 5 Feb 2024)), hybrid losses (BCE+IoU+SSIM), and explicit edge weighting (NEWLoss (Park et al., 2021)) reinforce boundary precision and context integration.
Parameter Efficiency and Speed: Efficient 1×1 and 3×3 convolutions (noted in RGB-D SOD (Huang et al., 22 Jan 2024)); real-time capabilities (e.g., 6 fps in fusion (Lahoud et al., 2019), 66 fps in SOD (Jiang et al., 5 Feb 2024)); small ViT adaptors in MLLMs (Tao et al., 4 Sep 2025).
Alternating Optimization: Interactively reinforced paradigms, as in IRFS (Wang et al., 2023), where fusion and saliency branches are trained in coordinated loops to improve multi-task synergy.

7. Future Directions and Limitations

Limitations and future work center on augmenting the robustness and expressivity of SEFF modules:

Dynamic and Task-Adaptive Saliency: Replacing analytic or static saliency maps with dynamic, input-conditioned attention and potential unsupervised or self-supervised saliency learning.
End-to-End Optimization: Integrating raw-signal encoders (e.g., for EEG (Delvigne et al., 2022)) or directly optimizing fusion parameters in large, transformer-based systems.
Expandability to New Tasks: Application to graph, sequential, or multi-hop data; further exploration of SEFF in foundation models and multimodal LLMs.
Limitation: Current SEFF paradigms in some fields (e.g., EEG) rely on pre-extracted features, not fully leveraging end-to-end spatial-temporal modeling capacities.

SEFF represents a unifying set of methodologies for fused feature processing governed by saliency or attention cues. Its extensions across modalities, task types, and network architectures are strongly supported by empirical ablation and large-scale experimentation, with continued development underway across vision, biomedical, and multimodal generative domains (Lahoud et al., 2019, Huang et al., 22 Jan 2024, Jiang et al., 5 Feb 2024, Jiang et al., 17 Jul 2025, Park et al., 2021, Wang et al., 2023, Delvigne et al., 2022, Tao et al., 4 Sep 2025, Li et al., 2020).