Transformer-Derived Saliency Maps

Updated 26 November 2025

Transformer-derived saliency maps are attributions produced by transformer architectures that highlight key regions or tokens critical for both visual and textual interpretation.
They employ multi-scale self-attention, cross-modality fusion, and advanced decoder mechanisms to generate dense, high-resolution saliency outputs.
These methods achieve state-of-the-art performance on metrics like F-measure and MAE, supporting diverse applications including RGB-D imaging, video saliency, and language model interpretability.

A transformer-derived saliency map is a spatial or token-level attribution output produced by a transformer architecture, where the architecture is either designed or adapted with mechanisms that allow extraction of saliency—regions, pixels, or tokens with maximal relevance for a given visual or textual target. Saliency maps—originally developed in visual attention research—have become central in computer vision, multimodal processing, and model interpretability, and transformer-based approaches now define the state of the art for dense saliency map generation, saliency-guided attention, and post hoc explanation in both vision and language domains.

1. Architectural Foundations in Vision Transformers

Transformer-derived saliency maps in vision primarily originate from self-attention–based architectures incorporating specialized backbones (e.g., Swin Transformer, T2T-ViT, PVT-v2) and attention-driven decoders. Typical pipelines tokenize an input image into patches, embed positional information, and then apply multi-head self-attention (MHSA) to all patches, resulting in feature representations with global receptive fields from the earliest layers (Liu et al., 2022, Liu et al., 2023, Ren et al., 2021, Djilali et al., 2023).

For example, SwinNet employs two parallel Swin Transformer backbones to process color and depth/thermal modalities, yielding multi-scale features at resolutions H/4×W/4 to H/32×W/32 (Liu et al., 2022). The hierarchical structure enables contextual information aggregation at multiple spatial scales. Multi-stream models can further exploit modality complementarity using cross-attention and fusion mechanisms, critical for robust saliency in challenging (e.g., RGB-D) settings (Liu et al., 2022, Liu et al., 2023, Jia et al., 2022, Zeng et al., 2022, Wang et al., 2021).

Saliency predictors such as SalTR (Djilali et al., 2023), GLSTR (Ren et al., 2021), GeleNet (Li et al., 2023), and VST++ (Liu et al., 2023) all operate by decoding stacks of globally contextualized transformer features into dense, high-resolution saliency maps. Transformer blocks can also be deployed atop CNN feature pyramids (e.g., TranSalNet (Lou et al., 2021)) to inject long-range dependencies otherwise missing from locally constrained convolutions.

2. Saliency Map Generation and Decoding Strategies

Saliency map construction from transformer backbones typically employs decoder mechanisms designed to upsample and refine feature representations. Approaches fall into several categories:

Progressive Transformer Decoders: Hierarchical decoders enable saliency map refinement through multi-scale upsampling and cross-level fusion. For example, GLSTR's "deeply-transformed decoder" concatenates outputs from numerous transformer layers at multiple stages, achieving both global context and spatial precision (Ren et al., 2021).
Multi-task Decoders with Boundary/Semantics Branches: VST++ uses token-wise multi-task heads for saliency and boundary prediction along with a "Reverse T2T" upsampling strategy, reconstructing dense outputs from transformer tokens alone (Liu et al., 2023).
Edge-Aware and Channel-Spatial Fusions: SwinNet employs an edge-guided module that fuses depth-enhanced edge cues with RGB saliency predictions. Spatial alignment and channel recalibration modules exploit cross-modality attention maps to align salient object regions before final saliency inference (Liu et al., 2022).
Set Prediction with Learned Queries: SalTR frames fixation-based saliency prediction as a set-prediction problem, outputting learned coordinates for fixation points by matching predicted and ground-truth fixations using bipartite assignment and global loss (Djilali et al., 2023).
Pixel-Level Generative Mechanisms: Generative transformer frameworks treat saliency maps as samples from a latent-variable distribution, yielding both mean saliency and pixel-wise uncertainty via sampling from a learned, energy-based prior (Zhang et al., 2021).

3. Cross-Modality and Attention Mechanisms

Effective transformer-derived saliency relies on careful handling of multimodal information and advanced attention mechanisms:

Cross-Modality Fusion: In RGB-D/T tasks, transformers apply channel-spatial cross-fusion (e.g., via Cross-Modality Transformer or interactive attention) for joint feature enhancement (Liu et al., 2022, Liu et al., 2023, Jia et al., 2022, Zeng et al., 2022, Wang et al., 2021). Fusion is often hierarchical (e.g., intra-level alignment, inter-level decoding, and final edge-guidance in SwinNet (Liu et al., 2022)).
Attention Masking and Saliency Guidance: SalViT integrates external or self-derived saliency priors as soft "foreground" masks within self-attention blocks, restricting receptive fields to likely salient regions. Morphological learners adapt mask sharpness on the fly, and class-token-to-patch attention from self-supervised ViTs (e.g., DINO) can act as transformer-native saliency signals without additional detectors (Lu et al., 2023).
Spatial and Channel Recalibration: Spatial alignment maps (common spatial-attention derived from elementwise multiplication of modalities) and channel-attention maps (global max pooled, convolved, and sigmoid-activated) reweight feature contributions to amplify salient cues (Liu et al., 2022).

4. Advances in Specialized Domains

4.1. Video Saliency

UniST introduces a spatio-temporal saliency-aware transformer for video, using cascaded global attention and cross-scale fusion across multiple temporal and spatial resolutions. A semantic-guided block produces initial saliency cues, which guide higher-resolution stages through attention map transfer and multi-scale aggregation. The architecture can be switched between saliency prediction (continuous densities) and salient object mask detection (segmentation), supporting multiple video saliency tasks in a unified model (Xiong et al., 2023).

4.2. Remote Sensing and Orientation Sensitivity

GeleNet addresses optical remote sensing SOD by combining a PVT-v2 transformer backbone with Direction-aware Shuffle Weighted Spatial Attention Modules (D-SWSAM/SWSAM) for enhanced orientation selectivity and fine-grained detail extraction. A Knowledge Transfer Module applies self-attention-based pixelwise context modeling across middle-stage features, resulting in superior saliency localization for objects with arbitrary orientation and complex backgrounds (Li et al., 2023).

4.3. Textual Saliency

In language transformers, decoded gradient-based saliency approaches (e.g., "decoded Grad-CAM") compute intermediate layer importances, project hidden states back to token-vocabulary space with the pre-trained masked-language-model (MLM) head, and aggregate across layers for semantic coherence, yielding token-level explanations with minimal computational overhead (Hou et al., 2023).

5. Supervision, Training, and Losses

Transformer-derived saliency models incorporate multiple objective terms:

Per-pixel Binary Cross-Entropy: Used for direct saliency supervision against binary masks (Liu et al., 2022, Liu et al., 2023, Ren et al., 2021, Jia et al., 2022, Zeng et al., 2022, Wang et al., 2021).
Auxiliary Edge or Boundary Losses: Edge map prediction is often supervised jointly, leading to sharper saliency boundaries (e.g., SwinNet joint cross-entropy on saliency and edge maps (Liu et al., 2022), VST++ boundary head supervision (Liu et al., 2023), edge loss in DTMINet (Zeng et al., 2022)).
Set and Coordinate Losses: SalTR uses an ℓ₁-based bipartite matching between predicted and ground-truth fixation points, along with NSS loss on generated maps (Djilali et al., 2023).
Energy-based Generative Objectives: Sampling-based MCMC maximum likelihood for joint modeling of network and prior energies, providing both mode saliency and uncertainty (Zhang et al., 2021).
Multi-metric Saliency Losses: TranSalNet and similar models optimize with weighted combinations of KLD, NSS, linear correlation (CC), similarity (SIM), and area under curve (AUC) (Lou et al., 2021, Djilali et al., 2023).

6. Performance, Ablation, and Benchmarking

Transformer-derived saliency models consistently advance SOD and fixation-prediction state of the art:

Quantitative Metrics: Across S-measure ( $S_\alpha$ ), F-measure ( $F_\beta$ ), E-measure ( $E_\xi$ ), Mean Absolute Error (MAE), and human-viewing correlation metrics, transformer-based approaches achieve superior numbers compared to contemporary CNN architectures (Liu et al., 2022, Djilali et al., 2023, Liu et al., 2023, Ren et al., 2021, Zhang et al., 2021, Li et al., 2023).
Ablation Studies: Removal of components such as spatial alignment, edge guidance, or channel recalibration directly reduces F-score and degrades MAE, verifying the necessity of transformer-based context fusion and edge mechanisms (Liu et al., 2022, Liu et al., 2023, Ren et al., 2021, Li et al., 2023).
Generalization: Transformer backbones and fusion strategies lead to robust cross-dataset performance in RGB-D, RGB-T, video, and remote sensing domains, as well as efficient computational footprints when advanced attention reduction (e.g., Select-Integrate Attention, token-supervised loss) is used (Liu et al., 2023).
Uncertainty Quantification: Energy-based generative transformers model saliency with pixel-wise predictive variance, with uncertainty maps aligning with human ambiguity on object boundaries or cluttered scenes (Zhang et al., 2021).

7. Interpretability and Extensions

Transformer-derived saliency maps furnish interpretable explanations for both model predictions and model structure:

Token-level Attribution in NLP: Layer-wise decoded saliency explanations in sequence transformers improve correlation with human semantic discrimination and outperform classic explainers in perturbation-based evaluation (revealing/hiding games, semantic coherence overlap) (Hou et al., 2023).
Saliency for Keypoint Detection: SalViT demonstrates that self-attention can be gated by external or transformer-internal (DINO) saliency to improve few-shot keypoint localization, highlighting the utility of transformer-derived saliency priors even as soft input masks (Lu et al., 2023).
Uncertainty and Probabilistic Modeling: Transformer generative models provide not only pointwise saliency but also uncertainty estimates, facilitating downstream risk-aware inference (Zhang et al., 2021).
Cross-domain Generalization: The modularity of transformer architectures allows end-to-end adaptation for distinct modalities, supervision schemas (mask, fixations, token overlap), and prediction types (dense, coordinate, uncertainty).

References:

SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection (Liu et al., 2022)
VST++: Efficient and Stronger Visual Saliency Transformer (Liu et al., 2023)
Saliency-guided Vision Transformer for Few-shot Keypoint Detection (Lu et al., 2023)
Learning Saliency From Fixations (Djilali et al., 2023)
Global-Local Saliency TRansformer (Ren et al., 2021)
Salient Object Detection in Optical Remote Sensing Images Driven by Transformer (Li et al., 2023)
Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction (Zhang et al., 2021)
TranSalNet: Towards perceptually relevant visual saliency prediction (Lou et al., 2021)
Unifying Global-Local Representations in Salient Object Detection with Transformer (Ren et al., 2021)
SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection (Jia et al., 2022)
Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection (Zeng et al., 2022)
Transformer-based Network for RGB-D Saliency Detection (Wang et al., 2021)
Decoding Layer Saliency in Language Transformers (Hou et al., 2023)
UniST: Towards Unifying Saliency Transformer for Video Saliency Prediction and Detection (Xiong et al., 2023)