SACNet: Unaligned RGBT Salient Object Detection

Updated 26 May 2026

SACNet is a deep learning architecture for unaligned RGB-Thermal salient object detection, fusing modalities with semantics-guided asymmetric correlation and deformable feature sampling.
It employs dual Swin-B backbones with specialized modules (ACM and AFSM) to robustly handle misalignment in real-world sensor data.
Evaluations on the UVT2000 benchmark show significant improvements, with increased E-measure and reduced MAE compared to traditional methods.

SACNet refers to a range of distinct deep learning architectures designed primarily for computer vision tasks. Most notably, the term denotes the Semantics-guided Asymmetric Correlation Network for alignment-free RGB-Thermal (RGBT) Salient Object Detection. This architecture introduces fundamental innovations targeting the longstanding challenge of fusing visible and thermal modalities captured under real-world, spatially unaligned conditions. The following survey focuses on SACNet as defined in "Alignment-Free RGBT Salient Object Detection: Semantics-guided Asymmetric Correlation Network and A Unified Benchmark" (Wang et al., 2024), while distinguishing it from similarly named but unrelated models in the literature.

1. Motivation: Alignment-Free RGBT Salient Object Detection

RGB-Thermal Salient Object Detection (SOD) aims to accurately segment visually distinctive objects by leveraging the complementary information present in both visible (RGB) and thermal image sensors. Crucially, most prior models assume pixelwise-aligned image pairs, neglecting the considerable misalignment that naturally occurs due to sensor calibration errors, perspective discrepancies, and physical hardware limitations. In practical settings, RGB and thermal salient regions are often mis-registered (translation, scale, and rotation), confounding naive feature fusion techniques based on concatenation, summation, or local attention mechanisms. Existing data augmentation approaches—such as those in DCNet using small-range random affine transformation—fail to address the full diversity and magnitude of real-world misalignment, yielding substantial performance degradation on unaligned data.

2. SACNet Architecture: Core Modules and Pipeline

SACNet introduces an end-to-end solution for unaligned RGBT SOD, architecturally distinguished by two key components:

Asymmetric Correlation Module (ACM): Employs semantics-guided, windowed cross-modal correlation restricted to asymmetric window pairs, allowing correlation between a small region in one modality and a larger spatial region in the second modality. This addresses misalignment by spatially covering probable locations of a salient object across modalities with explicit semantic guidance.
Associated Feature Sampling Module (AFSM): Adds alignment refinement after ACM, using cascaded deformable convolutions to sample and realign the thermal features conditioned on the corresponding RGB features, producing fused multi-modal representations suitable for sharp saliency segmentation.

The architecture comprises dual parallel backbones (one per modality), built using a Swin-B transformer pretrained on ImageNet. Features are extracted as pyramids $\{f^i_{rgb}, f^i_{t}\}_{i=1..4}$ , and at pyramid levels 2–4, the ACM and AFSM are applied in sequence. A U-Net style decoder aggregates these enriched and realigned features to predict saliency maps.

3. Details of ACM and AFSM

Asymmetric Correlation Module (ACM):

Asymmetric Window Partition (AWP): Given feature maps $f^i_{rgb}$ and $f^i_{t}$ , non-overlapping small windows of size $M \times M$ are extracted from one modality, and for each, a larger $N \times N$ window is centered at the same location in the other modality, covering probable spatial shifts and scale mismatches. Both combinations are considered by swapping modalities. The best-performing configuration was $M=4$ , $N=6$ .
Semantics-Guided Attention: High-level semantic features are globally pooled and correlated using multi-head attention over the concatenated feature representations. Resulting semantic maps are upsampled and applied via element-wise multiplication to each modality's local features, suppressing background noise and biasing towards salient regions.
Windowed Cross-Modal Attention: For each asymmetric window pair, a transformer-like attention operation is applied to measure inter-modality affinity, producing enhanced features $\tilde{f}_{rgb}$ , $\tilde{f}_{t}$ .

Associated Feature Sampling Module (AFSM):

Deformable convolutions are deployed in a cascade (depth 4) to learn spatial sampling offsets for thermal features, conditioned on their aligned RGB counterparts. This learned offset mechanism allows for dynamic, spatially adaptive alignment at each location, correcting for residual misalignments not addressed by ACM. The output features are then fused via $1 \times 1$ convolution.

4. Loss Functions and Training Objective

SACNet uses a mixed loss to supervise the saliency prediction:

Binary Cross-Entropy (BCE) Loss: Penalizes pixel-level misclassifications.
Edge-Aware Smoothness Loss: Penalizes high-frequency errors, enforcing crisp boundaries (as per Godard et al.).
Dice Loss: Optimizes the overlap between predicted and ground-truth saliency maps, especially important for class imbalance.

This composite loss ensures both region-level and boundary-level accuracy.

5. UVT2000: A Unified Benchmark for Alignment-Free RGBT SOD

SACNet is the first to introduce a large-scale, alignment-free benchmark, UVT2000:

2000 RGB-thermal image pairs, captured directly from a FLIR SC620 rig, without any manual alignment.
RGB image resolution: $f^i_{rgb}$ 0; thermal: $f^i_{rgb}$ 1.
Spans 295 scenes and 429 object categories, annotated with pixel-level saliency masks.
Includes 11 challenge attributes (e.g., low illumination, thermal crossover, small target, bad weather).
Standard split: 1000 train, 200 val, 800 test.

UVT2000 directly addresses the lack of real-world, diverse, unaligned multi-modal datasets for SOD.

6. Experimental Performance and Ablation Studies

SACNet was evaluated against 14 state-of-the-art RGBT SOD models across seven datasets (including both aligned and weakly/un-aligned protocols). Key metrics include E-measure ( $f^i_{rgb}$ 2), S-measure ( $f^i_{rgb}$ 3), $f^i_{rgb}$ 4, and MAE:

On UVT2000 (test): $f^i_{rgb}$ 5, $f^i_{rgb}$ 6, $f^i_{rgb}$ 7, MAE $f^i_{rgb}$ 8, outperforming the next-best SPNet ( $f^i_{rgb}$ 9, $f^i_{t}$ 0, $f^i_{t}$ 1, MAE $f^i_{t}$ 2) with an average +7.9% $f^i_{t}$ 3 and –39% MAE.
On aligned VT5000: $f^i_{t}$ 4, $f^i_{t}$ 5, $f^i_{t}$ 6, MAE $f^i_{t}$ 7, establishing state-of-the-art across both aligned and unaligned settings.

Ablation studies confirm that removing ACM increases MAE by 20–36%, omitting AWP results in ~21% higher MAE, removing semantic guidance increases MAE by 14%, and deleting AFSM causes a 15% MAE increase. These confirm the necessity of each component for robust unaligned SOD.

7. Implementation Notes and Future Directions

SACNet is implemented with Swin-B transformer backbones, trained at input size $f^i_{t}$ 8 for both modalities, with AdamW (lr $f^i_{t}$ 9, weight decay $M \times M$ 0), batch size 8, for 200 epochs using two RTX3090 GPUs (15 hours training time). The architecture is robust to backbone choices; ResNet-50 yields similar improvements.

Identified future directions include the development of modules targeting extreme conditions (e.g., hollow objects, strong glare), expanding UVT2000 in terms of scenes, modalities (such as depth), and more complex misalignment patterns to further bridge the sim-to-real gap in practical sensing deployments (Wang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Alignment-Free RGBT Salient Object Detection: Semantics-guided Asymmetric Correlation Network and A Unified Benchmark (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SACNet.

SACNet: Unaligned RGBT Salient Object Detection

1. Motivation: Alignment-Free RGBT Salient Object Detection

2. SACNet Architecture: Core Modules and Pipeline

3. Details of ACM and AFSM

4. Loss Functions and Training Objective

5. UVT2000: A Unified Benchmark for Alignment-Free RGBT SOD

6. Experimental Performance and Ablation Studies

7. Implementation Notes and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SACNet: Unaligned RGBT Salient Object Detection

1. Motivation: Alignment-Free RGBT Salient Object Detection

2. SACNet Architecture: Core Modules and Pipeline

3. Details of ACM and AFSM

4. Loss Functions and Training Objective

5. UVT2000: A Unified Benchmark for Alignment-Free RGBT SOD

6. Experimental Performance and Ablation Studies

7. Implementation Notes and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research