Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images

Published 7 Apr 2026 in cs.CV | (2604.05527v1)

Abstract: Multimodal change detection (MMCD) identifies changed areas in multimodal remote sensing (RS) data, demonstrating significant application value in land use monitoring, disaster assessment, and urban sustainable development. However, literature MMCD approaches exhibit limitations in cross-modal interaction and exploiting modality-specific characteristics. This leads to insufficient modeling of fine-grained change information, thus hindering the precise detection of semantic changes in multimodal data. To address the above problems, we propose STSF-Net, a framework designed for MMCD between optical and SAR images. STSF-Net jointly models modality-specific and spatio-temporal common features to enhance change representations. Specifically, modality-specific features are exploited to capture genuine semantic change signals, while spatio-temporal common features are embedded to suppress pseudo-changes caused by differences in imaging mechanisms. Furthermore, we introduce an optical and SAR feature fusion strategy that adaptively adjusts feature importance based on semantic priors obtained from pre-trained foundational models, enabling semantic-guided adaptive fusion of multi-modal information. In addition, we introduce the Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark consisting of very-high-resolution (VHR) fully polarimetric SAR and optical images. Experimental results on Delta-SN6, BRIGHT, and Wuhan-Het datasets demonstrate that our method outperforms the state-of-the-art (SOTA) by 3.21%, 1.08%, and 1.32% in mIoU, respectively. The associated code and Delta-SN6 dataset will be released at: https://github.com/liuxuanguang/STSF-Net.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces STSF-Net which fuses optical and SAR features via semantic priors to improve fine-grained change detection.
It employs a dual-branch encoder with spatio-temporal common feature extraction to outperform 13 state-of-the-art methods.
The authors establish the Delta-SN6 benchmark, a high-resolution, multiclass dataset that enhances evaluation in urban and disaster scenarios.

Prior-Guided Fusion of Multimodal Features for Optical–SAR Change Detection

Introduction

This paper introduces STSF-Net, a multimodal change detection (MMCD) framework that fuses optical and synthetic aperture radar (SAR) features under semantic priors from visual foundation models (VFMs). In remote sensing, MMCD between optical and SAR is critical for robust land use monitoring and disaster assessment due to the complementary properties of these modalities—optical sensors provide rich spectral and textural detail under clear conditions, whereas SAR penetrates clouds and operates independenty of lighting, ensuring all-weather monitoring.

However, significant challenges persist. The inherent discrepancies—spectral reflection in optical versus dielectric and structural scattering in SAR—induce substantial cross-modal gaps, making it difficult to align, fuse, and exploit both modality-specific and modality-agnostic cues for fine-grained semantic change detection. The literature predominantly relies on post-classification comparison or generative translation and feature alignment, which frequently ignore spatio-temporal context and underutilize modality-specific signals. These deficiencies result in suppressed discriminative power and vulnerability to pseudo-changes.

Core Contributions

Disentangled Feature Representation: STSF-Net employs dual branches—a pseudo-siamese network (with unshared weights) for modality-specific encoding and a spatio-temporal common feature encoder (STCFE) for semantically-aligned representations. This design avoids over-reliance on shared-space projections, balancing modality-specific discriminability and cross-modal consistency.
Semantic Prior-Guided Fusion: The fusion module leverages semantic priors from a frozen VFM (SAM2) as adaptive weights to control the fusion of specific and common features. Semantic priors are computed via per-pixel distance maps derived from VFM-pretrained optical/SAR encodings, allowing self-adaptive enhancement of salient changes and robust suppression of pseudo-changes.
Delta-SN6 Dataset: The authors contribute Delta-SN6, the first large-scale, multiclass MMCD benchmark with co-registered, fully polarimetric VHR SAR and optical imagery, annotated for fine-grained semantic change categories, including directionality (e.g., addition/removal of buildings, roads, water). This addresses a key bottleneck in multi-modal change detection research.

Methodological Details

Disentangled Feature Encoding

The dual-branch encoder is asymmetrical: the optical branch fine-tunes a VFM backbone (SAM2), thus inheriting strong general semantic priors, whereas the SAR branch utilizes a Swin Transformer to learn domain-specific scattering and structure responses. This avoids destructive interference between modalities, and ablation shows this design outperforms conventional shared-weight or naive concatenation approaches in both overall and per-class accuracy.

Spatio-Temporal Commonality Modeling

A two-stage module aligns bi-temporal features. The feature interaction module (FIM) generates initial channel attention via cross-modal concatenation and convolution, followed by a graph convolutional network (GSFM) that embeds spatial structure/contextual dependency, using node-wise aggregation within feature maps. This architecture enables capturing consistent spatial patterns while maintaining sensitivity to genuine temporal signals.

Semantic Prior-Guided Fusion

The fusion is dynamically controlled by change intensity maps computed from the VFM transformers, as per-pixel Euclidean distances. For each spatial position, a fusion weight is generated—regions with high change probability weight the modality-specific residual, while unchanged/background regions preferentially propagate common features. This enables spatially adaptive context, substantially improving discrimination in heterogeneous environments.

Experimental Validation

The framework is extensively evaluated on Delta-SN6, BRIGHT, and Wuhan-Het benchmarks against 13 SOTA methods (CNN, Transformer, Mamba-based, and VFM-driven baselines). The key quantitative results are:

Delta-SN6: STSF-Net achieves 94.60% F1 (binary change), 91.33% mIoU, and 99.54% OA—improving over the second-best by 3.18% mIoU and demonstrating superior multiclass and localization performance.
BRIGHT: STSF-Net outperforms GSTM-SCD and DamageFormer, excelling in identifying severely damaged (56.8% IoU) and intact buildings (75.01% IoU), reducing false positives and omission errors.
Wuhan-Het: The approach achieves 57.79% F1 and 64.25% mIoU, balancing recall and precision better than all comparators, especially in urban heterogeneous change regions.

In all cases, ablation confirms that each module (spatio-temporal feature modeling, pseudo-siamese encoding, PGFFM fusion) contributes significant gains. Incorporating PGFFM into other baselines also systematically improves their mIoU, corroborating its robustness as a generic cross-modal fusion block.

Advantages of the Delta-SN6 Benchmark

Delta-SN6 addresses critical gaps noted in prior benchmarks (see Table 2 in the paper):

Resolution and Modal Diversity: It provides co-registered VHR (0.5 m) optical and fully polarimetric SAR images, a significant improvement over previous datasets with coarser resolution and limited sensor modalities.
Semantic Granularity: 2,818 annotated change instances across buildings, roads, and water, including directionality, foster nuanced algorithm evaluation.
Temporal Coverage: With multi-temporal annotations spanning over a decade, it supports research beyond simple pre/post binary evaluations.

These features enable not only benchmarking but also comprehensive study of deep networks’ ability to discriminate subtle, complex surface dynamics in highly heterogeneous multimodal remote sensing data.

Implications and Future Research Outlook

The disentanglement of modality-specific and common representations, guided by high-level priors, advances the precision and reliability of MMCD networks, especially for scenarios marked by significant cross-domain gaps (e.g., urban change, disaster damage). The prior-guided fusion paradigm generalizes well to other architectures.

Practically, these results facilitate more robust and explainable change detection for large-scale, time-critical applications in urban planning, infrastructure monitoring, and rapid disaster assessment, especially under non-ideal conditions (e.g., cloud cover, irregular acquisition intervals).

Potential future directions enabled by this work include:

Unified MMCD Architectures: Models that dynamically adapt encoding/fusion strategies for both homogeneous and heterogeneous (e.g., optical-optical, optical-SAR) input pairs, as suggested by the Delta-SN6 support for both input types.
Temporal Generalization: Extension from bi-temporal to true time-series change detection, leveraging the modularity of the STSF-Net framework.
Model Compression: Knowledge distillation and efficient selection for real-time or edge deployment, as discussed in the paper’s efficiency analysis.
Multimodal Foundation Models: Further research into foundation models pretrained specifically for Earth observation, leveraging both natural and synthetic data, to improve transferability beyond optical imagery.

Conclusion

This work establishes a principled, modular approach to MMCD by leveraging disentangled feature spaces and semantic priors, validated on a newly curated, high-value dataset. The integration of VFM-driven adaptive fusion targeting both specificity and consistency sets a new technical standard, enabling robust and scalable change detection under challenging multimodal scenarios. The practical and theoretical implications are significant, opening avenues for unified, high-precision, and efficient cross-domain change analytics in Earth observation and remote sensing research.

Markdown Report Issue