PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection

Published 11 Sep 2025 in cs.CV | (2509.09572v1)

Abstract: To tackle the prevalence of pseudo changes, the scarcity of labeled samples, and the difficulty of cross-domain generalization in multi-temporal and multi-source remote sensing imagery, we propose PeftCD, a change detection framework built upon Vision Foundation Models (VFMs) with Parameter-Efficient Fine-Tuning (PEFT). At its core, PeftCD employs a weight-sharing Siamese encoder derived from a VFM, into which LoRA and Adapter modules are seamlessly integrated. This design enables highly efficient task adaptation by training only a minimal set of additional parameters. To fully unlock the potential of VFMs, we investigate two leading backbones: the Segment Anything Model v2 (SAM2), renowned for its strong segmentation priors, and DINOv3, a state-of-the-art self-supervised representation learner. The framework is complemented by a deliberately lightweight decoder, ensuring the focus remains on the powerful feature representations from the backbones. Extensive experiments demonstrate that PeftCD achieves state-of-the-art performance across multiple public datasets, including SYSU-CD (IoU 73.81%), WHUCD (92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%) and LEVIR-CD (85.62%), with notably precise boundary delineation and strong suppression of pseudo-changes. In summary, PeftCD presents an optimal balance of accuracy, efficiency, and generalization. It offers a powerful and scalable paradigm for adapting large-scale VFMs to real-world remote sensing change detection applications. The code and pretrained models will be released at https://github.com/dyzy41/PeftCD.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents PeftCD, a novel framework integrating PEFT with VFMs to enhance remote sensing change detection.
It shows that freezing backbone weights and fine-tuning lightweight modules like LoRA and Adapter significantly improves IoU and F1 scores.
The method achieves scalability and resource efficiency while maintaining high boundary accuracy and robust detection across benchmarks.

PeftCD: Parameter-Efficient Fine-Tuning of Vision Foundation Models for Remote Sensing Change Detection

Introduction

The PeftCD framework addresses the persistent challenges in remote sensing change detection (RSCD), including pseudo-change suppression, limited labeled data, and cross-domain generalization. The approach leverages Vision Foundation Models (VFMs)—specifically, Segment Anything Model v2 (SAM2) and DINOv3—using parameter-efficient fine-tuning (PEFT) strategies, namely LoRA and Adapter modules. By freezing the majority of backbone parameters and only training lightweight adaptation modules, PeftCD achieves high task adaptability with minimal computational overhead, while maintaining the strong representational priors of large-scale VFMs.

Figure 1: Architecture of PeftCD. Bi-temporal images are encoded by a shared VFM backbone with injected PEFT modules. Features undergo exchange before decoding into a change map.

Architecture and Methodology

Backbone Selection and PEFT Integration

PeftCD is instantiated with two leading VFMs:

SAM2: A segmentation-oriented ViT backbone with strong boundary priors, trained on over 1.1B masks.
DINOv3: A self-supervised ViT trained on 17B images, providing robust, generalizable semantic features.

PEFT is realized via:

LoRA: Low-rank adapters injected into the qkv projections of MHSA layers, with $r=8$ and $\alpha=32$.
Adapter: Bottleneck MLP modules (dim=32, GELU) inserted pre-block in each Transformer layer.

All backbone weights are frozen; only PEFT module parameters are updated during fine-tuning.

Bi-Temporal Feature Encoding and Exchange

A weight-sharing Siamese encoder processes the bi-temporal inputs. After feature extraction, a cross-temporal feature exchange is performed, swapping features at selected layers between the two streams to enhance change-relevant representation. This operation is critical for robust modeling of temporal differences.

Decoder Design

SAM2CD Instantiation

Figure 2: SAM2CD: PeftCD with SAM2 Backbone. Features are exchanged between temporal streams and fused through FPN before dual-branch decoding with shared weights.

FPN-based Decoder: Aggregates multi-scale features from the SAM2 encoder.
ResBlock-based Progressive Upsampling: Refines and upsamples features to produce the final change map.
Dual-Branch Decoding: Each branch outputs a change probability map; outputs are averaged at inference for stability.

DINO3CD Instantiation

Figure 3: Architecture of the PeftCD model based on DINOv3 (denoted as DINO3CD).

Same-Scale Multi-Layer Fusion: Features from multiple Transformer layers (e.g., 5, 11, 17, 23) are fused via position-adaptive attention.
ASPP-based Contextual Enhancement: Multi-dilation convolutions capture diverse spatial context.
Progressive Upsampling: Three-stage upsampling with depthwise separable convolutions restores spatial resolution.

This decoder design compensates for the lack of a spatial pyramid in ViT backbones, enabling precise boundary recovery and small-object detection.

Experimental Results

PeftCD is evaluated on seven public RSCD datasets: SYSU-CD, WHUCD, MSRSCD, MLCD, CDD, S2Looking, and LEVIR-CD. Across all benchmarks, PeftCD achieves state-of-the-art (SOTA) performance, with notable improvements in IoU, F1, and precision metrics.

SYSU-CD: IoU 73.81%, F1 84.93%
WHUCD: IoU 92.05%, F1 95.86%
MSRSCD: IoU 64.07%, F1 78.10%
MLCD: IoU 76.89%, F1 86.93%
CDD: IoU 97.01%, F1 98.48%
S2Looking: IoU 52.25%, F1 68.64%
LEVIR-CD: IoU 85.62%, F1 92.25%

Qualitative comparisons demonstrate superior boundary delineation, reduced false positives, and robust detection of small and irregular change regions.

Figure 4: Qualitative comparison between PeftCD and competing methods on the SYSU-CD dataset.

Figure 5: Qualitative comparison between PeftCD and competing methods on the WHUCD dataset.

Figure 6: Qualitative comparison between PeftCD and competing methods on the S2Looking dataset.

Figure 7: Qualitative comparison between PeftCD and competing methods on the MSRSCD dataset.

Figure 8: Qualitative comparison between PeftCD and competing methods on the MLCD dataset.

Figure 9: Qualitative comparison between PeftCD and competing methods on the CDD dataset.

Figure 10: Qualitative comparison between PeftCD and competing methods on the LEVIR-CD dataset.

Ablation and Analysis

Ablation studies confirm:

PEFT Efficacy: Both LoRA and Adapter yield substantial gains over frozen backbones, with DINOv3+LoRA providing the most consistent cross-dataset improvements.
Decoder Design: The MFCE decoder in DINO3CD significantly outperforms baseline upsampling, validating the necessity of multi-layer fusion and context enhancement for ViT-based RSCD.
Parameter Efficiency: PeftCD achieves SOTA results with 2.86M–11M trainable parameters, an order of magnitude fewer than many competing models (e.g., CDNeXt at 242M).

Discussion

Decoding with ViT Backbones

The single-scale nature of ViT backbones (e.g., DINOv3) limits spatial detail recovery. The proposed MFCE decoder, combining deep attention fusion and ASPP, partially mitigates this, but further research is needed for efficient multi-layer fusion and adaptive context modeling, especially for small-object and boundary-sensitive tasks.

PEFT Adaptability and Limitations

LoRA and Adapter modules enable rapid adaptation to RSCD with minimal overhead. However, adaptation capacity is constrained when domain gaps are large (e.g., side-looking imagery in S2Looking). Current PEFT configurations are static; future work should explore adaptive PEFT strategies and cross-layer interaction mechanisms.

Practical Implications

PeftCD's parameter efficiency and strong generalization make it suitable for deployment in resource-constrained or real-time remote sensing applications. The framework demonstrates that large-scale VFMs can be effectively adapted to specialized tasks without full fine-tuning, supporting scalable and sustainable model deployment in operational RSCD systems.

Conclusion

PeftCD establishes a new paradigm for remote sensing change detection by integrating vision foundation models with parameter-efficient fine-tuning. The framework achieves SOTA performance across diverse datasets, with strong boundary accuracy, pseudo-change suppression, and minimal trainable parameters. The results validate the effectiveness of PEFT for adapting large vision models to specialized geospatial tasks and highlight promising directions for efficient, generalizable, and scalable RSCD solutions.