UniVCD: Open-Vocabulary Change Detection

Updated 17 December 2025

UniVCD is an open-vocabulary method that uses foundation models to detect, localize, and classify changes between multi-temporal remote sensing images without relying on fixed category sets.
It integrates frozen vision-language models (SAM2 and CLIP) with domain-specific modules and a feature alignment mechanism to achieve high spatial and semantic precision.
Quantitative results demonstrate improved metrics and robustness across datasets, marking a shift from supervised to unsupervised, open-world change detection.

Unified Open-Vocabulary Change Detection (UniVCD) refers to a new generation of methodologies designed to identify, localize, and categorize changes between multi-temporal remote sensing images using open-vocabulary vision-LLMs, without reliance on predefined category sets or labeled training data. These methods integrate frozen foundation models, notably large-scale vision transformers and multimodal architectures such as SAM2 and CLIP, combined with lightweight domain-specific modules. UniVCD systems enable detection of category-agnostic and user-specified changes across diverse imaging conditions, providing both high spatial resolution and semantic interpretability well beyond classical closed-set change detection approaches (Zhu et al., 15 Dec 2025, Li et al., 22 Jan 2025, Zhu et al., 12 Jan 2025).

1. Motivation and Problem Definition

Traditional change detection (CD) in remote sensing and Earth observation is predominantly supervised, operating over a small set of predefined semantic categories and requiring substantial annotated data. These systems generalize poorly in operational scenarios involving novel classes, sensor domains, or complex semantic shifts (Zhu et al., 15 Dec 2025, Li et al., 22 Jan 2025). Open-vocabulary change detection (OVCD) generalizes the CD task by leveraging the compositionality of vision-LLMs (VLMs), enabling detection for arbitrary user-supplied category prompts at inference.

The fundamental problem is, for a co-registered bi-temporal image pair $I_1$ , $I_2$ of the same scene, and a set of free-form target category prompts $\mathcal{C} = \{c_k\}$ (e.g., “building,” “vegetation,” “playground”), to output dense pixel-wise “change masks” $M^k$ for each class $k$ , indicating where that semantic category has changed from $I_1$ to $I_2$ (Li et al., 22 Jan 2025, Zhu et al., 15 Dec 2025). These masks are not restricted to any closed label set, enabling completely open-ended scene understanding.

Key challenges for UniVCD include:

The lack of densely labeled semantic-change datasets spanning diverse object types.
Large domain gaps between foundation model training (natural images) and aerial/overhead remote-sensing imagery.
Accumulation of segmentation, comparison, and classification errors when system modules are trained or designed in isolation.

2. Methodological Frameworks

2.1 Foundation Models

UniVCD pipelines exploit frozen, pretrained vision foundation models:

SAM2 encoder ( $E_s$ ): A multi-level Transformer generating hierarchical spatial feature maps (for fine-grained object boundaries).
CLIP encoders ( $E_c^{img}$ , $E_c^{txt}$ ): Provide dense image features and semantic priors for open-vocabulary prompts.

The integration of high-resolution spatial detail from models such as SAM2 with the semantic generalization of CLIP underpins most current UniVCD systems (Zhu et al., 15 Dec 2025).

2.2 Feature Alignment and Fusion

A central technical contribution is the SAM–CLIP Feature Alignment Module (SCFAM), which spatially aligns and fuses representations from frozen SAM2 and CLIP encoders into a high-resolution feature tensor $F \in \mathbb{R}^{D\times H\times W}$ (Zhu et al., 15 Dec 2025). SCFAM consists of:

Per-scale adapters: Two-layer conv blocks project each SAM2 feature scale to a common semantic channel space.
Hierarchical fusion: A ConvNeXt-inspired architecture aggregates feature maps from coarse to fine scale.
Projection heads: MLPs or convolutions separately reconstruct SAM2 features (supervised by MSE) and align the fused map to CLIP features (supervised by MSE and cosine losses).

Loss functions for SCFAM are unsupervised and self-supervised, with no manual change labels: $L_{\mathrm{total}} = \sum_{i} \lambda_{\mathrm{rec},i} L_{\mathrm{rec},i} + \lambda_{\mathrm{MSE}} L_{\mathrm{align}}^{\mathrm{MSE}} + \lambda_{\mathrm{cos}} L_{\mathrm{align}}^{\mathrm{cos}}$ Typically, the alignment losses dominate ( $\lambda_{\mathrm{MSE}}, \lambda_{\mathrm{cos}} \gg \lambda_{\mathrm{rec}}$ ).

2.3 Change Likelihood and Mask Estimation

For each class $c$ and each image $t \in \{1,2\}$ , compute cosine similarity between spatial features and class text embeddings: $S_{t,c}(i,j) = \frac{F_t(i,j) \cdot T_c}{\|F_t(i,j)\| \|T_c\|}$
The per-category change-likelihood map is: $D_c(i, j) = (S_{1,c}(i, j) - S_{2,c}(i, j))^2$

2.4 Post-processing

A streamlined post-processing pipeline improves mask quality:

Otsu thresholding followed by morphological opening and small-region filtering removes pseudo-changes and noise.
SAM2-based boundary refinement leverages SAM2's segmentation capabilities to sharpen object boundaries, especially for well-defined coherent structures like buildings and roads.
Empirically, post-processing can increase building and road change precision by 8–10 percentage points and IoU by 5–8 points, though with a minor recall loss (Zhu et al., 15 Dec 2025).

3. Representative Pipelines and Instantiations

The UniVCD paradigm encompasses several architectures and pipeline choices, including both training-free and parameter-efficient trainable models:

Architecture / Framework	Key Components	Notes
UniVCD (SAM2+CLIP+SCFAM)	Frozen SAM2, CLIP, trainable SCFAM	Fully unsupervised, post-processing pipeline (Zhu et al., 15 Dec 2025)
DynamicEarth M-C-I	Mask proposal (SAM2), comparison (DINOv2), identifier (SegEarth-OV)	Training-free, instance-first, open-vocabulary (Li et al., 22 Jan 2025)
DynamicEarth I-M-C	Identifier (APE/Grounding DINO), mask (SAM2), compare (DINOv2)	Training-free, prompt-first, high precision in simple scenes (Li et al., 22 Jan 2025)
Semantic-CD	Bi-temporal CLIP+adapters, open-vocab prompter, binary and semantic decoders	RemoteCLIP, two-stage training, explicit SCD (Zhu et al., 12 Jan 2025)

DynamicEarth's M-C-I and I-M-C

Both are training-free frameworks using combinations of off-the-shelf foundation models for mask discovery, comparison, and semantic assignment. M-C-I instantiates high recall, while I-M-C, when strong grounding models are available, achieves high precision (Li et al., 22 Jan 2025).

Semantic-CD

Implements open-vocabulary SCD by leveraging CLIP features, meta-token context prompt engineering, and two decoders for binary and semantic tasks trained in a decoupled way on the SECOND dataset (Zhu et al., 12 Jan 2025).

4. Quantitative Results and Benchmarking

UniVCD methods are benchmarked on multiple public datasets: LEVIR-CD, WHU-CD, and SECOND.

Key metrics:

Precision = $\mathrm{TP}/(\mathrm{TP}+\mathrm{FP})$
Recall = $\mathrm{TP}/(\mathrm{TP}+\mathrm{FN})$
F1 = $2 \cdot \mathrm{Precision} \cdot \mathrm{Recall} / (\mathrm{Precision}+\mathrm{Recall})$
IoU = $\mathrm{TP}/(\mathrm{TP}+\mathrm{FP}+\mathrm{FN})$
mIoU = mean IoU across categories

Sample results for UniVCD (SAM2+CLIP+SCFAM, post-processed):

LEVIR-CD: Precision 64.7%, Recall 77.9%, F1 70.7%, IoU 54.7%, mIoU 75.6
WHU-CD: Precision 70.2%, Recall 84.1%, F1 76.5%, IoU 61.9%, mIoU 79.6
SECOND (building): F1 58.4%, IoU 41.2%; tree: F1 32.5%, IoU 19.4%

These outcomes match or exceed prior open-vocabulary methods (e.g., APE-DINOv2, SCM, DynamicEarth) (Zhu et al., 15 Dec 2025, Li et al., 22 Jan 2025).

DynamicEarth's training-free OVCD pipelines, notably M-C-I variants, demonstrate substantial cross-dataset robustness, addressing the problem that state-of-the-art supervised methods collapse outside their training distribution (Li et al., 22 Jan 2025).

5. Analysis of Strengths and Limitations

Strengths:

Unsupervised change detection: All SCFAM and mask generation modules require no labeled change data for training (Zhu et al., 15 Dec 2025).
Open-vocabulary generalization: Categories are specified by CLIP text prompts at test time; the approach generalizes to arbitrary target semantics (Zhu et al., 15 Dec 2025, Zhu et al., 12 Jan 2025).
Parameter efficiency: Only lightweight adapters and projection heads are trainable; foundation models are frozen, yielding $\sim 1$ –2M trainable parameters (Zhu et al., 15 Dec 2025).
Spatial and semantic fusion: High-resolution spatial information (SAM2) is enriched with semantic priors (CLIP), enabling fine-grained, context-aware mask prediction (Zhu et al., 15 Dec 2025, Zhu et al., 12 Jan 2025).
Plug-and-play post-processing: Refinement steps are modular and conceptually decoupled from the base feature backbone (Zhu et al., 15 Dec 2025).

Limitations:

No part-level change detection: Systems fail to flag geometric changes to objects that remain in the same broad category (e.g., flyover reconfiguration) (Zhu et al., 15 Dec 2025, Li et al., 22 Jan 2025).
Domain gap for amorphous classes: CLIP semantics transfer less effectively to categories underrepresented in natural image datasets (e.g., certain types of vegetation or water) (Zhu et al., 15 Dec 2025).
Post-processing trade-off: Boundary refinement improves precision (and IoU) at the expense of mild recall degradation; requires per-category tuning (Zhu et al., 15 Dec 2025).
Prompt engineering dependency: Manual selection and crafting of effective semantic prompts is non-trivial for certain land-cover types (Li et al., 22 Jan 2025).
Decoupled losses and incomplete unification: Some instantiations (e.g., Semantic-CD) use decoupled two-stage training, not a fully unified architecture (Zhu et al., 12 Jan 2025).

6. Directions for Future Research

Potential future extensions, as articulated in recent literature, include:

Adoption of domain-specific vision-LLMs (e.g., RemoteCLIP) to mitigate domain gaps for underrepresented semantic classes (Zhu et al., 15 Dec 2025, Zhu et al., 12 Jan 2025).
Integration of contrastive objectives between bi-temporal features to penalize feature drift in frozen encoders explicitly (Zhu et al., 15 Dec 2025).
Generalization to multi-temporal (>2) input for gradual or longitudinal change detection tasks (Zhu et al., 15 Dec 2025, Li et al., 22 Jan 2025).
Joint fine-tuning of lightweight adapter layers via knowledge distillation strategies, increasing trainability while retaining foundation model constraints (Zhu et al., 15 Dec 2025).
End-to-end unified training frameworks combining mask proposal, comparison, and open-vocabulary classification, reducing error accumulation and pipeline fragmentation (Li et al., 22 Jan 2025, Zhu et al., 12 Jan 2025).
Replacement of classical thresholding/morphological post-processing with learned or CRF-based refinement (Zhu et al., 15 Dec 2025).
Automatic prompt generation using LLMs or prompt tuning to decrease reliance on manual engineering (Li et al., 22 Jan 2025, Zhu et al., 12 Jan 2025).
Expansion to richer modalities, including multispectral, hyperspectral, and SAR data, and benchmarking on more geographically and semantically diverse datasets.

A plausible implication is that as open-vocabulary change detection tasks become central to remote sensing and environmental analysis pipelines, fully unified, robust, and domain-adaptive approaches that minimize manual supervision will become standard.

7. Relationship to Broader Literature

UniVCD operationalizes the shift from closed-set, supervised change detection to training-free or minimally supervised, open-world semantic reasoning, by leveraging advances in vision-language pretraining (CLIP, SAM2, DINOv2, Grounding DINO). Its development is consistent with trends observed in the broader OVCD literature, as seen in DynamicEarth (Li et al., 22 Jan 2025) and Semantic-CD (Zhu et al., 12 Jan 2025). These works collectively establish unified open-vocabulary change detection as both a critical methodological advance and a foundation for future large-scale, annotation-efficient remote sensing systems.