Referring Change Detection (RCD)

Updated 19 December 2025

Referring Change Detection (RCD) is a dynamic paradigm that uses referring expressions to flexibly identify and segment changes in temporal data.
It integrates cross-modal vision-language fusion, prompt-based segmentation, and probabilistic rank modeling for precise detection in domains like remote sensing and gene expression.
The framework enhances flexibility and interoperability by decoupling fixed labels and supporting application-specific change queries through scalable synthetic augmentation.

Referring Change Detection (RCD) is a paradigm in change detection methodologies enabling flexible, user- or context-driven identification of changes in temporal data. In contrast to traditional binary or semantic change detection, which operates on fixed categories and predefined label sets, RCD leverages dynamic referring mechanisms—often, natural language or specific prompts—to localize and segment only those changes of explicit interest. This paradigm has been formalized across earth observation, gene expression, and remote sensing imagery, with implementations encompassing cross-modal fusion, promptable segmentation, and probabilistic rank modeling.

1. Formal Definition and Motivation

RCD generalizes the classical change detection problem by decoupling the target concept from fixed outputs and making the “change query” dynamic at inference. Given paired data $(I_1, I_2)$ —for example, bi-temporal images in remote sensing—plus a referring expression $T$ (e.g., “detect building changes,” “vegetation loss”), RCD seeks a mask $M$ where $M_{ij}=1$ denotes that pixel/location $(i,j)$ underwent a change stipulated by $T$ (Korkmaz et al., 12 Dec 2025, Ahmad et al., 13 Aug 2024, Zhao et al., 13 Jul 2024):

$\hat M = \arg\max_M P(M|I_1, I_2, T)$

This framework extends naturally to biological domains, where RCD models differential splicing as rank reversals among mutually exclusive transcript junctions. In this context, referring is not language-based but structural: detecting changes only among reference transcripts or splice events (Gelfond et al., 2011).

The rationale for RCD is twofold. First, it enhances flexibility, allowing practitioners to focus on meaningful or application-specific changes (e.g., disaster-driven building loss rather than generic scene change). Second, it facilitates data and model interoperability, supporting training and inference across diverse, heterogeneous datasets and concepts (Korkmaz et al., 12 Dec 2025, Zhao et al., 13 Jul 2024).

2. Methodological Approaches

RCD encompasses several architectural and statistical approaches, with the principal methodologies described below.

Several works employ vision-LLMs (VLMs), notably CLIP and its variants, to encode both image and referring text into a shared space (Zhao et al., 13 Jul 2024, Korkmaz et al., 12 Dec 2025). The semantic-first paradigm (“SeFi-CD”) operates by embedding the user’s referring expression (Change-of-Interest, CRoI) and each image to compute token-wise similarity maps. Sparse prompts (point locations with high or low alignment) are generated from these similarity maps and used to guide foundation segmentation models.

$\text{Sim}_{t_i} = \text{Norm}\left(F_i \Delta T^T\right)$

Masks are produced by passing these prompts to SAM (Segment Anything Model), and the final change map is obtained as the set difference between CRoI masks from both images.

2.2 Prompt-Based Segmentation and Specialized Masking

Others, such as (Ahmad et al., 13 Aug 2024), use direct prompt construction from object masks in pre-change images. Connected components are extracted as candidate objects, with prompts generated from all but one component to probe for disappearance in post-change data. SAM’s internal confidence score for mask prediction is used to flag the most likely disappeared object without explicit IoU thresholding.

Pseudocode summary:

Input:  M_pre (binary mask), I_post (image)
Output: index k* of disappeared object
1. Find connected components O₁…O_N in M_pre
2. For each i=1…N, sample point prompts P_i ⊂ O_i
3. For k=1…N:
    P_k ← ⋃_{i≠k} P_i
    (M_post^{(k)}, c_k) ← SAM.segment(I_post; prompt=P_k)
4. k* ← arg max_k c_k
5. Return k*

2.3 Probabilistic Rank Modeling

In the domain of splice-junction microarrays, Rank Change Detection (RCD) assesses rank reversals among mutually exclusive junction sets (Gelfond et al., 2011). The model estimates the posterior probability that the ordering of latent probe intensities changes between conditions. Monte Carlo sampling from the Gaussian posterior provides robust estimates of rank-change events:

$U_{t₁t₂j} = P(R(\mu_{t₁j}) < R(\mu_{t₂j}) | \text{Data})$

$D_{t₁t₂j} = P(R(\mu_{t₁j}) > R(\mu_{t₂j}) | \text{Data})$

$S_{t₁t₂j} = \max\{U_{t₁t₂j}, D_{t₁t₂j}\}$

A threshold (e.g., $0.9$) is applied to $S$ to call differential splicing events.

3. Architectural Instantiations

The spectrum of RCD architectures spans zero-shot, few-shot, and supervised regimes.

RCDNet: A hybrid Siamese–Transformer network for referring CD combining visual state-space encoding with text-guided cross-modal fusion (Korkmaz et al., 12 Dec 2025).
AUWCD (Anything You Want Change Detection): A zero-shot pipeline based on the Semantic-First paradigm, comprising semantic alignment (CLIP Surgery), prompt generation, SAM-guided segmentation, and change map computation (Zhao et al., 13 Jul 2024).
SAM-based object disappearance: An unsupervised, prompt-based method for specialized disappearance detection adaptable across object classes (Ahmad et al., 13 Aug 2024).
Rank-based statistical RCD: Generalized linear mixed models with rank-change statistics for splicing analysis in biomedical arrays (Gelfond et al., 2011).

The flexibility of these frameworks enables seamless integration of new referring expressions (text prompts or structural queries) without retraining, supports dataset mixing, and addresses class imbalance via large-scale synthetic augmentation pipelines such as RCDGen—a latent diffusion and inpainting protocol generating diverse $(I_1, I_2, M)$ triples (Korkmaz et al., 12 Dec 2025).

4. Evaluation Strategies and Benchmark Results

RCD performance is typically assessed using standard change detection metrics (mean IoU, F1, Overall Accuracy, Separated Kappa) (Korkmaz et al., 12 Dec 2025, Zhao et al., 13 Jul 2024), together with specific point accuracy or prompt alignment measures.

Key benchmarks include:

Dataset	RCDNet (mIoU)	RCDNet (+Synth)	AUWCD (F1 uplift vs SOTA Supervised)
SECOND	73.04	73.47	+5.01 pp avg, +13.17 pp max (Zhao et al., 13 Jul 2024)
CNAM‐CD	72.83	75.32	Not evaluated
WHU‐CD (BCD)	91.6	91.9	Not evaluated
LEVIR‐CD (BCD)	85.8	85.8	Not evaluated

In biomedical contexts, RCD methods yielded lower false-positive rates under realistic nonlinearity compared to standard ANOVA-based approaches (ANOSVA), and higher enrichment for known gene-splice events (Gelfond et al., 2011).

A plausible implication is that RCD can outperform rigid supervised baselines, especially for “unseen” referring concepts and in regimes dominated by class imbalance or semantic heterogeneity.

5. Limitations, Generalization, and Prospective Developments

Limitations of RCD arise primarily from the dependency on the quality and coverage of referring mechanisms:

Vision-LLMs may falter on genuinely out-of-vocabulary (OOV) categories (Korkmaz et al., 12 Dec 2025, Zhao et al., 13 Jul 2024).
Absolute zero-shot generalization is constrained by semantic similarity between prompt and training data (Korkmaz et al., 12 Dec 2025).
Synthetic data realism, though improved by diffusion and inpainting, may still lack fine structure for extreme geographical contexts (Korkmaz et al., 12 Dec 2025).

Generalization to new data types is facilitated by foundation models and probabilistic decoupling from fixed class outputs (RCDNet, AUWCD, diffusion augmentation) (Korkmaz et al., 12 Dec 2025, Zhao et al., 13 Jul 2024). The methodology is robust to arbitrary monotonic distortions in feature spaces (probabilistic rank modeling is invariant under monotonic transforms (Gelfond et al., 2011)).

Emergent directions include:

Zero-shot and prompt-tuned RCD via larger VLMs and fine-tuning strategies.
Multi-turn and interactive RCD enabling mask refinement through language.
Multi-class and spatio-temporal RCD, extending the paradigm to compound or time-series queries.
Curriculum and focal-reweighting for extremely rare class scenarios (Korkmaz et al., 12 Dec 2025, Zhao et al., 13 Jul 2024).
Cross-domain and open-world extensions facilitating application to SAR, multispectral, or non-Earth data.

6. Privacy, Domain Adaptation, and Special Cases

Some RCD methodologies, notably prompt-based SAM approaches, provide privacy benefits by operating solely on object masks rather than raw images. This excludes transmission and usage of sensitive image content (Ahmad et al., 13 Aug 2024).

Domain adaptation is natively supported since the referring expression can flexibly encode any new target class, and synthetic augmentation (RCDGen) permits scalable expansion with minimal annotation overhead (Korkmaz et al., 12 Dec 2025).

Failure modes include mis-flagging unchanged objects when multiple disappearance events occur, as the model may require prior knowledge of the number of changes (Ahmad et al., 13 Aug 2024). Remedies such as iterative exclusion or multi-hypothesis testing have yet to be systematically explored.

7. Historical Context and Cross-Domain Perspectives

RCD has antecedents in biological statistical modeling, notably latent rank change detection in splice-junction microarrays (Gelfond et al., 2011), where the central objective is robust differential event detection insensitive to monotonic nonlinearities and fixed class definitions.

In earth observation and remote sensing imagery, the domain has expanded rapidly from rigid semantic CD models to “anything you want” paradigms using foundation models, semantic alignment, and promptable segmentation (Korkmaz et al., 12 Dec 2025, Ahmad et al., 13 Aug 2024, Zhao et al., 13 Jul 2024).

The transition from visual-first CD to semantic-first or referring CD appears to yield significant advantages in flexibility, interpretability, and task generality, particularly for applications requiring bespoke change queries, data privacy, or multi-domain interoperability.