CosimNet: Robust Siamese Metric Learning

Updated 5 November 2025

CosimNet is a fully convolutional Siamese metric learning network that distinguishes true semantic scene changes from nuisance factors like illumination and viewpoint variations.
It employs a custom Thresholded Contrastive Loss and MultiLayer Side-Output strategy to directly learn per-pixel feature dissimilarities, boosting detection accuracy.
Benchmarked on datasets like CDnet and VL-CMU-CD, CosimNet achieves competitive state-of-the-art performance under real-world conditions without explicit pre-alignment.

CosimNet is a fully convolutional Siamese metric learning network designed for robust scene change detection (SCD), focusing on discriminating genuine semantic changes from nuisance noise such as illumination, shadow, and viewpoint variation. The approach customizes implicit metrics through direct feature comparison and advances the state of metric-based SCD by introducing architectural, loss, and training innovations that yield high discriminability and quantitative performance on established benchmarks. CosimNet is representative of the direct metric-learning paradigm in modern change detection pipelines (Guo et al., 2018).

1. Core Principles and Design Rationale

CosimNet is motivated by the observation that conventional change detection methods often fail to disentangle semantic change from noise due to entangled variances in real-world conditions. Rather than relying on per-pixel or region-level classification with learned or hand-crafted invariances, CosimNet directly learns a feature space where unchanged pixel pairs (negative, background) are close, and changed pairs (positive, foreground) are maximally separated. The backbone is a fully convolutional Siamese network (FCSN) that supports local, per-pixel feature embedding, allowing the application of distance metrics to generate high-resolution change maps.

Key concepts:

Direct feature-wise dissimilarity is the detector signal, not a threshold over a single-pixel classification logit.
Metric learning via contrastive loss ensures robust separation despite nuisance transformations.
Explicit handling of registration errors and viewpoint variation through loss function design (see Section 3).

2. Network Architecture and Feature Embedding

CosimNet input comprises a pair of images, $X_0$ and $X_1$ , captured at $T_0$ and $T_1$ . Both are processed by identical, weight-shared FCSNs (e.g., based on DeeplabV2).

Detailed flow:

Feature Extraction: Both $X_0$ and $X_1$ propagate through deep, convolutional branches, producing dense feature maps $\text{feat}_0$ and $\text{feat}_1$ , respectively.
Normalization: Per-pixel vectors are normalized via $\|\mathrm{feat}\|=1$ mapping to stabilize and regularize metric computation.
Feature Distance Calculation: At each output location, a pre-defined distance metric is computed:
- Euclidean ( $L^2$ ) distance: $D(f_i, f_j) = \| f_i - f_j \|_2$
- Cosine similarity (alternative): $D(f_i, f_j) = 1 - \frac{f_i \cdot f_j}{\|f_i\|\|f_j\|}$
Change Map Generation: Resulting distance maps are interpreted as soft change likelihoods; high values correspond to likely semantic changes.

Table: Key Architectural Components

Component	Function	Implementation Detail
Siamese FCSN	Feature extraction, weight sharing	DeeplabV2 backbone (modifiable)
Normalization	Feature stability, learning regularization	Hypersphere projection $\\|\mathrm{feat}\\|=1$
Distance Metric	Dissimilarity quantification	$L^2$ (default), cosine (optional)
Change Map	Soft/hard mask of candidate changes	Applied post-metric calculation

3. Loss Functions: Contrastive and Thresholded Variants

CosimNet extends basic contrastive loss to address noise and unregistered content, introducing Thresholded Contrastive Loss (TCL) for enhanced robustness.

Contrastive Loss:

$\text{Contrastive Loss} = \begin{cases} D(f_i, f_j), & y_{i,j} = 1 \ \max(0, m - D(f_i, f_j)), & y_{i,j} = 0 \end{cases}$

where $y_{i,j} = 1$ for unchanged, $0$ for changed; $m$ is a separation margin.

Thresholded Contrastive Loss (TCL):

$TCL = \begin{cases} D(f_i, f_j) - \tau_k, & y_{i,j} = 1 \ \max(0, m - D(f_i, f_j)), & y_{i,j} = 0 \end{cases}$

where $\tau_k$ permits nonzero acceptable distance for unchanged pairs, reducing over-penalization in unregistered/semantically fused contexts (notably, large viewpoint shifts).

This strategy allows CosimNet to:

Tolerate moderate pixel misalignment without inflating false positives.
Differentiate semantic from nuisance “change” (e.g., movement due to parallax versus real object displacement).

4. Training Policy and Supervision

CosimNet introduces MultiLayer Side-Output (MLSO), which supervises intermediate feature layers (not just final outputs) via the contrastive objective:

$Loss = \sum_{l=h}^L{\beta_h \times loss_h}$

where $\beta_h$ are per-layer side-output weights. This encourages mid-level representations to develop strong change/no-change discriminativity, which empirically improves both feature and detection quality.

Further details:

Deep layers (e.g., fc7) yield greater change/no-change separation.
Euclidean distance as metric consistently outperformed cosine similarity in RMS contrast and qualitative discriminability experiments.

5. Robustness to Nuisance Change and Performance Benchmarks

Experiments on CDnet, PCD2015, and VL-CMU-CD demonstrate high robustness:

Illumination/shadow invariance: Learned features remain stable under photometric disturbance.
Viewpoint variation: TCL reduces false positives under both camera motion and zooming.
No reliance on explicit geometric pre-alignment such as SfM or homography; CosimNet’s metric learning subsumes these invariances.

Summary Table: F-Scores

Method	Tsunami (PCD)	GSV (PCD)	VL-CMU-CD
CosimNet-3layer-l2	0.806	0.692	0.706
FCN-Late-Fusion	0.809	0.685	0.714
FCN-Metrics	0.814	0.692	0.721

On VL-CMU-CD, CosimNet-3layer-l2 exceeds the best published baseline by approximately 15% (0.706 vs. 0.55 CDNet method). Precision on CDnet is 0.9383, $F$ -measure 0.8591, indicating state-of-the-art or competitive foreground change detection even with challenging nuisance conditions.

Ablation studies show:

TCL is critical under strong viewpoint shift; standard contrastive loss can be inadequate.
MLSO yields additional accuracy and feature separability gains.

6. Quantitative and Visual Analysis

Feature discriminability is substantiated via:

RMS and Michelson contrast metrics, with CosimNet’s $L^2$ features achieving strongest foreground-background separation.
t-SNE visualizations, where learned embeddings yield tight, well-separated clusters for changed and unchanged regions, in contrast to FCN baselines.
Change heatmaps, accurately localizing semantic change and suppressing nuisance responses, especially under geometric misalignment.

Additionally, CosimNet maintains smooth, accurate object boundaries, whereas FCN-based baselines often produce noisy or spatially inconsistent results, as illustrated in the paper’s qualitative comparisons.

7. Significance, Open Source Availability, and Further Implications

CosimNet demonstrates that scene change detection can be formulated as a metric learning problem, unifying feature representation and change quantification via learnable per-pixel dissimilarity. TCL provides a principled, tunable mechanism to increase tolerance to registration errors without hand-crafted pre-alignment.

The approach’s generality implies adaptation to other domains (e.g., remote sensing, medical imaging) where robust change detection under geometric and photometric drift is essential.

The full source is open at https://github.com/gmayday1997/ChangeDet, facilitating further paper and real-world deployment.

Key formula summary:

Contrastive loss:

$\text{Contrastive Loss} = \begin{cases} D(f_i, f_j), & y_{i,j} = 1 \ \max(0, m - D(f_i, f_j)), & y_{i,j} = 0 \end{cases}$

TCL:

$TCL = \begin{cases} D(f_i, f_j) - \tau_k, & y_{i,j} = 1 \ \max(0, m - D(f_i, f_j)), & y_{i,j} = 0 \end{cases}$

RMS contrast:

$C_{RMS} = \frac{\sqrt{\frac{1}{N}\sum (L_i-L_{mean})^2}}{L_{mean}}$

CosimNet represents a cohesive, empirically validated approach to robust, high-precision change detection by learning to directly measure feature dissimilarity under adverse real-world conditions (Guo et al., 2018).

PDF Markdown Chat (Pro)

References (1)

Learning to Measure Change: Fully Convolutional Siamese Metric Networks for Scene Change Detection (2018)

Follow Topic

Get notified by email when new papers are published related to CosimNet.