Fully Convolutional Siamese Metric Networks

Updated 5 November 2025

The paper introduces CosimNet, a fully convolutional Siamese network that learns per-pixel discriminative metrics to distinguish semantic changes from nuisances.
It employs a novel thresholded contrastive loss to maintain robustness against geometric misalignments and suppress false change detections.
Empirical results on benchmark datasets demonstrate improved F-scores and more reliable change detection compared to traditional FCN and descriptor-based methods.

CosimNet is a fully convolutional Siamese metric network designed to address scene change detection (SCD) by directly learning and applying discriminative per-pixel metrics for distinguishing semantic changes from noisy scene variations such as illumination shifts, shadows, and viewpoint differences. The method departs from traditional classification-based or decision-boundary approaches by employing explicit metric learning to separate feature representations of changed and unchanged scene regions, enabling robust detection in visually and geometrically challenging scenarios (Guo et al., 2018).

1. Network Architecture and Metric Learning Principle

CosimNet operates on a pair of images $(X_0, X_1)$ acquired at different times $(T_0, T_1)$ . Each image is processed in parallel by a fully convolutional Siamese network (FCSN), which serves as the shared feature extractor (backbone instantiated from standard semantic segmentation models such as DeeplabV2). The extracted feature maps $(feat_0, feat_1)$ at each spatial position are compared via either Euclidean (L2) distance or cosine similarity to yield a dense change map, which quantifies local feature dissimilarity. This per-pixel metric is interpreted as the probability or confidence of change occurrence.

By structuring the network as a Siamese configuration with shared weights, CosimNet ensures that identical content in the two images is mapped to nearby points in feature space, regardless of nuisance changes, while enforcing separability for semantically altered regions.

2. Contrastive Loss and Thresholded Variant

The key optimization driver is the adaptation of contrastive loss, a metric learning framework originally designed for identity recognition:

$\text{Contrastive Loss} = \begin{cases} D(f_i, f_j) & \text{if } y_{i,j} = 1 \ \max(0, m - D(f_i, f_j)) & \text{if } y_{i,j} = 0 \end{cases}$

where $D(f_i, f_j)$ is the L2 distance between corresponding features from $X_0$ and $X_1$ , and $y_{i,j}$ labels pairs as unchanged ($1$) or changed ($0$). This loss contracts unchanged-pixel feature pairs while repelling changed-pixel pairs by margin $m$ .

With large viewpoint differences or geometric misalignment, penalizing any nonzero distance for unchanged pairs is overly strict. CosimNet introduces Thresholded Contrastive Loss (TCL):

$TCL = \begin{cases} D(f_i, f_j) - \tau_k & y_{i,j} = 1 \ \max(0, m - D(f_i, f_j)) & y_{i,j} = 0 \end{cases}$

$\tau_k$ is a per-layer threshold tolerating moderate feature discrepancy in unchanged areas, reducing spurious detections caused by minor perspective or registration differences.

An alternative cosine similarity loss is provided:

$CosLoss = \sum_{k=0}^{h \times w} \left(y_k - e^{-\|w_k \times D_k(f_i,f_j) + b_k\|}\right)^{2}$

3. Robustness to Nuisance Variability

The fully convolutional backbone, with large receptive fields, captures invariances to illumination, shadows, and minor geometric transformations; thus, extraneous appearance changes are suppressed in the learned embedding. For small viewpoint shifts, standard contrastive loss is sufficient. For strong geometric changes (e.g., from camera motion or zoom), TCL allows for compensation without requiring explicit image pre-alignment (registration or use of structure-from-motion preprocessing is unnecessary).

The network further applies per-layer loss, with MultiLayer Side-Output (MLSO) supervising multiple stages in the backbone, improving mid-level feature discriminability.

4. Empirical Evaluation and Feature Analysis

CosimNet was extensively validated on three challenging SCD datasets:

VL-CMU-CD: Outdoor scenes with both semantic and nuisance changes.
PCD2015: Street-view aerial pairs (Tsunami and GSV subsets).
CDnet: Video frame sequences, targeting robust foreground change under varied nuisances.

Key metrics include F-score and precision. CosimNet-3Layer-L2 achieves:

Method	Tsunami (PCD)	GSV (PCD)	VL-CMU-CD
CosimNet-3layer-l2	0.806	0.692	0.706
FCN-Late-Fusion	0.809	0.685	0.714
FCN-Metrics	0.814	0.692	0.721

On CDnet, CosimNet reaches F-measure = 0.8591 and precision = 0.9383, outperforming or matching state-of-the-art segmentation and descriptor-based baselines.

Analysis of feature discriminability via RMS contrast and t-SNE embeddings shows that CosimNet's L2-based maps exhibit greater foreground-background separation and clear intra-class clustering than alternatives. Deeper feature layers (e.g., fc7) offer the strongest change response.

5. Comparison with Contemporary SCD Approaches

Relative to prior work—feature descriptor matching (Dense SIFT, Depth), classical FCNs without explicit metric learning, and previous SCD pipelines—CosimNet provides superior robustness and interpretability:

Direct metric learning yields smoother, sharper, and more contrastive change maps.
TCL innovation is crucial for suppressing false positives under geometric misalignment.
Visualizations display well-separated clusters for changed/unchanged classes; ablations confirm the efficacy of L2 metric over cosine and benefit of multi-layer loss.

Feature space analysis highlights improved intra-class compactness and inter-class separability, key to reliable SCD.

6. Implementation Considerations and Loss Formulas

The loss formulation supports multi-objective optimization through multitask training with segmentation integration:

$Loss = Loss_{class} + \lambda \cdot Loss_{feat}$

where $Loss_{feat}$ is the contrastive (or TCL) loss applied to network features and $Loss_{class}$ is a pixel-wise segmentation loss.

Quantitative feature contrast is measured via RMS contrast:

$C_{RMS} = \frac {\sqrt{\frac{1}{N} \sum (L_i - L_{mean})^2}}{L_{mean}}$

Feature vectors are normalized ( $\|\text{feat}\| = 1$ ) prior to distance calculation to ensure comparability across spatial locations and robustness to amplitude scaling.

7. Conclusions and Applications

CosimNet demonstrates that metric learning—supplemented with thresholded loss for viewpoint variation—enables disentangled, semantically meaningful scene change detection under noisy and real-world conditions. The approach generalizes across outdoor, aerial, and video SCD datasets, showing up to 15% F-score improvement over published methods on VL-CMU-CD and robust operation on CDnet and PCD2015.

The method requires only paired images and does not depend on prior alignment or geometric correction, making it suitable for automated urban change monitoring, surveillance, automated mapping, and dynamic scene understanding applications. The architecture and source code are publicly available for research and real-world deployment.

PDF Markdown Chat (Pro)

References (1)

Learning to Measure Change: Fully Convolutional Siamese Metric Networks for Scene Change Detection (2018)

Follow Topic

Get notified by email when new papers are published related to Fully Convolutional Siamese Metric Networks.