Siamese Networks

Updated 14 December 2025

Siamese Networks are neural architectures with two identical branches sharing parameters to learn similarity metrics from paired inputs.
They typically employ contrastive and triplet losses to enforce similarity for matched pairs and separation for dissimilar pairs.
Their versatile design supports applications in verification, retrieval, few-shot learning, and cross-modal matching across diverse domains.

A Siamese network is a neural network architecture composed of two or more identical subnetworks (branches) with shared parameters. Its principal function is to learn a representation or metric such that similar inputs yield similar embeddings and dissimilar inputs yield distant embeddings. This paradigm has found broad use in metric learning, verification, retrieval, few-shot classification, cross-modal matching, and self-supervised learning. The core architectural pattern is weight tying between branches and a similarity function (e.g., Euclidean distance, cosine similarity, or learned metric) between resulting embeddings. Training employs pairwise or triplet-based losses to enforce desired invariances or discriminate among classes, patterns, or structures.

The canonical Siamese network comprises two isomorphic or identical subnetworks, each ingesting one of the paired (or triplet/multiway) inputs. All parameters—including convolutional kernels, batch normalization statistics, or transformer weights—are exactly shared. This weight tying ensures that both branches perform the same transformation and the resulting representation is invariant to the branch origin.

Branches may be shallow or deep, and can be realized as MLPs, CNNs, LSTMs, or transformers depending on the input modality:

For image, speech, and signal processing, deep CNN branches predominate.
For sequential data or language, shared LSTM or transformer branches are used.
Hybrid architectures can process multi-modal inputs with tied or non-shared parameters for each view (Shaham et al., 2015).

The last layer(s) of each branch yields an embedding vector $f(x) \in \mathbb{R}^d$ , over which a suitable similarity or distance measure is applied. For practical uses, various downstream layers or decoders may consume concatenated or fused features (e.g., for segmentation, tracking, or change detection).

2. Loss Functions and Supervised, Unsupervised, and Semi-Supervised Regimes

Contrastive and Triplet-based Losses

The most widely used formulation is the contrastive loss (Sahito et al., 2021, Shahtalebi et al., 2020, Yuan et al., 12 Mar 2025):

$\mathcal{L}_{\text{contrastive}} = y D^2 + (1-y) \, [\max(0, m - D)]^2$

where $D = \|f(x_1)-f(x_2)\|_2$ , $y$ is the similarity label, and $m$ is a margin. The loss minimizes distance for positive (similar) pairs and enforces a boundary for negative (dissimilar) pairs (Sahito et al., 2021, Shahtalebi et al., 2020, Yuan et al., 12 Mar 2025).

The triplet loss is often preferred for finer discrimination:

$\mathcal{L}_{\text{triplet}} = \max\bigl(\|f(a)-f(p)\|_2 - \|f(a)-f(n)\|_2 + m, 0\bigr)$

where $a$ is the anchor input, $p$ is a positive (same class) input, and $n$ is a negative (different class) input (Sahito et al., 2021).

Other loss innovations include Fisher Discriminant Triplet/Contrastive losses employing within-class and between-class scatter matrices to align feature geometry with linear discriminants (Ghojogh et al., 2020).

Self-Unsupervised and Hybrid Objectives

For unsupervised/self-supervised learning, Siamese structures are used with losses over positive pairs generated by random augmentations, enforcing invariance while avoiding collapse. The stop-gradient mechanism, as in SimSiam (Chen et al., 2020), is critical for preventing degenerate solutions in the absence of negatives:

$\mathcal{L} = -\frac{1}{2}\left[\operatorname{sim}(p_1, \operatorname{stopgrad}(z_2)) + \operatorname{sim}(p_2, \operatorname{stopgrad}(z_1))\right]$

This alternating optimization, where one side's representation is fixed per iteration, is interpreted as an EM-like step and is effective in applications such as MRI reconstruction (Sun et al., 18 Jan 2025).

In geometric or manifold embedding tasks, the pairwise loss can encode structure-preserving objectives (e.g., Sammon's mapping) (Lei et al., 2019).

Semi-Supervised and Label Propagation

Siamese networks also underpin iterative pseudo-labeling and graph-based label propagation in semi-supervised settings. Label propagation is formulated with learned embeddings and affinity matrices, propagating class information smoothly across the induced metric space (Sahito et al., 2021).

3. Paradigmatic Applications and Domain-Specific Variants

Recognition, Retrieval, and Verification

Siamese networks are the standard approach for verification or matching tasks, such as face/fingerprint/iris recognition, product or document retrieval, or cross-modal search (e.g., code retrieval by natural language) (Yuan et al., 12 Mar 2025, Sinha et al., 2020, Benajiba et al., 2018). Here, the network optimizes for strong intra-class similarity and inter-class separation, permitting $k$ -NN-style inference or direct verification.

Few-Shot and Metric-Based Learning

In few-shot image classification, Siamese transformer networks exploit global and local features—specifically, combining class-token and patch-token representations via parallel branches and using Euclidean and KL divergence metrics (Jiang et al., 16 Jul 2024). Prototypical nearest-centroid classifiers over learned embeddings are common.

Change Detection and Multi-Input Fusion

For change detection in remote sensing, Siamese architectures compare co-registered image pairs through parallel encoders, with feature fusion as late as possible. Augmenting such architectures with cross-branch mutual attention modules (MASNet) yields statistically significant gains in mIoU and F1 across CNN and transformer backbones. Information exchange at intermediate layers allows earlier suppression or enhancement of features correlated with changes (Zhou et al., 2022).

Object Tracking

Deep Siamese networks are central for similarity learning in object tracking. Architectures such as SiamFC and variants operate in a fully convolutional fashion, using cross-correlation between template and search features (Li et al., 2021, Shen et al., 2019, Li et al., 2018). Advanced formulations employ multi-branch ensembles with online selection, hierarchical feature fusion (e.g., via SE-blocks and dual backbones), and compressed/ distillation approaches for real-time performance and resource efficiency (Li et al., 2021, Shen et al., 2019, Li et al., 2018).

Signal and Manifold Learning

Siamese networks unify supervised and unsupervised geometric embedding, such as in wireless user positioning and channel charting (Lei et al., 2019). Pairwise geometric losses are used to enforce local (Sammon) or global (MDS) structure in low-dimensional embeddings, with applications in positioning and manifold learning.

Speech, Language, and Semantic Pattern Mining

For unsupervised speech representation, type-frequency flattening in Siamese pair selection is critical due to Zipfian distributions, and pairwise frame alignment or truncation addresses variable-length inputs. An imbalanced mix of positive/negative class and same/different speaker pairs yields optimal subword discriminability (Riad et al., 2018). In NLP, Siamese LSTMs model semantic patterns (e.g., SQL template classification), and regression-based pairwise losses can be directly tied to structural similarity metrics (Benajiba et al., 2018).

Explainability

Embedding-space explanations can be obtained via Siamese prototypes and autoencoders. Perturbing important dimensions at the bottleneck and reconstructing to input space identifies input regions critical for similarity-based decisions (Utkin et al., 2019).

4. Feature Fusion, Attention, and Architecture Extensions

Siamese networks allow flexible feature fusion strategies:

Channel-wise concatenation, elementwise difference, or addition of final-layer embeddings (Zhou et al., 2022).
Early mutual attention modules implement cross-branch information exchange after each encoder stage, enhancing change detection and supporting modular insertion in any Siamese backbone (CNNs or transformers) with minor computational overhead (Zhou et al., 2022).
Hierarchical feature fusion combines multiple depths and/or multiple models, with feature calibration (e.g., SE-blocks) to select discriminative channels per context (Li et al., 2021).
In transformer-based networks, Siamese branches extract and fuse global and local features, integrating similarity scores using L2-normalization and weighted summation (Jiang et al., 16 Jul 2024).

Architecture search for Siamese "heads" (MLP projectors/predictors) via differentiable NAS has produced non-trivial compositions of pooling, smooth activations, and skip connections that robustly avoid representation collapse and outperform hand-designed baselines in self-supervised visual learning (Heuillet et al., 2023).

Multi-branch architectures with online branch selection or mutual knowledge sharing offer strong adaptive capacity to handle appearance variation and can be distilled for deployment on edge devices with minimal accuracy loss (Shen et al., 2019, Li et al., 2018).

5. Empirical Performance, Training Protocols, and Sampling Schemes

Performance and comparative benchmarking reveals several empirical guidelines:

Early cross-branch interaction substantially improves performance in change detection, exemplified by the MASNet block yielding mean mIoU gains of +1.89 over HRNet-OCR and +3.54 for SegFormer on the SECOND dataset (Zhou et al., 2022).
Frequency-flattened sampling and carefully balanced positive/negative pair ratios are necessary for robust subword phonetic discrimination in speech (Riad et al., 2018).
Data augmentation specific to lighting, occlusion, or domain improves generalization in retrieval and localization contexts (Cabrera et al., 15 Jul 2024).
Optimization often uses Adam or SGD, batch normalization, margin selection, and regularization according to the domain (Yuan et al., 12 Mar 2025, Sahito et al., 2021). Batch size and pair sampling ratios directly influence the geometry of the embedding space and downstream classifier performance (Riad et al., 2018).
In few-shot and SSL regimes, the meta-learning strategy with episodic sampling is standard, enabling strong performance on diverse tasks when using advance feature cashing and training of pseudo-Siamese transformers (Jiang et al., 16 Jul 2024).

Table: Typical Quantitative Performance (selected domains)

Domain / Task	Baseline(s)	Siamese Net Variant	Best Reported Metric
Change detection (Zhou et al., 2022)	HRNet-OCR, SegFormer	MASNet	SECOND mIoU = 55.59 (↑3.54 over SegFormer)
Wireless positioning (Lei et al., 2019)	FCNN, autoencoder	Siamese CC (unsup)	Kruskal stress ≈0.94–0.96 (test), TW ≈0.97
Few-shot CLS (Jiang et al., 16 Jul 2024)	CPEA (ViT)	Siamese Transformer Network	miniImageNet 5-shot acc. = 88.00% (ViT-small)
EEG BCI (Shahtalebi et al., 2020)	FBCSP, SCSSP, BSSFO	Siamese-CNN (OVR)	Avg. 4-class $\kappa$ = 0.554
MRI Recon (Sun et al., 18 Jan 2025)	SSDU, VS-Net, ISTA	SiamRecon	IXI 2D 20%: PSNR = 34.7, SSIM = 0.94
Biometric twins (Yuan et al., 12 Mar 2025)	Human benchmark	Siamese ResNet-18	Accuracy ≈81% (raw input, test set)

6. Generalization, Limitations, and Open Challenges

Siamese networks generalize well across data modalities, including visual, auditory, geometric, sensor, or multi-modal data, with minimal modification to the branch architecture. The model's invariance properties are determined by both network capacity and the structure of pair/triplet sampling (Shaham et al., 2015, Riad et al., 2018).

Practical limitations include marginal increases in parameter count and latency when integrating mutual attention (Zhou et al., 2022), the need for careful margin and sampling ratio calibration (Riad et al., 2018), imperfect global-context modeling in tasks requiring large receptive fields, and sensitivity to domain shifts or missing input modalities. For tasks involving more than two inputs (e.g., multi-temporal change detection), higher-order extension of the basic pairwise attention is required (Zhou et al., 2022).

Research challenges include robust semi-supervised extensions for highly nonlinear or cross-condition tasks, design of scalable mutual-attention for multi-branch scenarios, and cross-modal fusion beyond basic concatenation, as well as deeper theoretical understanding of loss function geometry and collapse-prevention mechanisms.

7. Conclusion and Future Prospects

Siamese networks are a central architecture in metric learning and representation learning, underpinning state-of-the-art methods in change detection, recognition, tracking, geometric embedding, bio-signal analysis, and self-supervised learning. Their power arises from shared-weight feature extraction, flexible loss designs, and compatibility with a variety of neural backbones. Recent innovations include plug-in mutual-attention modules, hierarchical and multi-branch fusion strategies, architecture search for projector/predictor heads, and advanced loss functions grounded in classical discriminant analysis. Empirical results demonstrate meaningful advances across a spectrum of domains at modest additional computational cost.

Ongoing research focuses on scaling Siamese architectures to more complex matching problems, integrating cross-modal and cross-task information, addressing data imbalance and long-tail distributions, and formalizing the stability properties of negative-free unsupervised objectives (Chen et al., 2020, Heuillet et al., 2023).