Siamese Vision Transformer Encoder

Updated 12 March 2026

Siamese Vision Transformer Encoder is a dual-branch model that shares weights to independently process two images for similarity assessment.
It leverages patch embedding, positional encoding, and transform blocks to produce comparable embeddings for metric learning tasks.
Weight sharing ensures consistent embedding production, making the architecture ideal for image retrieval, verification, and one-shot learning applications.

A Siamese Vision Transformer (ViT) encoder is a dual-branch architecture that applies the Vision Transformer encoder, as introduced in "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (Dosovitskiy et al., 2020), in a weight-sharing configuration to two separate input images. Each branch processes an image independently with shared parameters, producing embeddings suitable for metric learning or similarity tasks. All architectural and procedural aspects—including patch embedding, positional encoding, transformer blocks, and output handling—are directly inherited from the foundational ViT encoder design.

1. Patch Embedding and Input Representation

Each input image $X \in \mathbb{R}^{H \times W \times C}$ is first divided into a grid of non-overlapping patches of size $P \times P$ . This results in $N = \frac{HW}{P^2}$ patches, where each patch $x_p^i \in \mathbb{R}^{P \times P \times C}$ . Each patch is flattened to a vector $x_p^i \in \mathbb{R}^{P^2 \cdot C}$ and projected via a learnable embedding matrix $E \in \mathbb{R}^{(P^2 \cdot C) \times D}$ to obtain the patch embedding matrix:

$X_p = \left[ \text{flatten}(x_p^1)E;\ \text{flatten}(x_p^2)E;\ \ldots;\ \text{flatten}(x_p^N)E \right] \in \mathbb{R}^{N \times D}.$

This process forms the input token sequence for the transformer backbone, preparing both images in the Siamese branches identically, ensuring direct comparability between their representations.

2. Positional Encoding and Class Token Integration

A learnable class token $x_{\text{cls}} \in \mathbb{R}^{D}$ is prepended to the sequence of patch embeddings. Additionally, a learnable one-dimensional positional embedding matrix $E_{\text{pos}} \in \mathbb{R}^{(N+1) \times D}$ is added elementwise to the token sequence, including the class token. The composite input to the transformer is

$Z^0 = [x_{\text{cls}}; X_p] + E_{\text{pos}} \in \mathbb{R}^{(N+1) \times D},$

where the class token enables the aggregation of global image information for downstream use, such as generating a compact embedding for metric learning.

3. Transformer Encoder Architecture

The architecture employs $L$ sequential encoder blocks, each with two sub-layers: Multi-Head Self-Attention (MHSA) and a position-wise Multi-Layer Perceptron (MLP), both with LayerNorm pre-normalization ("Pre-Norm" variant). For each layer $\ell=1 \ldots L$ :

Layer Normalization: $\widetilde{Z}^{\ell-1} = \text{LN}(Z^{\ell-1})$
MHSA Sub-block:
- Each head operates in dimension $d_h = D/h$ across $h$ heads.
- Projections: $Q = \widetilde{Z}^{\ell-1}W_Q$ , $K = \widetilde{Z}^{\ell-1}W_K$ , $V = \widetilde{Z}^{\ell-1}W_V$ with $W_Q, W_K, W_V \in \mathbb{R}^{D \times D}$ .
- Per-head attention: $A^i = \text{Softmax}(Q^i (K^i)^\top / \sqrt{d_h})$
- Output: Concatenation over heads, projected by $W_O \in \mathbb{R}^{D \times D}$ .
- Residual connection: $Z' = Z^{\ell-1} + \text{MHSA}(\widetilde{Z}^{\ell-1})$
MLP Sub-block:
- Pre-norm: $U = \text{LN}(Z')$
- Feed-forward: $\text{MLP}(U) = W_2\,\text{GELU}(W_1 U + b_1) + b_2$ , $W_1 \in \mathbb{R}^{D \times 4D}$ , $W_2 \in \mathbb{R}^{4D \times D}$ .
- Residual: $Z^{\ell} = Z' + \text{MLP}(U)$

After $L$ blocks, the output is $Z^L \in \mathbb{R}^{(N+1) \times D}$ , from which the refined class token vector is extracted.

4. Output Embedding and Metric Learning Setup

For embedding-based tasks, only the final class token vector $z_{\text{cls}} = Z^L[0] \in \mathbb{R}^D$ is used. In the Siamese setting, each branch processes its respective image to produce $z_{\text{cls}}^a$ and $z_{\text{cls}}^b$ , both in $\mathbb{R}^D$ . These embeddings are then compared (e.g., with cosine or Euclidean distance), or used as input to contrastive, triplet, or other metric learning losses. For classification, the output is

$\hat{y} = z_{\text{cls}} W_{\text{cls}} \in \mathbb{R}^K$

with $W_{\text{cls}} \in \mathbb{R}^{D \times K}$ . For fine-tuning, $W_{\text{cls}}$ is generally re-initialized.

The Siamese instantiation requires strict weight sharing. All learnable parameters—embedding matrix $E$ , positional embeddings $E_{\text{pos}}$ , attention and projection weights $\{W_Q,\,W_K,\,W_V,\,W_O\}$ (per layer), MLP weights and biases ( $\{W_1,\,W_2,\,b_1,\,b_2\}$ per MLP), and optionally $W_{\text{cls}}$ —are identical across both branches. Consequently, the encoder network for each image is functionally and parametrically identical, ensuring the resulting embeddings are produced in the same $D$ -dimensional feature space and are directly comparable. This configuration is essential for learning effective metrics or similarities.

6. Hyper-parameters and Training Paradigms

The ViT encoder supports several model sizes: ViT-Base ( $L=12$ , $D=768$ , $h=12$ , MLP size $3072$, $\sim$ 86M parameters), ViT-Large ( $L=24$ , $D=1024$ , $h=16$ , MLP size $4096$, $\sim$ 307M parameters), and ViT-Huge ( $L=32$ , $D=1280$ , $h=16$ , MLP size $5120$, $\sim$ 632M parameters). The standard patch size is $16\times16$ with default input size $224\times224$ . Pre-training is typically conducted on large-scale datasets such as ImageNet-21k ( $\sim$ 14M images) or JFT-300M ( $\sim$ 303M images) with Adam optimizer, weight decay, and linear learning rate warm-up and decay. Fine-tuning involves SGD with momentum, smaller batch sizes, and sometimes larger image resolutions.

7. Applications and Context in Metric Learning

The Siamese ViT encoder is directly applicable to similarity and metric learning by producing embeddings invariant to the input branch. Typical use cases include one-shot learning, image retrieval, clustering, and verification tasks, where the ability to compare feature vectors in a shared embedding space is operationally critical. By leveraging transformer-based sequence modeling in vision, the Siamese ViT encoder can capture global contextual information in each image representation, and its strict architectural symmetry ensures that distance computations between embeddings are meaningful and consistent across diverse inputs (Dosovitskiy et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Siamese Vision Transformer (ViT) Encoder.

Siamese Vision Transformer Encoder

1. Patch Embedding and Input Representation

2. Positional Encoding and Class Token Integration

3. Transformer Encoder Architecture

4. Output Embedding and Metric Learning Setup

6. Hyper-parameters and Training Paradigms

7. Applications and Context in Metric Learning

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Siamese Vision Transformer Encoder

1. Patch Embedding and Input Representation

2. Positional Encoding and Class Token Integration

3. Transformer Encoder Architecture

4. Output Embedding and Metric Learning Setup

5. Weight Sharing and Twin Branch Realization

6. Hyper-parameters and Training Paradigms

7. Applications and Context in Metric Learning

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research