Papers
Topics
Authors
Recent
Search
2000 character limit reached

Siamese Vision Transformer Encoder

Updated 12 March 2026
  • Siamese Vision Transformer Encoder is a dual-branch model that shares weights to independently process two images for similarity assessment.
  • It leverages patch embedding, positional encoding, and transform blocks to produce comparable embeddings for metric learning tasks.
  • Weight sharing ensures consistent embedding production, making the architecture ideal for image retrieval, verification, and one-shot learning applications.

A Siamese Vision Transformer (ViT) encoder is a dual-branch architecture that applies the Vision Transformer encoder, as introduced in "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (Dosovitskiy et al., 2020), in a weight-sharing configuration to two separate input images. Each branch processes an image independently with shared parameters, producing embeddings suitable for metric learning or similarity tasks. All architectural and procedural aspects—including patch embedding, positional encoding, transformer blocks, and output handling—are directly inherited from the foundational ViT encoder design.

1. Patch Embedding and Input Representation

Each input image X∈RH×W×CX \in \mathbb{R}^{H \times W \times C} is first divided into a grid of non-overlapping patches of size P×PP \times P. This results in N=HWP2N = \frac{HW}{P^2} patches, where each patch xpi∈RP×P×Cx_p^i \in \mathbb{R}^{P \times P \times C}. Each patch is flattened to a vector xpi∈RP2⋅Cx_p^i \in \mathbb{R}^{P^2 \cdot C} and projected via a learnable embedding matrix E∈R(P2⋅C)×DE \in \mathbb{R}^{(P^2 \cdot C) \times D} to obtain the patch embedding matrix:

Xp=[flatten(xp1)E; flatten(xp2)E; …; flatten(xpN)E]∈RN×D.X_p = \left[ \text{flatten}(x_p^1)E;\ \text{flatten}(x_p^2)E;\ \ldots;\ \text{flatten}(x_p^N)E \right] \in \mathbb{R}^{N \times D}.

This process forms the input token sequence for the transformer backbone, preparing both images in the Siamese branches identically, ensuring direct comparability between their representations.

2. Positional Encoding and Class Token Integration

A learnable class token xcls∈RDx_{\text{cls}} \in \mathbb{R}^{D} is prepended to the sequence of patch embeddings. Additionally, a learnable one-dimensional positional embedding matrix Epos∈R(N+1)×DE_{\text{pos}} \in \mathbb{R}^{(N+1) \times D} is added elementwise to the token sequence, including the class token. The composite input to the transformer is

Z0=[xcls;Xp]+Epos∈R(N+1)×D,Z^0 = [x_{\text{cls}}; X_p] + E_{\text{pos}} \in \mathbb{R}^{(N+1) \times D},

where the class token enables the aggregation of global image information for downstream use, such as generating a compact embedding for metric learning.

3. Transformer Encoder Architecture

The architecture employs LL sequential encoder blocks, each with two sub-layers: Multi-Head Self-Attention (MHSA) and a position-wise Multi-Layer Perceptron (MLP), both with LayerNorm pre-normalization ("Pre-Norm" variant). For each layer ℓ=1…L\ell=1 \ldots L:

  • Layer Normalization: Z~ℓ−1=LN(Zℓ−1)\widetilde{Z}^{\ell-1} = \text{LN}(Z^{\ell-1})
  • MHSA Sub-block:
    • Each head operates in dimension dh=D/hd_h = D/h across hh heads.
    • Projections: Q=Z~ℓ−1WQQ = \widetilde{Z}^{\ell-1}W_Q, K=Z~ℓ−1WKK = \widetilde{Z}^{\ell-1}W_K, V=Z~ℓ−1WVV = \widetilde{Z}^{\ell-1}W_V with WQ,WK,WV∈RD×DW_Q, W_K, W_V \in \mathbb{R}^{D \times D}.
    • Per-head attention: Ai=Softmax(Qi(Ki)⊤/dh)A^i = \text{Softmax}(Q^i (K^i)^\top / \sqrt{d_h})
    • Output: Concatenation over heads, projected by WO∈RD×DW_O \in \mathbb{R}^{D \times D}.
    • Residual connection: Z′=Zℓ−1+MHSA(Z~ℓ−1)Z' = Z^{\ell-1} + \text{MHSA}(\widetilde{Z}^{\ell-1})
  • MLP Sub-block:
    • Pre-norm: U=LN(Z′)U = \text{LN}(Z')
    • Feed-forward: MLP(U)=W2 GELU(W1U+b1)+b2\text{MLP}(U) = W_2\,\text{GELU}(W_1 U + b_1) + b_2, W1∈RD×4DW_1 \in \mathbb{R}^{D \times 4D}, W2∈R4D×DW_2 \in \mathbb{R}^{4D \times D}.
    • Residual: Zâ„“=Z′+MLP(U)Z^{\ell} = Z' + \text{MLP}(U)

After LL blocks, the output is ZL∈R(N+1)×DZ^L \in \mathbb{R}^{(N+1) \times D}, from which the refined class token vector is extracted.

4. Output Embedding and Metric Learning Setup

For embedding-based tasks, only the final class token vector zcls=ZL[0]∈RDz_{\text{cls}} = Z^L[0] \in \mathbb{R}^D is used. In the Siamese setting, each branch processes its respective image to produce zclsaz_{\text{cls}}^a and zclsbz_{\text{cls}}^b, both in RD\mathbb{R}^D. These embeddings are then compared (e.g., with cosine or Euclidean distance), or used as input to contrastive, triplet, or other metric learning losses. For classification, the output is

y^=zclsWcls∈RK\hat{y} = z_{\text{cls}} W_{\text{cls}} \in \mathbb{R}^K

with Wcls∈RD×KW_{\text{cls}} \in \mathbb{R}^{D \times K}. For fine-tuning, WclsW_{\text{cls}} is generally re-initialized.

5. Weight Sharing and Twin Branch Realization

The Siamese instantiation requires strict weight sharing. All learnable parameters—embedding matrix EE, positional embeddings EposE_{\text{pos}}, attention and projection weights {WQ, WK, WV, WO}\{W_Q,\,W_K,\,W_V,\,W_O\} (per layer), MLP weights and biases ({W1, W2, b1, b2}\{W_1,\,W_2,\,b_1,\,b_2\} per MLP), and optionally WclsW_{\text{cls}}—are identical across both branches. Consequently, the encoder network for each image is functionally and parametrically identical, ensuring the resulting embeddings are produced in the same DD-dimensional feature space and are directly comparable. This configuration is essential for learning effective metrics or similarities.

6. Hyper-parameters and Training Paradigms

The ViT encoder supports several model sizes: ViT-Base (L=12L=12, D=768D=768, h=12h=12, MLP size $3072$, ∼\sim86M parameters), ViT-Large (L=24L=24, D=1024D=1024, h=16h=16, MLP size $4096$, ∼\sim307M parameters), and ViT-Huge (L=32L=32, D=1280D=1280, h=16h=16, MLP size $5120$, ∼\sim632M parameters). The standard patch size is 16×1616\times16 with default input size 224×224224\times224. Pre-training is typically conducted on large-scale datasets such as ImageNet-21k (∼\sim14M images) or JFT-300M (∼\sim303M images) with Adam optimizer, weight decay, and linear learning rate warm-up and decay. Fine-tuning involves SGD with momentum, smaller batch sizes, and sometimes larger image resolutions.

7. Applications and Context in Metric Learning

The Siamese ViT encoder is directly applicable to similarity and metric learning by producing embeddings invariant to the input branch. Typical use cases include one-shot learning, image retrieval, clustering, and verification tasks, where the ability to compare feature vectors in a shared embedding space is operationally critical. By leveraging transformer-based sequence modeling in vision, the Siamese ViT encoder can capture global contextual information in each image representation, and its strict architectural symmetry ensures that distance computations between embeddings are meaningful and consistent across diverse inputs (Dosovitskiy et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Siamese Vision Transformer (ViT) Encoder.