Siamese Vision Transformer Encoder
- Siamese Vision Transformer Encoder is a dual-branch model that shares weights to independently process two images for similarity assessment.
- It leverages patch embedding, positional encoding, and transform blocks to produce comparable embeddings for metric learning tasks.
- Weight sharing ensures consistent embedding production, making the architecture ideal for image retrieval, verification, and one-shot learning applications.
A Siamese Vision Transformer (ViT) encoder is a dual-branch architecture that applies the Vision Transformer encoder, as introduced in "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (Dosovitskiy et al., 2020), in a weight-sharing configuration to two separate input images. Each branch processes an image independently with shared parameters, producing embeddings suitable for metric learning or similarity tasks. All architectural and procedural aspects—including patch embedding, positional encoding, transformer blocks, and output handling—are directly inherited from the foundational ViT encoder design.
1. Patch Embedding and Input Representation
Each input image is first divided into a grid of non-overlapping patches of size . This results in patches, where each patch . Each patch is flattened to a vector and projected via a learnable embedding matrix to obtain the patch embedding matrix:
This process forms the input token sequence for the transformer backbone, preparing both images in the Siamese branches identically, ensuring direct comparability between their representations.
2. Positional Encoding and Class Token Integration
A learnable class token is prepended to the sequence of patch embeddings. Additionally, a learnable one-dimensional positional embedding matrix is added elementwise to the token sequence, including the class token. The composite input to the transformer is
where the class token enables the aggregation of global image information for downstream use, such as generating a compact embedding for metric learning.
3. Transformer Encoder Architecture
The architecture employs sequential encoder blocks, each with two sub-layers: Multi-Head Self-Attention (MHSA) and a position-wise Multi-Layer Perceptron (MLP), both with LayerNorm pre-normalization ("Pre-Norm" variant). For each layer :
- Layer Normalization:
- MHSA Sub-block:
- Each head operates in dimension across heads.
- Projections: , , with .
- Per-head attention:
- Output: Concatenation over heads, projected by .
- Residual connection:
- MLP Sub-block:
- Pre-norm:
- Feed-forward: , , .
- Residual:
After blocks, the output is , from which the refined class token vector is extracted.
4. Output Embedding and Metric Learning Setup
For embedding-based tasks, only the final class token vector is used. In the Siamese setting, each branch processes its respective image to produce and , both in . These embeddings are then compared (e.g., with cosine or Euclidean distance), or used as input to contrastive, triplet, or other metric learning losses. For classification, the output is
with . For fine-tuning, is generally re-initialized.
5. Weight Sharing and Twin Branch Realization
The Siamese instantiation requires strict weight sharing. All learnable parameters—embedding matrix , positional embeddings , attention and projection weights (per layer), MLP weights and biases ( per MLP), and optionally —are identical across both branches. Consequently, the encoder network for each image is functionally and parametrically identical, ensuring the resulting embeddings are produced in the same -dimensional feature space and are directly comparable. This configuration is essential for learning effective metrics or similarities.
6. Hyper-parameters and Training Paradigms
The ViT encoder supports several model sizes: ViT-Base (, , , MLP size $3072$, 86M parameters), ViT-Large (, , , MLP size $4096$, 307M parameters), and ViT-Huge (, , , MLP size $5120$, 632M parameters). The standard patch size is with default input size . Pre-training is typically conducted on large-scale datasets such as ImageNet-21k (14M images) or JFT-300M (303M images) with Adam optimizer, weight decay, and linear learning rate warm-up and decay. Fine-tuning involves SGD with momentum, smaller batch sizes, and sometimes larger image resolutions.
7. Applications and Context in Metric Learning
The Siamese ViT encoder is directly applicable to similarity and metric learning by producing embeddings invariant to the input branch. Typical use cases include one-shot learning, image retrieval, clustering, and verification tasks, where the ability to compare feature vectors in a shared embedding space is operationally critical. By leveraging transformer-based sequence modeling in vision, the Siamese ViT encoder can capture global contextual information in each image representation, and its strict architectural symmetry ensures that distance computations between embeddings are meaningful and consistent across diverse inputs (Dosovitskiy et al., 2020).