SigLIP-400M Vision Encoder

Updated 16 March 2026

The model integrates a ViT-based vision encoder with a transformer text tower using 400M parameters for state-of-the-art image–text alignment.
Training employs multiple objectives—including alignment, captioning, self-distillation, and masked patch prediction—to enhance dense prediction and localization.
The encoder supports multi-resolution inputs and fair data curation from a 109-language dataset, driving effective zero-shot, retrieval, and transfer learning.

SigLIP-400M ("So400m") is a 400 million-parameter multilingual vision-language encoder forming part of the SigLIP 2 family, designed for advanced image-text representation learning. It integrates a Vision Transformer (ViT) backbone with a transformer-based text tower, unified image-text alignment, captioning, self-supervised learning, and comprehensive data curation. The model supports zero-shot and transfer learning tasks across classification, retrieval, dense prediction, and localization, leveraging a diverse and de-biased web-scale dataset encompassing 109 languages and multi-resolution inputs (Tschannen et al., 20 Feb 2025).

1. Model Architecture

The vision encoder utilizes SoViT-14 ("So/14"), a standard Vision Transformer architecture, configured as follows:

Patch size: $p=14 \times 14$ pixels.
Spatial grid: $256 \times 256$ px input images, subdivided to yield $16 \times 16=256$ tokens.
Transformer depth: 24 blocks, each with hidden dimension $D=1024$ .
MLP (feedforward) inner dimension: $D_{ff}=4096$ .
Attention heads: $H=16$ .
Positional encoding: learned 2D embeddings.
Final pooling: via a MAP (attention-pooling) head, introducing ≈4M parameters.

The text encoder mirrors the vision backbone, differing primarily in:

Text input length: $L=64$ tokens.
Tokenizer: Multilingual Gemma, vocabulary size $V=256\text{k}$ .
Token embedding dimension: $D=1024$ .

Parameter allocation:

Vision encoder: 205M
Text encoder: 175M
Contrastive heads (projection layers, logit scale): 8M
Total: $\approx 388$ M (rounded to 400M).

2. Training Objectives

Training employs a staged integration of four objectives per minibatch of $B$ image–text pairs $\{(I_i,T_i)\}$ :

Sigmoid image–text alignment (SigLIP loss):

$L_{\text{sig}} = - \frac{1}{B^2} \sum_{i,j} \left[ y_{ij} \log \sigma\left(\frac{x_i \cdot y_j}{\tau}\right) + (1-y_{ij}) \log \left(1-\sigma\left(\frac{x_i \cdot y_j}{\tau}\right)\right) \right]$

where $x_i, y_j \in \mathbb{R}^D$ , $y_{ij}=1$ iff $i=j$ , $\tau$ is the temperature.

Decoder-based captioning and localization (LocCa):
- Image-level captioning (cross-entropy loss over predicted sequence).
- Referring expression: predicting bounding box for noun phrases.
- Grounded captioning: predicting phrases given box coordinates. The three objectives are combined with equal weight.
Self-distillation (SILC):
- Teacher (EMA of student) and student process global and local (8 random crop) views.
- Loss: squared $\ell_2$ difference of projected features.
Masked patch prediction (TIPS):
- 50% of student patch embeddings masked, projected features matched:

$L_{\text{mp}} = \sum_{m\in \text{masks}} \lVert h_{\text{student}}(\text{masked}) - h_{\text{teacher}}(\text{unmasked}) \rVert_2^2$

The overall training loss schedule:

First 80%: $L = L_{\text{sig}} + L_{\text{cap}}$
Last 20%: $L = L_{\text{sig}} + L_{\text{cap}} + \lambda_{\text{sd}} L_{\text{sd}} + \lambda_{\text{mp}} L_{\text{mp}}$ , with $\lambda_{\text{sd}}=1.0$ , $\lambda_{\text{mp}}=0.25$ .

3. Data Mixture and Fairness Curation

Primary data source is WebLI, furnishing approximately 10B images and 12B alt-texts in 109 languages. The sampled mixture is 90% English, 10% non-English image-text pairs. To mitigate bias, a two-stage filtering method is applied as per Alabdulmohsin et al., aimed at reducing both first- and second-order representational biases (e.g., gender/object associations). For ViT-B variants, active curation via ACID and implicit data distillation are deployed, but these are not applied for the So400m model.

4. Multi-Resolution and Native Aspect Ratio Support

So400m supports both fixed-resolution variants and a "NaFlex" native aspect ratio protocol:

Fixed resolution: checkpoints made after 95% pretraining, with bilinear/pseudoinverse resizing of positional (and, if necessary, patch) embeddings, followed by continued training.
NaFlex: a single checkpoint accommodates resolutions with sequence lengths $\{128, 256, 576, 784, 1024\}$ . During pre-processing, input images are resized so spatial dimensions are multiples of 14, with minimal distortion and grid no larger than the target length. Positional embeddings are resized accordingly, with padding masks ignoring attention to non-content patches. In the final 10% of training, one resolution per batch is randomly sampled to ensure balanced exposure.

5. Core Performance Characteristics

Zero-shot, retrieval, and transfer results are as follows:

Model	Input (px)	ImageNet-1k Top-1	COCO R@1 (T→I/I→T)	Flickr R@1 (T→I/I→T)
So/14 (256 px)	256	83.2%	55.1 / 71.5	84.3 / 94.6
So/14 (384 px)	384	84.1%	55.8 / 71.7	85.7 / 94.9

XM3600 (36 languages) retrieval: 256 px, T→I 47.9%, I→T 57.5%; 384 px, T→I 48.4%, I→T 57.5%.
10-shot ImageNet: 256 px: 82.1%, 384 px: 82.5%.
As a frozen encoder in the PaliGemma 2 recipe, So/14 surpasses SigLIP So/14 by ∼1.0–1.5 points across 35 downstream tasks (OCR, VQA, captioning, grounding; e.g., +2.2 on TextCaps, +1.6 on DocVQA).

6. Localization and Dense Prediction Capabilities

Dense feature extraction and localization are demonstrated via frozen-backbone probes and open-vocabulary detection:

Task	Metric	SigLIP So/14	SigLIP-2 So/14
Pascal-VOC Seg	mIoU ↑	72.0	77.1
ADE20k Seg	mIoU ↑	37.6	41.8
NYU-v2 Depth	RMSE ↓	0.576	0.493
NYU-v2 Normals	$\ell_{ang} \downarrow$	25.9 $^\circ$	24.9 $^\circ$

Referring expression (RefCOCO [email protected]): SigLIP-2, testA/testB 89.4% / 82.5% (vs. SigLIP 71.2% / 58.4%).
Open-vocabulary detection (OWL-ViT fine-tuned):
- COCO AP: SigLIP 44.3, SigLIP-2 45.2.
- LVIS (all): SigLIP 39.5, SigLIP-2 40.5; LVIS (rare): 40.9 → 42.3.
Open-vocabulary segmentation: SigLIP-2 L/16 gains +1.5–2.3 mIoU vs. SigLIP L/16; matches OpenCLIP G/14.

7. Ablation Studies and Compute Considerations

Objective staging:
- Stage 1 (SigLIP + LocCa) yields core improvements: +1.2–1.8 in classification/retrieval.
- Stage 2 (add self-distillation, masked prediction for last 20%) yields +3–4 mIoU for dense, -0.05 depth RMSE, +0.3–0.5 retrieval.
Inference costs (GFlops per image, $p=14$ ):
- 256 px (seq=256): ≃25 GFlops for 83.2% ImageNet-1k top-1.
- 384 px (seq=576): ≃60 GFlops for 84.1% top-1.
NaFlex: Quality within $\pm$ 0.3 pts retrieval, $\pm$ 1.0 pt classification vs. fixed-res, with broad input support.

A plausible implication is that NaFlex-style flexible resolution support democratizes model deployment across heterogeneous visual domains with minimal accuracy penalty.

SigLIP-400M unites a mid-sized SoViT backbone, multilingual contrastive and captioning objectives, and large-scale web data to achieve state-of-the-art open-weight performance for multimodal tasks, while addressing multilinguality, fairness, and flexible input requirements. The full training recipe, hyperparameters, and checkpoints are publicly available (Tschannen et al., 20 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SigLIP-400M Vision Encoder.