SigLIP-400M Vision Encoder
- The model integrates a ViT-based vision encoder with a transformer text tower using 400M parameters for state-of-the-art image–text alignment.
- Training employs multiple objectives—including alignment, captioning, self-distillation, and masked patch prediction—to enhance dense prediction and localization.
- The encoder supports multi-resolution inputs and fair data curation from a 109-language dataset, driving effective zero-shot, retrieval, and transfer learning.
SigLIP-400M ("So400m") is a 400 million-parameter multilingual vision-language encoder forming part of the SigLIP 2 family, designed for advanced image-text representation learning. It integrates a Vision Transformer (ViT) backbone with a transformer-based text tower, unified image-text alignment, captioning, self-supervised learning, and comprehensive data curation. The model supports zero-shot and transfer learning tasks across classification, retrieval, dense prediction, and localization, leveraging a diverse and de-biased web-scale dataset encompassing 109 languages and multi-resolution inputs (Tschannen et al., 20 Feb 2025).
1. Model Architecture
The vision encoder utilizes SoViT-14 ("So/14"), a standard Vision Transformer architecture, configured as follows:
- Patch size: pixels.
- Spatial grid: px input images, subdivided to yield tokens.
- Transformer depth: 24 blocks, each with hidden dimension .
- MLP (feedforward) inner dimension: .
- Attention heads: .
- Positional encoding: learned 2D embeddings.
- Final pooling: via a MAP (attention-pooling) head, introducing ≈4M parameters.
The text encoder mirrors the vision backbone, differing primarily in:
- Text input length: tokens.
- Tokenizer: Multilingual Gemma, vocabulary size .
- Token embedding dimension: .
Parameter allocation:
- Vision encoder: 205M
- Text encoder: 175M
- Contrastive heads (projection layers, logit scale): 8M
- Total: M (rounded to 400M).
2. Training Objectives
Training employs a staged integration of four objectives per minibatch of image–text pairs :
- Sigmoid image–text alignment (SigLIP loss):
where , iff , is the temperature.
- Decoder-based captioning and localization (LocCa):
- Image-level captioning (cross-entropy loss over predicted sequence).
- Referring expression: predicting bounding box for noun phrases.
- Grounded captioning: predicting phrases given box coordinates. The three objectives are combined with equal weight.
- Self-distillation (SILC):
- Teacher (EMA of student) and student process global and local (8 random crop) views.
- Loss: squared difference of projected features.
- Masked patch prediction (TIPS):
- 50% of student patch embeddings masked, projected features matched:
The overall training loss schedule:
- First 80%:
- Last 20%: , with , .
3. Data Mixture and Fairness Curation
Primary data source is WebLI, furnishing approximately 10B images and 12B alt-texts in 109 languages. The sampled mixture is 90% English, 10% non-English image-text pairs. To mitigate bias, a two-stage filtering method is applied as per Alabdulmohsin et al., aimed at reducing both first- and second-order representational biases (e.g., gender/object associations). For ViT-B variants, active curation via ACID and implicit data distillation are deployed, but these are not applied for the So400m model.
4. Multi-Resolution and Native Aspect Ratio Support
So400m supports both fixed-resolution variants and a "NaFlex" native aspect ratio protocol:
- Fixed resolution: checkpoints made after 95% pretraining, with bilinear/pseudoinverse resizing of positional (and, if necessary, patch) embeddings, followed by continued training.
- NaFlex: a single checkpoint accommodates resolutions with sequence lengths . During pre-processing, input images are resized so spatial dimensions are multiples of 14, with minimal distortion and grid no larger than the target length. Positional embeddings are resized accordingly, with padding masks ignoring attention to non-content patches. In the final 10% of training, one resolution per batch is randomly sampled to ensure balanced exposure.
5. Core Performance Characteristics
Zero-shot, retrieval, and transfer results are as follows:
| Model | Input (px) | ImageNet-1k Top-1 | COCO R@1 (T→I/I→T) | Flickr R@1 (T→I/I→T) |
|---|---|---|---|---|
| So/14 (256 px) | 256 | 83.2% | 55.1 / 71.5 | 84.3 / 94.6 |
| So/14 (384 px) | 384 | 84.1% | 55.8 / 71.7 | 85.7 / 94.9 |
- XM3600 (36 languages) retrieval: 256 px, T→I 47.9%, I→T 57.5%; 384 px, T→I 48.4%, I→T 57.5%.
- 10-shot ImageNet: 256 px: 82.1%, 384 px: 82.5%.
- As a frozen encoder in the PaliGemma 2 recipe, So/14 surpasses SigLIP So/14 by ∼1.0–1.5 points across 35 downstream tasks (OCR, VQA, captioning, grounding; e.g., +2.2 on TextCaps, +1.6 on DocVQA).
6. Localization and Dense Prediction Capabilities
Dense feature extraction and localization are demonstrated via frozen-backbone probes and open-vocabulary detection:
| Task | Metric | SigLIP So/14 | SigLIP-2 So/14 |
|---|---|---|---|
| Pascal-VOC Seg | mIoU ↑ | 72.0 | 77.1 |
| ADE20k Seg | mIoU ↑ | 37.6 | 41.8 |
| NYU-v2 Depth | RMSE ↓ | 0.576 | 0.493 |
| NYU-v2 Normals | 25.9 | 24.9 |
- Referring expression (RefCOCO [email protected]): SigLIP-2, testA/testB 89.4% / 82.5% (vs. SigLIP 71.2% / 58.4%).
- Open-vocabulary detection (OWL-ViT fine-tuned):
- COCO AP: SigLIP 44.3, SigLIP-2 45.2.
- LVIS (all): SigLIP 39.5, SigLIP-2 40.5; LVIS (rare): 40.9 → 42.3.
- Open-vocabulary segmentation: SigLIP-2 L/16 gains +1.5–2.3 mIoU vs. SigLIP L/16; matches OpenCLIP G/14.
7. Ablation Studies and Compute Considerations
- Objective staging:
- Stage 1 (SigLIP + LocCa) yields core improvements: +1.2–1.8 in classification/retrieval.
- Stage 2 (add self-distillation, masked prediction for last 20%) yields +3–4 mIoU for dense, -0.05 depth RMSE, +0.3–0.5 retrieval.
- Inference costs (GFlops per image, ):
- 256 px (seq=256): ≃25 GFlops for 83.2% ImageNet-1k top-1.
- 384 px (seq=576): ≃60 GFlops for 84.1% top-1.
- NaFlex: Quality within 0.3 pts retrieval, 1.0 pt classification vs. fixed-res, with broad input support.
A plausible implication is that NaFlex-style flexible resolution support democratizes model deployment across heterogeneous visual domains with minimal accuracy penalty.
SigLIP-400M unites a mid-sized SoViT backbone, multilingual contrastive and captioning objectives, and large-scale web data to achieve state-of-the-art open-weight performance for multimodal tasks, while addressing multilinguality, fairness, and flexible input requirements. The full training recipe, hyperparameters, and checkpoints are publicly available (Tschannen et al., 20 Feb 2025).