The paper introduces SigLIP 2, a novel family of multilingual vision-language encoders, which expands upon the original SigLIP model. The core innovation lies in the integration of captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation into a unified training recipe. The SigLIP 2 models demonstrate improved performance across a range of tasks, including zero-shot classification, image-text retrieval, and transfer learning for Vision-LLMs (VLMs). Additionally, the training methodology enhances localization and dense prediction capabilities. The paper also introduces variants that accommodate multiple resolutions while preserving the input's native aspect ratio. Training is conducted on a diverse, de-biased dataset, leading to enhanced multilingual understanding and fairness. The authors release model checkpoints in four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).
The primary contributions of SigLIP 2 are:
- Enhanced multilingual vision-language encoders: Strong performance on both English and multilingual vision-language tasks.
- Improved dense features: Self-supervised losses and a decoder-based loss enhance dense features for tasks like segmentation and depth estimation, and improve localization tasks like referring expression comprehension.
- Backward compatibility: The architecture is the same as SigLIP, allowing users to easily swap model weights and tokenizer.
- Native aspect ratio and variable resolution: A NaFlex variant supports multiple resolutions and preserves the native image aspect ratio.
- Optimized small models: Techniques in distillation via active data curation enhance performance of smaller models.
The SigLIP 2 training recipe combines the original SigLIP training with decoder-based pretraining, self-distillation, and masked prediction. The staged approach manages computational and memory overhead.
The architecture is based on SigLIP, using the standard ViT architecture with learned positional embeddings for the fixed-resolution variant. The image and text towers share the same architecture, except for the g-sized vision encoder, which is paired with an So400m-sized text encoder. Vision and text representations are pooled using a MAP head (attention pooling). The text length is set to 64, and the multilingual Gemma tokenizer is used with a vocabulary size of 256k.
The models are trained on the WebLI dataset, containing 10 billion images and 12 billion alt-texts in 109 languages. The training mixture is composed of 90% English and 10% non-English image-text pairs. Filtering techniques are applied to mitigate data biases.
The Adam optimizer is used with a learning rate of , decoupled weight decay of , and gradient clipping to norm 1.
The batch size is set to 32k, and a cosine schedule with 20k warmup steps is used, training for a total of 40B examples. The models are trained on up to 2048 TPUv5e chips using a fully-sharded data-parallel strategy (FSDP).
In the first pretraining step, SigLIP is combined with LocCa by combining the two losses with equal weight. SigLIP creates binary classification problems by combining every image embedding with every text embedding in the mini-batch and trains the embeddings to classify matching and non-matching pairs via logistic regression (sigmoid loss).
For LocCa, a standard transformer decoder with cross-attention is attached to the un-pooled vision encoder representation. The decoder is trained to predict image captions, automatic referring expression predictions, and grounded captions. The captioning target is predicted with parallel prediction with a probability of 50%.
Following SILC and TIPS, the training setup is augmented with local-to-global correspondence learning with self-distillation and masked prediction losses to improve the local semantics of the feature representation.
The first term is the local-to-global consistency loss, in which the vision encoder becomes the student network, trained to match the teacher's representation, derived from the full image. The teacher parameters are obtained as an EMA (exponential moving average) of the student parameters.
The second loss term is the masked prediction objective. A percentage of the embedded image patches in the student network are replaced with mask tokens, and the student is trained to match the features of the teacher at masked locations.
To obtain fixed-resolution checkpoints at multiple resolutions, the checkpoints are resumed at 95% of training, the positional embedding is resized, and training is resumed at the target resolution with all losses.
NaFlex combines ideas from FlexiViT, supporting multiple sequence lengths, and NaViT, processing images at their native aspect ratio. The data is preprocessed by resizing the input image such that the height and width after resizing are multiples of the patch size, while keeping the aspect ratio distortion as small as possible and producing a sequence length of at most the desired target sequence length. The learned positional embedding is bilinearly resized to the target, non-square patch grid for the resized input image. The attention layers are masked to ignore the extra padding tokens when the sequence length after resizing is smaller than the target sequence length.
To maximize performance of the smallest fixed-resolution models (ViT-B/16 and ViT-B/32), knowledge is distilled from a teacher model during a short fine-tuning stage, using the ACID method.
In zero-shot classification and retrieval experiments, SigLIP 2 performs better than SigLIP and other baselines, despite supporting many languages. SigLIP 2's improvements are particularly significant for the B-sized models, owing to distillation. SigLIP 2's recall exceeds that of SigLIP by a large margin on Crossmodal-3600 (XM3600), while only lagging slightly behind mSigLIP.
The NaFlex variant outperforms the standard variant on the majority of OCR/document/screen-focused image-text benchmarks, particularly for small sequence lengths.
In experiments evaluating SigLIP 2 as a vision encoder for VLMs, SigLIP 2 outperforms SigLIP across resolutions and model sizes. For an L-sized vision encoder, SigLIP 2 also outperforms the recently released AIMv2 model.
In dense prediction tasks, SigLIP 2 outperforms several previous open, CLIP-style vision encoders, including SigLIP.
In open-vocabulary segmentation experiments, SigLIP 2 at L/16 improves on SigLIP and surpasses the much bigger OpenCLIP G/14 model.
In referring expression comprehension experiments, SigLIP 2 outperforms SigLIP as well as CLIP and pretraining via image captioning by a large margin, but is outperformed by LocCa.
In open-vocabulary detection experiments, SigLIP 2 achieves better performance than SigLIP on the two popular benchmarks COCO and LVIS.
In terms of cultural diversity, SigLIP 2 demonstrates an improvement in zero-shot classification accuracy on Dollar Street, GeoDE, and Google Landmarks Dataset v2 (GLDv2), and in 10-shot geolocalization using Dollar Street and GeoDE.
In terms of fairness, SigLIP 2 is significantly better than SigLIP in representation bias.
In conclusion, SigLIP 2 achieves significant improvements in zero-shot classification, transfer performance as a vision encoder in VLMs, and in localization and dense prediction tasks. Additionally, SigLIP 2 attains more balanced quality across culturally diverse data and supports multiple resolutions with a single model checkpoint, while preserving the native image aspect ratio.