Sigmoid Contrastive Language–Image Pre-training

Updated 2 December 2025

SigLIP is a joint vision–language pretraining approach that employs a pairwise sigmoid loss to align normalized image and text embeddings.
It eliminates batch-global softmax normalization and supports multiple positives per anchor, improving training robustness and flexibility.
Empirical benchmarks show SigLIP and SigLIP 2 achieve competitive zero-shot and retrieval performance with reduced computational overhead.

Sigmoid Contrastive Language–Image Pre-training (SigLIP) refers to a contemporary paradigm for joint vision–language representation learning, in which paired embeddings for images and texts are aligned via a pairwise sigmoid (binary cross-entropy) loss, superseding the softmax-based (InfoNCE) contrastive objectives characteristic of earlier large-scale systems (e.g., CLIP, ALIGN). SigLIP and its direct successors—most notably SigLIP 2—introduce substantial architectural, algorithmic, and theoretical refinements that enhance performance, efficiency, and flexibility. The approach is characterized by its elimination of batch-global softmax normalization, accommodation of multiple positives per anchor, and decoupling of positive/negative pairwise treatment. These features enable robust training under realistic web-scale data noise and diversity, strengthen results on downstream tasks, and facilitate foundation model development with practical memory and compute requirements (Zhai et al., 2023, Tschannen et al., 20 Feb 2025, Bulat et al., 16 May 2024, Bangachev et al., 23 Sep 2025, Lee et al., 20 Feb 2024).

1. Mathematical Formulation and Loss Derivation

The core innovation of SigLIP is the use of a pairwise sigmoid-based binary classification loss applied to all possible image–text pairs in a training batch. For a mini-batch $B = \{(I_1,T_1), ..., (I_n,T_n)\}$ of $n$ image–text pairs, let $f(I) \in \mathbb{R}^d$ and $g(T) \in \mathbb{R}^d$ denote their $\ell_2$ -normalized encodings: $z_i = f(I_i)/\|f(I_i)\|$ , $w_j = g(T_j)/\|g(T_j)\|$ . The scalar similarity is $s(i, j) = t \cdot (z_i \cdot w_j) + b$ , where $t = \exp(t')$ is a learnable temperature and $b$ is a learnable bias.

Let $B^+ = \{(i, i)\}$ denote positive (matched) pairs and $B^- = \{(i, j),\ i \neq j\}$ the negatives. The SigLIP loss is

$L = - \sum_{(i, j) \in B^+} \log \sigma(s(i, j)) - \sum_{(i, j) \in B^-} \log [1 - \sigma(s(i, j))]$

where $\sigma(x) = 1/(1+e^{-x})$ . Generalizations to multi-positive training extend $B^+$ to include multiple mined or synthetic positives per anchor (Zhai et al., 2023, Bulat et al., 16 May 2024).

This pairwise decoupling contrasts with InfoNCE, which applies softmax normalization over all batch-pair logits, enforcing mutual exclusivity per anchor and necessitating global batch communication and numerically stable normalizations.

2. Theoretical Foundations and Geometric Analysis

The underlying geometry of SigLIP–trained embeddings has been characterized through the Constant Embedding Model (CCEM) and, more recently, the introduction of $(m, b)$ –Constellations (Lee et al., 20 Feb 2024, Bangachev et al., 23 Sep 2025). CCEM interpolates between the Equiangular Tight Frame (ETF) optimum achieved at high temperature—where image/text embeddings are symmetrically distributed and aligned—and antipodal configurations at low temperature, reflecting repulsive regimes.

For a batch of $N$ pairs, empirical and theoretical analysis demonstrates that with appropriate selection of temperature $t$ and bias $b$ , the optimal SigLIP embedding structure converges to a CCEM configuration, which is a global minimum for the sigmoid loss. $(m, b)$ –Constellations formalize configurations for which a margin $m$ and relative bias $b$ admit perfect separability of matching and non-matching pairs. The loss attains zero in the limit $t \rightarrow \infty$ (i.e., for sharpened sigmoid) and $b$ adjusted appropriately.

Crucially, the learnable bias $b$ —free to vary independently from $t$ —enables the optimizer to attain zero-loss global minima for a rich family of embedding constellations even when $N \gg d$ . Rigorous bounds relate the maximal attainable batch size $N$ (i.e., the packing number) for a given margin to the embedding dimensionality $d$ , connecting SigLIP’s finite- $d$ expressivity to performance guarantees in retrieval (Bangachev et al., 23 Sep 2025). These results also explain the empirically observed “modality gap,” where image and text embeddings form linearly separable clusters—a direct consequence of constellation geometry.

3. Model Architecture, Training Protocols, and Implementation

Both SigLIP and SigLIP 2 utilize a two-tower design with independent vision and text encoders:

Vision tower: Patch-based Vision Transformers (ViT-B/16, ViT-L/16, So400m/14, “g”/16) with learned positional embeddings and attention-based pooling.
Text tower: Transformer models with equivalent hidden size, processing up to 64 tokens and using identical pooling (MAP head).

Variants include “locked-image tuning” (SigLiT), where vision encoders are frozen and only text encoders are trained, yielding significant speed-ups and reusing pre-trained visual features (Zhai et al., 2023, Tschannen et al., 20 Feb 2025).

Training and Optimization:

Optimizers: ViT–Adafactor ( $\beta_1=0.9, \beta_2=0.95$ ), AdamW, or LION for more aggressive fine-tuning.
Batch size: Effective batch scaling from $512$ up to $1\,000\,000$ is feasible, but empirical saturation occurs at $32k$; larger batches yield diminishing returns (Zhai et al., 2023).
Data augmentation: Standard augmentation as in CLIP (random crop, color jitter, etc.); text truncated to model capacity.
Positives/negatives: Multiple positives per image via synthetic captions and batch-local mining using cosine similarity thresholds. No need for momentum encoders, hard negative sampling, or automation of global logit normalization (Bulat et al., 16 May 2024).

Efficient Implementation: The memory footprint reduces from quadratic in token dimension for softmax-based approaches to quadratic in batch size for SigLIP. Communication patterns are simplified (“chunked ring-all-permute”), supporting high-throughput multi-chip regimes.

4. Empirical Performance and Benchmarks

SigLIP and SigLIP 2 establish state-of-the-art performance across retrieval, zero-shot classification, dense prediction, localization, and fairness metrics. Key results include:

Model/Experiment	Data/Setup	ImageNet ZS (%)	Flickr30k Img/Text R@1 (%)	COCO Img/Text R@1 (%)	Multilingual Recall@1 XM3600
SigLiT (g/14)	LiT, batch 20k, 2d, 4×TPUv4	84.5	–	–	–
SigLIP (B/16)	YFCC, batch 32k, from-scratch	73.4	–	–	–
SigLIP (ViT-B/32)	YFCC15M, k=5, all mining	51.1	67.6/85.3	44.3/61.7	–
SigLIP 2 (ViT-L/16@512)	WebLI, batch 32k	83.5	–	–	–
mSigLIP (WebLI, 109 lang)	Multilingual, batch 32k	–	–	–	54.1 / 34.9

SigLIP outperforms prior CLIP-like models for moderate ( $< 16$ k) batch sizes by $2$– $3\%$ absolute on zero-shot image classification.
With synthetic and mined positives, SigLIP closes or exceeds the performance gap for multi-positive, noisy, or multilingual data (Zhai et al., 2023, Bulat et al., 16 May 2024, Tschannen et al., 20 Feb 2025).
SigLIP 2 delivers further improvements on zero-shot, retrieval, dense prediction (e.g., segmentation, depth) and localization, as well as substantial reductions in bias and improvements in global/cultural fairness (Tschannen et al., 20 Feb 2025).
Ablations confirm the robustness of SigLIP to noise, number of positives, and thresholding hyperparameters, with minimal sensitivity to batch mining thresholds (Bulat et al., 16 May 2024).

5. Extensions: Multi-Positive Learning and Robustness

A principal advantage of SigLIP is its intrinsic support for multiple positives per anchor. By training with multiple synthetic captions (e.g., $k=5$ via BLIP-2 or OFA) and batch-mined positives (via image-image, text-text, and image-text similarity thresholds), SigLIP avoids the normalization constraints of InfoNCE/softmax, which allocates the entire probability mass to a single positive (Bulat et al., 16 May 2024). The addition of multiple positives per sample drives substantial increases in downstream robustness and accuracy, mitigating issues stemming from caption duplication, web-scale data noise, and ambiguous semantic assignments.

The pairwise binary classification setup enables resilience to noisy and misassigned labels, since errors only affect affected pairs, not global normalization, aligning with the noise tolerance observed empirically (Bulat et al., 16 May 2024).

6. SigLIP 2: Recipe Advances and Multitask Extensions

SigLIP 2 builds explicitly on the original SigLIP foundation by incorporating orthogonal advancements:

Captioning-based multitask pretraining: Lightweight transformer decoder prediciting full-image captions, region boxes, and grounded region captions, with corresponding losses summed with the core sigmoid contrastive objective.
Self-supervision: Local-to-global feature distillation and masked-prediction objectives preserve global semantics, improve dense and localization transfer, and stabilize convergence.
Active data curation (ACID): Teacher–student batch scoring selects informative examples, implicitly distilling large-model knowledge into smaller models with minimal computational burden.
Multilingual and fairness optimizations: Integrated training on WebLI (109 languages), bias filtering, and resolution/NaFlex (native aspect flexible transformer) variants targeting aspect-ratio invariance and fine-grained multisize usage.
Performance: Substantial improvements over SigLIP on both English and multilingual retrieval, zero-shot, and dense prediction, including closing the gap to closed-weight models on diverse diagnostic tasks (Tschannen et al., 20 Feb 2025).

7. Practical Recommendations and Implementation Tips

SigLIP and SigLIP 2 are compatible with a wide range of batch sizes, but a batch size of $32k$ is typically optimal, enabling state-of-the-art results on relatively modest modern hardware (4 TPUv4 chips). For locked-image tuning, disable weight decay on frozen backbones. Use reduced $\beta_2$ values in Adam/Adafactor optimizers (e.g., 0.95) to stabilize gradient updates at large scale. Efficient implementation uses a chunked ring-all-permute algorithm to minimize memory overhead.

Future explorations include: hard negative mining, generative-contrastive hybrid objectives, extension to multimodal data streams, semi-supervised objectives, and aggressive adaptation to low-memory deployment (Zhai et al., 2023, Tschannen et al., 20 Feb 2025).

References:

(Zhai et al., 2023, Lee et al., 20 Feb 2024, Tschannen et al., 20 Feb 2025, Bulat et al., 16 May 2024, Bangachev et al., 23 Sep 2025)