SigLIP: Sigmoid Loss for Language-Image Pre-Training

Updated 29 January 2026

The paper introduces SigLIP, replacing global softmax with independent sigmoid evaluations to reduce memory and computation while ensuring effective multimodal alignment.
Methodologically, SigLIP uses learnable temperature and bias to adjust embedding geometry, achieving stable performance across varying batch sizes and handling noisy data.
Empirical results show that SigLIP outperforms InfoNCE-based losses, delivering robust zero-shot and retrieval performance through efficient, scalable training protocols.

Sigmoid Loss for Language-Image Pre-Training (SigLIP) denotes a class of contrastive self-supervised objectives and training protocols for learning highly aligned multimodal embeddings. In SigLIP, similarity between image and text representations is optimized using a pairwise sigmoid loss, in contrast to the global softmax-based normalization in InfoNCE losses (e.g., CLIP). This architectural and loss-formulation paradigm yields improvements in computational efficiency, batch-size robustness, memory requirements, and tolerance to noisy or ambiguous data. SigLIP underlies recent state-of-the-art models in vision–language pre-training and has been rigorously analyzed for its embedding geometry, theoretical properties, and downstream task performance (Zhai et al., 2023, Lee et al., 2024, Bangachev et al., 23 Sep 2025, Bulat et al., 2024).

1. Mathematical Formulation and Training Objective

The SigLIP loss operates on sets of image–text pairs. Each image $I_i$ and caption $T_j$ are encoded via modality-specific deep architectures and projected into a shared $d$ -dimensional unit-norm embedding space, yielding vectors $x_i = f(I_i)$ and $y_j = g(T_j)$ with $\|x_i\| = \|y_j\| = 1$ (Zhai et al., 2023, Bulat et al., 2024). The similarity score for a pair is

$s_{ij} = t\, x_i^\top y_j + b$

where $t > 0$ is a learnable temperature and $b$ is a learnable bias. A label matrix $z_{ij}$ is defined as $+1$ for positive pairs (usually $i = j$ or mined positives) and $-1$ for negatives. The SigLIP loss is a mean binary cross-entropy over all image–text pairs:

$L_{\mathrm{sigmoid}} = -\frac{1}{N^2} \sum_{i=1}^N \sum_{j=1}^N \log \left[ \sigma \left(z_{ij} (t\, x_i^\top y_j + b) \right) \right]$

with $\sigma(u) = 1/(1+e^{-u})$ . In practice, normalization can be adjusted to average over positives and negatives as needed.

Unlike InfoNCE, which normalizes via a global softmax over all pairs entangling positive and negative scores, SigLIP applies independent sigmoid terms to each pair (Zhai et al., 2023). The loss supports arbitrary positive:negative ratios and decouples training batch size from loss normalization.

2. Embedding Geometry: Double-Constant Embedding Model and Constellations

The geometric structure of learned embeddings under the sigmoid loss can be parameterized by the Double-Constant Embedding Model (CCEM), which captures symmetric configurations via a single parameter $\delta \geq 0$ (Lee et al., 2024). Let $\{e_i\}$ denote an $(N-1)$ -simplex equiangular tight frame (ETF). The embeddings are constructed as:

$x_i^{(\delta)} = \left( \frac{1}{\sqrt{1+\delta^2}} e_i \mid \frac{\delta}{\sqrt{1+\delta^2}} \right )^\top$

$y_i^{(\delta)} = \left( \frac{1}{\sqrt{1+\delta^2}} e_i \mid \frac{-\delta}{\sqrt{1+\delta^2}} \right )^\top$

giving positive pair inner product $(1-\delta^2)/(1+\delta^2)$ and negative pair inner products $-[(1/(N-1)) + \delta^2]/(1+\delta^2)$ . As $\delta$ varies from $0$ to $\infty$ , embeddings interpolate between a simplex ETF (full alignment) and an antipodal arrangement (strong separation).

Theoretical analysis demonstrates that the global minimizers of the sigmoid contrastive loss lie within the CCEM family (Lee et al., 2024). Furthermore, trainable temperature and bias extend this geometry into the richer class of $(m, b)$ -Constellations, which are combinatorial objects related to spherical codes and define precise separation bounds between positive and negative pairs (Bangachev et al., 23 Sep 2025).

3. Batch Size, Loss Normalization, and Computational Efficiency

SigLIP fundamentally decouples the loss normalization from batch size and negative sampling (Zhai et al., 2023). Since each pair contributes independently, SigLIP models achieve stable and high performance across a wide range of batch sizes—from per-example streaming up to million-scale batches.

Key computational advantages include:

No global softmax normalization: Each device processes its local mini-batch and negatives, reducing peak memory requirements from $O(D b^2)$ (InfoNCE) to $O(b^2)$ (SigLIP).
Memory and throughput gains: SigLIP eliminates global all-gather operations, lowers overhead, and supports efficient training on moderate hardware (e.g., 4 TPUv4 chips for strong ImageNet zero-shot performance).
Robustness to batch size: Empirically, performance saturates at batch size $\approx 32,000$ ; further scale gives diminishing returns ( $<0.3\%$ additional accuracy at $>1$ M batch size) (Zhai et al., 2023, Bulat et al., 2024).

This structure also enables probing of positive:negative ratios, selective negative mining, and curriculum strategies.

4. Training Protocols, Implementation, and Positive Pair Mining

SigLIP-based pre-training typically follows these steps (Bulat et al., 2024):

Encode images and texts, project to shared space, apply L2 normalization.
Construct similarity matrices $S_{it}$ (image–text), $S_{ii}$ (image–image), $S_{tt}$ (text–text).
Synthesize assignment mask $M_{ij} \in \{\pm 1\}$ : assign $+1$ for pairs matching semantic or syntactic criteria; mine true positives using thresholds on similarity scores (e.g., $p_1=0.27$ , $p_2=0.92$ , $p_3=0.99$ ).
Augment caption sets by generating $k$ pseudo-captions per image (via models like BLIP-2), forming expanded text batch.
Compute and average per-pair sigmoid losses.
Update model parameters (encoders, projection heads, bias) via AdamW or similar optimizers.

This pairwise loss supports multiple true positives per image, noisy assignments, and is agnostic to the number of positives/negatives per batch, in contrast to softmax normalization (Bulat et al., 2024).

5. Theoretical Properties: Margin, Bias, and the Modality Gap

Recent studies have formalized the combinatorics and representation limits of sigmoid contrastive learning. Every global minimum of the SigLIP loss (with fully trainable temperature and bias) corresponds to a $(m, b)$ -Constellation: a collection of embedding pairs separated by a margin $m$ and relative bias $b$ (Bangachev et al., 23 Sep 2025). The feasible number of pairs with a given margin/bias scales exponentially with dimension, with boundaries governed by Shannon–Wyner spherical code bounds.

These minima guarantee perfect top-1 retrieval: the highest dot product for an image always corresponds to its matching caption. Importantly, SigLIP induces a “modality gap”—linearly separable regions for image and text embeddings, typically observed empirically and justified theoretically. Embedding dimension requirements follow directly from combinatorial bounds.

Reparameterizing the loss via explicit relative bias $\beta = b/t$ accelerates and stabilizes training and supports scenarios with locked encoders or multi-modality alignment.

6. Empirical Results and Comparison to InfoNCE-Based Models

Extensive experiments validate the efficacy and robustness of SigLIP (Zhai et al., 2023, Bulat et al., 2024):

ImageNet zero-shot: SigLiT g/14 text tower, batch size 20k, four TPUv4 chips, achieves 84.5% accuracy in two days—matching larger fleet baselines (Zhai et al., 2023).
Batch size scaling: Sigmoid loss exhibits superior performance at small batch sizes (<16k) and matches or slightly exceeds InfoNCE at large batches; saturation observed at batch size $\sim$ 32k (no further gain at 1M).
Caption diversity and positive mining: Multiple pseudo-captions and mask-based mining for true positives yield large gains—up to +19 points on average for zero-shot classification and +40–50 points for retrieval across major benchmarks.
Multilingual retrieval: Saturation at batch size 32k; larger batches degrade multilingual performance.
Robustness to label noise: SigLIP models outperform supervised contrastive objectives under noisy assignments.
Hyperparameter stability: Default settings for learning rate ( $1\times10^{-3}$ ), weight decay ( $1\times10^{-4}$ ), temperature ( $t\approx 10$ ), and bias ( $b \approx -10$ ) are effective across large scale sweeps.

7. Practical Recommendations and Future Directions

Guidelines for deploying SigLIP in contrastive multimodal training (Zhai et al., 2023, Lee et al., 2024, Bangachev et al., 23 Sep 2025, Bulat et al., 2024):

Use temperature $t\gtrsim \log(N)$ to guarantee ETF geometry (uniform, well-aligned embeddings).
Set bias $b$ close to $t$ for balanced positive/negative contributions; explicit relative bias further stabilizes optimization.
Employ moderate batch sizes (up to 32k); scale beyond this yields diminishing returns.
Leverage online positive mining and caption augmentation to improve alignment.
Utilize small-batch or streaming training to maximize computational efficiency.
Consider relative bias parametrization for scenarios requiring locked encoders or modality adaptation.

Current research suggests directions including optimal negative mining, disentangled loss normalization, and scaling SigLIP to accessible hardware for broader adoption. The SigLIP loss formulation offers a principled, efficient approach for high-quality language–image pre-training, facilitating advances in retrieval, classification, and alignment tasks across modalities.