Contrastive Learning: CLIP & SigLIP

Updated 1 September 2025

Contrastive learning is a self-supervised method that brings similar data closer and pushes dissimilar data apart, exemplified by CLIP and SigLIP.
It employs a loss formulation with a strict global minimum set—permutation matrices—ensuring semantically meaningful and diverse feature embeddings.
Architectural enhancements like normalization and predictor modules are critical for non-contrastive methods to avoid suboptimal minima and recover latent structures.

Contrastive learning (CL) is a prominent self-supervised representation learning paradigm that relies on learning feature embeddings such that similar (positive) pairs are brought closer and dissimilar (negative) pairs are pushed apart in the representation space. This approach, exemplified by models such as CLIP and SigLIP, has become a central methodology in vision–language pretraining and broader multi-modal domains due to its scalability, effectiveness, and strong empirical and theoretical underpinnings. Recent work has provided detailed insights into the landscape and dynamics of contrastive learning, its theoretical guarantees, loss formulations, and its practical consequences when compared to non-contrastive alternatives.

1. Loss Landscapes: Contrastive Versus Non-Contrastive Frameworks

Two dominant families of unsupervised/self-supervised learning objectives are: (1) contrastive losses (CL), as in SimCLR, CLIP, and related models, and (2) non-contrastive losses (NCL), as in BYOL and SimSiam. Theoretical analysis in the sparse coding setting shows that contrastive losses possess a highly constrained set of global minima, all corresponding (in the noiseless, identity-dictionary case) to permutation matrices that perfectly recover the latent feature structure. Formally, minimizers of the contrastive loss satisfy:

$\arg\min_{W\in\mathcal{U}} L_{\rm CL}(W, W) = \{\text{permutation matrices}\}.$

By contrast, non-contrastive losses have a dramatically larger set of global optima—including many “bad” non-collapsed solutions—comprising all matrices in a certain nonnegative, unit-norm column set. As a result, unless the optimization is staged with a “warm start” or aided by techniques like normalization or prediction heads, non-contrastive procedures are generically drawn toward “bad” minima that do not recover the underlying structure (Pokle et al., 2022).

2. Training Dynamics and the Role of Initialization

The convergence behavior under contrastive and non-contrastive losses diverges. For linear networks trained with the non-contrastive loss and weight decay, the weight matrices remain confined to the linear span of their initialization:

$W_t^{(o)} = C_{1,t}W_0^{(o)} + C_{2,t}(W_0^{(o)} + W_0^{(t)}),$

with $C_{1,t} \in (0,1)$ and $C_{2,t} > 0$ determined by optimization hyperparameters. If the initialization does not already point toward a permutation (i.e., is far from ground truth), NCL from random initialization is ineffective and incapable of recovering the desired latent structure.

For ReLU or non-linear networks, non-contrastive objectives can recover the correct representation, but only under a warm-start and with row normalization or an added prediction head. Here, the dynamics—in alternating gradient and normalization—systematically shrink the off-diagonal entries of $W$ , and diagonal elements monotonically approach unity. The update rule:

$W \leftarrow \text{RowNorm}(W - \eta \nabla_W L_{\rm NCL})$

ensures convergence to the identity as long as a contraction factor $\gamma \in (0,1)$ is maintained in the updates (Pokle et al., 2022).

3. Empirical Characterization and Metrics

Extensive controlled experiments using synthetic sparse coding models validate the contrasting behaviors predicted theoretically. Non-contrastive models, barring normalization or predictor modules, do not recover high max-cosine similarity (a direct evaluation against the ground-truth dictionary). Conversely, contrastive objectives—whether in standard or “non-contrastive + prediction head” configuration—consistently attain high alignment with the underlying codes.

Typical empirical trajectories are characterized by:

“Minimum max-cosine” (alignment) metrics tracking the degree to which columns of learned weights match the ground truth.
Convergence properties mirroring theoretical predictions: NCL optimization can stagnate at poor local minima without architectural “tricks.”
Augmentations such as normalization or auxiliary heads can, in some cases, rescue the NCL optimization, pushing it to desirable minima.

4. Architectural and Algorithmic Design Principles

A decisive implication of contrastive learning theory is that the negative sampling (repulsive) term is essential for “spreading out” representations and guaranteeing diversity in the learned features—even in simple models. By forcing dissimilar data points apart, contrastive objectives prevent collapse and enforce alignment with semantic structure, recovering canonical or permuted basis representations.

Architectural consequences include:

The introduction of predictor modules, row/column normalization, or target networks (e.g., BYOL, SimSiam tricks) in non-contrastive SSL is a response to the suboptimal landscape of NCL objectives.
Warm starts and normalization control the explosion of undesirable global minima by shrinking off-diagonal components during learning.
Successful state-of-the-art self-supervised algorithms rely critically on these devices to avoid “bad” minima.

5. Theoretical and Practical Implications for Future SSL Methods

This analysis provides a comprehensive rationale for the widespread success of contrastive approaches like CLIP, and equally, a cautionary perspective on non-contrastive algorithms absent strong architectural safeguards. The sharp contrast in the geometry of global minima between CL and NCL losses means:

Contrastive losses embed a mechanism to avoid collapse and enforce semantically meaningful diversity in representations, ensuring that only a narrowly defined set of weight configurations minimizes the loss.
Non-contrastive methods, by contrast, must employ secondary “guidance” strategies to prevent convergence to suboptimal, but nonconstant, degenerate minima.
Empirically, methods such as SimSiam or BYOL benefit precisely because they implement normalization, momentum targets, or predictor heads that mimic the repulsive effect of contrastive negatives.

Key representative expressions include:

Contrastive loss minimizer: $\arg\min_{W\in\mathcal U}L_{\rm CL}(W, W)=\{\text{permutations}\}$ .
Linear NCL update: $W_t^{(o)}=C_{1, t} W^0^{(o)}+C_{2, t}\left(W^0^{(o)}+W^0^{(t)}\right)$.

6. Summary Table: Contrastive vs Non-Contrastive SSL Properties

Aspect	Contrastive Loss (CL)	Non-Contrastive Loss (NCL)
Global minima	Permutation matrices (sparse coding)	Large set, many non-collapsed “bad” minima
Repulsion mechanism	Yes, via negative samples	Absent (unless predictor/normalization)
Vulnerability to bad minima	No	Yes, unless carefully controlled
Training from scratch	Recovers latent structure	Often fails (unless warm start/tricks used)
Architectural requirements	No specific constraints	Needs predictor, normalization, etc.

Contrastive learning, as used in CLIP and analogous frameworks, exhibits a superior optimization landscape and intrinsic mechanisms that sharply constrain the solution set to only semantically meaningful, diverse representations. In contrast, non-contrastive frameworks require a set of targeted architectural interventions to escape an overabundance of poor local minima. Therefore, negative sampling and other forms of intrinsic repulsion are foundational to the success and reliability of contrastive models for self-supervised representation learning (Pokle et al., 2022).

PDF Markdown Chat (Pro)

References (1)

Contrasting the landscape of contrastive and non-contrastive learning (2022)

Follow Topic

Get notified by email when new papers are published related to Contrastive Learning (CLIP, SigLIP).