A Mathematical Perspective On Contrastive Learning

Published 30 May 2025 in stat.ML, cs.CV, and cs.LG | (2505.24134v1)

Abstract: Multimodal contrastive learning is a methodology for linking different data modalities; the canonical example is linking image and text data. The methodology is typically framed as the identification of a set of encoders, one for each modality, that align representations within a common latent space. In this work, we focus on the bimodal setting and interpret contrastive learning as the optimization of (parameterized) encoders that define conditional probability distributions, for each modality conditioned on the other, consistent with the available data. This provides a framework for multimodal algorithms such as crossmodal retrieval, which identifies the mode of one of these conditional distributions, and crossmodal classification, which is similar to retrieval but includes a fine-tuning step to make it task specific. The framework we adopt also gives rise to crossmodal generative models. This probabilistic perspective suggests two natural generalizations of contrastive learning: the introduction of novel probabilistic loss functions, and the use of alternative metrics for measuring alignment in the common latent space. We study these generalizations of the classical approach in the multivariate Gaussian setting. In this context we view the latent space identification as a low-rank matrix approximation problem. This allows us to characterize the capabilities of loss functions and alignment metrics to approximate natural statistics, such as conditional means and covariances; doing so yields novel variants on contrastive learning algorithms for specific mode-seeking and for generative tasks. The framework we introduce is also studied through numerical experiments on multivariate Gaussians, the labeled MNIST dataset, and on a data assimilation application arising in oceanography.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents a probabilistic framework that generalizes contrastive learning by tilting the product of marginal distributions and minimizing divergences between true and learned conditionals.
It compares cosine similarity and L2-distance tilting methods, showing how different loss functions affect the matching of conditional means and covariances in Gaussian models.
The study demonstrates practical applications in crossmodal retrieval, classification, and Lagrangian data assimilation, validated by both theoretical analysis and numerical experiments.

This paper, "A Mathematical Perspective On Contrastive Learning" (2505.24134), provides a mathematical framework for understanding bimodal contrastive learning, interpreting it as a method for learning a joint probability distribution over two modalities by tilting the product of their marginal distributions. This probabilistic perspective allows the authors to generalize existing contrastive learning methods and analyze their properties, particularly in the tractable setting of multivariate Gaussian distributions. The paper emphasizes practical implications, analyzing how different formulations impact downstream tasks like retrieval and classification.

The core setup involves two data modalities, $u$ and $v$ , drawn in pairs from a joint distribution $\mu(u,v)$ . Contrastive learning aims to find encoders $g_u$ and $g_v$ that map $u$ and $v$ into a common latent space $R^{n_e}$ , typically with $n_e$ much smaller than the original data dimensions. Standard approaches, like CLIP, normalize the encoder outputs to the unit sphere, using the cosine similarity $\langle \mathcal{E}_u(u), \mathcal{E}_v(v) \rangle$ as an alignment metric. The training objective, often a form of InfoNCE or cross-entropy loss, encourages high similarity for paired data and low similarity for unpaired data (negative samples from shuffled batches). The paper shows that the population limit of the standard contrastive loss minimizes the sum of KL divergences between the true conditional distributions $\mu_{u|v}, \mu_{v|u}$ and the learned conditional distributions $\nu_{u|v}, \nu_{v|u}$ derived from a parameterized joint distribution $\nu$ defined by an exponential tilting of the product marginals $\mu_u \otimes \mu_v$ .

The probabilistic framework leads to two main classes of generalizations:

Generalized Probabilistic Loss Functions: Instead of minimizing the sum of KL divergences between conditionals, one could:
- Minimize a weighted sum of divergences for the conditionals, $\lambda_u D(\mu_{u|v}||\nu_{u|v}) + \lambda_v D(\mu_{v|u}||\nu_{v|u})$ . Setting $\lambda_v=0$ or $\lambda_u=0$ focuses the learning on matching only one conditional, which is relevant for asymmetric tasks like classification.
- Minimize the divergence between the true joint distribution $\mu$ and the learned joint distribution $\nu$ , $D(\mu||\nu)$ . The paper shows that for KL divergence, this leads to an objective that is computationally advantageous as it only requires one batch from the joint and one batch from the product of marginals, compared to the per-sample negative batches needed for the conditional loss. The KL joint loss is shown to provide an upper bound for the KL conditional loss.
- Use alternative divergences or metrics, like Maximum Mean Discrepancy (MMD), which are shown to be actionable with empirical data.
Generalized Tilting: The learned joint distribution $\nu$ $ν$ is defined by a density $\rho(u,v;\theta)$ $ρ (u, v; θ)$ relative to $\mu_u \otimes \mu_v$ $μ_{u} \otimes μ_{v}$ . The standard approach uses $\rho(u,v;\theta) \propto \exp(\langle \mathcal{E}_u(u), \mathcal{E}_v(v) \rangle / \tau)$ . Generalizations can involve different functional forms for $\rho$ $ρ$ , for instance:
- Using unnormalized encoders $g_u, g_v$ in the exponential tilting: $\rho(u,v;\theta) \propto \exp(\langle g_u(u), g_v(v) \rangle / \tau)$ .
- Using the L2 distance between latent vectors in the exponential tilting: $\rho(u,v;\theta) \propto \exp(-\frac{1}{2\tau}|g_u(u) - g_v(v)|^2)$ .

The paper analyzes these generalizations in detail for the case where $\mu$ is a multivariate Gaussian distribution and the encoders are linear functions ( $g_u(u) = Gu, g_v(v) = Hv$ ).

Cosine Distance + Conditional Loss (Standard CLIP): Using the original exponential tilting with linear encoders, the paper shows that minimizing the conditional loss results in learning a matrix $A = G^\top H$ that matches the conditional means of the Gaussian distribution (e.g., $\mathbb{E}[u|v]$ for $\mu$ ) but not the conditional covariances. The learned conditional covariances are fixed to the marginal covariances of $\mu_u, \mu_v$ , which are generally larger than the true conditional covariances unless $u$ and $v$ are independent. When restricted to low-rank matrices (corresponding to $n_e < \min(n_u, n_v)$ ), the solution is related to the low-rank approximation of a matrix derived from the covariances.
Positive Quadratic Form + Conditional Loss: Using the L2-distance tilting, $\rho(u,v;\theta) \propto \exp(-\frac{1}{2}|Gu-Hv|^2)$ , allows the model to learn matrices $A=G^\top H$ and $B=G^\top G$ (and $C=H^\top H$ ). Minimizing a one-sided conditional loss (e.g., matching only $\mu_{u|v}$ ) in this setting allows the model to exactly match both the conditional mean and covariance of that specific conditional distribution, provided $n_e$ is sufficiently large. The optimization for the rank-constrained case is also derived.
Cosine Distance + Joint Loss: Using the original exponential tilting but minimizing the joint loss $D(\mu||\nu)$ , the paper shows that the optimal matrix $A=G^\top H$ is obtained by applying a singular value shrinkage function to the singular values of the matrix optimized by the conditional loss. This formulation results in a learned joint distribution whose marginal distributions are closer to the true marginal distributions of $\mu$ compared to the distribution learned by minimizing the conditional loss.

The practical applications discussed include:

Crossmodal Retrieval: Given an instance of one modality (e.g., text prompt $v$ ), find the most similar instances of the other modality (e.g., images $u$ ) in a dataset. This is framed as finding the mode of the empirical conditional distribution $\nu_{u|v}^N$ , which corresponds to maximizing the cosine similarity $\langle \mathcal{E}_u(u^i), \mathcal{E}_v(v) \rangle$ over the dataset images $u^i$ .
Crossmodal Classification: Given an instance of one modality (e.g., image $u$ ), assign it a label from a predefined set (e.g., text labels $v_i$ ). This is framed as finding the mode of the empirical conditional distribution $\nu_{v|u}^K$ over the label set, maximizing $\langle \mathcal{E}_u(u), \mathcal{E}_v(v^i) \rangle$ . The paper shows how standard image classification networks (like LeNet on MNIST) can be interpreted within this framework using unnormalized image encoders and one-hot encoded labels with a one-sided conditional loss. The framework also supports fine-tuning to adapt pretrained models to specific classification tasks.
Lagrangian Data Assimilation: This is presented as a novel application in science and engineering. The task is to recover an Eulerian velocity field (represented by coefficients of a potential, $u$ ) from Lagrangian trajectories of particles in the flow ( $v$ ). The authors train a contrastive model with a transformer-based encoder for trajectories and a fixed encoder for potential coefficients. Experiments show that this purely data-driven approach successfully learns embeddings that enable accurate retrieval of potentials from trajectories and vice-versa.

Numerical experiments on Gaussian data validate the theoretical findings regarding mean/covariance matching and the properties of different loss functions. Experiments on MNIST demonstrate how different loss functions (one-sided vs. two-sided) affect classification accuracy vs. the diversity of images sampled from the learned conditional distribution. The Lagrangian data assimilation experiment highlights the potential of applying contrastive learning methods to scientific problems involving disparate data modalities.

In summary, the paper provides a principled probabilistic foundation for contrastive learning, generalizes existing methods via novel loss functions and tiltings, offers analytical insights through Gaussian models, and demonstrates practical applicability to traditional AI tasks and novel scientific domains. The focus on the learned joint and conditional distributions provides a valuable perspective for understanding the capabilities and limitations of different contrastive learning formulations.

Markdown Report Issue