Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 180 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

A Mathematical Perspective On Contrastive Learning (2505.24134v1)

Published 30 May 2025 in stat.ML, cs.CV, and cs.LG

Abstract: Multimodal contrastive learning is a methodology for linking different data modalities; the canonical example is linking image and text data. The methodology is typically framed as the identification of a set of encoders, one for each modality, that align representations within a common latent space. In this work, we focus on the bimodal setting and interpret contrastive learning as the optimization of (parameterized) encoders that define conditional probability distributions, for each modality conditioned on the other, consistent with the available data. This provides a framework for multimodal algorithms such as crossmodal retrieval, which identifies the mode of one of these conditional distributions, and crossmodal classification, which is similar to retrieval but includes a fine-tuning step to make it task specific. The framework we adopt also gives rise to crossmodal generative models. This probabilistic perspective suggests two natural generalizations of contrastive learning: the introduction of novel probabilistic loss functions, and the use of alternative metrics for measuring alignment in the common latent space. We study these generalizations of the classical approach in the multivariate Gaussian setting. In this context we view the latent space identification as a low-rank matrix approximation problem. This allows us to characterize the capabilities of loss functions and alignment metrics to approximate natural statistics, such as conditional means and covariances; doing so yields novel variants on contrastive learning algorithms for specific mode-seeking and for generative tasks. The framework we introduce is also studied through numerical experiments on multivariate Gaussians, the labeled MNIST dataset, and on a data assimilation application arising in oceanography.

Summary

  • The paper presents a probabilistic framework that generalizes contrastive learning by tilting the product of marginal distributions and minimizing divergences between true and learned conditionals.
  • It compares cosine similarity and L2-distance tilting methods, showing how different loss functions affect the matching of conditional means and covariances in Gaussian models.
  • The study demonstrates practical applications in crossmodal retrieval, classification, and Lagrangian data assimilation, validated by both theoretical analysis and numerical experiments.

This paper, "A Mathematical Perspective On Contrastive Learning" (2505.24134), provides a mathematical framework for understanding bimodal contrastive learning, interpreting it as a method for learning a joint probability distribution over two modalities by tilting the product of their marginal distributions. This probabilistic perspective allows the authors to generalize existing contrastive learning methods and analyze their properties, particularly in the tractable setting of multivariate Gaussian distributions. The paper emphasizes practical implications, analyzing how different formulations impact downstream tasks like retrieval and classification.

The core setup involves two data modalities, uu and vv, drawn in pairs from a joint distribution μ(u,v)\mu(u,v). Contrastive learning aims to find encoders gug_u and gvg_v that map uu and vv into a common latent space RneR^{n_e}, typically with nen_e much smaller than the original data dimensions. Standard approaches, like CLIP, normalize the encoder outputs to the unit sphere, using the cosine similarity Eu(u),Ev(v)\langle \mathcal{E}_u(u), \mathcal{E}_v(v) \rangle as an alignment metric. The training objective, often a form of InfoNCE or cross-entropy loss, encourages high similarity for paired data and low similarity for unpaired data (negative samples from shuffled batches). The paper shows that the population limit of the standard contrastive loss minimizes the sum of KL divergences between the true conditional distributions μuv,μvu\mu_{u|v}, \mu_{v|u} and the learned conditional distributions νuv,νvu\nu_{u|v}, \nu_{v|u} derived from a parameterized joint distribution ν\nu defined by an exponential tilting of the product marginals μuμv\mu_u \otimes \mu_v.

The probabilistic framework leads to two main classes of generalizations:

  1. Generalized Probabilistic Loss Functions: Instead of minimizing the sum of KL divergences between conditionals, one could:
    • Minimize a weighted sum of divergences for the conditionals, λuD(μuvνuv)+λvD(μvuνvu)\lambda_u D(\mu_{u|v}||\nu_{u|v}) + \lambda_v D(\mu_{v|u}||\nu_{v|u}). Setting λv=0\lambda_v=0 or λu=0\lambda_u=0 focuses the learning on matching only one conditional, which is relevant for asymmetric tasks like classification.
    • Minimize the divergence between the true joint distribution μ\mu and the learned joint distribution ν\nu, D(μν)D(\mu||\nu). The paper shows that for KL divergence, this leads to an objective that is computationally advantageous as it only requires one batch from the joint and one batch from the product of marginals, compared to the per-sample negative batches needed for the conditional loss. The KL joint loss is shown to provide an upper bound for the KL conditional loss.
    • Use alternative divergences or metrics, like Maximum Mean Discrepancy (MMD), which are shown to be actionable with empirical data.
  2. Generalized Tilting: The learned joint distribution ν\nu is defined by a density ρ(u,v;θ)\rho(u,v;\theta) relative to μuμv\mu_u \otimes \mu_v. The standard approach uses ρ(u,v;θ)exp(Eu(u),Ev(v)/τ)\rho(u,v;\theta) \propto \exp(\langle \mathcal{E}_u(u), \mathcal{E}_v(v) \rangle / \tau). Generalizations can involve different functional forms for ρ\rho, for instance:
    • Using unnormalized encoders gu,gvg_u, g_v in the exponential tilting: ρ(u,v;θ)exp(gu(u),gv(v)/τ)\rho(u,v;\theta) \propto \exp(\langle g_u(u), g_v(v) \rangle / \tau).
    • Using the L2 distance between latent vectors in the exponential tilting: ρ(u,v;θ)exp(12τgu(u)gv(v)2)\rho(u,v;\theta) \propto \exp(-\frac{1}{2\tau}|g_u(u) - g_v(v)|^2).

The paper analyzes these generalizations in detail for the case where μ\mu is a multivariate Gaussian distribution and the encoders are linear functions (gu(u)=Gu,gv(v)=Hvg_u(u) = Gu, g_v(v) = Hv).

  • Cosine Distance + Conditional Loss (Standard CLIP): Using the original exponential tilting with linear encoders, the paper shows that minimizing the conditional loss results in learning a matrix A=GHA = G^\top H that matches the conditional means of the Gaussian distribution (e.g., E[uv]\mathbb{E}[u|v] for μ\mu) but not the conditional covariances. The learned conditional covariances are fixed to the marginal covariances of μu,μv\mu_u, \mu_v, which are generally larger than the true conditional covariances unless uu and vv are independent. When restricted to low-rank matrices (corresponding to ne<min(nu,nv)n_e < \min(n_u, n_v)), the solution is related to the low-rank approximation of a matrix derived from the covariances.
  • Positive Quadratic Form + Conditional Loss: Using the L2-distance tilting, ρ(u,v;θ)exp(12GuHv2)\rho(u,v;\theta) \propto \exp(-\frac{1}{2}|Gu-Hv|^2), allows the model to learn matrices A=GHA=G^\top H and B=GGB=G^\top G (and C=HHC=H^\top H). Minimizing a one-sided conditional loss (e.g., matching only μuv\mu_{u|v}) in this setting allows the model to exactly match both the conditional mean and covariance of that specific conditional distribution, provided nen_e is sufficiently large. The optimization for the rank-constrained case is also derived.
  • Cosine Distance + Joint Loss: Using the original exponential tilting but minimizing the joint loss D(μν)D(\mu||\nu), the paper shows that the optimal matrix A=GHA=G^\top H is obtained by applying a singular value shrinkage function to the singular values of the matrix optimized by the conditional loss. This formulation results in a learned joint distribution whose marginal distributions are closer to the true marginal distributions of μ\mu compared to the distribution learned by minimizing the conditional loss.

The practical applications discussed include:

  • Crossmodal Retrieval: Given an instance of one modality (e.g., text prompt vv), find the most similar instances of the other modality (e.g., images uu) in a dataset. This is framed as finding the mode of the empirical conditional distribution νuvN\nu_{u|v}^N, which corresponds to maximizing the cosine similarity Eu(ui),Ev(v)\langle \mathcal{E}_u(u^i), \mathcal{E}_v(v) \rangle over the dataset images uiu^i.
  • Crossmodal Classification: Given an instance of one modality (e.g., image uu), assign it a label from a predefined set (e.g., text labels viv_i). This is framed as finding the mode of the empirical conditional distribution νvuK\nu_{v|u}^K over the label set, maximizing Eu(u),Ev(vi)\langle \mathcal{E}_u(u), \mathcal{E}_v(v^i) \rangle. The paper shows how standard image classification networks (like LeNet on MNIST) can be interpreted within this framework using unnormalized image encoders and one-hot encoded labels with a one-sided conditional loss. The framework also supports fine-tuning to adapt pretrained models to specific classification tasks.
  • Lagrangian Data Assimilation: This is presented as a novel application in science and engineering. The task is to recover an Eulerian velocity field (represented by coefficients of a potential, uu) from Lagrangian trajectories of particles in the flow (vv). The authors train a contrastive model with a transformer-based encoder for trajectories and a fixed encoder for potential coefficients. Experiments show that this purely data-driven approach successfully learns embeddings that enable accurate retrieval of potentials from trajectories and vice-versa.

Numerical experiments on Gaussian data validate the theoretical findings regarding mean/covariance matching and the properties of different loss functions. Experiments on MNIST demonstrate how different loss functions (one-sided vs. two-sided) affect classification accuracy vs. the diversity of images sampled from the learned conditional distribution. The Lagrangian data assimilation experiment highlights the potential of applying contrastive learning methods to scientific problems involving disparate data modalities.

In summary, the paper provides a principled probabilistic foundation for contrastive learning, generalizes existing methods via novel loss functions and tiltings, offers analytical insights through Gaussian models, and demonstrates practical applicability to traditional AI tasks and novel scientific domains. The focus on the learned joint and conditional distributions provides a valuable perspective for understanding the capabilities and limitations of different contrastive learning formulations.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 101 likes.

Upgrade to Pro to view all of the tweets about this paper: