Papers
Topics
Authors
Recent
2000 character limit reached

Probabilistic Pairwise Contrastive Loss

Updated 29 November 2025
  • PPCL is a probabilistic contrastive learning approach that models embeddings as von Mises–Fisher distributions to capture uncertainty and class imbalance.
  • It employs online moment estimation for efficient parameter fitting and computes closed-form similarity expectations over infinitely many sample pairs.
  • The method dynamically weights samples via concentration parameters, boosting performance in both supervised and self-supervised settings while addressing long-tailed data.

Probabilistic Pairwise Contrastive Loss (PPCL) is a general term for a family of contrastive learning objectives that incorporate probabilistic modeling, especially via the von Mises–Fisher (vMF) distribution, into the calculation of sample similarities and loss values. PPCL approaches reinterpret traditional contrastive frameworks—where embedding similarity is calculated by inner products—by proposing a probabilistic framework in which embeddings are treated as random variables (typically parameterized by mean direction and concentration) and their similarity scores, losses, and classification decisions are informed by distributional properties. These objectives have made it possible to address several key limitations of deterministic contrastive learning, such as the inability to robustly handle uncertainty, class imbalance, and data regime constraints.

1. Probabilistic Modeling with von Mises–Fisher Distributions

PPCL builds on the representation of neural embeddings as normalized vectors on the unit hypersphere, zRdz \in \mathbb{R}^d, z2=1\|z\|_2 = 1. For each class y{1,,K}y \in \{1,\ldots,K\}, the conditional feature distribution is modeled as a vMF:

P(zy)=Cd(κy)1exp(κyμyz)P(z \mid y) = C_d(\kappa_y)^{-1} \exp( \kappa_y \mu_y^\top z )

where μyRd\mu_y \in \mathbb{R}^d is the class mean direction (μy2=1\|\mu_y\|_2=1), κy0\kappa_y \geq 0 is the concentration, and Cd(κ)C_d(\kappa) is the normalizer, (2π)d/2Id/21(κ)/κd/21(2\pi)^{d/2} I_{d/2-1}(\kappa) / \kappa^{d/2-1}. The overall mixture distribution is P(z)=y=1KπyP(zy)P(z) = \sum_{y=1}^K \pi_y P(z|y), with πy\pi_y the estimated class prior, typically the fraction of training samples in class yy (Du et al., 11 Mar 2024, Li et al., 2021).

2. Moment Estimation and Parameter Fitting

PPCL methods estimate vMF parameters using online first-order moments over minibatches. For each class:

  • Maintain class counts nyn_y and running mean ryr_y.
  • Batch update: nyny+my(t)n_y \leftarrow n_y + m_y^{(t)}; ry[nyry+my(t)ry]/(ny+my(t))r_y \leftarrow [n_y r_y + m_y^{(t)} r'_y]/(n_y + m_y^{(t)}).
  • Compute Ry=ry2R_y = \|r_y\|_2, μ^y=ry/Ry\hat\mu_y = r_y / R_y, and fit κy\kappa_y by solving Ad(κy)=RyA_d(\kappa_y) = R_y, with Ad(κ)=Id/2(κ)/Id/21(κ)A_d(\kappa) = I_{d/2}(\kappa)/I_{d/2-1}(\kappa).

An efficient approximation (Sra 2012) is κ^y[Ry(dRy2)]/(1Ry2)\hat\kappa_y \approx [ R_y (d - R_y^2) ] / (1 - R_y^2 ) (Du et al., 11 Mar 2024).

3. Contrastive Loss under the Probabilistic Framework

PPCL replaces finite sample pairings and inner-product-based similarity with analytic expectations over the vMF mixture. For an anchor zz of class yy:

LPPCL(z,y)=log[πyCd(κ~y)/Cd(κy)]+log[j=1KπjCd(κ~j)/Cd(κj)]L_{\text{PPCL}}(z, y) = -\log [ \pi_y C_d(\tilde\kappa_y)/C_d(\kappa_y) ] + \log \bigg[ \sum_{j=1}^K \pi_j C_d(\tilde\kappa_j)/C_d(\kappa_j) \bigg ]

where κ~j=κjμj+z/τ2\tilde\kappa_j = \| \kappa_j \mu_j + z/\tau \|_2 and τ\tau is the temperature hyperparameter. All relevant terms—mean directions, concentrations, and priors—are computed globally, enabling a closed-form expectation over infinitely many possible sample pairs from each class (Du et al., 11 Mar 2024, Li et al., 2021). Alternate pairwise formulations reduce to the same closed-form in the infinite negative sampling limit.

4. PPCL in Self-Supervised and Supervised Contexts

PPCL is applied in both supervised and self-supervised setups. In self-supervised learning, each data point is modeled by a vMF on a sphere of radius r=1/τr = \sqrt{1/\tau}, with mean direction μ(x)\mu(x) and concentration κ(x)\kappa(x). The similarity between two probabilistic embeddings is quantified by the mutual likelihood score:

s(xi,xj)=log(Cd(κi)Cd(κj)Cd(κ~))dlogrs(x_i, x_j) = \log \left( \frac{ \mathcal{C}_d(\kappa_i)\mathcal{C}_d(\kappa_j) }{ \mathcal{C}_d(\tilde\kappa) } \right) - d\log r

with Cd(κ)\mathcal{C}_d(\kappa) as above and κ~=κiμi+κjμj2\tilde\kappa = \|\kappa_i \mu_i + \kappa_j \mu_j\|_2. The final loss integrates these scores into the standard InfoNCE contrastive structure, providing adaptive weighting and natural robustness to uncertainty (Li et al., 2021).

5. Advantages in Long-Tailed and Imbalanced Data Regimes

PPCL achieves robust performance on long-tailed visual recognition where minority classes suffer from severe pair-sampling constraints. Unlike standard supervised contrastive learning (SupCon), which is reliant on sufficiently large and diverse batches, PPCL leverages global feature statistics and infinite-sample expectations:

  • Every class contributes contrastive pairs in expectation, regardless of batch composition.
  • Balanced gradients for head and tail classes are restored at the loss-level.
  • The numerator and denominator in the closed-form PPCL implement a logit-adjustment-style margin that corrects class prior bias (Du et al., 11 Mar 2024).

These mechanisms make PPCL especially effective for class-imbalanced data, where standard SupCon exhibits poor gradients for minority classes.

6. Uncertainty Quantification via Concentration Parameters

A central feature of PPCL is the embedding of confidence/uncertainty directly into the loss via the concentration parameter κ\kappa. Large κ\kappa values indicate high confidence (tight clustering), modulating each pairwise similarity and corresponding gradient magnitude:

  • Pairs with large, aligned κ\kappa yield stronger similarity signals.
  • High-confidence disagreement is strongly penalized.
  • Mixed-confidence pairs see moderate weighting, which aligns with “human-like” uncertainty handling.

This confidence-driven weighting mechanism dynamically focuses optimization on reliable examples and mitigates the effect of noisy samples (Li et al., 2021).

7. Practical Considerations and Extensions

PPCL objectives have several practical recommendations:

  • Tuning the temperature τ\tau and embedding dimension dd adjusts the sensitivity between inner-product and dispersion terms; typical ranges for τ\tau are [0.05,0.2][0.05, 0.2].
  • Reliable computation of Bessel functions is necessary for normalization constants.
  • Temporarily delaying PPCL training (“warm-up phase”) can allow feature means to stabilize.
  • Loss weighting between PPCL and classification objectives (e.g., logit-adjusted cross-entropy) is straightforward, with α1\alpha \approx 1 working well empirically.
  • PPCL is naturally extendable to semi-supervised learning via pseudo-label updates to mixture parameters (Du et al., 11 Mar 2024).

PPCL embodies an analytic and distributional approach to contrastive learning, achieving closed-form, infinite-sampling objectives that address uncertainty and data imbalance without the batch size and pairing limitations of vanilla contrastive losses such as InfoNCE and SupCon.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Probabilistic Pairwise Contrastive Loss (PPCL).