Probabilistic Pairwise Contrastive Loss

Updated 29 November 2025

PPCL is a probabilistic contrastive learning approach that models embeddings as von Mises–Fisher distributions to capture uncertainty and class imbalance.
It employs online moment estimation for efficient parameter fitting and computes closed-form similarity expectations over infinitely many sample pairs.
The method dynamically weights samples via concentration parameters, boosting performance in both supervised and self-supervised settings while addressing long-tailed data.

Probabilistic Pairwise Contrastive Loss (PPCL) is a general term for a family of contrastive learning objectives that incorporate probabilistic modeling, especially via the von Mises–Fisher (vMF) distribution, into the calculation of sample similarities and loss values. PPCL approaches reinterpret traditional contrastive frameworks—where embedding similarity is calculated by inner products—by proposing a probabilistic framework in which embeddings are treated as random variables (typically parameterized by mean direction and concentration) and their similarity scores, losses, and classification decisions are informed by distributional properties. These objectives have made it possible to address several key limitations of deterministic contrastive learning, such as the inability to robustly handle uncertainty, class imbalance, and data regime constraints.

1. Probabilistic Modeling with von Mises–Fisher Distributions

PPCL builds on the representation of neural embeddings as normalized vectors on the unit hypersphere, $z \in \mathbb{R}^d$ , $\|z\|_2 = 1$ . For each class $y \in \{1,\ldots,K\}$ , the conditional feature distribution is modeled as a vMF:

$P(z \mid y) = C_d(\kappa_y)^{-1} \exp( \kappa_y \mu_y^\top z )$

where $\mu_y \in \mathbb{R}^d$ is the class mean direction ( $\|\mu_y\|_2=1$ ), $\kappa_y \geq 0$ is the concentration, and $C_d(\kappa)$ is the normalizer, $(2\pi)^{d/2} I_{d/2-1}(\kappa) / \kappa^{d/2-1}$ . The overall mixture distribution is $P(z) = \sum_{y=1}^K \pi_y P(z|y)$ , with $\|z\|_2 = 1$ 0 the estimated class prior, typically the fraction of training samples in class $\|z\|_2 = 1$ 1 (Du et al., 2024, Li et al., 2021).

2. Moment Estimation and Parameter Fitting

PPCL methods estimate vMF parameters using online first-order moments over minibatches. For each class:

Maintain class counts $\|z\|_2 = 1$ 2 and running mean $\|z\|_2 = 1$ 3.
Batch update: $\|z\|_2 = 1$ 4; $\|z\|_2 = 1$ 5.
Compute $\|z\|_2 = 1$ 6, $\|z\|_2 = 1$ 7, and fit $\|z\|_2 = 1$ 8 by solving $\|z\|_2 = 1$ 9, with $y \in \{1,\ldots,K\}$ 0.

An efficient approximation (Sra 2012) is $y \in \{1,\ldots,K\}$ 1 (Du et al., 2024).

3. Contrastive Loss under the Probabilistic Framework

PPCL replaces finite sample pairings and inner-product-based similarity with analytic expectations over the vMF mixture. For an anchor $y \in \{1,\ldots,K\}$ 2 of class $y \in \{1,\ldots,K\}$ 3:

$y \in \{1,\ldots,K\}$ 4

where $y \in \{1,\ldots,K\}$ 5 and $y \in \{1,\ldots,K\}$ 6 is the temperature hyperparameter. All relevant terms—mean directions, concentrations, and priors—are computed globally, enabling a closed-form expectation over infinitely many possible sample pairs from each class (Du et al., 2024, Li et al., 2021). Alternate pairwise formulations reduce to the same closed-form in the infinite negative sampling limit.

4. PPCL in Self-Supervised and Supervised Contexts

PPCL is applied in both supervised and self-supervised setups. In self-supervised learning, each data point is modeled by a vMF on a sphere of radius $y \in \{1,\ldots,K\}$ 7, with mean direction $y \in \{1,\ldots,K\}$ 8 and concentration $y \in \{1,\ldots,K\}$ 9. The similarity between two probabilistic embeddings is quantified by the mutual likelihood score:

$P(z \mid y) = C_d(\kappa_y)^{-1} \exp( \kappa_y \mu_y^\top z )$ 0

with $P(z \mid y) = C_d(\kappa_y)^{-1} \exp( \kappa_y \mu_y^\top z )$ 1 as above and $P(z \mid y) = C_d(\kappa_y)^{-1} \exp( \kappa_y \mu_y^\top z )$ 2. The final loss integrates these scores into the standard InfoNCE contrastive structure, providing adaptive weighting and natural robustness to uncertainty (Li et al., 2021).

5. Advantages in Long-Tailed and Imbalanced Data Regimes

PPCL achieves robust performance on long-tailed visual recognition where minority classes suffer from severe pair-sampling constraints. Unlike standard supervised contrastive learning (SupCon), which is reliant on sufficiently large and diverse batches, PPCL leverages global feature statistics and infinite-sample expectations:

Every class contributes contrastive pairs in expectation, regardless of batch composition.
Balanced gradients for head and tail classes are restored at the loss-level.
The numerator and denominator in the closed-form PPCL implement a logit-adjustment-style margin that corrects class prior bias (Du et al., 2024).

These mechanisms make PPCL especially effective for class-imbalanced data, where standard SupCon exhibits poor gradients for minority classes.

6. Uncertainty Quantification via Concentration Parameters

A central feature of PPCL is the embedding of confidence/uncertainty directly into the loss via the concentration parameter $P(z \mid y) = C_d(\kappa_y)^{-1} \exp( \kappa_y \mu_y^\top z )$ 3. Large $P(z \mid y) = C_d(\kappa_y)^{-1} \exp( \kappa_y \mu_y^\top z )$ 4 values indicate high confidence (tight clustering), modulating each pairwise similarity and corresponding gradient magnitude:

Pairs with large, aligned $P(z \mid y) = C_d(\kappa_y)^{-1} \exp( \kappa_y \mu_y^\top z )$ 5 yield stronger similarity signals.
High-confidence disagreement is strongly penalized.
Mixed-confidence pairs see moderate weighting, which aligns with “human-like” uncertainty handling.

This confidence-driven weighting mechanism dynamically focuses optimization on reliable examples and mitigates the effect of noisy samples (Li et al., 2021).

7. Practical Considerations and Extensions

PPCL objectives have several practical recommendations:

Tuning the temperature $P(z \mid y) = C_d(\kappa_y)^{-1} \exp( \kappa_y \mu_y^\top z )$ 6 and embedding dimension $P(z \mid y) = C_d(\kappa_y)^{-1} \exp( \kappa_y \mu_y^\top z )$ 7 adjusts the sensitivity between inner-product and dispersion terms; typical ranges for $P(z \mid y) = C_d(\kappa_y)^{-1} \exp( \kappa_y \mu_y^\top z )$ 8 are $P(z \mid y) = C_d(\kappa_y)^{-1} \exp( \kappa_y \mu_y^\top z )$ 9.
Reliable computation of Bessel functions is necessary for normalization constants.
Temporarily delaying PPCL training (“warm-up phase”) can allow feature means to stabilize.
Loss weighting between PPCL and classification objectives (e.g., logit-adjusted cross-entropy) is straightforward, with $\mu_y \in \mathbb{R}^d$ 0 working well empirically.
PPCL is naturally extendable to semi-supervised learning via pseudo-label updates to mixture parameters (Du et al., 2024).

PPCL embodies an analytic and distributional approach to contrastive learning, achieving closed-form, infinite-sampling objectives that address uncertainty and data imbalance without the batch size and pairing limitations of vanilla contrastive losses such as InfoNCE and SupCon.

Markdown Report Issue Upgrade to Chat

References (2)

Probabilistic Contrastive Learning for Long-Tailed Visual Recognition (2024)

Probabilistic Contrastive Loss for Self-Supervised Learning (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Probabilistic Pairwise Contrastive Loss (PPCL).