Papers
Topics
Authors
Recent
Search
2000 character limit reached

ProjNCE: Unified Contrastive Learning

Updated 16 January 2026
  • ProjNCE is a generalized contrastive learning framework that extends InfoNCE by incorporating flexible projection functions and an adjustment term.
  • It unifies self-supervised and supervised approaches, enabling robust class separation and a tighter mutual information lower bound.
  • Empirical evaluations on multiple datasets and noise regimes demonstrate its superiority over SupCon and cross-entropy baselines.

ProjNCE is a generalized framework for contrastive learning that extends the classical InfoNCE objective to unify self-supervised and supervised contrastive approaches. By introducing flexible projection functions and an adjustment term, ProjNCE achieves a valid mutual information (MI) bound, enabling improved representation learning with robust class separation. This formulation accommodates diverse strategies for embedding class information and demonstrates empirical superiority over SupCon and cross-entropy baselines across various datasets, noise regimes, and evaluation criteria (Jeong et al., 11 Jun 2025).

1. Formal Definition and Mathematical Foundation

The multi-sample InfoNCE objective (for self-supervised scenarios) is traditionally:

INCEself(X;C)=1Ni=1NEp(xici)jip(xj)[logexp(ψ(f(xi),f(xi))/τ)j=1Nexp(ψ(f(xi),f(xj))/τ)]I_{\mathrm{NCE}^{\rm self}}(X;C) = \frac{1}{N}\sum_{i=1}^N \mathbb{E}_{p(x_i|c_i)\prod_{j\neq i}p(x_j)} \left[ -\log \frac{\exp\left(\psi(f(x_i), f(x_i))/\tau\right)}{\sum_{j=1}^N \exp\left(\psi(f(x_i), f(x_j))/\tau\right)} \right]

where f()f(\cdot) is a normalized encoder, ψ(u,v)=uv\psi(u,v) = u \cdot v is the critic, and τ\tau is the temperature scaling.

ProjNCE introduces two projection functions:

g+:{1,,M}Rdz,g:{1,,M}Rdzg_+: \{1, \ldots, M\} \rightarrow \mathbb{R}^{d_z}, \quad g_-: \{1, \ldots, M\} \rightarrow \mathbb{R}^{d_z}

which enable positives and negatives to use separate projections, yielding the generalized objective:

INCEself-p(X;C)=1Ni=1NEp(xici)jip(xj)[logexp(ψ(f(xi),g+(ci)))j=1Nexp(ψ(f(xi),g(cj)))]I_{\mathrm{NCE}^{\rm self\text{-}p}}(X;C) = \frac{1}{N}\sum_{i=1}^N \mathbb{E}_{p(x_i|c_i)\prod_{j\neq i}p(x_j)} \left[ -\log \frac{\exp\left(\psi(f(x_i), g_+(c_i))\right)}{\sum_{j=1}^N \exp\left(\psi(f(x_i), g_-(c_j))\right)} \right]

To ensure this variant forms a valid MI lower bound, an adjustment term is introduced:

R(X,C)=Ep(x)j=1Np(xj)[k=1Nexp(ψ(f(x),g+(ck)))k=1Nexp(ψ(f(x),g(ck)))]R(X,C) = \mathbb{E}_{p(x)\prod_{j=1}^N p(x_j)} \left[ \frac{\sum_{k=1}^N \exp(\psi(f(x), g_+(c_k)))}{\sum_{k=1}^N \exp(\psi(f(x), g_-(c_k)))} \right]

The ProjNCE loss is thus:

LProjNCE(X;C)=INCEself-p(X;C)+R(X,C)\mathcal{L}_{\mathrm{ProjNCE}}(X;C) = I_{\mathrm{NCE}^{\rm self\text{-}p}}(X;C) + R(X,C)

This setup enables the encoder to pull representations toward positive class projections and push away from negative projections, with the adjustment term insuring that the overall loss remains a tight MI lower bound.

2. Mutual Information Bound Properties

The principal theoretical result is a multi-sample NWJ-type bound:

I(X;C)1+logNINCEself-p(X;C)R(X,C)I(X;C) \ge 1 + \log N - I_{\mathrm{NCE}^{\rm self\text{-}p}}(X;C) - R(X,C)

or equivalently,

LProjNCEI(X;C)(1+logN)-\mathcal{L}_{\mathrm{ProjNCE}} \le I(X;C) - (1 + \log N)

Here, minimizing the ProjNCE loss tightens the lower bound on I(X;C)I(X;C) regardless of the specific choices of critic or projections. The proof leverages the NWJ variational MI estimator, rearranging terms to recover the generalized InfoNCE and adjustment expectations. The formulation encompasses both self-supervised and supervised scenarios and generalizes the relationship between SupCon and MI estimation.

3. Projection Function Strategies

ProjNCE’s core flexibility lies in the arbitrary choice of (g+,g)(g_+, g_-) projection strategies. Key variants include:

  • Centroid-based (SupCon-style):

g+(c)=1P(c)xj:cj=cf(xj),g(c)=f(x)g_+(c) = \frac{1}{|P(c)|} \sum_{x_j: c_j = c} f(x_j), \quad g_-(c) = f(x)

This recovers the standard SupCon loss plus the RR term.

  • Orthogonal (conditional-expectation/Soft variants):

fˉ(c)=E[f(X)C=c]\bar{f}(c) = \mathbb{E}[f(X)|C=c]

Estimated via kernel regression (Nadaraya–Watson estimator):

f^(c)=j=1NKh(d(f(xj),))1{cj=c}f(xj)j=1NKh(d(f(xj),))1{cj=c}\hat{f}(c) = \frac{\sum_{j=1}^N K_h(d(f(x_j), \cdot))\mathbf{1}_{\{c_j=c\}}f(x_j)}{\sum_{j=1}^N K_h(d(f(x_j),\cdot))\mathbf{1}_{\{c_j=c\}}}

  • SoftNCE: g+=g=fˉg_+ = g_- = \bar{f} (no RR term; R=1R=1)
  • SoftSupCon: g+=fˉg_+ = \bar{f}, g=fg_- = f (with RR)
    • Median-based (robust):

fmed(c)=median{f(xj):cj=c}f_{\text{med}}(c) = \mathrm{median}\{f(x_j): c_j=c\}

Yielding analogous MedNCE and MedSupCon objectives.

This generalization enables tailored class embedding selection, supporting robustness to label noise and feature corruption via median and kernel strategies.

4. Experimental Evaluation and Quantitative Performance

Experiments employ a ResNet-18 encoder with dz=128d_z=128, AdamW optimizer, batch sizes $256$ or $512$, and temperature τ=0.07\tau=0.07. Datasets include CIFAR-10/100, Tiny-ImageNet, Imagenette, Caltech256, Food101, STL-10, and synthetic mixtures for MI estimation.

Top-1 Accuracy Across Variants

Dataset CE SupCon ProjNCE SoftNCE SoftSupCon
CIFAR-10 92.79 93.47 93.90 93.15 93.36
CIFAR-100 64.71 68.89 69.47 70.44 68.52
Tiny-ImageNet 16.26 50.92 54.08 49.13 49.94
Imagenette 84.97 84.74 84.71 85.40 84.18
Caltech256 75.63 83.18 81.08 80.94 80.94
Food101 68.29 69.18 70.18 68.27 67.69

Robustness to Label Noise (STL-10, label-flip probability pp)

Method 0.0 0.1 0.2 0.3 0.4 0.5
SupCon 77.71 71.89 67.43 62.85 51.63 50.41
ProjNCE 79.19 75.41 70.96 64.14 55.36 52.21
SoftNCE 78.10 72.94 70.39 61.89 56.58 54.94
MedSupCon 79.04 75.19 72.70 66.36 60.78 57.11

Mutual information estimates (Mixed-KSG) corroborate that ProjNCE consistently achieves higher I(f(X);C)I(f(X);C) than SupCon.

5. Ablation Studies and Empirical Insights

Experimental ablations illuminate the influence of projection choice, adjustment-term weighting, kernel parameters, and robustness properties:

  • Adjustment Term Weight (β\beta): Using Lβ=INCEself-p+βR\mathcal{L}_\beta = I_{\rm NCE}^{\rm self\text{-}p} + \beta R, t-SNE visualizations show β=5\beta=5 induces class cluster dispersion, facilitating greater false-positive separation, while β=10\beta=10 can lead to excessive intra-class tightness.
  • Kernel Bandwidth (hh): In SoftNCE, setting h=0.6h=0.6 with 1\ell_1 distance and Epanechnikov kernel maximizes accuracy; h>0.8h>0.8 degrades performance via oversmoothing.
  • Projection Dependence: SoftNCE tightens MI bounds most for binary classification; centroid-based ProjNCE excels in multiclass contexts; median variants are most robust to feature or label noise.
  • Noisy Feature Robustness: MedSupCon achieves the highest accuracy under pixel-level Gaussian noise, and integrating ProjNCE into joint-training pipelines augments performance by approximately 1 percentage point.

A plausible implication is that the flexibility in (g+,g)(g_+, g_-) adaptation is directly responsible for the observed improvements, particularly under challenging conditions.

6. Guidelines for Practical Use

Implementation of ProjNCE requires several practical considerations:

  • Batch Size: A minimum of 256 is required to stabilize both the InfoNCE and RR terms.
  • Temperature (τ\tau): Default τ=0.07\tau=0.07; tuning in [0.05,0.2][0.05, 0.2] can optimize results.
  • Adjustment Term Weight (β\beta): Start with 1; increase for greater cluster separation, decrease if clusters are too dispersed.
  • Projections:
    • Centroid: In-batch averaging over class.
    • Orthogonal (Soft): Kernel regression; use 1\ell_1 distance, Epanechnikov kernel, h[0.4,0.8]h \in [0.4, 0.8].
    • Median: Compute median dimension-wise.
  • Negative Sampling: In-batch negatives suffice; consider a memory bank for large datasets, maintaining class-independent sampling to preserve RR validity.
  • Optimization: AdamW, linear learning-rate warmup, weight decay 1e41\mathrm{e}{-4}; gradient clipping of RR if necessary.
  • Downstream Tasks: After contrastive pre-training, freeze encoder and train a linear classifier for 50–100 epochs.

This methodology offers a unified view of contrastive objectives under valid MI bounds, with projection flexibility and adjustment-term refinement yielding consistent, broadly-applicable performance improvements (Jeong et al., 11 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ProjNCE.