Papers
Topics
Authors
Recent
2000 character limit reached

H-InfoNCE Loss: Enhancing Contrastive Learning

Updated 26 November 2025
  • H-InfoNCE Loss is a family of contrastive learning objectives that refines standard InfoNCE by integrating hard negative mining, adversarial optimization, and temperature-free scaling.
  • It employs hardness-aware tilting functions and adversarial ranking constraints to improve discrimination in self-supervised and collaborative filtering tasks.
  • These variants offer robust convergence and enhanced empirical performance across applications in vision, graph, recommendation, and anomaly detection.

H-InfoNCE Loss refers to a family of modifications to the InfoNCE contrastive loss that address limitations of the standard formulation by introducing hard negative mining, adversarial optimization, supervised sampling, or temperature-free scaling. These variants are designed to enhance discriminative representation learning in self-supervised and collaborative filtering tasks by refining the negative sampling process or similarity calibration. Notably, the term H-InfoNCE encompasses (i) hardness-aware/tilted contrastive losses for supervised and unsupervised learning (Jiang et al., 2022), (ii) adversarially optimized InfoNCE (AdvInfoNCE) for collaborative filtering (Zhang et al., 2023), and (iii) a temperature-free reparameterization replacing the scaling hyperparameter with an arctanh mapping (Kim et al., 29 Jan 2025).

1. Foundations: Standard InfoNCE and Its Limitations

InfoNCE is the canonical objective for contrastive learning, structured to maximize similarity between positive pairs while minimizing it to negatives (Jiang et al., 2022, Kim et al., 29 Jan 2025). For sample embeddings zi,zjz_i, z_j (e.g., zi=f(xi)z_i=f(x_i)), the loss for a positive pair (x,x+)(x, x^+) is

LInfoNCE=logexp(sim(zx,zx+)/τ)xexp(sim(zx,zx)/τ)L_{\rm InfoNCE} = -\log \frac{ \exp(\mathrm{sim}(z_x, z_{x^+}) / \tau) }{ \sum_{x'} \exp(\mathrm{sim}(z_x, z_{x'}) / \tau) }

where sim\mathrm{sim} is typically cosine similarity, and τ>0\tau > 0 is a temperature hyperparameter. Drawbacks of InfoNCE include:

  • Sensitivity to temperature: requires dataset-specific tuning; poor choices yield vanishing or nonzero gradients at the optimum.
  • Ineffective negative sampling: uniform random negatives often ignore "hard" negatives, and may include false negatives.
  • In supervised or collaborative filtering contexts, negatives may not be truly unrelated.

2. Hard Negative Mining: Theory and Implementation

Hard negative mining in H-InfoNCE is based on modifying the negative sampling distribution using a "hardening function" η:R[0,)\eta : \mathbb{R} \to [0, \infty), nondecreasing in its argument. For an anchor xx and negative xx^-, similarity g(x,x)g(x, x^-) (scaled dot product), and base negative sampler q0(x)q_0(x^-), the tilted distribution is defined as

qHUCL(xx;f)=η(g(x,x))q0(x)Eq0[η(g(x,x))]q_{\rm H-UCL}(x^- | x; f) = \frac{\eta(g(x, x^-))\, q_0(x^-)}{\mathbb{E}_{q_0}[\eta(g(x, x^-))]}

and for supervised settings,

qHSCL(xx;f)=1{y(x)y(x)}η(g(x,x))q0(x)Eq0[1{y(x)y(x)}η(g(x,x))]q_{\rm H-SCL}(x^- | x; f) = \frac{1\{y(x^-) \neq y(x)\} \eta(g(x, x^-))\, q_0(x^-)}{\mathbb{E}_{q_0}[1\{y(x^-) \neq y(x)\} \eta(g(x, x^-))]}

Prototypical choices for η\eta are:

  • Exponential tilt: ηexp(t)=eβt\eta_{\rm exp}(t) = e^{\beta t}, β>0\beta > 0
  • Threshold tilt: ηthresh(t)=1(etτ)\eta_{\text{thresh}}(t) = 1(e^t \geq \tau), for threshold τ>0\tau > 0

The H-InfoNCE loss in the large-negative (infinite-kk) regime becomes

LHUCL()(f)=E(x,x+)[log(1+eg(x,x+)Eq0[η(g(x,x))eg(x,x)]Eq0[η(g(x,x))])]L^{(\infty)}_{\rm H-UCL}(f) = \mathbb{E}_{(x,x^+)} \left[ \log\left(1 + e^{-g(x,x^+)}\, \frac{ \mathbb{E}_{q_0}\left[ \eta(g(x, x^-))\, e^{g(x, x^-)} \right] }{ \mathbb{E}_{q_0}\left[ \eta(g(x, x^-)) \right] } \right) \right]

with a similar expression for LHSCL()L^{(\infty)}_{\rm H-SCL}, restricting to y(x)y(x)y(x^-) \ne y(x) (Jiang et al., 2022).

3. Adversarial and Distributionally Robust H-InfoNCE (AdvInfoNCE)

The hardness offsets δj\delta_j can be introduced directly per negative, leading to a fine-grained, hardness-aware ranking constraint: s(u,j)s(u,i)+δj<0s(u, j) - s(u, i) + \delta_j < 0 for each negative jj (where s(u,i)s(u, i) is the scaled similarity) (Zhang et al., 2023). This leads to the loss

LAdvInfoNCE(u,i;{δj})=logexp(s(u,i))exp(s(u,i))+jNuexp(δj)exp(s(u,j))L_{\rm AdvInfoNCE}(u, i; \{ \delta_j \}) = -\log \frac{\exp(s(u, i))}{ \exp(s(u, i)) + \sum_{j \in N_u} \exp(\delta_j) \exp(s(u, j)) }

The values δj\delta_j are learned adversarially via a min-max formulation: minθmaxδC(η)(u,i)LAdvInfoNCE(u,i;δ)\min_\theta \max_{\delta \in C(\eta)} \sum_{(u,i)} L_{\rm AdvInfoNCE}(u, i; \delta) with C(η)C(\eta) constraining the KL divergence between the hardness distribution and uniform. This DRO interpretation yields robustness to false negatives and enables explicit hard negative mining: minθmaxp():DKL(Uniformp)ηEjp()[exp(s(u,j)s(u,i))]\min_\theta \max_{p(\cdot): D_{\rm KL}(\text{Uniform} \| p) \leq \eta} \mathbb{E}_{j \sim p(\cdot)} [ \exp( s(u, j) - s(u, i) ) ] Negatives with larger δj\delta_j are upweighted, and likely false negatives (with δj<0\delta_j < 0) are downweighted (Zhang et al., 2023).

4. Temperature-Free H-InfoNCE: Log-Odds Scaling

To eliminate the temperature hyperparameter, H-InfoNCE (Kim et al., 29 Jan 2025) replaces sim(zi,zj)/τ\mathrm{sim}(z_i, z_j)/\tau in InfoNCE with the log-odds (inverse sigmoid) of the similarity: $\ell_{ij} = \logit\left( \frac{1 + C_{ij}}{2} \right) = 2 \arctanh(C_{ij})$ where Cij=sim(zi,zj)(1,1)C_{ij} = \mathrm{sim}(z_i, z_j) \in (-1, 1). The resulting loss is

$L_{\rm H-InfoNCE} = - \mathbb{E}_{(x, x^+)} \left[ \log \frac{ \exp\left(2 \arctanh(\mathrm{sim}(z_x, z_{x^+})) \right) }{ \sum_{x'} \exp\left(2 \arctanh(\mathrm{sim}(z_x, z_{x'})) \right) } \right]$

This ensures monotonic, non-vanishing gradients across the relevant similarity range, removes the need to tune τ\tau, and leads to improved or matched empirical performance in image, graph, anomaly, and recommendation tasks (Kim et al., 29 Jan 2025).

5. Theoretical Guarantees and Bounds

For hardness-tilted H-InfoNCE (Jiang et al., 2022), under the assumption that average hardness for same-class negatives is at least that for cross-class, it holds that: LHUCLLHSCLL_{\rm H-UCL} \geq L_{\rm H-SCL} where H-UCL and H-SCL indicate unsupervised and supervised settings, respectively. This provides justification for optimizing hard-tilted unsupervised InfoNCE as a proxy for the supervised objective when labels are absent. Conversely, with an extreme threshold tilt, H-SCL loss can be lower-bounded by the UCL loss, indicating the limitations of vanilla InfoNCE for discriminating especially hard negatives.

For AdvInfoNCE (Zhang et al., 2023), Theorem 3.1 shows that the adversarially optimized loss is equivalent to a DRO negative sampling problem, with an explicit KL-divergence constraint on the negative sampler, providing a theoretical foundation for its robustness to out-of-distribution and false negatives.

For temperature-free H-InfoNCE (Kim et al., 29 Jan 2025), the arctanh reparameterization ensures that gradient magnitude is strictly positive for all C<1C<1 and decays monotonically, leading to automatic, reliable convergence and learning dynamics not obtainable through any fixed choice of τ\tau.

6. Practical Implementation and Empirical Results

Implementation of both hardness-tilted and adversarial H-InfoNCE requires sampling or maintaining hardness weights per negative. For AdvInfoNCE, this involves maintaining δ\delta vectors per user-item pair and updating them adversarially, with projected gradient methods used to enforce the KL constraint (Zhang et al., 2023). Hardness can be parameterized via tiny MLPs or direct embeddings.

For tilted H-InfoNCE, large batches or memory banks approximate the infinite negative regime. Exponential or threshold tilting functions are deployed, with β1\beta \approx 1–$2$ typically effective, and threshold schedules can be adapted over training (Jiang et al., 2022).

Temperature-free H-InfoNCE is implemented by applying $2 \arctanh(\cdot)$ to the similarity matrix and passing the logits to a standard cross-entropy routine (Kim et al., 29 Jan 2025).

Empirical results demonstrate that:

  • H-SCL improves downstream classification accuracy by 3–4% over SCL on CIFAR100, with similar gains seen across vision and graph benchmarks (Jiang et al., 2022).
  • AdvInfoNCE substantially enhances generalization and robustness to false negatives and out-of-distribution items in collaborative filtering applications, outperforming other contrastive losses (Zhang et al., 2023).
  • Temperature-free H-InfoNCE consistently matches or outperforms best-tuned InfoNCE baselines across vision, graph, anomaly, NLP, and recommendation tasks, and removes the need for hyperparameter search (Kim et al., 29 Jan 2025).

7. Comparative Summary and Applications

Variant Key Feature Application Domain
H-InfoNCE (hard tilting) Hardness-weighted negatives Image, graph, self-supervised learning
AdvInfoNCE (adversarial) Adversarial hardness optimization, DRO Collaborative filtering, recommendation
H-InfoNCE (arctanh scaling) Temperature-free logit mapping Self-supervised across modalities

The H-InfoNCE family provides a unifying framework for addressing key limitations in standard contrastive learning objectives by enhancing negative sample informativeness, providing robust optimization guarantees, or improving training dynamics without the need for temperature tuning. Each variant is applicable to a wide range of representation learning, recommendation, and graph-learning problems, and affords both theoretical clarity and demonstrated empirical benefits (Jiang et al., 2022, Zhang et al., 2023, Kim et al., 29 Jan 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to H-InfoNCE Loss.