H-InfoNCE Loss: Enhancing Contrastive Learning

Updated 26 November 2025

H-InfoNCE Loss is a family of contrastive learning objectives that refines standard InfoNCE by integrating hard negative mining, adversarial optimization, and temperature-free scaling.
It employs hardness-aware tilting functions and adversarial ranking constraints to improve discrimination in self-supervised and collaborative filtering tasks.
These variants offer robust convergence and enhanced empirical performance across applications in vision, graph, recommendation, and anomaly detection.

H-InfoNCE Loss refers to a family of modifications to the InfoNCE contrastive loss that address limitations of the standard formulation by introducing hard negative mining, adversarial optimization, supervised sampling, or temperature-free scaling. These variants are designed to enhance discriminative representation learning in self-supervised and collaborative filtering tasks by refining the negative sampling process or similarity calibration. Notably, the term H-InfoNCE encompasses (i) hardness-aware/tilted contrastive losses for supervised and unsupervised learning (Jiang et al., 2022), (ii) adversarially optimized InfoNCE (AdvInfoNCE) for collaborative filtering (Zhang et al., 2023), and (iii) a temperature-free reparameterization replacing the scaling hyperparameter with an arctanh mapping (Kim et al., 29 Jan 2025).

1. Foundations: Standard InfoNCE and Its Limitations

InfoNCE is the canonical objective for contrastive learning, structured to maximize similarity between positive pairs while minimizing it to negatives (Jiang et al., 2022, Kim et al., 29 Jan 2025). For sample embeddings $z_i, z_j$ (e.g., $z_i=f(x_i)$ ), the loss for a positive pair $(x, x^+)$ is

$L_{\rm InfoNCE} = -\log \frac{ \exp(\mathrm{sim}(z_x, z_{x^+}) / \tau) }{ \sum_{x'} \exp(\mathrm{sim}(z_x, z_{x'}) / \tau) }$

where $\mathrm{sim}$ is typically cosine similarity, and $\tau > 0$ is a temperature hyperparameter. Drawbacks of InfoNCE include:

Sensitivity to temperature: requires dataset-specific tuning; poor choices yield vanishing or nonzero gradients at the optimum.
Ineffective negative sampling: uniform random negatives often ignore "hard" negatives, and may include false negatives.
In supervised or collaborative filtering contexts, negatives may not be truly unrelated.

2. Hard Negative Mining: Theory and Implementation

Hard negative mining in H-InfoNCE is based on modifying the negative sampling distribution using a "hardening function" $\eta : \mathbb{R} \to [0, \infty)$ , nondecreasing in its argument. For an anchor $x$ and negative $x^-$ , similarity $g(x, x^-)$ (scaled dot product), and base negative sampler $q_0(x^-)$ , the tilted distribution is defined as

$q_{\rm H-UCL}(x^- | x; f) = \frac{\eta(g(x, x^-))\, q_0(x^-)}{\mathbb{E}_{q_0}[\eta(g(x, x^-))]}$

and for supervised settings,

$q_{\rm H-SCL}(x^- | x; f) = \frac{1\{y(x^-) \neq y(x)\} \eta(g(x, x^-))\, q_0(x^-)}{\mathbb{E}_{q_0}[1\{y(x^-) \neq y(x)\} \eta(g(x, x^-))]}$

Prototypical choices for $\eta$ are:

Exponential tilt: $\eta_{\rm exp}(t) = e^{\beta t}$ , $\beta > 0$
Threshold tilt: $\eta_{\text{thresh}}(t) = 1(e^t \geq \tau)$ , for threshold $\tau > 0$

The H-InfoNCE loss in the large-negative (infinite- $k$ ) regime becomes

$L^{(\infty)}_{\rm H-UCL}(f) = \mathbb{E}_{(x,x^+)} \left[ \log\left(1 + e^{-g(x,x^+)}\, \frac{ \mathbb{E}_{q_0}\left[ \eta(g(x, x^-))\, e^{g(x, x^-)} \right] }{ \mathbb{E}_{q_0}\left[ \eta(g(x, x^-)) \right] } \right) \right]$

with a similar expression for $L^{(\infty)}_{\rm H-SCL}$ , restricting to $y(x^-) \ne y(x)$ (Jiang et al., 2022).

3. Adversarial and Distributionally Robust H-InfoNCE (AdvInfoNCE)

The hardness offsets $\delta_j$ can be introduced directly per negative, leading to a fine-grained, hardness-aware ranking constraint: $s(u, j) - s(u, i) + \delta_j < 0$ for each negative $j$ (where $s(u, i)$ is the scaled similarity) (Zhang et al., 2023). This leads to the loss

$L_{\rm AdvInfoNCE}(u, i; \{ \delta_j \}) = -\log \frac{\exp(s(u, i))}{ \exp(s(u, i)) + \sum_{j \in N_u} \exp(\delta_j) \exp(s(u, j)) }$

The values $\delta_j$ are learned adversarially via a min-max formulation: $\min_\theta \max_{\delta \in C(\eta)} \sum_{(u,i)} L_{\rm AdvInfoNCE}(u, i; \delta)$ with $C(\eta)$ constraining the KL divergence between the hardness distribution and uniform. This DRO interpretation yields robustness to false negatives and enables explicit hard negative mining: $\min_\theta \max_{p(\cdot): D_{\rm KL}(\text{Uniform} \| p) \leq \eta} \mathbb{E}_{j \sim p(\cdot)} [ \exp( s(u, j) - s(u, i) ) ]$ Negatives with larger $\delta_j$ are upweighted, and likely false negatives (with $\delta_j < 0$ ) are downweighted (Zhang et al., 2023).

4. Temperature-Free H-InfoNCE: Log-Odds Scaling

To eliminate the temperature hyperparameter, H-InfoNCE (Kim et al., 29 Jan 2025) replaces $\mathrm{sim}(z_i, z_j)/\tau$ in InfoNCE with the log-odds (inverse sigmoid) of the similarity: $\ell_{ij} = \logit\left( \frac{1 + C_{ij}}{2} \right) = 2 \arctanh(C_{ij})$ where $C_{ij} = \mathrm{sim}(z_i, z_j) \in (-1, 1)$ . The resulting loss is

$L_{\rm H-InfoNCE} = - \mathbb{E}_{(x, x^+)} \left[ \log \frac{ \exp\left(2 \arctanh(\mathrm{sim}(z_x, z_{x^+})) \right) }{ \sum_{x'} \exp\left(2 \arctanh(\mathrm{sim}(z_x, z_{x'})) \right) } \right]$

This ensures monotonic, non-vanishing gradients across the relevant similarity range, removes the need to tune $\tau$ , and leads to improved or matched empirical performance in image, graph, anomaly, and recommendation tasks (Kim et al., 29 Jan 2025).

5. Theoretical Guarantees and Bounds

For hardness-tilted H-InfoNCE (Jiang et al., 2022), under the assumption that average hardness for same-class negatives is at least that for cross-class, it holds that: $L_{\rm H-UCL} \geq L_{\rm H-SCL}$ where H-UCL and H-SCL indicate unsupervised and supervised settings, respectively. This provides justification for optimizing hard-tilted unsupervised InfoNCE as a proxy for the supervised objective when labels are absent. Conversely, with an extreme threshold tilt, H-SCL loss can be lower-bounded by the UCL loss, indicating the limitations of vanilla InfoNCE for discriminating especially hard negatives.

For AdvInfoNCE (Zhang et al., 2023), Theorem 3.1 shows that the adversarially optimized loss is equivalent to a DRO negative sampling problem, with an explicit KL-divergence constraint on the negative sampler, providing a theoretical foundation for its robustness to out-of-distribution and false negatives.

For temperature-free H-InfoNCE (Kim et al., 29 Jan 2025), the arctanh reparameterization ensures that gradient magnitude is strictly positive for all $C<1$ and decays monotonically, leading to automatic, reliable convergence and learning dynamics not obtainable through any fixed choice of $\tau$ .

6. Practical Implementation and Empirical Results

Implementation of both hardness-tilted and adversarial H-InfoNCE requires sampling or maintaining hardness weights per negative. For AdvInfoNCE, this involves maintaining $\delta$ vectors per user-item pair and updating them adversarially, with projected gradient methods used to enforce the KL constraint (Zhang et al., 2023). Hardness can be parameterized via tiny MLPs or direct embeddings.

For tilted H-InfoNCE, large batches or memory banks approximate the infinite negative regime. Exponential or threshold tilting functions are deployed, with $\beta \approx 1$ –$2$ typically effective, and threshold schedules can be adapted over training (Jiang et al., 2022).

Temperature-free H-InfoNCE is implemented by applying $2 \arctanh(\cdot)$ to the similarity matrix and passing the logits to a standard cross-entropy routine (Kim et al., 29 Jan 2025).

Empirical results demonstrate that:

H-SCL improves downstream classification accuracy by 3–4% over SCL on CIFAR100, with similar gains seen across vision and graph benchmarks (Jiang et al., 2022).
AdvInfoNCE substantially enhances generalization and robustness to false negatives and out-of-distribution items in collaborative filtering applications, outperforming other contrastive losses (Zhang et al., 2023).
Temperature-free H-InfoNCE consistently matches or outperforms best-tuned InfoNCE baselines across vision, graph, anomaly, NLP, and recommendation tasks, and removes the need for hyperparameter search (Kim et al., 29 Jan 2025).

7. Comparative Summary and Applications

Variant	Key Feature	Application Domain
H-InfoNCE (hard tilting)	Hardness-weighted negatives	Image, graph, self-supervised learning
AdvInfoNCE (adversarial)	Adversarial hardness optimization, DRO	Collaborative filtering, recommendation
H-InfoNCE (arctanh scaling)	Temperature-free logit mapping	Self-supervised across modalities

The H-InfoNCE family provides a unifying framework for addressing key limitations in standard contrastive learning objectives by enhancing negative sample informativeness, providing robust optimization guarantees, or improving training dynamics without the need for temperature tuning. Each variant is applicable to a wide range of representation learning, recommendation, and graph-learning problems, and affords both theoretical clarity and demonstrated empirical benefits (Jiang et al., 2022, Zhang et al., 2023, Kim et al., 29 Jan 2025).