Temperature-Free H-InfoNCE
- The paper introduces an arctanh-based InfoNCE loss that eliminates the need for temperature tuning while offering robust gradient dynamics.
- It replaces cosine similarity scaling with an unbounded arctanh mapping, facilitating more expressive logits and improved contrastive learning.
- Empirical benchmarks across vision, graphs, NLP, and recommender systems show that H-InfoNCE matches or outperforms manually tuned temperature approaches.
Temperature-Free H-InfoNCE (arctanh-based InfoNCE, “scaled log-odds”) defines a hyperparameter-free alternative to the canonical temperature-scaled InfoNCE loss for contrastive representation learning. By replacing the typical temperature scaling of cosine similarities with an arctanh mapping, the loss removes the need to tune the critical temperature parameter and exhibits more robust gradient behavior, with empirical results demonstrating state-of-the-art or superior performance across vision, graph, anomaly detection, NLP, and recommender system benchmarks (Kim et al., 29 Jan 2025).
1. Standard InfoNCE and Temperature Scaling
The InfoNCE loss is central to contrastive learning, operating on a batch of anchor examples , paired (e.g., with augmentations or alternative views) to produce a positive sample index and negatives . An encoder maps inputs to -normalized embeddings, yielding cosine similarities . The loss for one anchor is:
with the temperature parameter. rescales similarities before the softmax. Performance is highly sensitive to ; small increases “contrast” but risks vanishing gradients for moderate similarities, while large leaves nonzero gradient even at the optimum, necessitating expensive hyperparameter tuning.
2. Construction of Temperature-Free H-InfoNCE
To eliminate temperature dependence while preserving the ability of logits to take values across , the H-InfoNCE (Editor’s term) leverages a statistical log-odds transformation. For , define ; then,
$\logit(p_{ij}) = \log\frac{1+s_{ij}}{1-s_{ij}} = 2\,\arctanh(s_{ij})$
The H-InfoNCE loss is then specified as:
$L^{\mathrm{H-InfoNCE}} = \sum_{i=1}^{N} \left[ -\log \frac{\exp(2\,\arctanh(s_{i,i^+}))}{\sum_{j=1}^N \exp(2\,\arctanh(s_{i,j}))} \right]$
Or, equivalently:
$L_i = -2 \arctanh(s_{i, i^+}) + \log \sum_{j=1}^N \exp(2 \arctanh(s_{i,j}))$
Thus, is eliminated, each logit is mapped as $s_{ij} \mapsto 2 \arctanh(s_{ij})$, resulting in unbounded logits for more expressive contrast.
3. Gradient Properties and Analysis
H-InfoNCE’s theoretical advantage is evident in its gradient behavior. Under a toy model with one positive () and one negative (), standard InfoNCE yields:
which can vanish for moderate when is small. In contrast, H-InfoNCE produces:
$L = \log(1 + e^{-2\arctanh(C)})$
This gradient is strictly positive for and vanishes only as , i.e., only at the optimum. There is no middle-range gradient collapse, and no hyperparameter to tune.
4. Theoretical Insights
The main theoretical claim is that finite logit ranges in standard InfoNCE induce undesirable gradient trade-offs: nonzero gradients near optimum (impairing convergence), or vanishing gradients in the middle (impairing learning). The arctanh mapping resolves this, producing:
- Unbounded logits () allowing the softmax to approximate the “hard” one-hot distribution.
- Gradient only at the true optimum, supporting reliable gradient-based cross-entropy minimization. No global convergence proof or Lipschitz constant estimates are provided, but closed-form analysis supports these conclusions.
5. Empirical Performance
Extensive benchmarks across five domains compare H-InfoNCE (“Free”) against InfoNCE with . Representative results:
| Task | Best InfoNCE () | H-InfoNCE (“Free”) |
|---|---|---|
| Imagenette kNN-1 | 84.43 (0.25) | 84.65 (0.27) |
| CiteSeer F1 (micro/macro) | 67.33/60.47 (0.50) | 67.95/60.56 |
| CIFAR-10 ROC-AUC (mean) | 97.22 (0.25) | 97.28 |
| MABEL LM/ICAT (StereoSet) | 80.6/71.3 | 81.0/71.7 |
| DCRec HR@1 | 0.1336 | 0.1360 |
Across all tasks, H-InfoNCE matches or exceeds the best manually-tuned , without any hyperparameter search.
6. Implementation Practices
Image experiments use SimCLR with ResNet-18 (800 epochs, SGD, batch 256, standard augmentation); graph experiments employ GRACE with GCNs (1000 epochs, Adam); anomaly detection leverages ResNet-152 on CIFAR-10; NLP mitigation uses BERT-base (MABEL, 2 epochs, AdamW, batch 16); recommender experiments use DCRec (GNN, co-attention, batch 512, contrastive/regularizer weights as baseline). Standard dataset sizes and batch defaults are observed; no extensive ablation for batch or embedding dimension is reported.
7. Outlook and Potential Extensions
Temperature-Free H-InfoNCE offers a plug-in replacement for InfoNCE with robust gradient dynamics and total removal of temperature tuning across a wide spectrum of tasks. Its core innovation is replacing the scalar with the monotonic, unbounded $2\,\arctanh(\cdot)$ mapping. A plausible implication is that H-InfoNCE could generalize to other self-supervised paradigms (e.g., cross-modal, masked prediction), though empirical validation and further theoretical study (convergence guarantees under stochastic dynamics) remain open for future research (Kim et al., 29 Jan 2025).