Papers
Topics
Authors
Recent
Search
2000 character limit reached

SC-InfoNCE: Scaled Convergence in Contrastive Learning

Updated 7 June 2026
  • SC-InfoNCE is a contrastive learning method that generalizes InfoNCE by scaling feature alignment with a data-driven transition probability matrix.
  • It introduces a tunable scalar s to flexibly control clustering strength, balancing intra- and inter-cluster similarities based on augmentation dynamics.
  • Empirical results across vision, graph, and text domains demonstrate that moderate scaling improves representation fidelity and downstream accuracy.

Scaled Convergence InfoNCE (SC-InfoNCE) is a contrastive learning objective that generalizes the InfoNCE loss by introducing a tunable convergence target, thereby enabling flexible control over feature similarity alignment. Unlike the standard InfoNCE, which promotes uniform clustering based on a constant target, SC-InfoNCE exploits a transition probability matrix (TPM) induced by data augmentations and scales it by a factor ss to modulate the influence of augmented view dynamics on learned representations. This framework yields a principled mechanism for tuning alignment strength in accordance with data statistics and downstream requirements (Cheng et al., 15 Nov 2025).

1. Recapitulation of the InfoNCE Objective and Transition Matrix Formalism

Let D={x1,,xn}\mathcal{D} = \{ x_1, \ldots, x_n \} be an unlabeled dataset, and let an augmentation distribution TT induce a finite feature space SS with cardinality m=Sm = |S|. The transition-probability matrix T[0,1]m×mT \in [0,1]^{m \times m} is defined as

Tij=Pr(augmented view of feature i equals feature j).T_{ij} = \Pr(\text{augmented view of feature } i \text{ equals feature } j).

A parametric encoder fθf_\theta produces embeddings zi=fθ(t1(xi))z_i = f_\theta(t_1(x_i)), zj=fθ(t2(xi))z_j = f_\theta(t_2(x_i)) for independent augmentations D={x1,,xn}\mathcal{D} = \{ x_1, \ldots, x_n \}0. Cosine similarity (typically after D={x1,,xn}\mathcal{D} = \{ x_1, \ldots, x_n \}1-normalization) is used: D={x1,,xn}\mathcal{D} = \{ x_1, \ldots, x_n \}2, with a temperature D={x1,,xn}\mathcal{D} = \{ x_1, \ldots, x_n \}3.

The predicted pairwise probability is: D={x1,,xn}\mathcal{D} = \{ x_1, \ldots, x_n \}4 where D={x1,,xn}\mathcal{D} = \{ x_1, \ldots, x_n \}5 is the batch size. The InfoNCE loss is: D={x1,,xn}\mathcal{D} = \{ x_1, \ldots, x_n \}6 with D={x1,,xn}\mathcal{D} = \{ x_1, \ldots, x_n \}7 the positive index for anchor D={x1,,xn}\mathcal{D} = \{ x_1, \ldots, x_n \}8. In expectation, InfoNCE drives D={x1,,xn}\mathcal{D} = \{ x_1, \ldots, x_n \}9 toward a constant determined by the statistics of TT0, promoting uniform “clustering” in representation space (Cheng et al., 15 Nov 2025).

2. SC-InfoNCE: Definition and Mathematical Formulation

SC-InfoNCE extends InfoNCE by replacing the uniform convergence target with a scaled, data-driven target. A scalar TT1 is introduced to form a new target matrix: TT2 The SC-InfoNCE objective is then: TT3 This loss can equivalently be expressed as a cross-entropy between the predicted probability matrix and the scaled TPM.

The gradient with respect to the similarity matrix entry TT4 is: TT5 At stationarity, TT6.

3. Theoretical Properties and Feature Clustering

Under the assumption of a sufficiently expressive encoder and an infinite data stream, any stationary point of SC-InfoNCE satisfies TT7 for all TT8. Through the softmax link TT9, this yields SS0. Feature pairs with large SS1 (frequent cross-augmentation) will attain higher similarity, naturally imposing a soft clustering structure where affinities are prescribed by SS2.

The scaling parameter SS3 modulates the geometry: larger SS4 amplifies log differences in transition probabilities, facilitating cluster separation but risking mode collapse when SS5 is too large. Smaller SS6 sharpens sensitivity to local differences but may reduce inter-cluster distinctness. In downstream scenarios matching the co-occurrence pattern of SS7, proper normalization of SS8 aligns pretraining geometry to test-time statistics (Cheng et al., 15 Nov 2025).

4. Algorithmic Implementation

The typical pipeline for SC-InfoNCE pretraining is:

  1. Estimate the transition matrix SS9 via Monte-Carlo simulation over augmentations.
  2. Form the target matrix m=Sm = |S|0.
  3. For each epoch and mini-batch:
    • Sample two augmentations per anchor and encode to obtain m=Sm = |S|1 representations.
    • Compute all pairwise similarities m=Sm = |S|2.
    • Compute softmax probabilities m=Sm = |S|3.
    • Calculate the cross-entropy loss m=Sm = |S|4.
    • Backpropagate and update parameters m=Sm = |S|5.

Recommended hyperparameter ranges are:

  • m=Sm = |S|6, with m=Sm = |S|7 often a strong default.
  • m=Sm = |S|8, where smaller values sharpen the output distribution.
  • Batch size: 256, 1024, 64, 256.
  • Learning rate and weight decay as used in base InfoNCE protocols.

5. Empirical Evaluation Across Domains

Experiments were performed using vision (CIFAR-10, CIFAR-100, STL-10, ImageNet-100; ResNet-50 pretrained for 200 epochs), graph (COLLAB, DD, NCI1, PROTEINS; 3-layer GCN), and text (STS-B, SICK-R; BERT-base on 1M Wikipedia sentences). Baselines included SCL, InfoNCE, DCL, DHEL, and f-MICL.

Performance was assessed through linear-probe accuracy. Representative results:

Dataset Std InfoNCE SC-InfoNCE (best s) Δ
CIFAR-10 90.53 91.49 +0.96
CIFAR-100 50.90 51.95 +1.05
ImageNet-100 74.62 75.62 +1.00
STL-10 84.07 85.54 +1.47
COLLAB 75.98 76.28 +0.30
DD 73.92 75.89 +1.97
NCI1 75.38 75.72 +0.34
PROTEINS 70.44 73.41 +2.97
STS-B 74.95 77.64 +2.69
SICK-R 73.87 75.48 +1.61

Ablation over m=Sm = |S|9 on CIFAR-10 reveals: | T[0,1]m×mT \in [0,1]^{m \times m}0 | 0.5 | 1.0 | 1.5 | |---------|------|------|------| | Acc. | 89.1 | 90.5 | 91.2 |

This suggests that moderate increases in T[0,1]m×mT \in [0,1]^{m \times m}1 can consistently improve performance; however, excessive scaling risks instability.

6. Practical Considerations and Limitations

Choosing T[0,1]m×mT \in [0,1]^{m \times m}2 should be guided by data characteristics:

  • For mild augmentations and low inter-class separation, increase T[0,1]m×mT \in [0,1]^{m \times m}3.
  • To mitigate embedding collapse (few large embedding covariance eigenvalues), decrease T[0,1]m×mT \in [0,1]^{m \times m}4.
  • For fine-grained tasks (e.g., STS-B), moderate increases (T[0,1]m×mT \in [0,1]^{m \times m}5) can enhance subtle representation fidelity.

Limitations include the necessity to estimate T[0,1]m×mT \in [0,1]^{m \times m}6, which can be challenging in high-dimensional settings; one may approximate T[0,1]m×mT \in [0,1]^{m \times m}7 among prototypes or clusters. The global scaling T[0,1]m×mT \in [0,1]^{m \times m}8 may not correct for class imbalance, and row- or time-adaptive scaling may be advantageous—this remains an open direction. Excessive T[0,1]m×mT \in [0,1]^{m \times m}9 may result in representation collapse if Tij=Pr(augmented view of feature i equals feature j).T_{ij} = \Pr(\text{augmented view of feature } i \text{ equals feature } j).0 contains large entries (Cheng et al., 15 Nov 2025).

SC-InfoNCE enables principled, tunable alignment to augmentation-induced feature affinities, trading off inter- versus intra-cluster structure and yielding competitive results across multiple domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scaled Convergence InfoNCE (SC-InfoNCE).