SC-InfoNCE: Scaled Convergence in Contrastive Learning
- SC-InfoNCE is a contrastive learning method that generalizes InfoNCE by scaling feature alignment with a data-driven transition probability matrix.
- It introduces a tunable scalar s to flexibly control clustering strength, balancing intra- and inter-cluster similarities based on augmentation dynamics.
- Empirical results across vision, graph, and text domains demonstrate that moderate scaling improves representation fidelity and downstream accuracy.
Scaled Convergence InfoNCE (SC-InfoNCE) is a contrastive learning objective that generalizes the InfoNCE loss by introducing a tunable convergence target, thereby enabling flexible control over feature similarity alignment. Unlike the standard InfoNCE, which promotes uniform clustering based on a constant target, SC-InfoNCE exploits a transition probability matrix (TPM) induced by data augmentations and scales it by a factor to modulate the influence of augmented view dynamics on learned representations. This framework yields a principled mechanism for tuning alignment strength in accordance with data statistics and downstream requirements (Cheng et al., 15 Nov 2025).
1. Recapitulation of the InfoNCE Objective and Transition Matrix Formalism
Let be an unlabeled dataset, and let an augmentation distribution induce a finite feature space with cardinality . The transition-probability matrix is defined as
A parametric encoder produces embeddings , for independent augmentations 0. Cosine similarity (typically after 1-normalization) is used: 2, with a temperature 3.
The predicted pairwise probability is: 4 where 5 is the batch size. The InfoNCE loss is: 6 with 7 the positive index for anchor 8. In expectation, InfoNCE drives 9 toward a constant determined by the statistics of 0, promoting uniform “clustering” in representation space (Cheng et al., 15 Nov 2025).
2. SC-InfoNCE: Definition and Mathematical Formulation
SC-InfoNCE extends InfoNCE by replacing the uniform convergence target with a scaled, data-driven target. A scalar 1 is introduced to form a new target matrix: 2 The SC-InfoNCE objective is then: 3 This loss can equivalently be expressed as a cross-entropy between the predicted probability matrix and the scaled TPM.
The gradient with respect to the similarity matrix entry 4 is: 5 At stationarity, 6.
3. Theoretical Properties and Feature Clustering
Under the assumption of a sufficiently expressive encoder and an infinite data stream, any stationary point of SC-InfoNCE satisfies 7 for all 8. Through the softmax link 9, this yields 0. Feature pairs with large 1 (frequent cross-augmentation) will attain higher similarity, naturally imposing a soft clustering structure where affinities are prescribed by 2.
The scaling parameter 3 modulates the geometry: larger 4 amplifies log differences in transition probabilities, facilitating cluster separation but risking mode collapse when 5 is too large. Smaller 6 sharpens sensitivity to local differences but may reduce inter-cluster distinctness. In downstream scenarios matching the co-occurrence pattern of 7, proper normalization of 8 aligns pretraining geometry to test-time statistics (Cheng et al., 15 Nov 2025).
4. Algorithmic Implementation
The typical pipeline for SC-InfoNCE pretraining is:
- Estimate the transition matrix 9 via Monte-Carlo simulation over augmentations.
- Form the target matrix 0.
- For each epoch and mini-batch:
- Sample two augmentations per anchor and encode to obtain 1 representations.
- Compute all pairwise similarities 2.
- Compute softmax probabilities 3.
- Calculate the cross-entropy loss 4.
- Backpropagate and update parameters 5.
Recommended hyperparameter ranges are:
- 6, with 7 often a strong default.
- 8, where smaller values sharpen the output distribution.
- Batch size: 256, 1024, 64, 256.
- Learning rate and weight decay as used in base InfoNCE protocols.
5. Empirical Evaluation Across Domains
Experiments were performed using vision (CIFAR-10, CIFAR-100, STL-10, ImageNet-100; ResNet-50 pretrained for 200 epochs), graph (COLLAB, DD, NCI1, PROTEINS; 3-layer GCN), and text (STS-B, SICK-R; BERT-base on 1M Wikipedia sentences). Baselines included SCL, InfoNCE, DCL, DHEL, and f-MICL.
Performance was assessed through linear-probe accuracy. Representative results:
| Dataset | Std InfoNCE | SC-InfoNCE (best s) | Δ |
|---|---|---|---|
| CIFAR-10 | 90.53 | 91.49 | +0.96 |
| CIFAR-100 | 50.90 | 51.95 | +1.05 |
| ImageNet-100 | 74.62 | 75.62 | +1.00 |
| STL-10 | 84.07 | 85.54 | +1.47 |
| COLLAB | 75.98 | 76.28 | +0.30 |
| DD | 73.92 | 75.89 | +1.97 |
| NCI1 | 75.38 | 75.72 | +0.34 |
| PROTEINS | 70.44 | 73.41 | +2.97 |
| STS-B | 74.95 | 77.64 | +2.69 |
| SICK-R | 73.87 | 75.48 | +1.61 |
Ablation over 9 on CIFAR-10 reveals: | 0 | 0.5 | 1.0 | 1.5 | |---------|------|------|------| | Acc. | 89.1 | 90.5 | 91.2 |
This suggests that moderate increases in 1 can consistently improve performance; however, excessive scaling risks instability.
6. Practical Considerations and Limitations
Choosing 2 should be guided by data characteristics:
- For mild augmentations and low inter-class separation, increase 3.
- To mitigate embedding collapse (few large embedding covariance eigenvalues), decrease 4.
- For fine-grained tasks (e.g., STS-B), moderate increases (5) can enhance subtle representation fidelity.
Limitations include the necessity to estimate 6, which can be challenging in high-dimensional settings; one may approximate 7 among prototypes or clusters. The global scaling 8 may not correct for class imbalance, and row- or time-adaptive scaling may be advantageous—this remains an open direction. Excessive 9 may result in representation collapse if 0 contains large entries (Cheng et al., 15 Nov 2025).
SC-InfoNCE enables principled, tunable alignment to augmentation-induced feature affinities, trading off inter- versus intra-cluster structure and yielding competitive results across multiple domains.