Multi-View InfoNCE Loss
- The paper introduces MV-InfoNCE, which unifies multiple view comparisons into a single loss term to reduce gradient conflicts.
- MV-InfoNCE aggregates all intra-instance positive interactions and inter-instance negatives, ensuring simultaneous alignment and uniformity of representations.
- Empirical results show that MV-InfoNCE outperforms pairwise methods, improving top-1 accuracy by up to 0.8% on benchmarks like ImageNet and CIFAR.
Multi-View InfoNCE (MV-InfoNCE) is a contrastive loss formulation designed for multi-view self-supervised learning scenarios, addressing limitations of conventional pairwise contrastive objectives when leveraging more than two data augmentations (views) per instance. MV-InfoNCE enables simultaneous alignment of all within-instance views and comprehensive modeling of cross-instance interactions, extending InfoNCE to a principled, single-term objective per instance with alignment and uniformity guarantees (Koromilas et al., 9 Jul 2025).
1. Problem Setup and Notation
Given a mini-batch of instances, each instance yields views via diverse stochastic augmentations. Formally, for input , an encoder produces viewwise embeddings , normalized to unit length for stability. These embeddings are indexed as in a tensor .
The similarity function is defined by scaled cosine similarity:
where is a temperature parameter.
For each embedding , the set of positives 0 comprises other views of instance 1, while negatives 2 include all views from different instances.
2. MV-InfoNCE Loss Definition
MV-InfoNCE generalizes InfoNCE by aggregating all intra-instance view similarities and all inter-instance negatives into a single, per-instance loss. This reduces conflicting gradients and captures higher-order dependencies missed by pairwise summation. The core terms are:
- Positive sum:
3
- Negative sum:
4
The MV-InfoNCE loss is then:
5
This structure ensures that every view for a given instance is encouraged to align with all other views of the same instance, while being uniformly separated from embeddings of different instances.
3. Capturing All View Interactions
MV-InfoNCE structurally differs from conventional multi-view approaches, which typically aggregate 6 pairwise InfoNCE terms. Instead, MV-InfoNCE consolidates interaction modeling into a single-term per instance:
- One Loss Term per Instance: Each instance 7 contributes one global loss term, eliminating conflicts arising from multiple, overlapping pairwise losses.
- Simultaneous Alignment: The positive sum 8 encompasses all intra-instance view pairs, requiring the encoder to align every view simultaneously.
- Comprehensive Negative Energy: The negative sum 9 incorporates all view interactions with other instances, maximizing uniformity across the batch.
This joint treatment yields an objective that forces holistic alignment and uniformity, rather than piecewise pairwise objectives that may introduce suboptimal local minima or miss collective dependencies (Koromilas et al., 9 Jul 2025).
4. Theoretical Characterization
In the large-batch regime (0), the MV-InfoNCE objective asymptotically decomposes into alignment and uniformity penalties:
1
The first term penalizes lack of alignment among same-instance views, while the second encourages the global embedding set to be uniformly distributed on the sphere. Global minimization is achieved when all within-instance views are identical (alignment) and all representations are distributed according to the uniform hyperspherical distribution (uniformity).
5. Comparison with Two-View InfoNCE
Contrasts between MV-InfoNCE and traditional pairwise (two-view) InfoNCE frameworks are summarized as follows:
| Aspect | Two-View InfoNCE | MV-InfoNCE |
|---|---|---|
| Loss Terms per Instance | 2 | 1 |
| Computational Order | 3 | 4 |
| Positive Interactions | Pairwise only | All cross-view pairings |
| Gradient Symmetry | View-of-interest asymmetry | Fully symmetric |
Pairwise objectives yield 5 terms per instance, each focusing on a particular view, resulting in potential gradient interference and incomplete modeling of higher-order dependencies. MV-InfoNCE unifies all positive interactions and negatives, avoids the view-of-interest distinction, and captures all higher-order effects in a single term per instance (Koromilas et al., 9 Jul 2025).
6. Algorithmic Implementation
Efficient implementation of MV-InfoNCE follows the outlined pseudocode:
2 This procedure accumulates, for each instance, the positive and negative energy sums, and computes the log-ratio, averaged across the batch.
7. Empirical Evaluation and Scaling Behavior
MV-InfoNCE achieves superior performance relative to pairwise-aggregation baselines as the number of views increases:
- Linear Evaluation Protocols: On CIFAR-10/100, ImageNet-100, and ImageNet-1K, MV-InfoNCE consistently surpasses pairwise objectives, with top-1 accuracy improvements of approximately 6 to 7 on ImageNet-1K at 8.
- Scaling with View Number: Unlike conventional approaches that saturate or degrade beyond 9, MV-InfoNCE yields continued accuracy and embedding geometry improvements up to 0.
- Embedding Quality: k-Nearest Neighbor classification and neighborhood separability metrics indicate more uniform and better-aligned representation spaces as 1 increases.
MV-InfoNCE's empirical scaling properties underscore its suitability for high-multiplicity view regimes, both in unimodal and multimodal settings (Koromilas et al., 9 Jul 2025).