Multi-Similarity Contrastive Learning
- Multi-Similarity Contrastive Learning (MSCon) is a supervised framework that uses multiple categorical attributes to form diverse similarity metrics for representation learning.
- The method employs distinct projection heads and an uncertainty-based weighting mechanism to adaptively balance noisy or unreliable similarity signals.
- Empirical results demonstrate that MSCon significantly improves both in-domain and out-of-domain performance compared to traditional contrastive learning methods.
Multi-Similarity Contrastive Learning (MSCon) is a supervised representation learning framework designed to address the limitations of contrastive methods that optimize with respect to only a single similarity relation. In datasets where examples are annotated along multiple categorical attributes—each inducing a unique notion of similarity—MSCon leverages supervision from all available similarity metrics. The method introduces per-metric projection heads and integrates a principled uncertainty-based weighting mechanism, resulting in improved generalization, particularly for out-of-domain tasks and settings with noisy or unreliable similarity information (Mu et al., 2023).
1. Motivation and Problem Statement
Traditional contrastive learning frameworks such as SimCLR and SupCon assume a single notion of similarity (e.g., class membership) for forming positive and negative pairs in representation space. This approach is suboptimal in real-world datasets where each instance can simultaneously possess multiple attributes (e.g., category, style, gender), each defining a distinct relational structure among data points. Simply aggregating multiple supervised contrastive losses by summation assumes equal task reliability and can degrade generalization, especially when some metrics are noisy. MSCon resolves this by learning a separate projection for each metric and adaptively down-weighting uncertain similarity tasks, resisting overfitting due to corrupted or ambiguous attributes.
2. Formal Specification
Let the dataset consist of examples with multi-relational annotation , where each is a discrete label for attribute (), inducing a distinct similarity metric. The architecture comprises a shared encoder , mapping to . For each metric , a distinct projection head is followed by -normalization to produce . Positives under metric for anchor are ; negatives, . The pairwise similarity is .
3. Multi-Similarity Contrastive Loss Definition
For a given similarity metric and anchor , the supervised contrastive loss is
where is the inverse temperature. The complete MSCon loss for a batch is a weighted sum over metrics: with learnable, nonnegative metric weights . Alternatively, this is expressed over positive and negative index pairs to clarify contributions per metric.
4. Uncertainty-Based Weighting Mechanism
MSCon introduces an uncertainty parameter for each metric, controlling the effective temperature as . The learning objective is justified via a pseudo-likelihood formulation for each metric: Maximizing this pseudo-likelihood yields, up to Jensen's inequality, the standard supervised contrastive loss. The batch negative log pseudo-likelihood for metric is then
Hence, the joint MSCon objective to be minimized is
with . The penalty prevents degenerate solutions causing weight collapse.
5. Optimization Procedure and Implementation
The training regime consists of the following steps:
- Initialize encoder parameters , projection heads , and uncertainties .
- For each epoch and minibatch:
- Encode and project inputs per metric.
- Determine positive and negative sets for each .
- Compute per metric, using current for temperature scaling.
- Sum weighted losses and log-penalty: .
- Backpropagate and update , , and (SGD or Adam).
- After training, discard ; use for downstream tasks.
Key implementation details include:
- Normalizing outputs onto the unit sphere.
- Initializing all equally for stable early learning.
- Standard augmentations (random crop, flip, color jitter).
- Recommended optimizer and hyperparameters: SGD (momentum 0.9), learning rate 0.05, batch size 64, , weight decay , 200 epochs for learning, projection dimensions 32 (small datasets) or 64 (large).
6. Empirical Results and Comparative Analysis
MSCon has been empirically validated on multi-relational benchmarks:
- Zappos50k: 50K shoe images labeled by category (4), closure style (5), gender (4); held-out task: brand (20 classes). Encoder: ResNet-18, projection heads (32-dim).
- MEDIC: 71K disaster images annotated for damage severity (3), disaster type (7), humanitarian relevance (4), informativeness (2). Held-out: one metric at a time. Encoder: ResNet-50, projection heads (64-dim).
- All encoders pretrained on ImageNet and fine-tuned with MSCon; embeddings evaluated via frozen linear classifiers.
MSCon outperformed single-task and multi-task cross-entropy, SimCLR, SupCon, and Conditional Similarity Networks with triplet loss. In-domain top-1 accuracy (mean±std over 1,000 bootstrap trials), Zappos50k tasks: Category 97.17±0.27, Closure 94.37±0.35, Gender 85.98±0.56. Out-of-domain (Zappos brand): 42.62±1.52 vs. 32.10±1.48 for the best cross-entropy multi-task. On MEDIC hold-out, MSCon led or matched state-of-the-art except for the binary informativeness task (85.22±0.30 vs 86.18±0.30).
Ablation studies on label corruption show that, as a metric’s labels are increasingly corrupted (fraction ), the learned weight () for that metric decays toward zero, preserving performance except when all metrics are corrupted. Fixed-weight MSCon collapses under maximum corruption, indicating the efficacy of adaptive weighting.
7. Theoretical Foundations and Analysis
By supervising with respect to all available similarity metrics, MSCon drives the encoder to capture factors common to the relational structures present in the data. The uncertainty-based weighting mechanism is theoretically justified by pseudo-likelihood maximization under a task-specific noise model; weights correspond to maximum likelihood under Gaussian noise assumptions. The penalty ensures non-trivial uncertainty estimates and prevents trivial solutions where a metric’s uncertainty is collapsed. Empirically, the learned weights respond dynamically to signal quality, effectively rejecting noisy or less-informative relational labels.
Ablation studies indicate that introducing additional similarity metrics, even if some are noisy, does not degrade performance provided adaptive metric weights are learned. The optimal temperature parameter was found to be robust across tasks. Analysis of weight dynamics demonstrates that generally decays linearly with increasing corruption ratio in a synthetic task corruption setup.
8. Practical Considerations
MSCon is readily implemented atop standard deep metric learning pipelines. For each new similarity metric, a new projection head must be instantiated; however, only the encoder is retained for downstream applications after pretraining. Projected vectors should be -normalized to the unit sphere. Batch size and the quality of augmentations critically affect contrastive sample diversity; larger batches are beneficial. Hyperparameter tuning for the temperature parameter and careful initialization of uncertainties are essential for stability. Linear probing is used for evaluation to isolate representation quality.
MSCon improves both in-domain and out-of-domain generalization, especially where true underlying tasks are not fully captured by any single similarity metric. Its uncertainty-based weighting renders it robust to overfitting from noisy relational information, making it suitable for complex, multi-relational datasets in vision and beyond (Mu et al., 2023).