Contrastive Learning-Based Model

Updated 16 November 2025

Contrastive Learning-Based Model is a framework that learns data representations by maximizing similarity between positive pairs and minimizing similarity between negatives.
It uses multiple projection heads to capture diverse semantic aspects, making it effective for hierarchical and multi-label tasks in vision and text.
Empirical studies show this approach outperforms traditional methods in accuracy and robustness, especially in low-data and noisy-label scenarios.

A contrastive learning-based model is an architectural and algorithmic paradigm that learns data representations by contrasting “positive” and “negative” examples in a latent space. This framework explicitly maximizes the similarity between representations of semantically related (“positive”) sample pairs while minimizing the similarity between unrelated (“negative”) pairs. The approach has become foundational in self-supervised and supervised representation learning across domains as varied as computer vision, natural language processing, graph learning, and bioinformatics. Modern research increasingly explores multiple levels of semantic similarity, compositional augmentations, and task-adaptive contrast strategies, as exemplified by the Multi-level Supervised Contrastive Learning (MLCL) framework (Ghanooni et al., 4 Feb 2025).

1. General Principles of Contrastive Learning

The canonical contrastive objective constructs two (or more) “views” of a data sample—often through data augmentation or modality variation—and trains an encoder to produce high-dimensional representations that are close for positive pairs and distant for negatives. Formally, for a batch of $N$ data points with $2N$ views (two per example), and representations $\{z_i\}$ on the unit hypersphere, the InfoNCE loss is defined as:

$\mathcal{L}_{\mathrm{InfoNCE}} = -\frac{1}{2N} \sum_{i=1}^{2N} \frac{1}{|P(i)|}\sum_{p\in P(i)} \log \frac{\exp(\mathrm{sim}(z_i, z_p)/\tau)}{\sum_{a \neq i} \exp(\mathrm{sim}(z_i, z_a)/\tau)}$

where $P(i)$ denotes indices of positives for anchor $i$ , $\tau$ is the temperature, and $\mathrm{sim}(u,v)$ is typically angular (cosine) similarity.

Traditional frameworks, including SimCLR and SupCon, define positive pairs according to instance- or class-level identity, and typically employ a single projection head. However, real data often exhibits labeling ambiguities, multi-label structure, or hierarchical organization that cannot be adequately captured by a single notion of similarity or a single embedding subspace.

2. Multi-Level Supervised Contrastive Learning (MLCL) Approach

The MLCL framework (Ghanooni et al., 4 Feb 2025) generalizes standard supervised contrastive learning by introducing $H$ separate projection heads (“multi-level heads”), each tailored to a distinct semantic aspect or label granularity. This enables the model to simultaneously encode multiple types of similarity (e.g., fine-grained subclasses, coarse superclasses, or various label aspects in multi-label scenarios) in parallel subspaces.

Model overview:

Backbone encoder $f(\cdot)$ :
- Image domain: ResNet-50 (up to global pool, output dim 2048).
- Text domain: BERT-base (hidden dim 512).
Projection heads $\{g_h\}$ , $h=1,\ldots,H$ :
- Each $g_h$ is a two-layer MLP mapping from the encoder output ( $\mathbb{R}^d$ ) to a 128-dimensional normalized vector.
Semantic specialization:
- For hierarchical multi-class (e.g., CIFAR-100): $H = L_{\rm hi}$ (e.g., 2 heads for subclass, superclass).
- For multi-label: $H = L+1$ (one per label + global head for intersectional similarity).

Contrastive objectives per head:

For each head $h$ and augmented batch, a SupCon-type loss is defined:

$\mathcal{L}_h = \sum_{i=1}^{2N} \frac{-1}{|P_h(i)|} \sum_{p \in P_h(i)} \log \frac{\exp(z_i^h \cdot z_p^h / \tau_h)}{\sum_{a \in A(i)} \exp(z_i^h \cdot z_a^h / \tau_h)}$

with the total contrastive loss as a convex combination,

$\mathcal{L}_{\mathrm{contrast}} = \sum_{h=1}^H \alpha_h \mathcal{L}_h \,, \quad \sum_h \alpha_h = 1.$

An additional cross-entropy term is included for multi-label text settings: $\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{contrast}} + (1-\sum_h \alpha_h) \mathcal{L}_{\mathrm{CE}}$

Pair sampling strategies:

Hierarchical: $P_h(i)=\{j \mid y^h_{\text{view}(i)} = y^h_{\text{view}(j)}\}$ .
Multi-label: $P_\ell(i)=\{j \mid y^\ell_{\text{view}(i)}=y^\ell_{\text{view}(j)}=1\}$ , with a “global” head using Jaccard similarity thresholding.

3. Training and Hyperparameterization

The MLCL paradigm has distinct training procedures for images (hierarchical tasks) and text (multi-label document classification):

Images (hierarchical classification):
- Datasets: CIFAR-100 (100 subclasses, 20 superclasses), DeepFashion.
- Batch size: 256 samples ( $\times2$ views), trained for 250 epochs with SGD + momentum.
- Per-head temperatures: fine-level (e.g. subclass) $\tau_1=0.1$ , coarse-level (superclass) $\tau_2=0.5$ .
- Equal weights: e.g., $\alpha_1=\alpha_2=0.5$ .
- Evaluation: encoder frozen, linear classifier trained atop pooled features (linear evaluation protocol).
Text (multi-label):
- Datasets: TripAdvisor (L=7), BeerAdvocate (L=5).
- Encoder: BERT-base, $d=512$ .
- $H=L+1$ heads, $\tau_{label}=0.1,\ \tau_{global}=0.5.$
- Weights: $\alpha_{label}=0.03$ each, $\alpha_{global}=0.10$ , rest to cross-entropy.
- Optimizer: Adam, 100 epochs, batch size 16.

4. Empirical Effectiveness and Analyses

Image classification (CIFAR-100):

Method	Top-1 Accuracy (%)
SimCLR	70.70
Cross-Entropy (CE)	75.30
SupCon	76.50
Guided	76.40
MLCL	77.70

Low-data ablation for CIFAR-100 (MLCL vs SupCon):

#Train Samples	MLCL	SupCon	Delta (MLCL − SupCon)
1K	34.74	26.82	+7.9%
5K	56.02	46.15	+9.9%
10K	59.32	49.87	+9.4%

Text multi-label (fine-tuned BERT):

Dataset	CE	MLCL w/o global	MLCL
TripAdvisor	78.10	78.44	79.0
BeerAdvocate	70.54	71.22	71.81

The MLCL approach demonstrates strong performance benefits in both high-data and low-shot regimes, as well as notable robustness to label noise. For instance, with 50% uniform label noise on TripAdvisor, the CE method drops to ~51.8% while MLCL retains ~55%. Transfer learning (CIFAR-100 → CIFAR-10) also shows consistent improvements: SupCon 85.97% vs MLCL 86.88% on the full dataset.

Ablation analyses confirm:

Distinct heads focus on orthogonal semantic aspects.
Lower temperature parameters for fine-level heads encourage sharper separation among closely related classes.
Coarse-level heads benefit from higher temperatures to avoid fragmentation of broad groups.
MLCL delivers tighter and more semantically organized clusters in the embedding space, as visualized via t-SNE, compared to SupCon.

5. Methodological Limitations and Extensions

While MLCL provides substantial representation enhancements, several constraints exist:

It requires explicit pre-definition of semantic aspects—whether label, hierarchy, or attribute—thus limiting applicability to settings where such metadata is unavailable or ambiguous.
The number of heads increases with the number of supervised levels/aspects, which, while not dramatically increasing total parameters, does elevate the memory and compute required for the projection layers.
Prospective work involves automatic discovery of similarity levels, interpretability of the learned subspaces, and extending the multi-level paradigm to new domains such as graph-structured or time-series data.

The method is most impactful in settings with multi-label or hierarchical structure, low-data or noisy-label regimes, or tasks where a single similarity axis is inadequate. In such cases, multi-level contrastive learning unlocks both class-level discrimination and broader structural awareness required for robust downstream generalization.

6. Relationship to Broader Contrastive Learning Paradigms

MLCL is part of an expanding family of contrastive learning-based models that move beyond instance-level signals towards broader, supervision-informed objectives. This family includes:

SupCon [Khosla et al.], which uses supervised class labels in the contrastive objective.
Guided and HiMulConE frameworks, which introduce various hierarchy- or attribute-aware supervision signals.
Other multi-level or multi-branch architectures in text, graph, and multimodal domains.

The central insight underlying all such frameworks is the recognition that “similarity” is multi-faceted: robust representations must resolve not only identity but semantic structure, attribute coexistence, and hierarchical inclusion within a single embedding.

References:

Multi-level Supervised Contrastive Learning (Ghanooni et al., 4 Feb 2025)

PDF Markdown Chat (Pro)

References (1)

Multi-level Supervised Contrastive Learning (2025)

Follow Topic

Get notified by email when new papers are published related to Contrastive Learning-based Model.