Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 104 tok/s
Gemini 3.0 Pro 36 tok/s Pro
Gemini 2.5 Flash 133 tok/s Pro
Kimi K2 216 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Contrastive Learning-Based Model

Updated 16 November 2025
  • Contrastive Learning-Based Model is a framework that learns data representations by maximizing similarity between positive pairs and minimizing similarity between negatives.
  • It uses multiple projection heads to capture diverse semantic aspects, making it effective for hierarchical and multi-label tasks in vision and text.
  • Empirical studies show this approach outperforms traditional methods in accuracy and robustness, especially in low-data and noisy-label scenarios.

A contrastive learning-based model is an architectural and algorithmic paradigm that learns data representations by contrasting “positive” and “negative” examples in a latent space. This framework explicitly maximizes the similarity between representations of semantically related (“positive”) sample pairs while minimizing the similarity between unrelated (“negative”) pairs. The approach has become foundational in self-supervised and supervised representation learning across domains as varied as computer vision, natural language processing, graph learning, and bioinformatics. Modern research increasingly explores multiple levels of semantic similarity, compositional augmentations, and task-adaptive contrast strategies, as exemplified by the Multi-level Supervised Contrastive Learning (MLCL) framework (Ghanooni et al., 4 Feb 2025).

1. General Principles of Contrastive Learning

The canonical contrastive objective constructs two (or more) “views” of a data sample—often through data augmentation or modality variation—and trains an encoder to produce high-dimensional representations that are close for positive pairs and distant for negatives. Formally, for a batch of NN data points with $2N$ views (two per example), and representations {zi}\{z_i\} on the unit hypersphere, the InfoNCE loss is defined as:

LInfoNCE=12Ni=12N1P(i)pP(i)logexp(sim(zi,zp)/τ)aiexp(sim(zi,za)/τ)\mathcal{L}_{\mathrm{InfoNCE}} = -\frac{1}{2N} \sum_{i=1}^{2N} \frac{1}{|P(i)|}\sum_{p\in P(i)} \log \frac{\exp(\mathrm{sim}(z_i, z_p)/\tau)}{\sum_{a \neq i} \exp(\mathrm{sim}(z_i, z_a)/\tau)}

where P(i)P(i) denotes indices of positives for anchor ii, τ\tau is the temperature, and sim(u,v)\mathrm{sim}(u,v) is typically angular (cosine) similarity.

Traditional frameworks, including SimCLR and SupCon, define positive pairs according to instance- or class-level identity, and typically employ a single projection head. However, real data often exhibits labeling ambiguities, multi-label structure, or hierarchical organization that cannot be adequately captured by a single notion of similarity or a single embedding subspace.

2. Multi-Level Supervised Contrastive Learning (MLCL) Approach

The MLCL framework (Ghanooni et al., 4 Feb 2025) generalizes standard supervised contrastive learning by introducing HH separate projection heads (“multi-level heads”), each tailored to a distinct semantic aspect or label granularity. This enables the model to simultaneously encode multiple types of similarity (e.g., fine-grained subclasses, coarse superclasses, or various label aspects in multi-label scenarios) in parallel subspaces.

Model overview:

  • Backbone encoder f()f(\cdot):
    • Image domain: ResNet-50 (up to global pool, output dim 2048).
    • Text domain: BERT-base (hidden dim 512).
  • Projection heads {gh}\{g_h\}, h=1,,Hh=1,\ldots,H:
    • Each ghg_h is a two-layer MLP mapping from the encoder output (Rd\mathbb{R}^d) to a 128-dimensional normalized vector.
  • Semantic specialization:
    • For hierarchical multi-class (e.g., CIFAR-100): H=LhiH = L_{\rm hi} (e.g., 2 heads for subclass, superclass).
    • For multi-label: H=L+1H = L+1 (one per label + global head for intersectional similarity).

Contrastive objectives per head:

For each head hh and augmented batch, a SupCon-type loss is defined:

Lh=i=12N1Ph(i)pPh(i)logexp(zihzph/τh)aA(i)exp(zihzah/τh)\mathcal{L}_h = \sum_{i=1}^{2N} \frac{-1}{|P_h(i)|} \sum_{p \in P_h(i)} \log \frac{\exp(z_i^h \cdot z_p^h / \tau_h)}{\sum_{a \in A(i)} \exp(z_i^h \cdot z_a^h / \tau_h)}

with the total contrastive loss as a convex combination,

Lcontrast=h=1HαhLh,hαh=1.\mathcal{L}_{\mathrm{contrast}} = \sum_{h=1}^H \alpha_h \mathcal{L}_h \,, \quad \sum_h \alpha_h = 1.

An additional cross-entropy term is included for multi-label text settings: Ltotal=Lcontrast+(1hαh)LCE\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{contrast}} + (1-\sum_h \alpha_h) \mathcal{L}_{\mathrm{CE}}

Pair sampling strategies:

  • Hierarchical: Ph(i)={jyview(i)h=yview(j)h}P_h(i)=\{j \mid y^h_{\text{view}(i)} = y^h_{\text{view}(j)}\}.
  • Multi-label: P(i)={jyview(i)=yview(j)=1}P_\ell(i)=\{j \mid y^\ell_{\text{view}(i)}=y^\ell_{\text{view}(j)}=1\}, with a “global” head using Jaccard similarity thresholding.

3. Training and Hyperparameterization

The MLCL paradigm has distinct training procedures for images (hierarchical tasks) and text (multi-label document classification):

  • Images (hierarchical classification):
    • Datasets: CIFAR-100 (100 subclasses, 20 superclasses), DeepFashion.
    • Batch size: 256 samples (×2\times2 views), trained for 250 epochs with SGD + momentum.
    • Per-head temperatures: fine-level (e.g. subclass) τ1=0.1\tau_1=0.1, coarse-level (superclass) τ2=0.5\tau_2=0.5.
    • Equal weights: e.g., α1=α2=0.5\alpha_1=\alpha_2=0.5.
    • Evaluation: encoder frozen, linear classifier trained atop pooled features (linear evaluation protocol).
  • Text (multi-label):
    • Datasets: TripAdvisor (L=7), BeerAdvocate (L=5).
    • Encoder: BERT-base, d=512d=512.
    • H=L+1H=L+1 heads, τlabel=0.1, τglobal=0.5.\tau_{label}=0.1,\ \tau_{global}=0.5.
    • Weights: αlabel=0.03\alpha_{label}=0.03 each, αglobal=0.10\alpha_{global}=0.10, rest to cross-entropy.
    • Optimizer: Adam, 100 epochs, batch size 16.

4. Empirical Effectiveness and Analyses

Image classification (CIFAR-100):

Method Top-1 Accuracy (%)
SimCLR 70.70
Cross-Entropy (CE) 75.30
SupCon 76.50
Guided 76.40
MLCL 77.70

Low-data ablation for CIFAR-100 (MLCL vs SupCon):

#Train Samples MLCL SupCon Delta (MLCL − SupCon)
1K 34.74 26.82 +7.9%
5K 56.02 46.15 +9.9%
10K 59.32 49.87 +9.4%

Text multi-label (fine-tuned BERT):

Dataset CE MLCL w/o global MLCL
TripAdvisor 78.10 78.44 79.0
BeerAdvocate 70.54 71.22 71.81

The MLCL approach demonstrates strong performance benefits in both high-data and low-shot regimes, as well as notable robustness to label noise. For instance, with 50% uniform label noise on TripAdvisor, the CE method drops to ~51.8% while MLCL retains ~55%. Transfer learning (CIFAR-100 → CIFAR-10) also shows consistent improvements: SupCon 85.97% vs MLCL 86.88% on the full dataset.

Ablation analyses confirm:

  • Distinct heads focus on orthogonal semantic aspects.
  • Lower temperature parameters for fine-level heads encourage sharper separation among closely related classes.
  • Coarse-level heads benefit from higher temperatures to avoid fragmentation of broad groups.
  • MLCL delivers tighter and more semantically organized clusters in the embedding space, as visualized via t-SNE, compared to SupCon.

5. Methodological Limitations and Extensions

While MLCL provides substantial representation enhancements, several constraints exist:

  • It requires explicit pre-definition of semantic aspects—whether label, hierarchy, or attribute—thus limiting applicability to settings where such metadata is unavailable or ambiguous.
  • The number of heads increases with the number of supervised levels/aspects, which, while not dramatically increasing total parameters, does elevate the memory and compute required for the projection layers.
  • Prospective work involves automatic discovery of similarity levels, interpretability of the learned subspaces, and extending the multi-level paradigm to new domains such as graph-structured or time-series data.

The method is most impactful in settings with multi-label or hierarchical structure, low-data or noisy-label regimes, or tasks where a single similarity axis is inadequate. In such cases, multi-level contrastive learning unlocks both class-level discrimination and broader structural awareness required for robust downstream generalization.

6. Relationship to Broader Contrastive Learning Paradigms

MLCL is part of an expanding family of contrastive learning-based models that move beyond instance-level signals towards broader, supervision-informed objectives. This family includes:

  • SupCon [Khosla et al.], which uses supervised class labels in the contrastive objective.
  • Guided and HiMulConE frameworks, which introduce various hierarchy- or attribute-aware supervision signals.
  • Other multi-level or multi-branch architectures in text, graph, and multimodal domains.

The central insight underlying all such frameworks is the recognition that “similarity” is multi-faceted: robust representations must resolve not only identity but semantic structure, attribute coexistence, and hierarchical inclusion within a single embedding.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Contrastive Learning-based Model.