Multi-Level Contrastive Learning (MLCL)

Updated 22 May 2026

Multi-Level Contrastive Learning (MLCL) is a paradigm for hierarchical representation learning across different levels of semantic, structural, or spatial granularity.
The methodology leverages multiple projection heads, each tuned to a specific hierarchy level, and applies supervised contrastive loss to enhance model robustness and versatility.
MLCL exhibits superiority over single-level approaches, especially in low-resource or multi-label scenarios, outperforming traditional methods in accuracy and adaptability.

Multi-Level Contrastive Learning (MLCL) is a general paradigm for representation learning that leverages contrastive objectives at multiple semantic, structural, or spatial levels within or across samples. Rather than applying contrastive learning to a single aspect of the data (e.g., instance, class, or sentence level), MLCL decomposes similarity into a hierarchy—such as label granularity, spatial resolution, feature abstraction, semantic hierarchy, or modality—and constructs loss functions or architectural mechanisms to enforce both invariance and discriminability at each level. MLCL has found considerable success in domains such as supervised and self-supervised representation learning, vision, natural language processing, graph and multimodal learning, recommendation, clustering, and domain adaptation. Empirical evidence demonstrates its superiority over single-level or naive fusion approaches, especially under low-resource, multi-label, or hierarchical settings.

1. Theoretical and Architectural Foundations

The MLCL paradigm formalizes the idea that “similarity” is not monolithic but multi-faceted or hierarchical. For a sample $x_k$ , one may have multiple levels of annotation or grouping:

$L$ levels of labels (hierarchy): e.g., subclass, superclass
Multi-label setting: multiple (possibly overlapping) attributes or properties
Structural hierarchy: instance, region (pixel, patch), global (image/document)
Modality or feature abstraction: high-level semantic vs. low-level syntactic

In the canonical MLCL framework for supervised settings, a shared encoder $f(x)$ produces a backbone embedding $h_k$ . A set of $M$ projection heads $g_m$ , each (typically a small MLP), maps $h_k$ to $z^{(m)}$ specialized for one label level, attribute, or semantic aspect. For each head, a supervised contrastive loss is constructed, for anchor $i$ :

$\ell_i^{(m)} = -\frac{1}{|P_m(i)|} \sum_{p \in P_m(i)} \log \frac{\exp(z_i^{(m)} \cdot z_p^{(m)} / \tau_m)}{\sum_{a \in A(i)} \exp(z_i^{(m)} \cdot z_a^{(m)} / \tau_m)}$

where $L$ 0 is the set of positives at level $L$ 1 (e.g., same class, same attribute, Jaccard similarity threshold for multi-label), and $L$ 2 the head’s temperature.

The total MLCL loss is then a weighted sum:

$L$ 3

This architecture generalizes to unsupervised and graph contexts via the selection of appropriate contrastive objectives and positive/negative set definitions (Ghanooni et al., 4 Feb 2025, Shao et al., 2021).

2. Multi-Level Contrastive Objectives Across Domains

MLCL instantiations fall broadly into the following patterns, with domain-specific adaptations:

Supervised MLCL (vision, NLP): Multiple projection heads tied to hierarchy or label structure; levels include fine-to-coarse classes (e.g., subclass/superclass in CIFAR-100), multi-label attributes, or text aspects (Ghanooni et al., 4 Feb 2025).
Dense Prediction (dense vision, segmentation): MLCL enforces (a) pixel-to-pixel, (b) pixel-to-class, and (c) instance-to-class consistency, respectively, encouraging structure-aware clustering at multiple spatial resolutions and semantic scales (Yang et al., 2023, Guo et al., 2023).
Self-Supervised/Unsupervised Graph MLCL: Multiple graph views (e.g., topology vs. learned kNN-graph) and simultaneous node-level (local) and graph-level (global) contrast (Shao et al., 2021, Wang et al., 2024).
Feature Abstraction MLCL (deep networks): Contrastive losses applied at multiple depths of the encoder; each level learns representations tuned to a different layer’s abstraction, with downstream ensembling across layers (Chen et al., 2021).
Attribute/Head-level MLCL: Contrastive disentanglement where each feature head is forced to be both distinctive (via contrast) and active (via entropy), promoting fine-grained, class-agnostic compositionality (Jiang et al., 2024).
Recommendation and Knowledge Graphs: MLCL models contrast at interaction, intra-view, and inter-view graph levels (user-item interactions, neighborhood, and perturbed views), as well as knowledge distillation via multi-view projection (Hu et al., 8 May 2026, Zou et al., 2022, Wang et al., 2022, An et al., 2022).

3. Formal Loss Structures and Training Algorithms

The landscape of MLCL objectives includes but is not limited to:

InfoNCE: $L$ 4-contrast of positives vs. negatives, with temperature tuning per level.
Jensen-Shannon divergence: Probability-normalized similarity (notably for pixel-to-pixel contrast) for more nuanced similarity alignment (Yang et al., 2023).
Margin-based triplet loss: Used at utterance/slot/word level for fine-grained discrimination in language understanding (Liang et al., 2022, Cheng et al., 2024).
DGI/Discriminator-based graph contrast: For graph-level pooling followed by binary discrimination between realistic and corrupted views (An et al., 2022).
Entropy-regularized losses: Enforce feature-head balance and prevent collapse (Jiang et al., 2024).
Weighted sum of contrastive terms: Hyperparameters ( $L$ 5) controlling level-wise importance, with ablation studies confirming sensitivity and interpretability.

Training of MLCL models usually alternates or aggregates level-wise contrasts per minibatch, with per-level temperatures, projection head parameters, and loss weights tuned by validation. For self-supervised scenarios, EMA-target networks and aggressive data augmentations are applied to stabilize and enrich the contrastive signal (Guo et al., 2023, Zeng et al., 2023).

Typical MLCL training loop:

For each minibatch, extract base features
For each contrast level:
- Project to corresponding space
- Build positive/negative pairs
- Compute and sum per-level losses (possibly with per-head temperatures)
Aggregate loss terms and update all model parameters jointly

At inference, the encoders’ base output or selected head embedding serves for downstream tasks (e.g., classification, clustering, retrieval).

4. Variants and Instantiations in Practice

MLCL has been realized in diverse architectures:

Parallel Projection Heads: Each head encodes one semantic or hierarchical level; at inference, only the backbone is kept, unless ensembling is feasible (Ghanooni et al., 4 Feb 2025).
Layer-wise (Depth) Contrast: Losses applied at various encoder depths, with ensemble classifiers consuming intermediate representations to exploit scale/abstraction diversity (Chen et al., 2021).
Multi-view or Multi-modal MLCL: Views can be modalities (image, text, audio), raw vs. transformed graphs, or augmented semantic structures (Xu et al., 2021, Yang et al., 2023, Wang et al., 2024).
Graph-specific MLCL: Contrasts can be node/patch-level, global (graph), inter-view (e.g., topology-feature), intra-view (same view augmentations), or interaction-aligned (user-item, item-item, user-user edges) (Shao et al., 2021, Hu et al., 8 May 2026, Wang et al., 2022, Zou et al., 2022).
Label-structured MLCL: For tasks like SLU, MLCL applies to utterance, slot, and word levels, often coupled with margin-based and hard-negative augmentation (Cheng et al., 2024, Liang et al., 2022).

Table 1. MLCL Instantiations (selected papers):

Domain	Levels / Granularities	Reference
Vision	Class hierarchy (sub/super), attribute, multi-label, feature abstraction	(Ghanooni et al., 4 Feb 2025)
Segmentation	Pixel-to-pixel, pixel-to-class, instance-to-class	(Yang et al., 2023)
Graphs	Node (local) / Graph (global), topological vs. feature-space	(Shao et al., 2021)
Recommender	Interest-level, feature-level, attribute-level	(Wang et al., 2022)
NLP	Word/Span/Utterance, syntactic/semantic levels	(Chen et al., 2022)

5. Empirical Impact and Ablation

MLCL shows consistent superiority over baselines across domains:

In hierarchical image classification (CIFAR-100), MLCL exceeds single-head supervised contrastive learning (SupCon) by up to 1.2% absolute accuracy, and by >9% in low-data regimes (Ghanooni et al., 4 Feb 2025).
For sequential recommendation, interest- and feature-level CL are both critical: ablating either causes a 4–8% recall drop (Wang et al., 2022).
In dense prediction, MLCL achieves +1–4 AP or mIoU improvements over standard self-supervised and supervised pre-training (Guo et al., 2023, Yang et al., 2023).
In medical segmentation pre-training, multi-level asymmetric CL yields up to +7.8% Dice over the state-of-the-art (Zeng et al., 2023).
For multi-view clustering, performing separate contrastive objectives on feature and semantic spaces, rather than naively fusing them, leads to >10–20% accuracy gains on benchmarks (Xu et al., 2021).
Ablations—removing levels, disabling certain contrastive terms, or setting suboptimal temperatures—lead to consistent, interpretable performance degradations (up to ~10% in recall, 2–3% in mIoU in vision tasks) (Yang et al., 2023, Ghanooni et al., 4 Feb 2025, Chen et al., 2021). Balanced or adaptively-tuned head weights/temperatures are crucial.

6. Practical Considerations and Limitations

Weighting and temperature tuning: Optimal MLCL performance is sensitive to weights ( $L$ 6) and temperatures ( $L$ 7) per level; practitioners adopt grid search or ablation for calibration (Ghanooni et al., 4 Feb 2025).
Computational cost: MLCL increases per-batch compute and memory footprint due to multiple heads or projections, larger sets of positive/negative pairs, and per-level batch operations. However, head MLPs are typically lightweight compared to the backbone.
Label requirement: Supervised MLCL requires multi-level annotations; self-supervised variants require well-defined unsupervised proxies (e.g., kNN, view-augmentations) (Shao et al., 2021, Wang et al., 2024).
Interpretability: Multiple heads afford interpretability of learned features along semantic axes, but understanding the interplay of their learned spaces remains an open direction.
Extensions: Adaptive or learned weighting per head, application to self-supervised settings via pseudo-labels, and deeper architectural integration remain active areas of research (Ghanooni et al., 4 Feb 2025, Chen et al., 2021).

7. Applications, Generalizations, and Future Directions

MLCL’s core insight—explicitly targeting multiple, structured notions of similarity via architectural and loss modularity—facilitates robust learning in data-scarce, multi-label, hierarchical, and cross-modal environments. Demonstrated domains include but are not limited to hierarchical classification, multi-label classification, cross-lingual representation learning, domain-adaptive segmentation, dense vision tasks, recommendation, multi-view and multi-modal clustering, and graph representation learning (Ghanooni et al., 4 Feb 2025, Yang et al., 2023, Chen et al., 2022, Xu et al., 2021, Wang et al., 2022, Hu et al., 8 May 2026, Jiang et al., 2024).

Future directions include:

Automated discovery or adaptation of levels and their importance via attention or meta-learning mechanisms.
Integration with large-scale pre-training, especially for multi-modal or deeply hierarchical data.
Applying MLCL in settings with noisy or incomplete labels by regularizing with global heads or pseudo-labels.
Theoretical analysis of the trade-offs between contrast granularity and optimization stability.

By systematically decomposing similarity into multiple learnable axes and enforcing both invariance and discriminability at each, MLCL delivers more expressive, generalizable, and interpretable representations across domains. Its increasing adoption is pushing the research frontier beyond single-level or “flat” contrastive learning frameworks.