Bi-Level Contrastive Algorithm

Updated 29 December 2025

Bi-level contrastive algorithm is a learning paradigm that leverages two distinct contrastive losses to capture both fine-grained and coarse-grained similarities.
It combines outputs from separate projection heads and uses tailored temperature parameters to enhance discrimination and convergence.
Applications include hierarchical image/text classification, few-shot learning, and multi-modal tasks, enabling robust and discriminative feature representations.

A bi-level contrastive algorithm refers to a contrastive learning paradigm that explicitly leverages two levels (or “layers”) of similarity structure in the data, implementing distinct contrastive losses at each level and typically combining their gradients to better supervise representation learning. This approach generalizes single-level supervised or self-supervised contrastive frameworks by enabling the encoder to capture hierarchical, multi-grain, or multi-type notions of similarity—such as subclass and superclass in images, token-level and sequence-level in NLP, or instance-level and structure-level in graph and multi-view learning. The bi-level (two-level) instantiation of multi-level contrastive schemes has recently emerged as a leading strategy for problems involving limited supervision, transfer, or compositional structure.

1. Concept and Mathematical Formulation

The most systematic treatment of bi-level contrastive learning is provided by the two-head instantiation of Multi-level Supervised Contrastive Learning (MLCL) (Ghanooni et al., 4 Feb 2025). Here, a shared deep encoder $f(\cdot)$ is augmented with two separate projection heads $g_1$ and $g_2$ , each defining a distinct contrastive space:

$z_i^{(l)} = g_{l}\bigl(f(\tilde x_i)\bigr), \quad l\in\{1,2\}$

where $\{ \tilde x_i \}_{i=1}^{2N}$ are augmented views. Each head operates on a distinct label dimension—such as fine-grained class ( $y^{(1)}$ ) vs. coarse-grained superclass ( $y^{(2)}$ ) in hierarchical datasets. For each level $l$ , positives $P^l(i)$ are instances sharing label $y^{(l)}_{i}$ , and the denominator set $A(i)$ is all other views.

The per-level loss is a supervised InfoNCE:

$L^{(l)} = \sum_{i=1}^{2N} \Bigg[ -\frac{1}{|P^l(i)|} \sum_{p\in P^l(i)} \log\left( \frac{ \exp\big(z_i^{(l)}\cdot z_p^{(l)} / \tau_l\big) } { \sum_{a\in A(i)} \exp\big(z_i^{(l)}\cdot z_a^{(l)} / \tau_l\big) } \right) \Bigg]$

The overall training objective is a convex combination:

$\mathcal{L}_{\text{total}} = \lambda_1 L^{(1)} + \lambda_2 L^{(2)}$

with temperature and weight coefficients individually tuned for stability and expressivity.

2. Algorithmic Framework

The bi-level contrastive training loop proceeds as follows (Ghanooni et al., 4 Feb 2025):

Each data point is augmented twice, yielding $2N$ samples per batch.
Both heads compute projections from the shared encoder.
For each instance, positive and anchor sets at both levels are constructed according to the current label (fine/coarse, subclass/superclass, etc.).
Individual contrastive losses are computed for each head.
These are weighted and summed, and the gradient is backpropagated through all parameters.

A typical projection head is a 2-layer MLP with an intermediate (e.g. 512) and output (e.g. 128) dimension, followed by $\ell_2$ normalization. Temperature hyperparameters $\tau_1$ (fine-level) and $\tau_2$ (coarse-level) are separately tuned; lower $\tau_1$ enables sharper discrimination among fine classes, while higher $\tau_2$ regularizes class clustering.

Empirically, this bi-level structure is found to regularize the representation, preventing the fragmenting of semantic clusters and acting as an implicit prior under data-scarce conditions.

3. Applications of Bi-Level Contrastive Algorithms

Bi-level contrastive algorithms have been adopted across domains where it is necessary to encode multiple facets of similarity or hierarchy:

Hierarchical image/text classification: The MLCL framework applies to multi-label and hierarchical problems, capturing both within-subclass and within-superclass similarities, leading to marked improvements on datasets such as CIFAR-100 where hierarchical semantics are present (Ghanooni et al., 4 Feb 2025).
Few-shot learning: The ability to encode both local (fine labels) and global (coarse/hierarchical) structure regularizes learned features especially under small-sample regimes.
Multi-view or multi-modal learning: Related but structurally distinct bi-level schemes regularize both sample-level and structure-level correspondences, as in multi-view data (Zhang, 2023).
Molecule property prediction: Bi-level contrastive learning combines molecular graph-level and knowledge graph-level representations, linked by a contrastive objective, to improve the generalization of GNN-based molecular encoders (Jiang et al., 2023).
Backdoor attacks in SSL: Some works use bi-level optimization for adversarial trigger design, where the inner level simulates victim contrastive learning and the outer level enforces alignment of attacked features to target classes (Sun et al., 2024).

4. Empirical and Theoretical Advantages

Systematic empirical studies (Ghanooni et al., 4 Feb 2025) show key benefits:

Improved generalization: Bi-level contrastive training yields absolute accuracy gains (CIFAR-100: +1–10% top-1 acc. over single-level SupCon in limited-data experiments).
Faster convergence: Fewer training epochs are required for comparable accuracy versus a single contrastive objective.
Semantic structure: Embedding geometry reflects both fine-grained and coarse-grained class groupings, leading to more discriminative and clusterable representations.
Robustness in few-shot settings: The addition of a coarse-level contrastive term acts as a regularizer, reducing overfitting and improving downstream task stability.

A plausible implication is that the multi-facet supervision enforced by two losses approximates the effect of multi-task learning, but with improved sample efficiency due to shared base features.

5. Implementation Details and Design Guidance

Key considerations for instantiating bi-level contrastive algorithms include (Ghanooni et al., 4 Feb 2025):

Projection Heads: Employ shallow 2-layer MLPs (hidden=512, output=128), applying $\ell_2$ norm for projection stability.
Temperature Selection: Grid search over $\{0.05,0.1,0.2,0.5,1.0\}$ ; lower for fine-grained loss, higher for coarse.
Weighting of Levels: If the downstream task depends mainly on fine distinctions, increase $\lambda_1$ ($0.6$–$0.8$); for shallow or few-shot tasks relying on clustering, bias toward $\lambda_2$ .
Data Augmentation: Use stronger augmentations and higher dropout in the coarse-level head under low-data regimes.
Loss Monitoring: Dynamically down-weight a loss if one level’s contrastive pairs collapse (e.g., no positives).
Training Dynamics: Empirically, bi-level MLCL converges in fewer epochs (e.g., 250 for MLCL vs. 1000 for SupCon in CIFAR-100).

6. Comparisons to Single-Level Baselines

Relative to single-level supervised contrastive methods (SupCon), the bi-level algorithm (Ghanooni et al., 4 Feb 2025):

Encodes parallel similarity structures: (e.g., subclass and superclass) simultaneously via independent loss branches.
Prevents collapse and over-fragmentation: Coarse class regularization mitigates the risk that fine-grained representations split groups excessively.
Empirically dominates in limited annotation settings: Substantially improved performance is observed with severely limited samples per class in both images and text.

Standard supervised contrastive learning is, in this context, a degenerate case with only one head and one target label.

The two-level approach is a specific example of the general multi-level contrastive learning design space. Further generalizations (with $H>2$ levels) and extensions to other settings—such as graph, multi-modal, task-adaptive, and adversarial or backdoor contexts—are an ongoing area of research. Notably, bi-level contrastive schemes have also been adapted for:

Resource allocation in bilevel evolutionary algorithms (Xu et al., 3 Jun 2025)
Conversation disentanglement with utterance-to-utterance and utterance-to-prototype objectives (Huang et al., 2022)
Language modeling with token- and sequence-level contrastive objectives (Luo et al., 2021)
Graph and multi-view subspace learning (Zhang, 2023)
Adversarial robustness via bilevel optimization (Sun et al., 2024)

These schemes all share the fundamental strategy of supervising representations with at least two coupled but distinct views or granularities of similarity, implemented via layered or parallel contrastive objectives.

References

Multi-level Supervised Contrastive Learning (Ghanooni et al., 4 Feb 2025)
CR-BLEA: Contrastive Ranking for Adaptive Resource Allocation in Bilevel Evolutionary Algorithms (Xu et al., 3 Jun 2025)
Conversation Disentanglement with Bi-Level Contrastive Learning (Huang et al., 2022)
Backdoor Contrastive Learning via Bi-level Trigger Optimization (Sun et al., 2024)
Multi-view Feature Extraction based on Dual Contrastive Head (Zhang, 2023)
Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations (Jiang et al., 2023)
Bi-Granularity Contrastive Learning for Post-Training in Few-Shot Scene (Luo et al., 2021)