Bi-level Self-supervised Contrastive Loss (BSCL)

Updated 28 December 2025

BSCL is a bi-level loss that integrates instance-level similarity with feature-level clustering to capture both local invariances and global relational patterns.
It balances two interacting objectives—optimizing discriminative features and clustering structures—to enhance representations in domains such as images and graphs.
Empirical results show BSCL achieves improved performance and robustness in downstream tasks, evidenced by state-of-the-art benchmarks on ImageNet and knowledge tracing datasets.

A Bi-level Self-supervised Contrastive Learning Loss (BSCL) is any loss function or objective that organizes the optimization of representations in self-supervised learning across two separate but interacting levels. The two levels generally correspond to distinct semantic or structural hierarchies—such as instance- and feature-level (image), node- and graph-level (graph)—and the loss is structured to exploit both local discriminative invariances and global relational patterns without full supervision. The bi-level formulation is instantiated either by directly combining contrastive loss functions at different semantic granularity or by nesting an optimization over unsupervised (pretext) and downstream objectives in a bilevel framework. Representative approaches include the BiSSL framework for vision (Zakarias et al., 2024), the explicit BSCL construction for image clustering (Ge et al., 2023), and bi-level losses for structured prediction and graphs (Song et al., 2022).

1. Formal Definition and Mathematical Structure

The distinguishing feature of a BSCL is an architecture and loss that explicitly encodes bi-level objectives. The two main forms are:

Given input $x$ , sample two strong augmentations $T_1(x),T_2(x)$ . After shared encoding and a projection/prediction head:

Prediction stream: $q_1 = \text{pred}(\text{proj}(f(T_1(x))))$
Stop-gradient target: $z_2 = \text{proj}(f(T_2(x)))$ (no gradient back-propagation)

Define:

Instance-level similarity loss:

$L_{\text{sim}} = - \langle \frac{q_1}{\|q_1\|}, \frac{z_2}{\|z_2\|} \rangle$

Feature-level (clustering) cross-entropy loss: for $\tau_2, \tau_1 > 0$ ,

$p(i|z_2;\tau_2)=\frac{\exp(z_2^{(i)}/\tau_2)}{\sum_{j=1}^C \exp(z_2^{(j)}/\tau_2)}$

$p(i|q_1/\|q_1\|;\tau_1)=\frac{\exp(q_1^{(i)}/\|q_1\|)/\tau_1)}{\sum_{j=1}^C \exp(q_1^{(j)}/\|q_1\|)/\tau_1)}$

$L_{\text{iccl}} = -\sum_{i=1}^C p(i|z_2;\tau_2) \log p(i|q_1/\|q_1\|;\tau_1)$

with a batch-wise KL divergence to uniformity, $R = KL(U||\bar{p})$ . The full BSCL is

$L_{\text{BSCL}} = \alpha L_{\text{sim}} + (1-\alpha)L_{\text{iccl}} + \lambda_r R$

for some schedule/interpolation $\alpha$ .

Let $\mathcal{L}^p(\theta, \phi)$ be a self-supervised contrastive pretext loss with encoder $\theta$ . $\mathcal{L}^d(\theta, \psi)$ is a downstream supervised loss (cross-entropy).

Lower level (pretext): Find $\theta_p^*(\theta_d) \in \arg\min_{\theta_p, \phi} \mathcal{L}^p(\theta_p, \phi) + \lambda r(\theta_d, \theta_p)$ , where $r$ is a similarity regularizer.
Upper level (downstream):

$\min_{\theta_d, \psi} \mathcal{L}^d(\theta_p^*(\theta_d), \psi) + \gamma \mathcal{L}^d(\theta_d, \psi)$

$\theta_p^*(\theta_d)$ is the optimum of the lower (pretext) objective regularized towards $\theta_d$ .

2. Theoretical Foundations and Gradient Analysis

A core theoretical result is the relationship between instance-level similarity and feature-level clustering losses (Ge et al., 2023):

Cross-entropy on one-hot labels aligns gradients with cosine similarity between same-class instances.
The modified cross-entropy (MCE) loss $L_{\text{iccl}}$ matches the gradient direction of $L_{\text{sim}}$ with controlled $\tau_1$ , making them interchangeable in parameter space for certain schedules.
For bilevel frameworks (Zakarias et al., 2024), optimal updates at the upper level require implicit differentials $\partial \theta_p^* / \partial \theta_d$ , computed via the Implicit Function Theorem, and are typically approximated with conjugate-gradient solutions for tractable computation in deep architectures.

3. Optimization Strategies and Training Procedures

Pseudocode unifies common steps:

for epoch in range(N_total):
  for minibatch in dataloader:
    # Data augmentations and forward pass
    ...
    # Compute instance and clustering losses, plus regularizer
    if epoch < N_total // 2:
        loss = L_sim
    else:
        loss = L_iccl + lambda_r * R
    # Optimizer step
    ...

Key hyperparameters include batch size

B=256

–$1024$,

\tau_2=0.07

\tau_1=0.1

, and regularization weight

\lambda_r=1

–$5$.

The loss is decomposed into node-level (local) and graph-level (global) NT-Xent components, with equal or tunable weighting:

$L_\text{joint} = L_{\text{node}} + L_{\text{graph}}$

Graph augmentations are performed through node/edge dropout respecting centrality scores. Optimization uses Adam with a learning rate of $10^{-3}$ .

4. Empirical Results and Comparative Analysis

Image Classification

On ImageNet-1K, ResNet-50 (100 epochs, linear eval) (Ge et al., 2023):

SimCLR: 66.5%
MoCo-v2: 67.4%
BYOL: 66.0%
SwAV: 66.5%
DINO: 67.8%
BSCL: 68.2% (top-4 result)

At 300 epochs, BSCL remains competitive:

BSCL: 71.7%
MoCo-v3, BYOL, and DINO: ~72%

Imagenette (ResNet-18, 1000 epochs): BSCL achieves 91.23%, outperforming SimSiam, BYOL, and SwAV.

Transfer and Robustness

BSCL demonstrates robustness across pretraining lengths (100–600 epochs) and stability to hyperparameter scaling, with statistically significant improvements (+1.1–2.8 points) across 14 downstream benchmarks included in (Zakarias et al., 2024). No dataset shows accuracy deterioration. Feature alignment visualizations show tighter intra-class clusters in pretrained representations.

Graph and Knowledge Tracing

Bi-CLKT's joint bi-level loss achieves consistent improvements in knowledge tracing metrics across four real-world datasets (Song et al., 2022), establishing the benefit of capturing both exercise-level (local) and concept-level (global) structure.

5. Typical Use Cases and Research Impact

BSCL designs are primarily deployed in scenarios where representation learning must balance robust instance invariance against semantically meaningful global structure. Applications include:

Self-supervised image representation (classification, detection, transfer)
Knowledge tracing in educational assessment (discriminative exercise/concept embeddings)
Node and graph-level understanding in relational and structural data domains

A central theme is that bi-level losses facilitate both the compactness of local clusters and the separability of global structures, yielding improved downstream alignment after fine-tuning.

6. Design Variants and Scheduling

Two broad schedules are reported:

Two-stage: First half of training uses pure instance-level similarity loss; the second half switches to the joint bi-level loss (Ge et al., 2023).
Single-stage interpolation: A weighted sum smoothly transitions between similarity and clustering losses.

In graph contrastive settings, node- and graph-level losses are summed with equal or tunable weights without additional scheduling (Song et al., 2022).

For bilevel optimization frameworks (Zakarias et al., 2024), alternating inner (lower-level) and outer (upper-level) gradient steps is essential, often with mechanisms for exponential moving average synchronization.

7. Limitations and Open Challenges

Empirical reports confirm that bi-level losses consistently outperform or match monolithic objectives; however, computational demand increases notably due to second-order implicit gradients or extended bi-level updates. Efficient hypergradient computation remains an area of practical concern. While gradient norm analyses and empirical alignment provide support for interchangeability between loss components, broader theoretical underpinnings for general data modalities and schedules are an open avenue for research.

Summary Table: Bi-level Self-supervised Contrastive Loss Designs

Scheme	Lower Level Objective	Upper Level or Joint Objective	Domains
BSCL (Ge et al., 2023)	$L_{\text{sim}}$	$L_{\text{iccl}} + \lambda_r R$	Images
BiSSL (Zakarias et al., 2024)	$\mathcal{L}^p + \lambda r(\cdot)$	$\mathcal{L}^d$ (with bilevel coupling)	Images, Vision
Bi-CLKT (Song et al., 2022)	$L_{\text{node}}$ (E2E)	$L_{\text{graph}}$ (C2C); $L_{\text{joint}}$	Graph, Knowledge KT

In summary, Bi-level Self-supervised Contrastive Learning Losses systematically integrate multi-level invariances and discriminative cues using structured loss schedules or bilevel formulations, yielding demonstrable gains across self-supervised learning, transfer, and structured data domains (Zakarias et al., 2024, Ge et al., 2023, Song et al., 2022).

PDF Markdown Chat (Pro)

References (3)

BiSSL: Enhancing the Alignment Between Self-Supervised Pretraining and Downstream Fine-Tuning via Bilevel Optimization (2024)

Learning the Relation between Similarity Loss and Clustering Loss in Self-Supervised Learning (2023)

Bi-CLKT: Bi-Graph Contrastive Learning based Knowledge Tracing (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Bi-level Self-supervised Contrastive Learning Loss (BSCL).

Bi-level Self-supervised Contrastive Loss (BSCL)

1. Formal Definition and Mathematical Structure

(A) Bi-level combination of similarity and clustering losses (Ge et al., 2023)

(B) Bilevel optimization with explicit upper and lower levels (Zakarias et al., 2024)

2. Theoretical Foundations and Gradient Analysis

3. Optimization Strategies and Training Procedures

Image and Representation Models (Ge et al., 2023, Zakarias et al., 2024)

Structured Data and Graphs (Song et al., 2022)

4. Empirical Results and Comparative Analysis

Image Classification

Transfer and Robustness

Graph and Knowledge Tracing

5. Typical Use Cases and Research Impact

6. Design Variants and Scheduling

7. Limitations and Open Challenges

Summary Table: Bi-level Self-supervised Contrastive Loss Designs

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Bi-level Self-supervised Contrastive Loss (BSCL)

1. Formal Definition and Mathematical Structure

(A) Bi-level combination of similarity and clustering losses (Ge et al., 2023)

(B) Bilevel optimization with explicit upper and lower levels (Zakarias et al., 2024)

2. Theoretical Foundations and Gradient Analysis

3. Optimization Strategies and Training Procedures

Image and Representation Models (Ge et al., 2023, Zakarias et al., 2024)

Structured Data and Graphs (Song et al., 2022)

4. Empirical Results and Comparative Analysis

Image Classification

Transfer and Robustness

Graph and Knowledge Tracing

5. Typical Use Cases and Research Impact

6. Design Variants and Scheduling

7. Limitations and Open Challenges

Summary Table: Bi-level Self-supervised Contrastive Loss Designs

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics