Papers
Topics
Authors
Recent
Search
2000 character limit reached

SAG-VICReg: Stable & Global SSL

Updated 24 April 2026
  • SAG-VICReg is a self-supervised representation learning framework that extends VICReg by incorporating semantically meaningful cross-image pairs using a random-walk mechanism.
  • It replaces the standard instance-level invariance with a weighted invariance term that promotes global semantic structure in the embedding space.
  • Empirical evaluations on datasets like ImageNet and CIFAR-100 show improved hierarchical and global structure metrics compared to traditional VICReg.

SAG-VICReg (Stable and Generalizable VICReg) is a framework for self-supervised representation learning that modifies the Variance-Invariance-Covariance Regularization (VICReg) objective by injecting semantically meaningful cross-image positive pairs using a random-walk pairing mechanism. This approach targets both improved generalization to unseen data and enhanced capture of global semantic structures in the learned embeddings, addressing structural limitations inherent in VICReg's original formulation (Simai et al., 22 Jun 2025).

1. Background: VICReg and Spectral Embedding View

VICReg is a self-supervised learning (SSL) algorithm that learns image representations by optimizing three terms:

  • Invariance:

s(Y,Y)=1ni=1nyiyi22s(Y,Y') = \frac{1}{n} \sum_{i=1}^n \|y_i - y_i'\|_2^2

This term enforces that two random augmentations of the same image produce similar embeddings.

  • Variance:

v(Y)=1kj=1kmax(0,γS(yj,ϵ)),  S(yj,ϵ)=Var(yj)+ϵv(Y) = \frac{1}{k} \sum_{j=1}^k \max\left(0, \gamma - S(y^j, \epsilon)\right), \ \ S(y^j,\epsilon)=\sqrt{\mathrm{Var}(y^j) + \epsilon}

It ensures a minimal spread in each embedding dimension, preventing collapse.

  • Covariance:

c(Y)=1kpq[C(Y)]pq2c(Y) = \frac{1}{k} \sum_{p \neq q} [C(Y)]_{pq}^2

This encourages decorrelation across embedding dimensions.

The combined loss function is: LVICReg(Y,Y)=λs(Y,Y)+μ[v(Y)+v(Y)]+ν[c(Y)+c(Y)]L_{\rm VICReg}(Y,Y') = \lambda s(Y,Y') + \mu\left[v(Y) + v(Y')\right] + \nu\left[c(Y) + c(Y')\right]

VICReg can be interpreted as spectral embedding on a block-diagonal graph where edges connect only augmentations of the same image. Consequently, it does not capture global relationships between images, leading to sub-optimal representation of unseen data clusters and potentially distorted embeddings for out-of-distribution samples. This view is analogous to SpectralNet objectives but with the restriction that the graph only encodes instance-level locality.

2. Random-Walk Pairing: Injecting Global Structure

To overcome the absence of cross-image relationships in VICReg, SAG-VICReg extends the local, block-diagonal neighborhood by constructing semantically-informed, stochastic cross-image pairings using random walks on a learned affinity graph. The procedure is as follows:

  1. Generate two batches of embeddings:

Z={zi}i=1n,Z={zi}i=1nZ = \{z_i\}_{i=1}^n,\quad Z' = \{z_i'\}_{i=1}^n

Each derived from random augmentations.

  1. Build an affinity matrix WRn×nW \in \mathbb{R}^{n \times n} based on pairwise cosine distances between ziz_i and zjz_j':

dij=1cos(zi,zj)d_{ij} = 1 - \cos(z_i, z_j')

For each ii, retain v(Y)=1kj=1kmax(0,γS(yj,ϵ)),  S(yj,ϵ)=Var(yj)+ϵv(Y) = \frac{1}{k} \sum_{j=1}^k \max\left(0, \gamma - S(y^j, \epsilon)\right), \ \ S(y^j,\epsilon)=\sqrt{\mathrm{Var}(y^j) + \epsilon}0 nearest neighbors (typically v(Y)=1kj=1kmax(0,γS(yj,ϵ)),  S(yj,ϵ)=Var(yj)+ϵv(Y) = \frac{1}{k} \sum_{j=1}^k \max\left(0, \gamma - S(y^j, \epsilon)\right), \ \ S(y^j,\epsilon)=\sqrt{\mathrm{Var}(y^j) + \epsilon}1), and set a local scale v(Y)=1kj=1kmax(0,γS(yj,ϵ)),  S(yj,ϵ)=Var(yj)+ϵv(Y) = \frac{1}{k} \sum_{j=1}^k \max\left(0, \gamma - S(y^j, \epsilon)\right), \ \ S(y^j,\epsilon)=\sqrt{\mathrm{Var}(y^j) + \epsilon}2 as the 20th percentile of v(Y)=1kj=1kmax(0,γS(yj,ϵ)),  S(yj,ϵ)=Var(yj)+ϵv(Y) = \frac{1}{k} \sum_{j=1}^k \max\left(0, \gamma - S(y^j, \epsilon)\right), \ \ S(y^j,\epsilon)=\sqrt{\mathrm{Var}(y^j) + \epsilon}3:

v(Y)=1kj=1kmax(0,γS(yj,ϵ)),  S(yj,ϵ)=Var(yj)+ϵv(Y) = \frac{1}{k} \sum_{j=1}^k \max\left(0, \gamma - S(y^j, \epsilon)\right), \ \ S(y^j,\epsilon)=\sqrt{\mathrm{Var}(y^j) + \epsilon}4

  1. Define the normalized transition matrix for a one-step random walk:

v(Y)=1kj=1kmax(0,γS(yj,ϵ)),  S(yj,ϵ)=Var(yj)+ϵv(Y) = \frac{1}{k} \sum_{j=1}^k \max\left(0, \gamma - S(y^j, \epsilon)\right), \ \ S(y^j,\epsilon)=\sqrt{\mathrm{Var}(y^j) + \epsilon}5

v(Y)=1kj=1kmax(0,γS(yj,ϵ)),  S(yj,ϵ)=Var(yj)+ϵv(Y) = \frac{1}{k} \sum_{j=1}^k \max\left(0, \gamma - S(y^j, \epsilon)\right), \ \ S(y^j,\epsilon)=\sqrt{\mathrm{Var}(y^j) + \epsilon}6 gives the probability of transitioning from v(Y)=1kj=1kmax(0,γS(yj,ϵ)),  S(yj,ϵ)=Var(yj)+ϵv(Y) = \frac{1}{k} \sum_{j=1}^k \max\left(0, \gamma - S(y^j, \epsilon)\right), \ \ S(y^j,\epsilon)=\sqrt{\mathrm{Var}(y^j) + \epsilon}7 to v(Y)=1kj=1kmax(0,γS(yj,ϵ)),  S(yj,ϵ)=Var(yj)+ϵv(Y) = \frac{1}{k} \sum_{j=1}^k \max\left(0, \gamma - S(y^j, \epsilon)\right), \ \ S(y^j,\epsilon)=\sqrt{\mathrm{Var}(y^j) + \epsilon}8.

  1. Sample for each v(Y)=1kj=1kmax(0,γS(yj,ϵ)),  S(yj,ϵ)=Var(yj)+ϵv(Y) = \frac{1}{k} \sum_{j=1}^k \max\left(0, \gamma - S(y^j, \epsilon)\right), \ \ S(y^j,\epsilon)=\sqrt{\mathrm{Var}(y^j) + \epsilon}9 an index c(Y)=1kpq[C(Y)]pq2c(Y) = \frac{1}{k} \sum_{p \neq q} [C(Y)]_{pq}^20 and set c(Y)=1kpq[C(Y)]pq2c(Y) = \frac{1}{k} \sum_{p \neq q} [C(Y)]_{pq}^21. The new batch c(Y)=1kpq[C(Y)]pq2c(Y) = \frac{1}{k} \sum_{p \neq q} [C(Y)]_{pq}^22 comprises these cross-image positive pairs.
  2. Define a weighted invariance term:

c(Y)=1kpq[C(Y)]pq2c(Y) = \frac{1}{k} \sum_{p \neq q} [C(Y)]_{pq}^23

(In implementation, one c(Y)=1kpq[C(Y)]pq2c(Y) = \frac{1}{k} \sum_{p \neq q} [C(Y)]_{pq}^24 is sampled per c(Y)=1kpq[C(Y)]pq2c(Y) = \frac{1}{k} \sum_{p \neq q} [C(Y)]_{pq}^25; the expectation matches the above expression.)

This process integrates semantically related cross-image pairs into the invariance objective, enriching the geometry underlying the embedding space.

3. The SAG-VICReg Objective

The loss function for SAG-VICReg retains the original variance and covariance regularization but replaces instance-only invariance with random-walk weighted invariance. The objective is:

c(Y)=1kpq[C(Y)]pq2c(Y) = \frac{1}{k} \sum_{p \neq q} [C(Y)]_{pq}^26

or, grouping terms,

c(Y)=1kpq[C(Y)]pq2c(Y) = \frac{1}{k} \sum_{p \neq q} [C(Y)]_{pq}^27

with the canonical hyperparameters c(Y)=1kpq[C(Y)]pq2c(Y) = \frac{1}{k} \sum_{p \neq q} [C(Y)]_{pq}^28, c(Y)=1kpq[C(Y)]pq2c(Y) = \frac{1}{k} \sum_{p \neq q} [C(Y)]_{pq}^29, LVICReg(Y,Y)=λs(Y,Y)+μ[v(Y)+v(Y)]+ν[c(Y)+c(Y)]L_{\rm VICReg}(Y,Y') = \lambda s(Y,Y') + \mu\left[v(Y) + v(Y')\right] + \nu\left[c(Y) + c(Y')\right]0.

The inclusion of LVICReg(Y,Y)=λs(Y,Y)+μ[v(Y)+v(Y)]+ν[c(Y)+c(Y)]L_{\rm VICReg}(Y,Y') = \lambda s(Y,Y') + \mu\left[v(Y) + v(Y')\right] + \nu\left[c(Y) + c(Y')\right]1 in place of instance-only LVICReg(Y,Y)=λs(Y,Y)+μ[v(Y)+v(Y)]+ν[c(Y)+c(Y)]L_{\rm VICReg}(Y,Y') = \lambda s(Y,Y') + \mu\left[v(Y) + v(Y')\right] + \nu\left[c(Y) + c(Y')\right]2 promotes invariance not just among multiple views of the same image, but also across semantically proximate images, directly enhancing global semantic structure in the learned representation.

4. Standalone Evaluation Metric for Global Structure

To assess embedding quality on global semantic structure without requiring labels, SAG-VICReg introduces a dendrogram-based similarity metric:

  • Use agglomerative clustering (Ward linkage, cosine distance) to construct dendrograms LVICReg(Y,Y)=λs(Y,Y)+μ[v(Y)+v(Y)]+ν[c(Y)+c(Y)]L_{\rm VICReg}(Y,Y') = \lambda s(Y,Y') + \mu\left[v(Y) + v(Y')\right] + \nu\left[c(Y) + c(Y')\right]3 and LVICReg(Y,Y)=λs(Y,Y)+μ[v(Y)+v(Y)]+ν[c(Y)+c(Y)]L_{\rm VICReg}(Y,Y') = \lambda s(Y,Y') + \mu\left[v(Y) + v(Y')\right] + \nu\left[c(Y) + c(Y')\right]4 for two embedding sets LVICReg(Y,Y)=λs(Y,Y)+μ[v(Y)+v(Y)]+ν[c(Y)+c(Y)]L_{\rm VICReg}(Y,Y') = \lambda s(Y,Y') + \mu\left[v(Y) + v(Y')\right] + \nu\left[c(Y) + c(Y')\right]5 and LVICReg(Y,Y)=λs(Y,Y)+μ[v(Y)+v(Y)]+ν[c(Y)+c(Y)]L_{\rm VICReg}(Y,Y') = \lambda s(Y,Y') + \mu\left[v(Y) + v(Y')\right] + \nu\left[c(Y) + c(Y')\right]6 (e.g., train and test splits).
  • Compute Lowest Common Ancestor (LCA) distances for all pairs LVICReg(Y,Y)=λs(Y,Y)+μ[v(Y)+v(Y)]+ν[c(Y)+c(Y)]L_{\rm VICReg}(Y,Y') = \lambda s(Y,Y') + \mu\left[v(Y) + v(Y')\right] + \nu\left[c(Y) + c(Y')\right]7: the dendrogram height where pair LVICReg(Y,Y)=λs(Y,Y)+μ[v(Y)+v(Y)]+ν[c(Y)+c(Y)]L_{\rm VICReg}(Y,Y') = \lambda s(Y,Y') + \mu\left[v(Y) + v(Y')\right] + \nu\left[c(Y) + c(Y')\right]8 merges in LVICReg(Y,Y)=λs(Y,Y)+μ[v(Y)+v(Y)]+ν[c(Y)+c(Y)]L_{\rm VICReg}(Y,Y') = \lambda s(Y,Y') + \mu\left[v(Y) + v(Y')\right] + \nu\left[c(Y) + c(Y')\right]9 yields Z={zi}i=1n,Z={zi}i=1nZ = \{z_i\}_{i=1}^n,\quad Z' = \{z_i'\}_{i=1}^n0 and analogously for Z={zi}i=1n,Z={zi}i=1nZ = \{z_i\}_{i=1}^n,\quad Z' = \{z_i'\}_{i=1}^n1.
  • Calculate Pearson, Spearman, and Kendall correlations between Z={zi}i=1n,Z={zi}i=1nZ = \{z_i\}_{i=1}^n,\quad Z' = \{z_i'\}_{i=1}^n2 and Z={zi}i=1n,Z={zi}i=1nZ = \{z_i\}_{i=1}^n,\quad Z' = \{z_i'\}_{i=1}^n3; high rank correlation indicates similar hierarchical structure.
  • Assess cophenetic correlation via:

Z={zi}i=1n,Z={zi}i=1nZ = \{z_i\}_{i=1}^n,\quad Z' = \{z_i'\}_{i=1}^n4

where Z={zi}i=1n,Z={zi}i=1nZ = \{z_i\}_{i=1}^n,\quad Z' = \{z_i'\}_{i=1}^n5 are the original distances, Z={zi}i=1n,Z={zi}i=1nZ = \{z_i\}_{i=1}^n,\quad Z' = \{z_i'\}_{i=1}^n6 are dendrogram distances.

This metric directly quantifies the preservation of global structure across domains or data splits in the absence of labels.

5. Algorithm and Hyperparameters

The SAG-VICReg procedure can be summarized as follows:

  1. Apply two random augmentations to each batch sample, yielding Z={zi}i=1n,Z={zi}i=1nZ = \{z_i\}_{i=1}^n,\quad Z' = \{z_i'\}_{i=1}^n7.
  2. Compute representations Z={zi}i=1n,Z={zi}i=1nZ = \{z_i\}_{i=1}^n,\quad Z' = \{z_i'\}_{i=1}^n8 and embeddings Z={zi}i=1n,Z={zi}i=1nZ = \{z_i\}_{i=1}^n,\quad Z' = \{z_i'\}_{i=1}^n9.
  3. Form batches WRn×nW \in \mathbb{R}^{n \times n}0.
  4. Compute pairwise cosine distances WRn×nW \in \mathbb{R}^{n \times n}1.
  5. For each WRn×nW \in \mathbb{R}^{n \times n}2, construct local scale WRn×nW \in \mathbb{R}^{n \times n}3 from the 20th percentile of WRn×nW \in \mathbb{R}^{n \times n}4 and compute affinity WRn×nW \in \mathbb{R}^{n \times n}5 over WRn×nW \in \mathbb{R}^{n \times n}6 nearest neighbors.
  6. Normalize WRn×nW \in \mathbb{R}^{n \times n}7 to transition matrix WRn×nW \in \mathbb{R}^{n \times n}8.
  7. For each WRn×nW \in \mathbb{R}^{n \times n}9, sample ziz_i0, set ziz_i1, yielding ziz_i2.
  8. Calculate weighted invariance ziz_i3, variances, and covariances on ziz_i4 and ziz_i5.
  9. Compute total loss: ziz_i6. 10. Backpropagate gradients and update parameters.

Hyperparameters: ziz_i7, ziz_i8, ziz_i9, batch size zjz_j'0, random-walk neighbors zjz_j'1, local scale via 20th percentile.

6. Empirical Results and Benchmarks

Evaluations on ImageNet-1k, CIFAR-100, and Caltech-256 reveal that SAG-VICReg consistently surpasses or matches VICReg and state-of-the-art SSL baselines on both label-free and conventional structure-preserving metrics:

  • LCA Similarity (Spearman): SAG-VICReg achieves 0.433 on ImageNet versus VICReg's 0.403 (+7%), and 0.296 on CIFAR-100 versus VICReg's 0.223 (+33%).
  • Cophenetic similarity: On CIFAR-100 D1zjz_j'2P2, SAG-VICReg yields 0.405 compared to the next best 0.349 (+16%).
  • Hierarchical Rand Index: Superior results at all hierarchy levels, with larger gains at intermediate granularity.
  • Hierarchical zjz_j'3-NN and linear probes: Best or competitive performance on both coarse and fine levels, indicating no compromise in local discriminability.
  • Masking-based comparisons on CIFAR-100: Outperforms DINO, MAE, and I-JEPA in rank correlations despite differing architectural design philosophies.

These results indicate that the random-walk pairing mechanism meaningfully enhances global semantic coherence and stability of the learned representations, an effect captured distinctly by the proposed global-structure metric.

7. Significance and Implications

By integrating a principled random-walk–based extension over the original block-diagonal affinity graph of VICReg, SAG-VICReg addresses limitations in generalization and semantic fidelity for self-supervised representations. The approach leverages spectral embedding theory to diagnose sub-optimalities and prescribes an efficient and differentiable modification applicable in current deep learning workflows. The dendrogram-based metric facilitates objective, label-free benchmarking for global embedding structure, of practical value where annotated data is unavailable. A plausible implication is that random-walk pairing or similar spectral enrichment mechanisms may serve as a general template for augmenting contrastive and non-contrastive SSL objectives to better capture data manifold geometry, irrespective of domain (Simai et al., 22 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SAG-VICReg.