SAG-VICReg is a self-supervised representation learning framework that extends VICReg by incorporating semantically meaningful cross-image pairs using a random-walk mechanism.
It replaces the standard instance-level invariance with a weighted invariance term that promotes global semantic structure in the embedding space.
Empirical evaluations on datasets like ImageNet and CIFAR-100 show improved hierarchical and global structure metrics compared to traditional VICReg.
SAG-VICReg (Stable and Generalizable VICReg) is a framework for self-supervised representation learning that modifies the Variance-Invariance-Covariance Regularization (VICReg) objective by injecting semantically meaningful cross-image positive pairs using a random-walk pairing mechanism. This approach targets both improved generalization to unseen data and enhanced capture of global semantic structures in the learned embeddings, addressing structural limitations inherent in VICReg's original formulation (Simai et al., 22 Jun 2025).
1. Background: VICReg and Spectral Embedding View
VICReg is a self-supervised learning (SSL) algorithm that learns image representations by optimizing three terms:
Invariance:
s(Y,Y′)=n1i=1∑n∥yi−yi′∥22
This term enforces that two random augmentations of the same image produce similar embeddings.
Variance:
v(Y)=k1j=1∑kmax(0,γ−S(yj,ϵ)),S(yj,ϵ)=Var(yj)+ϵ
It ensures a minimal spread in each embedding dimension, preventing collapse.
Covariance:
c(Y)=k1∑p=q[C(Y)]pq2
This encourages decorrelation across embedding dimensions.
The combined loss function is: LVICReg(Y,Y′)=λs(Y,Y′)+μ[v(Y)+v(Y′)]+ν[c(Y)+c(Y′)]
VICReg can be interpreted as spectral embedding on a block-diagonal graph where edges connect only augmentations of the same image. Consequently, it does not capture global relationships between images, leading to sub-optimal representation of unseen data clusters and potentially distorted embeddings for out-of-distribution samples. This view is analogous to SpectralNet objectives but with the restriction that the graph only encodes instance-level locality.
2. Random-Walk Pairing: Injecting Global Structure
To overcome the absence of cross-image relationships in VICReg, SAG-VICReg extends the local, block-diagonal neighborhood by constructing semantically-informed, stochastic cross-image pairings using random walks on a learned affinity graph. The procedure is as follows:
Generate two batches of embeddings:
Z={zi}i=1n,Z′={zi′}i=1n
Each derived from random augmentations.
Build an affinity matrix W∈Rn×n based on pairwise cosine distances between zi and zj′:
dij=1−cos(zi,zj′)
For each i, retain v(Y)=k1j=1∑kmax(0,γ−S(yj,ϵ)),S(yj,ϵ)=Var(yj)+ϵ0 nearest neighbors (typically v(Y)=k1j=1∑kmax(0,γ−S(yj,ϵ)),S(yj,ϵ)=Var(yj)+ϵ1), and set a local scale v(Y)=k1j=1∑kmax(0,γ−S(yj,ϵ)),S(yj,ϵ)=Var(yj)+ϵ2 as the 20th percentile of v(Y)=k1j=1∑kmax(0,γ−S(yj,ϵ)),S(yj,ϵ)=Var(yj)+ϵ3:
v(Y)=k1j=1∑kmax(0,γ−S(yj,ϵ)),S(yj,ϵ)=Var(yj)+ϵ6 gives the probability of transitioning from v(Y)=k1j=1∑kmax(0,γ−S(yj,ϵ)),S(yj,ϵ)=Var(yj)+ϵ7 to v(Y)=k1j=1∑kmax(0,γ−S(yj,ϵ)),S(yj,ϵ)=Var(yj)+ϵ8.
Sample for each v(Y)=k1j=1∑kmax(0,γ−S(yj,ϵ)),S(yj,ϵ)=Var(yj)+ϵ9 an index c(Y)=k1∑p=q[C(Y)]pq20 and set c(Y)=k1∑p=q[C(Y)]pq21. The new batch c(Y)=k1∑p=q[C(Y)]pq22 comprises these cross-image positive pairs.
Define a weighted invariance term:
c(Y)=k1∑p=q[C(Y)]pq23
(In implementation, one c(Y)=k1∑p=q[C(Y)]pq24 is sampled per c(Y)=k1∑p=q[C(Y)]pq25; the expectation matches the above expression.)
This process integrates semantically related cross-image pairs into the invariance objective, enriching the geometry underlying the embedding space.
3. The SAG-VICReg Objective
The loss function for SAG-VICReg retains the original variance and covariance regularization but replaces instance-only invariance with random-walk weighted invariance. The objective is:
c(Y)=k1∑p=q[C(Y)]pq26
or, grouping terms,
c(Y)=k1∑p=q[C(Y)]pq27
with the canonical hyperparameters c(Y)=k1∑p=q[C(Y)]pq28, c(Y)=k1∑p=q[C(Y)]pq29, LVICReg(Y,Y′)=λs(Y,Y′)+μ[v(Y)+v(Y′)]+ν[c(Y)+c(Y′)]0.
The inclusion of LVICReg(Y,Y′)=λs(Y,Y′)+μ[v(Y)+v(Y′)]+ν[c(Y)+c(Y′)]1 in place of instance-only LVICReg(Y,Y′)=λs(Y,Y′)+μ[v(Y)+v(Y′)]+ν[c(Y)+c(Y′)]2 promotes invariance not just among multiple views of the same image, but also across semantically proximate images, directly enhancing global semantic structure in the learned representation.
4. Standalone Evaluation Metric for Global Structure
To assess embedding quality on global semantic structure without requiring labels, SAG-VICReg introduces a dendrogram-based similarity metric:
Use agglomerative clustering (Ward linkage, cosine distance) to construct dendrograms LVICReg(Y,Y′)=λs(Y,Y′)+μ[v(Y)+v(Y′)]+ν[c(Y)+c(Y′)]3 and LVICReg(Y,Y′)=λs(Y,Y′)+μ[v(Y)+v(Y′)]+ν[c(Y)+c(Y′)]4 for two embedding sets LVICReg(Y,Y′)=λs(Y,Y′)+μ[v(Y)+v(Y′)]+ν[c(Y)+c(Y′)]5 and LVICReg(Y,Y′)=λs(Y,Y′)+μ[v(Y)+v(Y′)]+ν[c(Y)+c(Y′)]6 (e.g., train and test splits).
Compute Lowest Common Ancestor (LCA) distances for all pairs LVICReg(Y,Y′)=λs(Y,Y′)+μ[v(Y)+v(Y′)]+ν[c(Y)+c(Y′)]7: the dendrogram height where pair LVICReg(Y,Y′)=λs(Y,Y′)+μ[v(Y)+v(Y′)]+ν[c(Y)+c(Y′)]8 merges in LVICReg(Y,Y′)=λs(Y,Y′)+μ[v(Y)+v(Y′)]+ν[c(Y)+c(Y′)]9 yields Z={zi}i=1n,Z′={zi′}i=1n0 and analogously for Z={zi}i=1n,Z′={zi′}i=1n1.
Calculate Pearson, Spearman, and Kendall correlations between Z={zi}i=1n,Z′={zi′}i=1n2 and Z={zi}i=1n,Z′={zi′}i=1n3; high rank correlation indicates similar hierarchical structure.
Assess cophenetic correlation via:
Z={zi}i=1n,Z′={zi′}i=1n4
where Z={zi}i=1n,Z′={zi′}i=1n5 are the original distances, Z={zi}i=1n,Z′={zi′}i=1n6 are dendrogram distances.
This metric directly quantifies the preservation of global structure across domains or data splits in the absence of labels.
5. Algorithm and Hyperparameters
The SAG-VICReg procedure can be summarized as follows:
Apply two random augmentations to each batch sample, yielding Z={zi}i=1n,Z′={zi′}i=1n7.
Compute representations Z={zi}i=1n,Z′={zi′}i=1n8 and embeddings Z={zi}i=1n,Z′={zi′}i=1n9.
Form batches W∈Rn×n0.
Compute pairwise cosine distances W∈Rn×n1.
For each W∈Rn×n2, construct local scale W∈Rn×n3 from the 20th percentile of W∈Rn×n4 and compute affinity W∈Rn×n5 over W∈Rn×n6 nearest neighbors.
Normalize W∈Rn×n7 to transition matrix W∈Rn×n8.
For each W∈Rn×n9, sample zi0, set zi1, yielding zi2.
Calculate weighted invariance zi3, variances, and covariances on zi4 and zi5.
Compute total loss: zi6.
10. Backpropagate gradients and update parameters.
Hyperparameters: zi7, zi8, zi9, batch size zj′0, random-walk neighbors zj′1, local scale via 20th percentile.
6. Empirical Results and Benchmarks
Evaluations on ImageNet-1k, CIFAR-100, and Caltech-256 reveal that SAG-VICReg consistently surpasses or matches VICReg and state-of-the-art SSL baselines on both label-free and conventional structure-preserving metrics:
LCA Similarity (Spearman): SAG-VICReg achieves 0.433 on ImageNet versus VICReg's 0.403 (+7%), and 0.296 on CIFAR-100 versus VICReg's 0.223 (+33%).
Cophenetic similarity: On CIFAR-100 D1zj′2P2, SAG-VICReg yields 0.405 compared to the next best 0.349 (+16%).
Hierarchical Rand Index: Superior results at all hierarchy levels, with larger gains at intermediate granularity.
Hierarchical zj′3-NN and linear probes: Best or competitive performance on both coarse and fine levels, indicating no compromise in local discriminability.
Masking-based comparisons on CIFAR-100: Outperforms DINO, MAE, and I-JEPA in rank correlations despite differing architectural design philosophies.
These results indicate that the random-walk pairing mechanism meaningfully enhances global semantic coherence and stability of the learned representations, an effect captured distinctly by the proposed global-structure metric.
7. Significance and Implications
By integrating a principled random-walk–based extension over the original block-diagonal affinity graph of VICReg, SAG-VICReg addresses limitations in generalization and semantic fidelity for self-supervised representations. The approach leverages spectral embedding theory to diagnose sub-optimalities and prescribes an efficient and differentiable modification applicable in current deep learning workflows. The dendrogram-based metric facilitates objective, label-free benchmarking for global embedding structure, of practical value where annotated data is unavailable. A plausible implication is that random-walk pairing or similar spectral enrichment mechanisms may serve as a general template for augmenting contrastive and non-contrastive SSL objectives to better capture data manifold geometry, irrespective of domain (Simai et al., 22 Jun 2025).
“Emergent Mind helps me see which AI papers have caught fire online.”
Philip
Creator, AI Explained on YouTube
Sign up for free to explore the frontiers of research
Discover trending papers, chat with arXiv, and track the latest research shaping the future of science and technology.Discover trending papers, chat with arXiv, and more.