VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning (2105.04906v3)

Published 11 May 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Recent self-supervised methods for image representation learning are based on maximizing the agreement between embedding vectors from different views of the same image. A trivial solution is obtained when the encoder outputs constant vectors. This collapse problem is often avoided through implicit biases in the learning architecture, that often lack a clear justification or interpretation. In this paper, we introduce VICReg (Variance-Invariance-Covariance Regularization), a method that explicitly avoids the collapse problem with a simple regularization term on the variance of the embeddings along each dimension individually. VICReg combines the variance term with a decorrelation mechanism based on redundancy reduction and covariance regularization, and achieves results on par with the state of the art on several downstream tasks. In addition, we show that incorporating our new variance term into other methods helps stabilize the training and leads to performance improvements.

Citations (829)

View on Semantic Scholar

Summary

The paper introduces an explicit regularization strategy combining variance, invariance, and covariance terms to prevent encoder collapse in self-supervised learning.
It simplifies architecture by eliminating the need for weight sharing, batch normalization, or memory banks while maintaining robust performance across diverse tasks.
Empirical results demonstrate competitive accuracy, with VICReg achieving 73.2% top-1 on ImageNet and strong performance in semi-supervised, transfer, and multi-modal scenarios.

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

Self-supervised learning methods for image representation largely focus on maximizing the agreement between embedding vectors produced by encoders processing different views of the same image. A critical challenge in these methods is avoiding the collapse phenomenon, where encoders generate constant or non-informative vectors. The paper introduces VICReg (Variance-Invariance-Covariance Regularization), which seeks to explicitly avoid the collapse by employing two regularization terms applied to embeddings: one that maintains the variance of each embedding dimension above a threshold, and another that decorrelates the pairs of variables.

Methodology

VICReg leverages a joint embedding architecture similar to other recent self-supervised techniques. Key to this architecture is the presence of two networks trained to yield consistent embeddings for different views of the same image. Unlike other methods, VICReg does not necessitate techniques such as weight sharing, batch normalization, or memory banks, simplifying its design and making it more generally applicable.

The Regularization in VICReg comprises:

Invariance Term: The mean square distance between embedding vectors, encouraging the network to learn invariant features to different transformations of the same image.
Variance Term: A hinge loss ensuring the standard deviation of each embedding dimension exceeds a specific threshold to prevent collapse by discouraging identical embeddings.
Covariance Term: Ensures decorrelation among the embedding dimensions by penalizing non-zero covariance values, thus maximizing the informative content of embeddings.

Empirical Evaluation

The effectiveness of VICReg is validated across several downstream tasks, using the established benchmarks with ResNet-50 backbones pretrained on ImageNet. The evaluation focuses on:

Linear Classification: VICReg achieves strong performance with 73.2% top-1 accuracy on ImageNet, comparable to the state-of-the-art self-supervised methods like BYOL and Barlow Twins.
Semi-Supervised Learning: It shows competitive results under semi-supervised learning settings where only a fraction of the labels are used for fine-tuning.
Transfer Learning: The representations learned using VICReg are also evaluated on tasks like scene classification using Places205, multi-label classification with VOC07, and others. It performs on par with the best existing methods.
Multi-modal Learning: VICReg's ability to handle different architectures for its encoding branches makes it suitable for multi-modal tasks, such as combining audio and text inputs, where it shows superior performance in image and text retrieval tasks on the MS-COCO dataset.

Theoretical and Practical Implications

Theoretically, VICReg simplifies the mechanism required for preventing collapse through explicit variance and covariance regularization. This explicit regularization mitigates the issues prevalent in normalization-based methods, making VICReg a more interpretable and robust approach.

Practically, VICReg's fewer architectural constraints mean it can be more widely applied, including in tasks involving multi-modal data like audio-visual learning and multi-sensor fusion. The independence of parameter sharing also broadens its applicability to scenarios where the inputs to the branches of the network differ significantly, whether in architecture or data modality.

Future Directions

Future research can explore optimizing the expander network further, finding the most computationally efficient hierarchy that maintains or improves representation quality. Another direction is fully leveraging VICReg's architectural flexibility by applying it in more diverse domains, such as video understanding or cross-domain transfer learning. Combining VICReg with other regularization techniques or architectural tricks used in popular methods like BYOL or SimCLR might also yield interesting hybrid methods that can leverage the best of both worlds.

Conclusion

The VICReg method, with its explicit and effective regularization strategy, offers a substantial step towards more interpretable and flexible self-supervised learning architectures. Achieving state-of-the-art results without the necessity of complex architectures or intensive computational resources, VICReg presents a compelling case for broadening the scope and applicability of self-supervised learning in computer vision.

PDF Markdown

Related Papers

Tweets

https://twitter.com/adad8m/status/1761299300540465302

https://twitter.com/bored_boltzmann/status/1924654887104569796

YouTube

Show All Videos