Whitening for Self-Supervised Representation Learning (2007.06346v5)

Published 13 Jul 2020 in cs.LG, cs.CV, and stat.ML

Abstract: Most of the current self-supervised representation learning (SSL) methods are based on the contrastive loss and the instance-discrimination task, where augmented versions of the same image instance ("positives") are contrasted with instances extracted from other images ("negatives"). For the learning to be effective, many negatives should be compared with a positive pair, which is computationally demanding. In this paper, we propose a different direction and a new loss function for SSL, which is based on the whitening of the latent-space features. The whitening operation has a "scattering" effect on the batch samples, avoiding degenerate solutions where all the sample representations collapse to a single point. Our solution does not require asymmetric networks and it is conceptually simple. Moreover, since negatives are not needed, we can extract multiple positive pairs from the same image instance. The source code of the method and of all the experiments is available at: https://github.com/htdt/self-supervised.

Authors (4)

Aleksandr Ermolov (5 papers)
Aliaksandr Siarohin (58 papers)
Enver Sangineto (34 papers)
Nicu Sebe (270 papers)

Citations (294)

View on Semantic Scholar

Summary

Whitening for Self-Supervised Representation Learning

The paper, "Whitening for Self-Supervised Representation Learning," presents an alternative approach to self-supervised learning (SSL) by introducing a new loss function based on whitening of latent-space features. Conventional SSL methods often rely on contrastive loss which contrasts positive samples (different augmentations of the same image) against negative samples (augmentations from different images). Such methods necessitate a large number of negatives to be effective and can be computationally costly. This research circumvents the need for negative samples by utilizing a whitening operation that naturally scatters batch samples in their latent space, thus preventing degenerate solutions—where all sample representations collapse to a single point—while permitting multiple positive pairs to be extracted from the same image instance.

Methodological Insights

The core of this approach is the adoption of a whitening transform, which projects the embeddings into a spherical distribution to maintain non-degenerate representations. This is coupled with a straightforward Mean Squared Error (MSE) metric to minimize the distance between positive pairs within this whitened space. By eliminating the dependency on negative samples, this method allows for the exploration of multiple positive pairs, which can demonstrate improved performance while simplifying the model architecture by not requiring asymmetric networks.

The contributions of this approach can be summarized as follows:

Whitening MSE (W-MSE) Loss Function: This loss function simplifies the learning task by requiring samples to reside in a spherical distribution, negating the need for positive-negative contest.
Efficient Use of Positive Pairs: The method's reduced reliance on a large number of negatives enables a focus on multiple positives within a single batch, a configuration empirically shown to enhance performance.
Competitive Empirical Results: The paper asserts, through empirical validation on multiple datasets including CIFAR-10, CIFAR-100, STL-10, and Tiny ImageNet, that the W-MSE loss outperforms traditional contrastive losses, providing results that are competitive when compared to state-of-the-art methods like BYOL and SimSiam.

Theoretical and Practical Implications

The theoretical implications of this research provide a broader understanding of SSL by illustrating that it is feasible to circumvent the typical positive-negative paradigm without loss of learning effectiveness. Whitening serves as a robust means to ensure embedding diversity and non-collapse while minimizing intraclass variance by focusing solely on positive pairs.

Practically, this approach could inspire more computationally efficient SSL models particularly suitable for environments with constrained resources. Reducing the negative sample dependency means smaller batch sizes can be leveraged, effectively lowering computational costs and enabling the use of SSL in scenarios traditionally deemed challenging.

Future Directions

The insights derived from the paper raise several avenues for future exploration. Integrating asymmetrical models with whitening transformations could potentially enhance performance further by combining the strengths of both approaches. Additionally, extending the application of these findings to non-vision domains within unsupervised learning could be insightful, potentially enhancing SSL techniques within natural language processing or multimodal learning tasks.

Conclusively, this paper challenges the standard conventions of contrastive learning in unsupervised contexts, proposing that intelligent feature normalization strategies like whitening are sufficient to derive high-quality embeddings purely from positive self-supervised tasks. This is a promising direction that may lead to faster, simpler, and more adaptable machine learning models in the future.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - htdt/self-supervised: Whitening for Self-Supervised Representation Learning | Official repository (129 stars)