Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Feature Normalization Prevents Collapse of Non-contrastive Learning Dynamics (2309.16109v1)

Published 28 Sep 2023 in cs.LG and stat.ML

Abstract: Contrastive learning is a self-supervised representation learning framework, where two positive views generated through data augmentation are made similar by an attraction force in a data representation space, while a repulsive force makes them far from negative examples. Non-contrastive learning, represented by BYOL and SimSiam, further gets rid of negative examples and improves computational efficiency. While learned representations may collapse into a single point due to the lack of the repulsive force at first sight, Tian et al. (2021) revealed through the learning dynamics analysis that the representations can avoid collapse if data augmentation is sufficiently stronger than regularization. However, their analysis does not take into account commonly-used feature normalization, a normalizer before measuring the similarity of representations, and hence excessively strong regularization may collapse the dynamics, which is an unnatural behavior under the presence of feature normalization. Therefore, we extend the previous theory based on the L2 loss by considering the cosine loss, which involves feature normalization. We show that the cosine loss induces sixth-order dynamics (while the L2 loss induces a third-order one), in which a stable equilibrium dynamically emerges even if there are only collapsed solutions with given initial parameters. Thus, we offer a new understanding that feature normalization plays an important role in robustly preventing the dynamics collapse.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Do more negative samples necessarily hurt in contrastive learning? In Proceedings of the 39th International Conference on Machine Learning, pages 1101–1116. PMLR, 2022.
  2. Richard Bellman. The stability of solutions of linear differential equations. Duke Mathematical Journal, 10(1):643–647, 1943.
  3. On the surrogate gap between contrastive and supervised losses. In Proceedings of the 39th International Conference on Machine Learning, pages 1585–1606. PMLR, 2022.
  4. Classification from pairwise similarity and unlabeled data. In Proceedings of the 35th International Conference on Machine Learning, pages 452–461. PMLR, 2018.
  5. VICReg: Variance-invariance-covariance regularization for self-supervised learning. In Proceedings of the 11th International Conference on Learning Representations, 2022.
  6. Pairwise supervision can provably elicit a decision boundary. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pages 2618–2640. PMLR, 2022.
  7. Exploring simple Siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
  8. Learning a similarity metric discriminatively, with application to face verification. In CVPR, pages 539–546, 2005.
  9. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, pages 1597–1607. PMLR, 2020.
  10. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems 33, pages 9912–9924, 2020.
  11. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
  12. Optimization theory for relu neural networks trained with normalization layers. In Proceedings of the 37th International conference on machine learning, pages 2751–2760. PMLR, 2020.
  13. Whitening for self-supervised representation learning. In Proceedings of the 38th International Conference on Machine Learning, pages 3015–3024. PMLR, 2021.
  14. Implicit regularization of discrete gradient dynamics in linear neural networks. Advances in Neural Information Processing Systems 32, pages 3202–3211, 2019.
  15. Bootstrap your own latent - a new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, pages 21271–21284, 2020.
  16. Bootstrapping upper confidence bound. Advances in Neural Information Processing Systems 32, pages 12123–12133, 2019.
  17. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  18. [re] understanding self-supervised learning dynamics without contrastive pairs. In ML Reproducibility Challenge 2021 (Fall Edition), 2022.
  19. Differential Equations, Dynamical Systems, and An Introduction to Chaos. Academic Press, 2012.
  20. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015.
  21. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  22. Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems 31, 31, 2018.
  23. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  24. Neural manifold clustering and embedding. arXiv preprint arXiv:2201.10000, 2022.
  25. What shapes the loss landscape of self supervised learning? In Proceedings of the 11th International Conference on Learning Representations, 2023.
  26. Same pre-training loss, better downstream: Implicit bias matters for language models. In Proceedings of the 40th International Conference on Machine Learning, pages 22188–22214. PMLR, 2023.
  27. Understanding negative samples in instance discriminative self-supervised representation learning. Advances in Neural Information Processing Systems 34, pages 5784–5797, 2021.
  28. On variational bounds of mutual information. In Proceedings of 36th International Coneference on Machine Learning, pages 5171–5180, 2019.
  29. The matrix cookbook, 2012.
  30. Contrasting the landscape of contrastive and non-contrastive learning. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, pages 8592–8618. PMLR, 2022.
  31. Understanding the limitations of variational mutual information estimators. In Proceedings of th 9th International Conference on Learning Representations, 2020.
  32. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Proceedings of the 2nd International Conference on Learning Representations, 2014.
  33. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, pages 5628–5637. PMLR, 2019.
  34. Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? arXiv preprint arXiv:2201.05119, 2022.
  35. Understanding self-supervised learning dynamics without contrastive pairs. In Proceedings of the 38th International Conference on Machine Learning, pages 10268–10278. PMLR, 2021.
  36. Understanding self-predictive learning for reinforcement learning. In Proceedings of the 40th International Conference on Machine Learning, pages 33632–33656. PMLR, 2023.
  37. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  38. Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018.
  39. Sub-weibull distributions: Generalizing sub-gaussian and sub-exponential properties to heavier tailed distributions. Stat, 9(1):e318, 2020.
  40. Twan van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017.
  41. Towards demystifying representation learning with non-contrastive self-supervision. arXiv:2110.04947, 2021.
  42. On the importance of asymmetry for siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16570–16579, 2022.
  43. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the 37th International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
  44. Toward understanding the feature learning process of self-supervised contrastive learning. In Proceedings of the 39th International Conference on Machine Learning, pages 11112–11122. PMLR, 2021.
  45. The mechanism of prediction head in non-contrastive self-supervised learning. Advances in Neural Information Processing Systems 35, pages 24794–24809, 2022.
  46. Chaos is a ladder: A new theoretical understanding of contrastive learning via augmentation overlap. In Proceedings of 11th International Conference on Learning Representations, 2022.
  47. Spherical motion dynamics: Learning dynamics of normalized neural network using SGD and weight decay. Advances in Neural Information Processing Systems 34, pages 6380–6391, 2021.
  48. Barlow Twins: Self-supervised learning via redundancy reduction. In Proceedings of the 38th International Conference on Machine Learning, pages 12310–12320. PMLR, 2021.
  49. Towards a unified theoretical understanding of non-contrastive learning via rank differential mechanism. In Proceedings of the 11th International Conference on Learning Representations, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Han Bao (77 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.