Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks (2407.03475v1)

Published 3 Jul 2024 in cs.LG

Abstract: Two competing paradigms exist for self-supervised learning of data representations. Joint Embedding Predictive Architecture (JEPA) is a class of architectures in which semantically similar inputs are encoded into representations that are predictive of each other. A recent successful approach that falls under the JEPA framework is self-distillation, where an online encoder is trained to predict the output of the target encoder, sometimes using a lightweight predictor network. This is contrasted with the Masked AutoEncoder (MAE) paradigm, where an encoder and decoder are trained to reconstruct missing parts of the input in the data space rather, than its latent representation. A common motivation for using the JEPA approach over MAE is that the JEPA objective prioritizes abstract features over fine-grained pixel information (which can be unpredictable and uninformative). In this work, we seek to understand the mechanism behind this empirical observation by analyzing the training dynamics of deep linear models. We uncover a surprising mechanism: in a simplified linear setting where both approaches learn similar representations, JEPAs are biased to learn high-influence features, i.e., features characterized by having high regression coefficients. Our results point to a distinct implicit bias of predicting in latent space that may shed light on its success in practice.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. A cookbook of self-supervised learning, 2023.
  2. A simple framework for contrastive learning of visual representations. ArXiv, abs/2002.05709, 2020.
  3. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023.
  4. Emerging properties in self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640, 2021.
  5. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. ArXiv, abs/2105.04906, 2021.
  6. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  7. MAST: Masked augmentation subspace training for generalizable self-supervised priors. In ICLR, 2023.
  8. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  9. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
  10. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pages 1298–1312. PMLR, 2022.
  11. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021.
  12. Learning deep representations by mutual information estimation and maximization. ArXiv, abs/1808.06670, 2018.
  13. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
  14. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23390–23400, 2023.
  15. Videomae v2: Scaling video masked autoencoders with dual masking. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023.
  16. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  17. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  18. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024.
  19. Learning by reconstruction produces uninformative features for perception. arXiv preprint arXiv:2402.11337, 2024.
  20. On the stepwise nature of self-supervised learning. In International Conference on Machine Learning, pages 31852–31876. PMLR, 2023.
  21. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. ArXiv, abs/2012.09839, 2020.
  22. Transformers learn through gradual rank increase. ArXiv, abs/2306.07042, 2023.
  23. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
  24. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.
  25. Deep learning for universal linear embeddings of nonlinear dynamics. Nature communications, 9(1):4950, 2018.
  26. Exact learning dynamics of deep linear networks with prior knowledge. Advances in Neural Information Processing Systems, 35:6615–6629, 2022.
  27. Implicit regularization in deep matrix factorization. In Neural Information Processing Systems, 2019.
  28. Modulate your spectrum in self-supervised learning. International Conference on Learning Representations (ICLR), 2024.
  29. WERank: Towards rank degradation prevention for self-supervised learning using weight regularization. ArXiv, abs/2402.09586, 2024.
  30. Lidar: Sensing linear probing performance in joint embedding ssl architectures. ArXiv, abs/2312.04000, 2023.
  31. Understanding self-supervised learning dynamics without contrastive pairs. In International Conference on Machine Learning, pages 10268–10278. PMLR, 2021.
  32. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. arXiv preprint arXiv:2205.11508, 2022.
  33. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019.
  34. Raphael Berthier. Incremental learning in diagonal linear networks. J. Mach. Learn. Res., 24:171:1–171:26, 2022.
  35. Saddle-to-saddle dynamics in diagonal linear networks. ArXiv, abs/2304.00488, 2023.
  36. Joint embedding predictive architectures focus on slow features. ArXiv, abs/2211.10831, 2022.
  37. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. 2015 IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets