Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning (2410.19560v1)

Published 25 Oct 2024 in cs.CV, cs.AI, cs.LG, eess.IV, and eess.SP

Abstract: In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations. Addressing these challenges, this study introduces a novel framework, namely C-JEPA (Contrastive-JEPA), which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy. This integration is designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the mean of augmented views, thereby overcoming the identified limitations. Through empirical and theoretical evaluations, our work demonstrates that C-JEPA significantly enhances the stability and quality of visual representation learning. When pre-trained on the ImageNet-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62, 2022.
  2. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, 2023.
  3. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024.
  4. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15750–15758, 2021.
  5. Understanding self-supervised learning dynamics without contrastive pairs. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 10268–10278. PMLR, 2021.
  6. VICReg: Variance-invariance-covariance regularization for self-supervised learning. In Proceedings of International Conference on Learning Representations (ICLR), 2022.
  7. What makes for good views for contrastive learning? In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pages 6827–6839, 2020.
  8. Contrastive multiview coding. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
  9. Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  10. Beit: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  11. Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602, 2021.
  12. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
  13. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14668–14678, June 2022.
  14. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9653–9663, June 2022.
  15. Object-wise masked autoencoders for fast pre-training. arXiv preprint arXiv:2205.14338, 2022.
  16. Masked momentum contrastive learning for zero-shot semantic understanding. arXiv preprint arXiv:2308.11448, 2023.
  17. Beyond accuracy: Statistical measures and benchmark for evaluation of representation from self-supervised learning. arXiv preprint arXiv:2312.01118, 2023.
  18. Dailymae: Towards pretraining masked autoencoders in one day. arXiv preprint arXiv:2404.00509, 2024.
  19. Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022.
  20. MST: Masked self-supervised transformer for visual representation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
  21. Adversarial masking for self-supervised learning. In Proceedings of International Conference on Machine Learning (ICML), 2022.
  22. MVP: multimodality-guided visual pre-training. In Proceedings of the European Conference on Computer Vision (ECCV), pages 337–353, 2022.
  23. SemMAE: Semantic-guided masking for learning masked autoencoders. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  24. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  25. A simple framework for contrastive learning of visual representations. In Proceedings of International Conference on Machine Learning (ICML), 2020.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com