Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning (2410.19560v1)
Abstract: In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations. Addressing these challenges, this study introduces a novel framework, namely C-JEPA (Contrastive-JEPA), which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy. This integration is designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the mean of augmented views, thereby overcoming the identified limitations. Through empirical and theoretical evaluations, our work demonstrates that C-JEPA significantly enhances the stability and quality of visual representation learning. When pre-trained on the ImageNet-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics.
- Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62, 2022.
- Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, 2023.
- Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024.
- Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15750–15758, 2021.
- Understanding self-supervised learning dynamics without contrastive pairs. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 10268–10278. PMLR, 2021.
- VICReg: Variance-invariance-covariance regularization for self-supervised learning. In Proceedings of International Conference on Learning Representations (ICLR), 2022.
- What makes for good views for contrastive learning? In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pages 6827–6839, 2020.
- Contrastive multiview coding. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
- Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Beit: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602, 2021.
- Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
- Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14668–14678, June 2022.
- Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9653–9663, June 2022.
- Object-wise masked autoencoders for fast pre-training. arXiv preprint arXiv:2205.14338, 2022.
- Masked momentum contrastive learning for zero-shot semantic understanding. arXiv preprint arXiv:2308.11448, 2023.
- Beyond accuracy: Statistical measures and benchmark for evaluation of representation from self-supervised learning. arXiv preprint arXiv:2312.01118, 2023.
- Dailymae: Towards pretraining masked autoencoders in one day. arXiv preprint arXiv:2404.00509, 2024.
- Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022.
- MST: Masked self-supervised transformer for visual representation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Adversarial masking for self-supervised learning. In Proceedings of International Conference on Machine Learning (ICML), 2022.
- MVP: multimodality-guided visual pre-training. In Proceedings of the European Conference on Computer Vision (ECCV), pages 337–353, 2022.
- SemMAE: Semantic-guided masking for learning masked autoencoders. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
- A simple framework for contrastive learning of visual representations. In Proceedings of International Conference on Machine Learning (ICML), 2020.