DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture (2405.17995v1)
Abstract: The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can even lead to the neglect of critical local semantics. To bridge this gap, we introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA, specifically designed to generate discriminative latent targets from neighboring information. Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch. To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets. Consequently, DMT-JEPA demonstrates strong discriminative power, offering benefits across a diverse spectrum of downstream tasks. Through extensive experiments, we demonstrate our effectiveness across various visual benchmarks, including ImageNet-1K image classification, ADE20K semantic segmentation, and COCO object detection tasks. Code is available at: \url{https://github.com/DMTJEPA/DMTJEPA}.
- Self-supervised learning from images with a joint-embedding predictive architecture. arXiv preprint arXiv:2301.08243, 2023.
- Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602, 2021.
- data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555, 2022.
- Beit: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- A simple framework for contrastive learning of visual representations. In Proceedings of International Conference on Machine Learning (ICML), 2020.
- Exploring simple siamese representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- An empirical study of training self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022.
- ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations, 2021.
- Bootstrap your own latent - a new approach to self-supervised learning. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Mask r-cnn. In IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2980–2988, 2017.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738, 2020.
- Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2901–2910, 2016.
- Morphing tokens draw strong masked image models. arXiv preprint arXiv:2401.00254, 2024.
- LeCun, Y. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62, 2022.
- MST: masked self-supervised transformer for visual representation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 740–755, 2014.
- Exploring target representations for masked autoencoders. arXiv preprint arXiv:2209.03917, 2023.
- Multi-level contrastive learning for self-supervised vision transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2778–2787, 2023.
- Dinov2: Learning robust visual features without supervision, 2023.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Hybrid distillation: Connecting masked autoencoders with contrastive learners. arXiv preprint arXiv:2306.15876, 2023.
- Adversarial masking for self-supervised learning. In Proceedings of International Conference on Machine Learning (ICML), 2022.
- Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14668–14678, June 2022.
- Unified perceptual parsing for scene understanding. In Proceedings of European Conference on Computer Vision (ECCV), pp. 432–448, 2018.
- Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553, 2021.
- Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9653–9663, June 2022.
- Patch-level representation learning for self-supervised vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8354–8363, 2022.
- Scene parsing through ade20k dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5122–5130, 2017.
- Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision (IJCV), 127:302–321, 2018.
- ibot: Image bert pre-training with online tokenizer. International Conference on Learning Representations (ICLR), 2022.
- Shentong Mo (56 papers)
- Sukmin Yun (10 papers)