Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture (2405.17995v1)

Published 28 May 2024 in cs.CV, cs.AI, cs.LG, and eess.IV

Abstract: The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can even lead to the neglect of critical local semantics. To bridge this gap, we introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA, specifically designed to generate discriminative latent targets from neighboring information. Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch. To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets. Consequently, DMT-JEPA demonstrates strong discriminative power, offering benefits across a diverse spectrum of downstream tasks. Through extensive experiments, we demonstrate our effectiveness across various visual benchmarks, including ImageNet-1K image classification, ADE20K semantic segmentation, and COCO object detection tasks. Code is available at: \url{https://github.com/DMTJEPA/DMTJEPA}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Self-supervised learning from images with a joint-embedding predictive architecture. arXiv preprint arXiv:2301.08243, 2023.
  2. Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602, 2021.
  3. data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555, 2022.
  4. Beit: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  5. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  6. A simple framework for contrastive learning of visual representations. In Proceedings of International Conference on Machine Learning (ICML), 2020.
  7. Exploring simple siamese representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  8. An empirical study of training self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  9. Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022.
  10. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  248–255, 2009.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations, 2021.
  12. Bootstrap your own latent - a new approach to self-supervised learning. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2020.
  13. Mask r-cnn. In IEEE/CVF International Conference on Computer Vision (ICCV), pp.  2980–2988, 2017.
  14. Momentum contrast for unsupervised visual representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  9729–9738, 2020.
  15. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
  16. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  2901–2910, 2016.
  17. Morphing tokens draw strong masked image models. arXiv preprint arXiv:2401.00254, 2024.
  18. LeCun, Y. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62, 2022.
  19. MST: masked self-supervised transformer for visual representation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
  20. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  740–755, 2014.
  21. Exploring target representations for masked autoencoders. arXiv preprint arXiv:2209.03917, 2023.
  22. Multi-level contrastive learning for self-supervised vision transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.  2778–2787, 2023.
  23. Dinov2: Learning robust visual features without supervision, 2023.
  24. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  25. Hybrid distillation: Connecting masked autoencoders with contrastive learners. arXiv preprint arXiv:2306.15876, 2023.
  26. Adversarial masking for self-supervised learning. In Proceedings of International Conference on Machine Learning (ICML), 2022.
  27. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  14668–14678, June 2022.
  28. Unified perceptual parsing for scene understanding. In Proceedings of European Conference on Computer Vision (ECCV), pp.  432–448, 2018.
  29. Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553, 2021.
  30. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  9653–9663, June 2022.
  31. Patch-level representation learning for self-supervised vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  8354–8363, 2022.
  32. Scene parsing through ade20k dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  5122–5130, 2017.
  33. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision (IJCV), 127:302–321, 2018.
  34. ibot: Image bert pre-training with online tokenizer. International Conference on Learning Representations (ICLR), 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shentong Mo (56 papers)
  2. Sukmin Yun (10 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.