Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Analyzing Local Representations of Self-supervised Vision Transformers (2401.00463v2)

Published 31 Dec 2023 in cs.CV

Abstract: In this paper, we present a comparative analysis of various self-supervised Vision Transformers (ViTs), focusing on their local representative power. Inspired by LLMs, we examine the abilities of ViTs to perform various computer vision tasks with little to no fine-tuning. We design evaluation framework to analyze the quality of local, i.e.\ patch-level, representations in the context of few-shot semantic segmentation, instance identification, object retrieval and tracking. We discover that contrastive learning based methods like DINO produce more universal patch representations that can be immediately applied for downstream tasks with no parameter tuning, compared to masked image modeling. The embeddings learned using the latter approach, e.g. in masked autoencoders, have high variance features that harm distance-based algorithms, such as k-NN, and do not contain useful information for most downstream tasks. Furthermore, we demonstrate that removing these high-variance features enhances k-NN for MAE, as well as for its recent extension Scale-MAE. Finally, we find an object instance retrieval setting where DINOv2, a model pretrained on two orders of magnitude more data, falls short of its less compute intensive counterpart DINO.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Understanding robustness of transformers for image classification. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 10211–10221. IEEE, 2021.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Emerging properties in self-supervised vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9630–9640. IEEE, 2021.
  5. Masked-attention mask transformer for universal image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 1280–1289. IEEE, 2022.
  6. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  7. Randaugment: Practical automated data augmentation with a reduced search space. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  8. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023.
  9. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, ICML 2023, pages 7480–7512. PMLR, 2023.
  10. Unsupervised visual representation learning by context prediction. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 1422–1430. IEEE Computer Society, 2015.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  12. The unreasonable effectiveness of few-shot learning for machine translation. In International Conference on Machine Learning, pages 10867–10878. PMLR, 2023.
  13. Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2980–2988. IEEE Computer Society, 2017.
  14. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 9726–9735. Computer Vision Foundation / IEEE, 2020.
  15. Masked autoencoders are scalable vision learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15979–15988. IEEE, 2022.
  16. Usable information and evolution of optimal representations during training. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  17. Intriguing properties of vision transformers. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 23296–23308, 2021.
  18. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  19. Dinov2: Learning robust visual features without supervision. CoRR, abs/2304.07193, 2023.
  20. How do vision transformers work? In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  21. What do self-supervised vision transformers learn? In The Eleventh International Conference on Learning Representations, 2023.
  22. Vision transformers are robust learners. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 2071–2081. AAAI Press, 2022.
  23. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8748–8763. PMLR, 2021.
  24. Do vision transformers see like convolutional neural networks? In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12116–12128, 2021.
  25. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. CoRR, abs/2212.14532, 2022.
  26. How to train your vit? data, augmentation, and regularization in vision transformers. CoRR, abs/2106.10270, 2021.
  27. Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 184:116–130, 2022.
  28. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process., 13(4):600–612, 2004.
  29. Simmim: a simple framework for masked image modeling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 9643–9653. IEEE, 2022.
  30. BDD100K: A diverse driving dataset for heterogeneous multitask learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 2633–2642. Computer Vision Foundation / IEEE, 2020.
  31. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.
  32. mixup: Beyond empirical risk minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
  33. ibot: Image BERT pre-training with online tokenizer. CoRR, abs/2111.07832, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ani Vanyan (1 paper)
  2. Alvard Barseghyan (2 papers)
  3. Hakob Tamazyan (1 paper)
  4. Vahan Huroyan (11 papers)
  5. Hrant Khachatrian (19 papers)
  6. Martin Danelljan (96 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com