Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training (2306.07346v1)

Published 12 Jun 2023 in cs.CV, cs.AI, and cs.MM

Abstract: The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of visual tasks such as image classification. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a backbone by reconstructing visual tokens associated with randomly masked image patches. This masking approach, however, introduces noise into the input data during pre-training, leading to discrepancies that can impair performance during the fine-tuning phase. Furthermore, input masking neglects the dependencies between corrupted patches, increasing the inconsistencies observed in downstream fine-tuning tasks. To overcome these issues, we propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT), that employs autoregressive and permuted predictions to capture intra-patch dependencies. In addition, MaPeT employs auxiliary positional information to reduce the disparity between the pre-training and fine-tuning phases. In our experiments, we employ a fair setting to ensure reliable and meaningful comparisons and conduct investigations on multiple visual tokenizers, including our proposed $k$-CLIP which directly employs discretized CLIP features. Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting. Source code and trained models are publicly available at: https://github.com/aimagelab/MaPeT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. BEiT: BERT pre-training of image Transformers. In ICLR, 2022.
  2. Food-101–mining discriminative components with random forests. In ECCV, 2014.
  3. Language models are few-shot learners. NeurIPS, 2020.
  4. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  5. Generative Pretraining From Pixels. In ICML, 2020.
  6. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  7. Context Autoencoder for Self-Supervised Representation Learning. arXiv preprint arXiv:2202.03026, 2022.
  8. Exploring simple siamese representation learning. In CVPR, 2021.
  9. An empirical study of training self-supervised vision transformers. In ICCV, 2021.
  10. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  11. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, 2019.
  12. Unsupervised Visual Representation Learning by Context Prediction. In ICCV, 2015.
  13. PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers. In AAAI, 2023.
  14. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021.
  15. Are Large-scale Datasets Necessary for Self-Supervised Pre-training? arXiv preprint arXiv:2112.10740, 2021.
  16. Corrupted image modeling for self-supervised visual pre-training. arXiv preprint arXiv:2202.03382, 2022.
  17. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
  18. Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020.
  19. Masked Autoencoders Are Scalable Vision Learners. In CVPR, 2022.
  20. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  21. MILAN: Masked Image Pretraining on Language Assisted Representation. arXiv preprint arXiv:2208.06049, 2022.
  22. Contrastive masked autoencoders are stronger vision learners. arXiv preprint arXiv:2207.13532, 2022.
  23. 3d object representations for fine-grained categorization. In ICCV, 2013.
  24. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692, 2019.
  25. Swin Transformer V2: Scaling Up Capacity and Resolution. In CVPR, 2022.
  26. Decoupled weight decay regularization. In ICLR, 2019.
  27. Fine-grained visual classification of aircraft. Technical report, 2013.
  28. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In ECCV, 2016.
  29. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  30. BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers. arXiv preprint arXiv:2208.06366, 2022.
  31. Learning transferable visual models from natural language supervision. In ICML, 2021.
  32. Zero-Shot Text-to-Image Generation. In ICML, 2021.
  33. MPNet: Masked and Permuted Pre-training for Language Understanding. NeurIPS, 2020.
  34. Unsupervised learning of visual representations using videos. In ICCV, 2015.
  35. Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022.
  36. MVP: Multimodality-guided Visual Pre-training. In ECCV, 2022.
  37. SimMIM: a Simple Framework for Masked Image Modeling. In CVPR, 2022.
  38. XLNet: Generalized Autoregressive Pretraining for Language Understanding. NeurIPS, 2019.
  39. Colorful image colorization. In ECCV, 2016.
  40. CAE v2: Context Autoencoder with CLIP Target. arXiv preprint arXiv:2211.09799, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Lorenzo Baraldi (68 papers)
  2. Roberto Amoroso (4 papers)
  3. Marcella Cornia (61 papers)
  4. Andrea Pilzer (17 papers)
  5. Rita Cucchiara (142 papers)
Citations (2)
Github Logo Streamline Icon: https://streamlinehq.com