Morphing Tokens Draw Strong Masked Image Models (2401.00254v3)
Abstract: Masked image modeling (MIM) has emerged as a promising approach for training Vision Transformers (ViTs). The essence of MIM lies in the token-wise prediction of masked tokens, which aims to predict targets tokenized from images or generated by pre-trained models like vision-LLMs. While using tokenizers or pre-trained models are plausible MIM targets, they often offer spatially inconsistent targets even for neighboring tokens, complicating models to learn unified and discriminative representations. Our pilot study identifies spatial inconsistencies and suggests that resolving them can accelerate representation learning. Building upon this insight, we introduce a novel self-supervision signal called Dynamic Token Morphing (DTM), which dynamically aggregates contextually related tokens to yield contextualized targets, thereby mitigating spatial inconsistency. DTM is compatible with various SSL frameworks; we showcase improved MIM results by employing DTM, barely introducing extra training costs. Our method facilitates training by using consistent targets, resulting in 1) faster training and 2) reduced losses. Experiments on ImageNet-1K and ADE20K demonstrate the superiority of our method compared with state-of-the-art, complex MIM methods. Furthermore, the comparative evaluation of the iNaturalists and fine-grained visual classification datasets further validates the transferability of our method on various downstream tasks. Code is available at https://github.com/naver-ai/dtm
- John Wiley & Sons, Ltd, 1990.
- Masked siamese networks for label-efficient learning. arXiv preprint arXiv:2204.07141, 2022.
- data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pages 1298–1312. PMLR, 2022.
- Beit: Bert pre-training of image transformers. In International Conference on Learning Representations, 2021.
- Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022.
- Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision, 2021.
- Mixed autoencoder for self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (CVPR), pages 22742–22751, 2023.
- A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
- An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
- Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022a.
- Sdae: Self-distillated masked autoencoder. In European Conference on Computer Vision, pages 108–124. Springer, 2022b.
- Bootstrapped masked autoencoders for vision bert pretraining. arXiv preprint arXiv:2207.07116, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740, 2021.
- Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning. In Advances in Neural Information Processing Systems, pages 21271–21284. Curran Associates, Inc., 2020.
- Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022.
- Contrastive masked autoencoders are stronger vision learners. arXiv preprint arXiv:2207.13532, 2022.
- An optimal algorithm for on-line bipartite matching. In Proceedings of the twenty-second annual ACM symposium on Theory of computing, pages 352–358, 1990.
- Do better imagenet models transfer better? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2661–2671, 2019.
- Semmae: Semantic-guided masking for learning masked autoencoders. arXiv preprint arXiv:2206.10207, 2022a.
- mc-beit: Multi-choice discretization for image bert pre-training. In European conference on computer vision, 2022b.
- EVit: Expediting vision transformers via token reorganizations. In International Conference on Learning Representations, 2022.
- S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
- A simple, efficient and scalable contrastive masked autoencoder for learning visual representations. arXiv preprint arXiv:2210.16870, 2022.
- Slip: Self-supervision meets language-image pre-training. arXiv preprint arXiv:2112.12750, 2021.
- Less is more: Pay less attention in vision transformers. In AAAI, 2022.
- BEiT v2: Masked image modeling with vector-quantized visual tokenizers. 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021.
- Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, 2019.
- Deepmim: Deep supervision for masked image modeling. arXiv preprint arXiv:2303.08817, 2023.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- Patch slimming for efficient vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12165–12174, 2022.
- Siamese image modeling for self-supervised vision representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2132–2141, 2023.
- Training data-efficient image transformers &distillation through attention. In International Conference on Machine Learning, pages 10347–10357, 2021.
- Deit iii: Revenge of the vit. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 516–533. Springer, 2022.
- Co-training 2l submodels for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11701–11710, 2023.
- The inaturalist species classification and detection dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8769–8778, 2018.
- Hard patches mining for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (CVPR), 2023.
- Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022a.
- Mvp: Multimodality-guided visual pre-training. In European Conference on Computer Vision, pages 337–353. Springer, 2022b.
- Contrastive learning rivals masked image modeling in fine-tuning via feature distillation. arXiv preprint arXiv:2205.14141, 2022c.
- Extreme masking for learning instance and distributed visual representations. arXiv preprint arXiv:2206.04667, 2022.
- Unified perceptual parsing for scene understanding. In European Conference on Computer Vision. Springer, 2018.
- Simmim: A simple framework for masked image modeling. In International Conference on Computer Vision, 2022.
- Evo-vit: Slow-fast token evolution for dynamic vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2964–2972, 2022.
- Masked image modeling with denoising contrast. International Conference on Learning Representations, 2023.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- ibot: Image bert pre-training with online tokenizer. In International Conference on Learning Representations, 2022.
- Taekyung Kim (41 papers)
- Dongyoon Han (49 papers)
- Byeongho Heo (33 papers)