Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders (2310.20704v2)
Abstract: Vision Transformers (ViTs) have become ubiquitous in computer vision. Despite their success, ViTs lack inductive biases, which can make it difficult to train them with limited data. To address this challenge, prior studies suggest training ViTs with self-supervised learning (SSL) and fine-tuning sequentially. However, we observe that jointly optimizing ViTs for the primary task and a Self-Supervised Auxiliary Task (SSAT) is surprisingly beneficial when the amount of training data is limited. We explore the appropriate SSL tasks that can be optimized alongside the primary task, the training schemes for these tasks, and the data scale at which they can be most effective. Our findings reveal that SSAT is a powerful technique that enables ViTs to leverage the unique characteristics of both the self-supervised and primary tasks, achieving better performance than typical ViTs pre-training with SSL and fine-tuning sequentially. Our experiments, conducted on 10 datasets, demonstrate that SSAT significantly improves ViT performance while reducing carbon footprint. We also confirm the effectiveness of SSAT in the video domain for deepfake detection, showcasing its generalizability. Our code is available at https://github.com/dominickrei/Limited-data-vits.
- VIVIT: A video vision transformer. arXiv preprint arXiv:2103.15691, 2021.
- Is space-time attention all you need for video understanding? In Int. Conf. on Mach. Learn., July 2021.
- Albumentations: fast and flexible image augmentations. Information, 11(2):125, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
- Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16144–16155, 2022.
- Self-supervised vision transformers learn visual concepts in histopathology. arXiv preprint arXiv:2203.00585, 2022.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Per-pixel classification is not all you need for semantic segmentation. arXiv preprint arXiv:2107.06278, 2021.
- Combining efficientnet and vision transformers for video deepfake detection. In International conference on image analysis and processing, pages 219–229. Springer, 2022.
- MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection. In CVPR, 2022.
- Demystifying attention mechanisms for deepfake detection. In 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pages 1–7. IEEE, 2021.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Multiscale vision transformers. In ICCV, 2021.
- How to train vision transformer on small-scale datasets? arXiv preprint arXiv:2210.07240, 2022.
- Convmae: Masked convolution meets masked autoencoders. arXiv preprint arXiv:2205.03892, 2022.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Transformer in transformer, 2021.
- Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Rethinking spatial dimensions of vision transformers. In International Conference on Computer Vision (ICCV), 2021.
- Cd-net: Histopathology representation learning using pyramidal context-detail network, 2022.
- Learning multiple layers of features from tiny images. 2009.
- Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700, 2019.
- Curl: Contrastive unsupervised representations for reinforcement learning. In Int. Conf. on Mach. Learn., pages 5639–5650. PMLR, 2020.
- Vision transformer for small-size datasets. CoRR, abs/2112.13492, 2021.
- Locality guidance for improving vision transformers on tiny datasets. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 110–127. Springer, 2022.
- Does self-supervised learning really improve reinforcement learning from pixels? In Advances in Neural Information Processing Systems, 2022.
- Uniform masking: Enabling mae pre-training for pyramid-based vision transformers with locality. arXiv:2205.10063, 2022.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, 2022.
- Efficient training of visual transformers with small datasets. In Conference on Neural Information Processing Systems (NeurIPS), 2021.
- Swin transformer: Hierarchical vision transformer using shifted windows. In CVPR, pages 10012–10022, 2021.
- Video swin transformer. arXiv preprint arXiv:2106.13230, 2021.
- Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008.
- Blurs behave like ensembles: Spatial smoothings to improve accuracy, uncertainty, and robustness. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 17390–17419. PMLR, 17–23 Jul 2022.
- How do vision transformers work? arXiv preprint arXiv:2202.06709, 2022.
- What do self-supervised vision transformers learn? In International Conference on Learning Representations.
- Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1406–1415, 2019.
- Evolving losses for unsupervised video representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Self-supervised video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2874–2884, 2022.
- FaceForensics++: Learning to detect manipulated facial images. In International Conference on Computer Vision (ICCV), 2019.
- TokenLearner: Adaptive Space-Time Tokenization for Videos. Advances in Neural Information Processing Systems, 34, 2021.
- Selim Seferbekov. Dfdc 1st place solution, 2020.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017.
- Learning viewpoint-agnostic visual representations by recovering tokens in 3d space. In Advances in Neural Information Processing Systems, 2022.
- Scorenet: Learning non-uniform attention and augmentation for transformer-based histopathological image classification. arXiv preprint arXiv:2202.07570, 2022.
- Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, pages 843–852. IEEE Computer Society, 2017.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv preprint arXiv:2203.12602, 2022.
- Training data-efficient image transformers: Distillation through attention. In Int. Conf. on Mach. Learn., volume 139, pages 10347–10357, July 2021.
- Attention is all you need. In NIPS, 2017.
- PVTv2: Improved baselines with pyramid vision transformer. arXiv preprint arXiv:2106.13797, 2021.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122, 2021.
- Transpath: Transformer-based self-supervised learning for histopathological image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 186–195. Springer, 2021.
- Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22–31, 2021.
- Segformer: Simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203, 2021.
- Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
- Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 10(1):41, 2023.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pages 558–567, 2021.
- Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.
- Hard sample aware noise robust learning for histopathology image classification. IEEE Transactions on Medical Imaging, 41(4):881–894, 2021.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.