The effectiveness of MAE pre-pretraining for billion-scale pretraining (2303.13496v3)
Abstract: This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.7%), ImageNet-ReaL (91.1%), 1-shot ImageNet-1k (63.6%), and zero-shot transfer on Food-101 (96.2%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images, and our models are available publicly.
- Exploring the limits of large scale pre-training. arXiv preprint arXiv:2110.02095, 2021.
- ViViT: A video vision transformer. In CVPR, 2021.
- Masked siamese networks for label-efficient learning. In ECCV, 2022.
- Self-supervised learning from images with a joint-embedding predictive architecture. arXiv preprint arXiv:2301.08243, 2023.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In NeurIPS, 2019.
- Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020.
- Tide: A general toolbox for identifying object detection errors. In ECCV, 2020.
- Food-101–mining discriminative components with random forests. In ECCV. Springer, 2014.
- Language models are few-shot learners. NeurIPS, 2020.
- Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- Adaptformer: Adapting vision transformers for scalable visual recognition. arXiv preprint arXiv:2205.13535, 2022.
- A simple framework for contrastive learning of visual representations. In ICML, 2020.
- Electra: Pre-training text encoders as discriminators rather than generators. ICLR, 2020.
- Unsupervised cross-lingual representation learning at scale. In ACL, 2020.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPR, 2020.
- Dynamic head: Unifying object detection heads with attentions. In CVPR, 2021.
- Does object recognition work for everyone? In CVPR Workshop, 2019.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Multiscale vision transformers. In ICCV, 2021.
- Masked autoencoders as spatiotemporal learners. arXiv preprint arXiv:2205.09113, 2022.
- Lvis challenge track technical report 1st place solution: distribution balanced and boundary refinement for large vocabulary instance segmentation. arXiv preprint arXiv:2111.02668, 2021.
- Large-scale weakly-supervised pre-training for video action recognition. In CVPR, 2019.
- Omnimae: Single model masked pretraining on images and videos. In CVPR, 2023.
- Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR, 2013.
- The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017.
- Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020.
- LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
- Mask r-cnn. In ICCV, 2017.
- The inaturalist species classification and detection dataset. In CVPR, 2018.
- Parameter-efficient transfer learning for nlp. In ICML, 2019.
- LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
- Deep networks with stochastic depth. In ECCV, 2016.
- Imagenet-x: Understanding model mistakes with factor of variation annotations. arXiv preprint arXiv:2211.01866, 2022.
- Openclip, 2021.
- Visual prompt tuning. In ECCV, 2022.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Do better imagenet models transfer better? arxiv 2018. In CVPR, 2019.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020.
- Exploring plain vision transformer backbones for object detection. In ECCV, 2022.
- Cbnet: A composite backbone network architecture for object detection. IEEE Transactions on Image Processing, 2022.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
- George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 1995.
- Slip: Self-supervision meets language-image pre-training. In ECCV, 2022.
- Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Karol J Piczak. Esc: Dataset for environmental sound classification. In ACM MM, 2015.
- Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 1992.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR Workshops, 2014.
- Do imagenet classifiers generalize to imagenet? In ICML, 2019.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
- Imagenet-21k pretraining for the masses. In NeurIPS, 2021.
- ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
- LAION-5B: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
- Do image classifiers generalize across time? In ICCV, 2021.
- Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
- Two-stream convolutional networks for action recognition in videos. In NeurIPS, 2014.
- Flava: A foundational language and vision alignment model. In CVPR, 2022.
- Revisiting weakly supervised pre-training of visual perception models. In CVPR, 2022.
- Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014.
- How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
- Rethinking the inception architecture for computer vision. In CVPR, 2016.
- Measuring robustness to natural distribution shifts in image classification. In NeurIPS, 2020.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv preprint arXiv:2203.12602, 2022.
- Attention is all you need. In NeurIPS, 2017.
- Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 2010.
- Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022.
- Simmim: A simple framework for masked image modeling. In CVPR, 2022.
- Focal modulation networks. arXiv preprint arXiv:2203.11926, 2022.
- Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
- Scaling vision transformers. In CVPR, 2022.
- Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
- mixup: Beyond empirical risk minimization. In ICLR, 2018.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In ICLR, 2022.
- Random erasing data augmentation. In AAAI, 2020.
- ibot: Image bert pre-training with online tokenizer. ICLR, 2022.
- Detrs with collaborative hybrid assignments training. arXiv preprint arXiv:2211.12860, 2022.
- Mannat Singh (13 papers)
- Quentin Duval (9 papers)
- Kalyan Vasudev Alwala (9 papers)
- Haoqi Fan (33 papers)
- Vaibhav Aggarwal (8 papers)
- Aaron Adcock (10 papers)
- Armand Joulin (81 papers)
- Piotr Dollár (49 papers)
- Christoph Feichtenhofer (52 papers)
- Ross Girshick (75 papers)
- Rohit Girdhar (43 papers)
- Ishan Misra (65 papers)