EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Published 14 Nov 2022 in cs.CV, cs.CL, and cs.LG | (2211.07636v2)

Abstract: We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.

Abstract PDF Upgrade to Chat

Authors (9)

Citations (556)

View on Semantic Scholar

Summary

The paper demonstrates that scaling masked image modeling with a vanilla Vision Transformer and CLIP features leads to strong performance across multiple downstream tasks.
It employs a pre-training strategy on 29.6M unlabeled images to effectively reconstruct masked image-text aligned features, ensuring reproducibility without proprietary datasets.
EVA serves as a versatile vision-centric model that boosts results in image recognition, object detection, segmentation, and video action recognition while enhancing multi-modal learning.

Exploring the Limits of Masked Visual Representation Learning at Scale

The paper "Exploring the Limits of Masked Visual Representation Learning at Scale" introduces a vision-centric foundation model and investigates the potential of scaling masked image modeling (MIM) for visual representation learning. The approach employs a vanilla Vision Transformer (ViT) architecture pre-trained to reconstruct masked image-text aligned vision features, leveraging this as a pretext task. This exploration enables the model to scale to one billion parameters, demonstrating remarkable performance across a wide array of downstream tasks without heavily relying on supervised training.

Methodology

The pre-training focuses on MIM, employing image-text aligned features, particularly CLIP-generated features, as prediction targets. This choice is rooted in an empirical investigation that showed the superiority of using CLIP features directly for masked prediction over alternative approaches like feature tokenization or feature distillation. The model architecture remains a vanilla ViT, which simplifies and streamlines the overall design, focusing on fundamental scaling principles without ornate modifications.

The authors emphasize the use of publicly available datasets, training the model on 29.6 million unlabeled images. This approach underscores the feasibility of achieving high performance without proprietary datasets and supervised labels, emphasizing reproducibility and accessibility.

Performance and Results

In evaluating downstream tasks, the model, referred to as EVA, surpasses existing benchmarks across image recognition, object detection, semantic segmentation, and video action recognition. Notably, in large vocabulary tasks such as the LVIS instance segmentation, the model achieves nearly equivalent performance on datasets with varying category sizes, showcasing its adaptability and robustness at scale. On the COCO and LVIS benchmarks, EVA demonstrates that large models can overcome previous performance gaps attributed to challenging real-world scenarios.

The model's robustness is further highlighted through comparisons on ImageNet variants, showing a minimal performance drop, thereby indicating strong generalization capabilities. This robustness extends to video action recognition, where the model sets new records on Kinetics datasets, further bolstering its suitability across multiple modalities.

Beyond its function as a visual encoder, EVA serves as a vision-centric pivot in multi-modal learning environments, notably CLIP models. By initializing a large-scale CLIP model with a pre-trained EVA vision tower, the training process is significantly stabilized and accelerated, utilizing fewer resources and attaining superior zero-shot classification results.

Implications and Future Directions

The results suggest that MIM pre-training can be effectively scaled to billion-parameter models, achieving state-of-the-art performance without extensive supervision. This work implications reach beyond visual recognition, influencing the trajectory of multi-modal models, particularly in contrastive language-image pre-training domains. By bridging vision and language, EVA hints at future avenues where interleaved masked modeling and contrastive learning can further push the boundaries of AI capabilities.

Overall, the paper contributes a compelling case for large-scale MIM and its potential in both vision-specific and multi-modal applications, advancing the conversation on scalable and efficient AI model training.

Markdown Report Issue