This paper explores the standard pretrain-then-finetune paradigm in computer vision for visual recognition tasks.
An additional pre-pretraining stage using the self-supervised MAE technique is introduced, which scales with both model and data size.
Key terms:
Pretrain-then-finetune paradigm: A common approach in computer vision for visual recognition tasks where models are pretrained on large scale datasets before fine-tuning
Foundation models: State-of-the-art models that are pretrained using large scale (weakly) supervised datasets with billions of images
Self-supervised MAE technique: A method used to initialize models in the pre-pretraining stage, which has been shown to scale with both model size and dataset size
Model convergence: The process of a model learning to make accurate predictions during training
Downstream transfer performance: A measure of how well a pretrained model can be fine-tuned for specific tasks and perform on new data