• This paper explores the standard pretrain-then-finetune paradigm in computer vision for visual recognition tasks.
  • An additional pre-pretraining stage using the self-supervised MAE technique is introduced, which scales with both model and data size.

Key terms:

  • Pretrain-then-finetune paradigm: A common approach in computer vision for visual recognition tasks where models are pretrained on large scale datasets before fine-tuning
  • Foundation models: State-of-the-art models that are pretrained using large scale (weakly) supervised datasets with billions of images
  • Self-supervised MAE technique: A method used to initialize models in the pre-pretraining stage, which has been shown to scale with both model size and dataset size
  • Model convergence: The process of a model learning to make accurate predictions during training
  • Downstream transfer performance: A measure of how well a pretrained model can be fine-tuned for specific tasks and perform on new data


Research Computer Vision Visual Recognition Pretrain Then Finetune Paradigm self-supervised MAE Model Initialization Model Convergence Downstream Transfer Performance iNaturalist-18 Food-101