The Lottery Tickets Hypothesis for Pre-training in Computer Vision Models
In recent developments within machine learning, the Lottery Ticket Hypothesis (LTH) has provided a compelling viewpoint on the architecture and training of deep neural networks. This paper extends the scope of LTH to assess its applicability in both supervised and self-supervised pre-training paradigms for computer vision (CV). Traditional methods of pre-training utilize large datasets to create models with versatile and generalizable capabilities. However, the paper questions the necessity of such large models for subsequent downstream tasks, positing that sparse subnetwork forms, identified through the LTH, could provide equivalent performance with fewer computational resources.
The authors investigate pre-trained CV models using popular pre-training approaches, including supervised ImageNet classification, and self-supervised methods such as simCLR and MoCo. Through iterative magnitude pruning (IMP), they seek sparse subnetworks within pre-trained dense networks. These subnetworks, noted as "winning tickets", can purportedly be isolated and further trained with comparable success to their larger counterparts. A notable conclusion from the experiments presented is that matching subnetworks consistently achieve no less transfer performance across various downstream tasks, maintaining high accuracies even at substantial sparsity levels ranging from 59.04% to 96.48%.
From the analysis, the authors report universal transferability of subnetworks generated from pre-training tasks to multiple downstream tasks such as image classification, semantic segmentation, and detection. It is significant that subnetwork sparsity could remain as high as 67% without degrading performance, particularly in classification tasks. This observation advocates for the deployment of compressed models, potentially reinvigorating a conversation about computational economy in machine learning endeavors.
The paper further explores attributes of various pre-training methods, examining the mask structures and sensitivity to perturbations within subnetworks extracted from pre-trained models. One salient find is the diversity in the behavior of subnetworks, contingent on their pre-training regime—namely between supervised and self-supervised approaches—affecting downstream task efficacy. Subnetworks from self-supervised pre-training methods like simCLR exhibit better transfer performance in natural image classification tasks, a contrast to those derived from supervised pre-training which prove more dependable across domains inclusive of synthetic datasets.
A pivotal perspective the paper offers is in drawing parallels to techniques adapted in Natural Language Processing, hinting at broader applications across machine learning if this hypothesis is strengthened further. Importantly, it acknowledges two major areas for development: understanding the impact of larger models on LTH efficacy and investigating structured pruning strategies as a natural extension of unstructured pruning.
In conclusion, the findings compel researchers to consider smaller, lottery-winning subnetworks in CV models across various pre-training paradigms, unraveling potential efficiencies that could drive future innovations in deep learning architectures. The implications for AI development are profound, signaling a shift towards models that are sophisticated in capability yet lean in computational demand.