Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models (2012.06908v2)

Published 12 Dec 2020 in cs.LG, cs.CV, and cs.NE

Abstract: The computer vision world has been re-gaining enthusiasm in various pre-trained models, including both classical ImageNet supervised pre-training and recently emerged self-supervised pre-training such as simCLR and MoCo. Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation. Latest studies suggest that pre-training benefits from gigantic model capacity. We are hereby curious and ask: after pre-training, does a pre-trained model indeed have to stay large for its downstream transferability? In this paper, we examine supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH). LTH identifies highly sparse matching subnetworks that can be trained in isolation from (nearly) scratch yet still reach the full models' performance. We extend the scope of LTH and question whether matching subnetworks still exist in pre-trained computer vision models, that enjoy the same downstream transfer performance. Our extensive experiments convey an overall positive message: from all pre-trained weights obtained by ImageNet classification, simCLR, and MoCo, we are consistently able to locate such matching subnetworks at 59.04% to 96.48% sparsity that transfer universally to multiple downstream tasks, whose performance see no degradation compared to using full pre-trained weights. Further analyses reveal that subnetworks found from different pre-training tend to yield diverse mask structures and perturbation sensitivities. We conclude that the core LTH observations remain generally relevant in the pre-training paradigm of computer vision, but more delicate discussions are needed in some cases. Codes and pre-trained models will be made available at: https://github.com/VITA-Group/CV_LTH_Pre-training.

The Lottery Tickets Hypothesis for Pre-training in Computer Vision Models

In recent developments within machine learning, the Lottery Ticket Hypothesis (LTH) has provided a compelling viewpoint on the architecture and training of deep neural networks. This paper extends the scope of LTH to assess its applicability in both supervised and self-supervised pre-training paradigms for computer vision (CV). Traditional methods of pre-training utilize large datasets to create models with versatile and generalizable capabilities. However, the paper questions the necessity of such large models for subsequent downstream tasks, positing that sparse subnetwork forms, identified through the LTH, could provide equivalent performance with fewer computational resources.

The authors investigate pre-trained CV models using popular pre-training approaches, including supervised ImageNet classification, and self-supervised methods such as simCLR and MoCo. Through iterative magnitude pruning (IMP), they seek sparse subnetworks within pre-trained dense networks. These subnetworks, noted as "winning tickets", can purportedly be isolated and further trained with comparable success to their larger counterparts. A notable conclusion from the experiments presented is that matching subnetworks consistently achieve no less transfer performance across various downstream tasks, maintaining high accuracies even at substantial sparsity levels ranging from 59.04% to 96.48%.

From the analysis, the authors report universal transferability of subnetworks generated from pre-training tasks to multiple downstream tasks such as image classification, semantic segmentation, and detection. It is significant that subnetwork sparsity could remain as high as 67% without degrading performance, particularly in classification tasks. This observation advocates for the deployment of compressed models, potentially reinvigorating a conversation about computational economy in machine learning endeavors.

The paper further explores attributes of various pre-training methods, examining the mask structures and sensitivity to perturbations within subnetworks extracted from pre-trained models. One salient find is the diversity in the behavior of subnetworks, contingent on their pre-training regime—namely between supervised and self-supervised approaches—affecting downstream task efficacy. Subnetworks from self-supervised pre-training methods like simCLR exhibit better transfer performance in natural image classification tasks, a contrast to those derived from supervised pre-training which prove more dependable across domains inclusive of synthetic datasets.

A pivotal perspective the paper offers is in drawing parallels to techniques adapted in Natural Language Processing, hinting at broader applications across machine learning if this hypothesis is strengthened further. Importantly, it acknowledges two major areas for development: understanding the impact of larger models on LTH efficacy and investigating structured pruning strategies as a natural extension of unstructured pruning.

In conclusion, the findings compel researchers to consider smaller, lottery-winning subnetworks in CV models across various pre-training paradigms, unraveling potential efficiencies that could drive future innovations in deep learning architectures. The implications for AI development are profound, signaling a shift towards models that are sophisticated in capability yet lean in computational demand.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Tianlong Chen (202 papers)
  2. Jonathan Frankle (37 papers)
  3. Shiyu Chang (120 papers)
  4. Sijia Liu (204 papers)
  5. Yang Zhang (1129 papers)
  6. Michael Carbin (45 papers)
  7. Zhangyang Wang (374 papers)
Citations (119)