The Lottery Ticket Hypothesis for Pre-trained BERT Networks

Published 23 Jul 2020 in cs.LG, cs.CL, cs.NE, and stat.ML | (2007.12223v2)

Abstract: In NLP, enormous pre-trained models like BERT have become the standard starting point for training on a range of downstream tasks, and similar trends are emerging in other areas of deep learning. In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matching subnetworks capable of training in isolation to full accuracy and transferring to other tasks. In this work, we combine these observations to assess whether such trainable, transferrable subnetworks exist in pre-trained BERT models. For a range of downstream tasks, we indeed find matching subnetworks at 40% to 90% sparsity. We find these subnetworks at (pre-trained) initialization, a deviation from prior NLP research where they emerge only after some amount of training. Subnetworks found on the masked language modeling task (the same task used to pre-train the model) transfer universally; those found on other tasks transfer in a limited fashion if at all. As large-scale pre-training becomes an increasingly central paradigm in deep learning, our results demonstrate that the main lottery ticket observations remain relevant in this context. Codes available at https://github.com/VITA-Group/BERT-Tickets.

Abstract PDF Upgrade to Chat

Citations (356)

View on Semantic Scholar

Summary

The paper demonstrates the existence and identification of efficient, sparse subnetworks (winning tickets) within pre-trained BERT models using iterative magnitude pruning.
It shows that these subnetworks exhibit universal transferability across diverse NLP tasks, achieving comparable accuracy with fewer parameters.
The study reveals that pre-trained BERT initialization permits identification of effective subnetworks without additional training, promoting computational efficiency.

An Examination of the Lottery Ticket Hypothesis in Pre-trained BERT Networks

The paper "The Lottery Ticket Hypothesis for Pre-trained BERT Networks" explores the applicability of the lottery ticket hypothesis (LTH) within the context of large-scale, pre-trained BERT models. The authors address whether it is feasible to identify sparse yet efficient subnetworks within pre-trained models that can be utilized for various downstream tasks in NLP.

Overview

In the field of NLP, BERT and other extensive pre-trained models have emerged as fundamental components, primarily due to their efficacy in a range of downstream applications. These models, often characterized by immense parameter scales, aid in significantly reducing the effort required for task-specific training by offering robust initializations. Concurrently, the LTH suggests that within these expansive models are smaller subnetworks that, if correctly identified, can independently achieve comparable performance levels. This research seeks to investigate whether such subnetworks are present in BERT models and if they can be effectively transferred across different tasks.

Key Findings

The study presents several notable findings:

Existence and Identification of Winning Tickets: By applying iterative magnitude pruning (IMP), the researchers successfully identified winning tickets within BERT models at varying levels of sparsity, ranging from 40% to 90% depending on the downstream task. This contradicts previous NLP findings, where such subnetworks typically became apparent only after significant training.
Transferrability Across Tasks: Subnetworks discovered through the masked language modeling (MLM) task—integral to BERT’s pre-training—demonstrated universal transferability when applied to other tasks. The study observed that these subnetworks could achieve full accuracy by themselves, emphasizing the broader applicability of these frozen subnetworks.
Role of Pre-trained Initialization: The authors provide evidence that, unlike earlier research, matching subnetworks in this setting can be found directly at the pre-trained initialization without requiring additional training steps, reinforcing the value of using BERT's pre-trained weights as an effective starting point.
Performance Comparisons: When comparing IMP-derived subnetworks to those generated by standard post-training pruning, results were mixed. Standard pruning sometimes surpassed, and at other times underperformed, the IMP method, particularly in small-data scenarios where overfitting may have been a concern.

Implications and Speculations

This work underscores the potential of utilizing smaller, resource-efficient subnetworks within massive pre-trained models without sacrificing performance, making AI systems more accessible and cost-effective. The practical implications are substantial in terms of computational resources and energy efficiency, particularly in applications where deployment on edge devices or lower-end hardware is required.

Theoretically, these findings extend the LTH into the domain of large-scale, pre-trained models, suggesting that the initial training phase establishes a weight distribution conducive to identifying useful subnetworks right from initialization. This could influence how pre-training and pruning strategies evolve, potentially guiding new architectures and training methodologies.

Future research may explore methods of identifying these winning tickets more efficiently, assessing their transferability across diverse datasets or tasks beyond NLP, and leveraging finer pruning techniques to achieve even greater sparsity while maintaining task performance.

The findings open avenues for optimizing both neural architecture design and the training paradigm, potentially impacting the broader landscape of AI model development heavily reliant on pre-trained frameworks.

Markdown