When BERT Plays the Lottery, All Tickets Are Winning (2005.00561v2)

Published 1 May 2020 in cs.CL and cs.LG

Abstract: Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the "good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns.

PDF Abstract

An Examination of BERT and the Lottery Ticket Hypothesis

The paper, "When BERT Plays the Lottery, All Tickets Are Winning" by Sai Prasanna, Anna Rogers, and Anna Rumshisky, explores the intersection of BERT's overparametrization and the Lottery Ticket Hypothesis (LTH). The researchers investigate BERT's architecture's redundancy in the context of its use in NLP tasks. The primary focus is on determining whether BERT's smaller subnetworks can perform comparably to the full model.

Research Context and Background

BERT, a Transformer-based model, is renowned for its success in transfer learning across various NLP tasks. Previous studies have documented that BERT is overparameterized, indicating that significant components, such as Transformer heads and layers, can be pruned with minimal performance loss. This paper takes a unique approach by leveraging the LTH, which posits that within large, dense neural networks, there exist smaller, sparse subnetworks (or "winning tickets") that can achieve comparable performance when trained in isolation.

Methodology and Experimentation

The paper employs both structured and magnitude-based pruning approaches to scrutinize BERT's self-attention heads and layers on the General Language Understanding Evaluation (GLUE) benchmark. The pruning methods aim to identify subnetworks that retain at least 90% of the full model's accuracy.

Magnitude Pruning: The authors fine-tune BERT on each task, iteratively removing the weights with the lowest magnitude while monitoring performance.
Structured Pruning: They assess the importance of each BERT component (self-attention heads and MLPs) and remove the least impactful components iteratively.

Across nine GLUE tasks, both pruning techniques consistently found subnetworks capable of high performance. Notably, even the worst possible subnetworks could achieve strong results after retraining.

Key Findings and Implications

Widespread Trainability: The paper finds that nearly all pruned subnetworks of BERT can be successfully re-trained to near full capacity performances, suggesting that most of BERT's parameters are potentially useful.
Unstable "Good" Subnetworks: The paper reveals that "good" subnetworks lack stability across multiple fine-tuning runs, highlighting the significant role of weight initialization and suggesting randomness plays a critical role in capturing the linguistic information.
Non-Linguistic Explanation: The pruned subnetworks' success is not directly attributable to superior linguistic knowledge, as they do not demonstrate particularly meaningful self-attention patterns, countering assumptions about BERT's understanding of underlying linguistic structures.
Future Considerations: The results prompt questions about the dependence on pre-trained weights’ optimization landscapes over specific linguistic knowledge. The research suggests potential areas for redefining and improving models to leverage these findings in practical applications of efficient model compression.

Conclusion

The paper illustrates the feasibility of reducing BERT's complexity without significant performance degradation and encourages a reevaluation of what constitutes “linguistic knowledge” in model architectures. These insights offer valuable considerations for future BERT derivative models and model optimization, advancing understanding of sparse networks in NLP systems. This research contributes compelling evidence on the nature of BERT's parameters and reiterates the potential benefits of pruning not only for model efficiency but also as a tool for model interpretability and theoretical development in AI.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Sai Prasanna (4 papers)
Anna Rogers (27 papers)
Anna Rumshisky (42 papers)

Citations (177)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos