Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask (1905.01067v4)

Published 3 May 2019 in cs.LG, cs.CV, and stat.ML

Abstract: The recent "Lottery Ticket Hypothesis" paper by Frankle & Carbin showed that a simple approach to creating sparse networks (keeping the large weights) results in models that are trainable from scratch, but only when starting from the same initial weights. The performance of these networks often exceeds the performance of the non-sparse base model, but for reasons that were not well understood. In this paper we study the three critical components of the Lottery Ticket (LT) algorithm, showing that each may be varied significantly without impacting the overall results. Ablating these factors leads to new insights for why LT networks perform as well as they do. We show why setting weights to zero is important, how signs are all you need to make the reinitialized network train, and why masking behaves like training. Finally, we discover the existence of Supermasks, masks that can be applied to an untrained, randomly initialized network to produce a model with performance far better than chance (86% on MNIST, 41% on CIFAR-10).

Citations (369)

View on Semantic Scholar

Summary

The paper demonstrates that sign preservation, rather than exact weight values, is key for retraining sparse neural subnetworks.
The paper validates alternative mask criteria, revealing that magnitude- or movement-based masks yield comparable or superior performance.
The paper introduces Supermasks, showing that effective masking on untrained networks can achieve high test accuracy without full retraining.

Exploring "Deconstructing Lottery Tickets": Mechanisms and Insights

The paper "Deconstructing Lottery Tickets" by Hattie Zhou et al., explores the effectiveness of sparse neural subnetworks, termed "Lottery Tickets" (LT), which have been shown to be trainable from initialization. This builds upon the work of Frankle and Carbin on the Lottery Ticket Hypothesis (LTH), where it was demonstrated that subnetworks isolated via pruning can outperform larger models, provided they are rewound to their initial weights. The paper under discussion investigates different aspects of this hypothesis, offering insights into why these subnetworks exhibit strong performance and introducing the concept of "Supermasks."

Key Contributions

The authors address three main components of the LT algorithm, each with its own pivotal inquiry and experimental validation:

Mask Criteria: The original LT hypothesis proposed that retaining large weights leads to high-performing subnetworks. Zhou et al. explore alternative mask criteria, such as magnitude and movement, analyzing their impact on network performance. Notably, their results indicate that, among several criteria, those based on magnitude or movement achieve comparable, if not superior, accuracy to the original large weight criteria suggested by Frankle and Carbin.
Mask-1 Actions: This aspect investigates the initialization of weights that are kept (not pruned) after applying the mask. Contrary to the original LT approach, the paper finds that it is not the specific weight values but their signs that are critical for network retraining. That is, resetting weights to the original signs, irrespective of the exact values, is sufficient for maintaining performance.
Mask-0 Actions: The traditional approach to pruning involves setting masked (pruned) weights to zero. This paper elucidates the importance of setting these weights to zero, supplementing the findings with an ablation paper. It demonstrates that the masking operation essentially emulates a form of training by guiding weights toward the zero that were already heading toward zero during training.

Emergence of Supermasks

One of the seminal findings in this paper is the identification of "Supermasks". By applying well-chosen masks to randomly initialized networks, without any weight retraining, the networks attain performance significantly better than chance. This intriguing result underscores the idea that certain subnetworks inherently possess an advantageous structure that lends itself to effective performance even before any conventional training takes place. Supermasks offer a potential avenue for network compression and efficient initialization, achieving 86% test accuracy on MNIST solely via effective masking.

Implications and Future Directions

The theoretical implications of this paper suggest an expanded understanding of neural network training dynamics, specifically highlighting that sparse networks can be constructed and perform well through selective pruning and strategic initialization. In practical terms, this suggests new approaches to network compression and initialization strategies which might reduce training times and resources, possibly influencing AI application domains where efficiency is paramount.

Looking forward, the exploration of Supermasks raises questions about their applicability to larger and more complex networks beyond the scope covered in this paper. Moreover, the potential to dynamically adapt pruning based on task-related patterns might enhance multitask learning architectures. Advancing these findings might involve developing adaptive mask criteria that can be optimized through learning, exploiting the structural insights gained from the Supermask phenomenon.

Overall, this paper presents a comprehensive exploration into the mechanisms that underlie the effectiveness of Lottery Ticket networks, providing both broader understanding and practical methodologies for efficient neural network design.

PDF Markdown

Related Papers

GitHub

GitHub - uber-research/deconstructing-lottery-tickets (144 stars)

YouTube

Show All Videos