- The paper demonstrates that preserving initial weight signs is critical for enabling effective sparse subnetworks from random initializations.
- The paper reveals that various mask criteria, especially final magnitude and movement, reliably identify weights trending toward inactivity during training.
- The paper introduces Supermasks, which, through optimized binary masking, achieve high test accuracies on benchmarks without conventional weight optimization.
Authoritative Summary of "Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask" (1905.01067)
Context and Motivation
The Lottery Ticket Hypothesis (LTH) posits that within large neural networks exist sparse subnetworks (termed "lottery tickets," LTs) that, when initialized with certain weights, can be trained in isolation to match the performance of the original dense network. The original formulation creates such subnetworks via magnitude pruning and rewinds remaining weights to their initial values. However, fundamental questions about the mechanism behind LTH, the relative importance of the pruning criterion, initialization treatment, and the fate of pruned weights persist. This paper systematically deconstructs the LT algorithm to interrogate these components, conducts extensive ablations, and introduces the concept of Supermasks: binary masks that, when applied to randomly initialized networks, yield strikingly high accuracy without any subsequent weight optimization.
Mask Criteria: Choice of Pruning Heuristic
The authors investigate nine distinct mask criteria, ranging from final and initial weight magnitude, movement, random masking, to combinations thereof. Experimental results on MNIST (FC) and CIFAR-10 (Conv2, Conv4, Conv6) reveal that:
- The large final magnitude criterion (∣wf​∣), as used in LTH, consistently produces sparse networks whose test accuracy matches or sometimes exceeds the baseline dense model.
- The magnitude increase criterion (∣wf​∣−∣wi​∣) and movement criterion (∣wf​−wi​∣) perform comparably or slightly better, especially for smaller convolutional networks.
- Inverted criteria (small magnitude, small movement) act as negative controls, yielding subrandom performance.
- Random masking serves as a baseline and is outperformed by tailored criteria.
These results demonstrate that the selection of weights to prune need not hinge precisely on final magnitude; several mask criteria coincidentally bias toward weights that naturally approach zero as training progresses. The main finding here is that the efficacy of LT subnetworks is robust to a range of pruning heuristics, provided these heuristics are well-designed to remove weights that were headed toward inactivity during training.
Mask-1 Actions: Role of Weight Initialization and Sign
A pivotal observation from the original LTH is that rewinding remaining weights to their initial values is essential, whereas reinitializing destroys the LT effect. This study tests alternative mask-1 actions:
- Reinitialization of remaining weights from the original distribution cripples performance.
- Reshuffling initial values within layers (preserving statistics but not correspondence) does not recover the LT phenomenon.
- Setting retained weights to a positive or negative constant (with magnitude equal to the standard deviation of initialization), followed by matching to original sign, maintains performance.
A salient result is the critical significance of weight signs: retaining the sign of initial weights, regardless of their exact value or distribution, is sufficient for successful LT training. This finding contradicts prior beliefs about the necessity of the precise initial values and demonstrates that the basin of attraction for LT subnetworks is considerably broader. The sign constraint enables optimization trajectories similar to the original LT setup.
Mask-0 Actions: Importance of Zeroing Pruned Weights
Conventionally, pruned weights are set to zero and frozen. The authors dissect this standard and propose alternative treatments:
- Freezing pruned weights at their initial values (instead of zero) consistently underperforms zeroing, except at extremely high pruning rates.
- A hybrid treatment where weights are frozen at zero only if their magnitude decreased during training (and at initial values otherwise) recovers and in some cases enhances performance.
- Control experiments show that randomly freezing a subset of pruned weights to zero, or freezing to zero those that increased in magnitude, degrades performance.
The empirical evidence supports a hypothesis: the large (and related) mask criteria are highly selective for weights that move toward zero during training; zeroing these post-pruning effectively completes the training trajectory for inactive weights. Masking, therefore, can be interpreted as a form of training, solidifying a network state that would otherwise require further optimization.
Supermasks: Masking as Training and Beyond
Building on the masking-as-training hypothesis, the authors explore applying LT masks to randomly initialized networks without any training:
- Applying the large mask yields remarkable test accuracy: 86% on MNIST and 41% on CIFAR-10, far exceeding random chance.
- Using signed constants (preserving initialization sign) further boosts accuracy.
- Optimizing the mask itself (learned Supermask), while keeping weights frozen at random initialization, enables test accuracy approaching those of trained dense networks: up to 98% on MNIST and 76.5% on CIFAR-10 (Conv6), nearly matching fully trained networks.
Dynamic weight rescaling during Supermask optimization further enhances performance by compensating for pruning-induced norm changes.
Supermasks provide strong evidence for the existence of powerful, trainable subnetworks embedded in the parameter space of randomly initialized networks. The results challenge prior assumptions about the necessity of weight training, suggesting that sparsely masked architectures may already possess highly structured function mappings.
Implications and Future Directions
This paper highlights several theoretical and practical implications for neural network training, pruning, and compression:
- The sign of initial weights, rather than their precise value, is foundational for successful LT subnetworks, which refines the understanding of initialization's role.
- Mask selection criteria need only align with weights’ training trajectory—preferentially pruning those with diminishing magnitude.
- Masking operations themselves can approximate training for certain weights, reframing classical pruning procedures as a training-like intervention.
- The existence of Supermasks with high classification accuracy opens new avenues for network compression: storing a binary mask and a random seed suffices to reconstruct performant models.
- The success of learned Supermasks points to possible intrinsic structure in random initialization and may inspire alternative approaches to network architecture search and rapid deployment.
Future research should explore generalization of these findings to larger-scale datasets and architectures (e.g., ResNet/ImageNet), investigate the dynamics of mask optimization in more challenging tasks, and develop efficient algorithms to exploit Supermasks for practical applications in edge computing and rapid model instantiation.
Conclusion
The deconstruction of the Lottery Ticket Hypothesis performed in this study elucidates the mechanisms underlying successful sparse subnetworks in deep neural networks. Ablations across mask criteria, weight initialization, and pruning treatment reveal the centrality of sign preservation and the effect of masking as a training proxy. The introduction and optimization of Supermasks substantially advance the understanding of subnetworks in untrained models and provide a practical route toward efficient network compression. These insights refine the theoretical landscape of over-parameterization, training dynamics, and pruning strategies, and suggest promising directions for both theoretical inquiry and practical methodology in neural network research.