One-shot Magnitude Pruning
- The paper demonstrates that one-shot magnitude pruning efficiently identifies sparse subnetworks by zeroing the lowest-magnitude weights and then retraining for competitive accuracy.
- The methodology employs a threshold-based mask selection, which is reproducible across sibling networks with shared early training, ensuring consistent subnetwork performance.
- Empirical results show that union and intersection mask compositions yield nearly identical accuracy–sparsity tradeoffs, validating one-shot magnitude pruning as a near-optimal compression technique.
One-shot magnitude pruning (MP) is a widely used neural network sparsification method in which a sparse subnetwork is identified by directly zeroing the smallest-magnitude weights in a single pruning step, often followed by retraining or fine-tuning. This approach underpins much of the literature on efficient network compression, adaptive subnetwork selection, and the empirical exploration of the "Lottery Ticket Hypothesis." Recent research further elucidates its consistency, composability, and near-optimality on standard architectures.
1. Formal Definition and Methodology
Let (or ) denote the network's weight vector after training. A binary pruning mask selects a subnetwork via , where .
Given a network and a target sparsity , the one-shot magnitude pruning mask sets exactly of the smallest-magnitude weights to zero:
The subnetwork is defined by applying to the weights at the initialization or a "rewinding" point (e.g., after some early training epochs), and then training or fine-tuning the mask-fixed network to convergence.
Canonical One-Shot MP Workflow
- Initialize weights .
- Train the dense network to convergence, yielding .
- Compute and prune the smallest entries, forming mask .
- Optionally, rewind to (or for some early ).
- Train to convergence.
- Evaluate performance.
Pseudocode (LaTeX-style)
1 2 3 4 5 6 7 8 9 10 |
Require initialization θ₀, data D, sparsity s
θ ← TrainDense(θ₀, D)
Compute a_i = |θ_i| for i = 1 … d
Sort {a_i}, choose threshold τ = a_{(s·d)}
For i = 1 … d:
If |θ_i| < τ: m_i ← 0
Else: m_i ← 1
θ_rewind ← θ₀ or θ_t
TrainSparse(θ_rewind, m, D)
Return final sparse accuracy |
2. Consistency and Composability Across Sibling Networks
Beyond standard one-shot MP, "Studying the Consistency and Composability of Lottery Ticket Pruning Masks" investigates whether multiple independently trained sibling networks (initialized identically but trained with different data orders or seeds) yield compatible pruning masks and whether combining their masks improves the sparsity–accuracy tradeoff.
Sibling Training
- Clone networks after shared pretraining (length ), each trained independently to completion, yielding for .
- Each sibling produces its own one-shot MP mask at sparsity .
Mask Composition: Union and Intersection
Define, for each coordinate :
- Union:
- Intersection:
Union keeps weights surviving in any sibling; intersection keeps weights surviving in all siblings.
Retraining Composed Masks
The network is retrained from the shared (rewind) initialization using or and evaluated for accuracy at the resulting (post-composition) sparsity.
Empirical Findings
Experiments on ResNet-20/CIFAR-10 with and reveal:
- Without shared pretraining (), union/intersection masks perform no better than random tickets.
- With sufficient shared pretraining (), union and intersection masks match the one-shot MP baseline across up to $10$.
Accuracy–Sparsity Summary Table:
| Sparsity | One-shot MP | Union () | Intersection () |
|---|---|---|---|
| 20% | 93.5 ±0.2 | 93.4 ±0.3 | 93.5 ±0.2 |
| 50% | 91.2 ±0.3 | 91.1 ±0.2 | 91.2 ±0.3 |
| 80% | 88.0 ±0.4 | 87.9 ±0.4 | 88.0 ±0.3 |
| 90% | 84.6 ±0.5 | 84.5 ±0.6 | 84.6 ±0.5 |
Union (conservative) and intersection (aggressive) masks yield virtually identical accuracy–sparsity curves when compared at the resultant post-composed sparsity.
3. Analysis: Stability of Magnitude-Based Saliency
The observed compositional consistency is a direct consequence of the robustness of weight magnitude rankings after shared pretraining. Once the initial SGD-induced randomness subsides (via steps of shared training), the ranking of weights by magnitude is highly stable across different sibling runs, and the dominant parameters (largest magnitudes) are selected across siblings regardless of the data order seed.
Consequently, whether the union or intersection mask is applied, the pruned subnetworks converge to nearly the same "important" subset, explaining the near-equal accuracy at a given resultant sparsity.
A plausible implication is that simple magnitude-based saliency becomes a deterministic, architecture- and initialization-dependent property after a sufficient early training phase; network stochasticity becomes irrelevant in this regime.
4. Implications for Pruning Confidence and Subnetwork Selection
The observation that masks become consistent across siblings after early training supports the use of mask intersection as a conservative pruning criterion: weights pruned by all siblings can be safely removed with high certainty, as their informative content is negligible across SGD trajectories. Conversely, the survival frequency of each weight across sibling masks could, in principle, serve as a proxy for pruning "confidence," suggesting refinements such as frequency-weighted mask compositions for future pruning criteria.
The findings also validate one-shot magnitude pruning with late rewinding as a near-optimal scheme for identifying high-accuracy, high-sparsity subnetworks at the architectural scale of ResNet-20/CIFAR-10.
5. Comparison with Alternative Mask Aggregation and Future Directions
While naive union/intersection mask aggregation does not strictly outperform standard one-shot MP, these methods point toward new classes of mask composition strategies—such as leveraging the survival frequency of individual weights across siblings—which may yield superior accuracy–sparsity tradeoffs.
More broadly, these insights motivate further inquiry into:
- The relationship between early-training stability and the generality of magnitude-based pruning in larger or more complex architectures.
- The development of confidence-calibrated or frequency-weighted pruning schemes.
- The integration of mask composability principles in scenarios demanding robustness over initialization or data stochasticity, such as federated or ensemble learning contexts.
6. Practical Considerations and Empirical Tradeoffs
The one-shot MP procedure and its compositional extensions offer high efficiency, deterministic mask extraction, and straightforward integration into dense-to-sparse model conversion pipelines. This efficiency makes it particularly attractive for large-scale, resource-constrained deployments where iterative pruning or extensive mask search is infeasible.
Empirical evidence shows that even with up to $10$ siblings, mask compositional strategies do not degrade performance relative to established one-shot MP baselines. As such, for practical purposes—on standard convolutional architectures and datasets—elaborate mask aggregation does not currently confer additional benefit over classic one-shot magnitude pruning post shared pretraining.