Stochastic Subnetwork Annealing: A Regularization Technique for Fine Tuning Pruned Subnetworks (2401.08830v1)
Abstract: Pruning methods have recently grown in popularity as an effective way to reduce the size and computational complexity of deep neural networks. Large numbers of parameters can be removed from trained models with little discernible loss in accuracy after a small number of continued training epochs. However, pruning too many parameters at once often causes an initial steep drop in accuracy which can undermine convergence quality. Iterative pruning approaches mitigate this by gradually removing a small number of parameters over multiple epochs. However, this can still lead to subnetworks that overfit local regions of the loss landscape. We introduce a novel and effective approach to tuning subnetworks through a regularization technique we call Stochastic Subnetwork Annealing. Instead of removing parameters in a discrete manner, we instead represent subnetworks with stochastic masks where each parameter has a probabilistic chance of being included or excluded on any given forward pass. We anneal these probabilities over time such that subnetwork structure slowly evolves as mask values become more deterministic, allowing for a smoother and more robust optimization of subnetworks at high levels of sparsity.
- What is the state of neural network pruning?, 2020.
- Autoaugment: Learning augmentation policies from data, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- Fire together wire together: A dynamic pruning approach with self-supervised mask prediction, 2022.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks, 2019.
- The state of sparsity in deep neural networks, 2019.
- Loss surfaces, mode connectivity, and fast ensembling of dnns, 2018.
- A loss curvature perspective on training instability in deep learning, 2021.
- A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation, 2018.
- Learning both weights and connections for efficient neural networks, 2015.
- L.K. Hansen and P. Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993–1001, 1990.
- Training independent subnetworks for robust prediction, 2021.
- Deep residual learning for image recognition. In IEEE Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Channel gating neural networks, 2019.
- Snapshot ensembles: Train 1, get m for free, 2017.
- Adam: A method for stochastic optimization, 2017.
- Variational dropout and the local reparameterization trick, 2015.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical Report, University of Toronto, May 2012.
- Neural network ensembles, cross validation, and active learning. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 7. MIT Press, 1994.
- Ya Le and Xuan S. Yang. Tiny imagenet visual recognition challenge. 2015.
- Layer-adaptive sparsity for the magnitude-based pruning, 2021.
- Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint:1511.06314, 2015.
- Pruning filters for efficient convnets, 2017.
- Runtime neural pruning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Deep ensembling with no overhead for either training or testing: The all-round blessings of dynamic sparsity, 2022.
- The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training, 2022.
- Rethinking the value of network pruning, 2019.
- On the adequacy of untuned warmup for adaptive optimization, 2019.
- On the adequacy of untuned warmup for adaptive optimization, 2021.
- Yashwant Malaiya. Antirandom testing: getting the most out of black-box testing. In Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE’95, pages 86 – 95, 1995.
- Variational dropout sparsifies deep neural networks, 2017.
- Uncertainty Baselines: Benchmarks for uncertainty & robustness in deep learning. arXiv preprint arXiv:2106.04015, 2021.
- Omiita. Vit-cifar. https://github.com/omihub777/ViT-CIFAR, 2022.
- Comparing rewinding and fine-tuning in neural network pruning, 2020.
- Super-convergence: Very fast training of neural networks using large learning rates, 2018.
- Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014.
- Neural networks with late-phase weights, 2022.
- Mothernets: Rapid deep ensemble learning. arXiv preprint:1809.04270, 2018.
- Learning structured sparsity in deep neural networks, 2016.
- Batchensemble: An alternative approach to efficient ensemble and lifelong learning, 2020.
- Prune and tune ensembles: Low-cost ensemble learning with sparse independent subnetworks, 2022.
- Antirandom testing: A distance-based approach. Journal of VLSI Design, 2008.
- How does learning rate decay help modern neural networks?, 2019.
- Wide residual networks, 2017.
- Scaling vision transformers, 2022.