Shaving Weights with Occam's Razor: Bayesian Sparsification for Neural Networks Using the Marginal Likelihood (2402.15978v2)
Abstract: Neural network sparsification is a promising avenue to save computational time and memory costs, especially in an age where many successful AI models are becoming too large to na\"ively deploy on consumer hardware. While much work has focused on different weight pruning criteria, the overall sparsifiability of the network, i.e., its capacity to be pruned without quality loss, has often been overlooked. We present Sparsifiability via the Marginal likelihood (SpaM), a pruning framework that highlights the effectiveness of using the Bayesian marginal likelihood in conjunction with sparsity-inducing priors for making neural networks more sparsifiable. Our approach implements an automatic Occam's razor that selects the most sparsifiable model that still explains the data well, both for structured and unstructured sparsification. In addition, we demonstrate that the pre-computed posterior Hessian approximation used in the Laplace approximation can be re-used to define a cheap pruning criterion, which outperforms many existing (more expensive) approaches. We demonstrate the effectiveness of our framework, especially at high sparsity levels, across a range of different neural network architectures and datasets.
- Imagenet large scale visual recognition challenge, 2015.
- Robust speech recognition via large-scale weak supervision, 2022.
- Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.
- Partha Pratim Ray. A review on tinyml: State-of-the-art and prospects. Journal of King Saud University - Computer and Information Sciences, 34(4):1595–1623, 2022. ISSN 1319-1578. doi:https://doi.org/10.1016/j.jksuci.2021.11.019. URL https://www.sciencedirect.com/science/article/pii/S1319157821003335.
- Challenges in deploying machine learning: A survey of case studies. ACM Computing Surveys, 55(6):1–29, December 2022. ISSN 1557-7341. doi:10.1145/3533378. URL http://dx.doi.org/10.1145/3533378.
- Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
- Lost in pruning: The effects of pruning neural networks beyond test accuracy, 2021.
- David John Cameron MacKay. Probable networks and plausible predictions - a review of practical bayesian methods for supervised neural networks. Network: Computation In Neural Systems, 6:469–505, 1995. URL https://api.semanticscholar.org/CorpusID:14332165.
- Scalable marginal likelihood estimation for model selection in deep learning. In International Conference on Machine Learning. PMLR, 2021.
- Occam's razor. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000. URL https://proceedings.neurips.cc/paper_files/paper/2000/file/0950ca92a4dcf426067cfd2246bb5ff3-Paper.pdf.
- Laplace redux–effortless Bayesian deep learning. In NeurIPS, 2021.
- David J. C. MacKay. Information Theory, Inference & Learning Algorithms. Cambridge University Press, USA, 2002. ISBN 0521642981.
- David J. C. MacKay. A Practical Bayesian Framework for Backpropagation Networks. Neural Computation, 4(3):448–472, 05 1992. ISSN 0899-7667. doi:10.1162/neco.1992.4.3.448. URL https://doi.org/10.1162/neco.1992.4.3.448.
- Nicol N Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, 14(7):1723–1738, 2002.
- Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR, 2015.
- Practical gauss-newton optimisation for deep learning, 2017.
- A scalable laplace approximation for neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Skdvd2xAZ.
- Pruning and quantization for deep neural network acceleration: A survey. Neurocomput., 461(C):370–403, oct 2021. ISSN 0925-2312. doi:10.1016/j.neucom.2021.07.045. URL https://doi.org/10.1016/j.neucom.2021.07.045.
- Depgraph: Towards any structural pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16091–16101, 2023.
- SNIP: single-shot network pruning based on connection sensitivity. CoRR, abs/1810.02340, 2018. URL http://arxiv.org/abs/1810.02340.
- Winning the lottery ahead of time: Efficient early network pruning, 2022.
- Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. The Journal of Machine Learning Research, 22(1):10882–11005, 2021.
- Net-trim: Convex pruning of deep neural networks with performance guarantee. Advances in neural information processing systems, 30, 2017.
- Morphnet: Fast & simple resource-constrained structure learning of deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1586–1595, 2018.
- Adapting the linearised laplace model evidence for modern deep learning. In International Conference on Machine Learning, pages 796–821. PMLR, 2022.
- Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67, 2006.
- Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29, 2016.
- Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
- Fast approximate natural gradient descent in a kronecker factored eigenbasis. Advances in Neural Information Processing Systems, 31, 2018.
- Stochastic marginal likelihood gradients using neural tangent kernels. In International Conference on Machine Learning, pages 14333–14352. PMLR, 2023.
- Online laplace model selection revisited. arXiv preprint arXiv:2307.06093, 2023.
- David JC MacKay et al. Bayesian nonlinear modeling for the prediction competition. ASHRAE transactions, 100(2):1053–1062, 1994.
- Michael E Tipping. Sparse bayesian learning and the relevance vector machine. Journal of machine learning research, 1(Jun):211–244, 2001.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Picking winning tickets before training by preserving gradient flow. CoRR, abs/2002.07376, 2020. URL https://arxiv.org/abs/2002.07376.
- A gradient flow framework for analyzing network pruning, 2021.
- Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of international conference on learning representations (ICLR), 2016.
- Fast sparse convnets, 2019.
- Radford Neal. Bayesian learning via stochastic dynamics. Advances in neural information processing systems, 5, 1992.
- Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5–13, 1993.
- Invariance learning in deep neural networks with differentiable Laplace approximations. Advances in Neural Information Processing Systems, 35:12449–12463, 2022.
- Sparse bayesian learning for basis selection. IEEE Transactions on Signal processing, 52(8):2153–2164, 2004.
- Deconstructing lottery tickets: Zeros, signs, and the supermask, 2020.
- Deep rewiring: Training very sparse deep networks, 2018.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
- Pruning neural networks without any data by iteratively conserving synaptic flow, 2020.
- The llm surgeon, 2023.
- Continual learning via neural pruning. arXiv preprint arXiv:1903.04476, 2019.
- Efficient neural network training via forward and backward propagation sparsification, 2021.
- Only train once: A one-shot neural network training and pruning framework. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
- Otov2: Automatic, generic, user-friendly. In International Conference on Learning Representations, 2023.
- Better uncertainty calibration via proper scores for classification and beyond. Advances in Neural Information Processing Systems, 35:8618–8632, 2022.
- D. Dua and C. Graff. Breast cancer wisconsin (diagnostic) data set. UCI Machine Learning Repository, 2019. URL https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Mlp-mixer: An all-mlp architecture for vision, 2021.
- Asdl: A unified interface for gradient preconditioning in pytorch, 2023.
- Wide residual networks, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Learning both weights and connections for efficient neural network. In NeurIPS, 2015.
- Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization, 2019.