Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Shaving Weights with Occam's Razor: Bayesian Sparsification for Neural Networks Using the Marginal Likelihood (2402.15978v2)

Published 25 Feb 2024 in cs.LG and stat.ML

Abstract: Neural network sparsification is a promising avenue to save computational time and memory costs, especially in an age where many successful AI models are becoming too large to na\"ively deploy on consumer hardware. While much work has focused on different weight pruning criteria, the overall sparsifiability of the network, i.e., its capacity to be pruned without quality loss, has often been overlooked. We present Sparsifiability via the Marginal likelihood (SpaM), a pruning framework that highlights the effectiveness of using the Bayesian marginal likelihood in conjunction with sparsity-inducing priors for making neural networks more sparsifiable. Our approach implements an automatic Occam's razor that selects the most sparsifiable model that still explains the data well, both for structured and unstructured sparsification. In addition, we demonstrate that the pre-computed posterior Hessian approximation used in the Laplace approximation can be re-used to define a cheap pruning criterion, which outperforms many existing (more expensive) approaches. We demonstrate the effectiveness of our framework, especially at high sparsity levels, across a range of different neural network architectures and datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Imagenet large scale visual recognition challenge, 2015.
  2. Robust speech recognition via large-scale weak supervision, 2022.
  3. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
  4. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.
  5. Partha Pratim Ray. A review on tinyml: State-of-the-art and prospects. Journal of King Saud University - Computer and Information Sciences, 34(4):1595–1623, 2022. ISSN 1319-1578. doi:https://doi.org/10.1016/j.jksuci.2021.11.019. URL https://www.sciencedirect.com/science/article/pii/S1319157821003335.
  6. Challenges in deploying machine learning: A survey of case studies. ACM Computing Surveys, 55(6):1–29, December 2022. ISSN 1557-7341. doi:10.1145/3533378. URL http://dx.doi.org/10.1145/3533378.
  7. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
  8. Lost in pruning: The effects of pruning neural networks beyond test accuracy, 2021.
  9. David John Cameron MacKay. Probable networks and plausible predictions - a review of practical bayesian methods for supervised neural networks. Network: Computation In Neural Systems, 6:469–505, 1995. URL https://api.semanticscholar.org/CorpusID:14332165.
  10. Scalable marginal likelihood estimation for model selection in deep learning. In International Conference on Machine Learning. PMLR, 2021.
  11. Occam's razor. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000. URL https://proceedings.neurips.cc/paper_files/paper/2000/file/0950ca92a4dcf426067cfd2246bb5ff3-Paper.pdf.
  12. Laplace redux–effortless Bayesian deep learning. In NeurIPS, 2021.
  13. David J. C. MacKay. Information Theory, Inference & Learning Algorithms. Cambridge University Press, USA, 2002. ISBN 0521642981.
  14. David J. C. MacKay. A Practical Bayesian Framework for Backpropagation Networks. Neural Computation, 4(3):448–472, 05 1992. ISSN 0899-7667. doi:10.1162/neco.1992.4.3.448. URL https://doi.org/10.1162/neco.1992.4.3.448.
  15. Nicol N Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, 14(7):1723–1738, 2002.
  16. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR, 2015.
  17. Practical gauss-newton optimisation for deep learning, 2017.
  18. A scalable laplace approximation for neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Skdvd2xAZ.
  19. Pruning and quantization for deep neural network acceleration: A survey. Neurocomput., 461(C):370–403, oct 2021. ISSN 0925-2312. doi:10.1016/j.neucom.2021.07.045. URL https://doi.org/10.1016/j.neucom.2021.07.045.
  20. Depgraph: Towards any structural pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16091–16101, 2023.
  21. SNIP: single-shot network pruning based on connection sensitivity. CoRR, abs/1810.02340, 2018. URL http://arxiv.org/abs/1810.02340.
  22. Winning the lottery ahead of time: Efficient early network pruning, 2022.
  23. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. The Journal of Machine Learning Research, 22(1):10882–11005, 2021.
  24. Net-trim: Convex pruning of deep neural networks with performance guarantee. Advances in neural information processing systems, 30, 2017.
  25. Morphnet: Fast & simple resource-constrained structure learning of deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1586–1595, 2018.
  26. Adapting the linearised laplace model evidence for modern deep learning. In International Conference on Machine Learning, pages 796–821. PMLR, 2022.
  27. Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67, 2006.
  28. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29, 2016.
  29. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
  30. Fast approximate natural gradient descent in a kronecker factored eigenbasis. Advances in Neural Information Processing Systems, 31, 2018.
  31. Stochastic marginal likelihood gradients using neural tangent kernels. In International Conference on Machine Learning, pages 14333–14352. PMLR, 2023.
  32. Online laplace model selection revisited. arXiv preprint arXiv:2307.06093, 2023.
  33. David JC MacKay et al. Bayesian nonlinear modeling for the prediction competition. ASHRAE transactions, 100(2):1053–1062, 1994.
  34. Michael E Tipping. Sparse bayesian learning and the relevance vector machine. Journal of machine learning research, 1(Jun):211–244, 2001.
  35. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  36. Picking winning tickets before training by preserving gradient flow. CoRR, abs/2002.07376, 2020. URL https://arxiv.org/abs/2002.07376.
  37. A gradient flow framework for analyzing network pruning, 2021.
  38. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of international conference on learning representations (ICLR), 2016.
  39. Fast sparse convnets, 2019.
  40. Radford Neal. Bayesian learning via stochastic dynamics. Advances in neural information processing systems, 5, 1992.
  41. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5–13, 1993.
  42. Invariance learning in deep neural networks with differentiable Laplace approximations. Advances in Neural Information Processing Systems, 35:12449–12463, 2022.
  43. Sparse bayesian learning for basis selection. IEEE Transactions on Signal processing, 52(8):2153–2164, 2004.
  44. Deconstructing lottery tickets: Zeros, signs, and the supermask, 2020.
  45. Deep rewiring: Training very sparse deep networks, 2018.
  46. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  47. Pruning neural networks without any data by iteratively conserving synaptic flow, 2020.
  48. The llm surgeon, 2023.
  49. Continual learning via neural pruning. arXiv preprint arXiv:1903.04476, 2019.
  50. Efficient neural network training via forward and backward propagation sparsification, 2021.
  51. Only train once: A one-shot neural network training and pruning framework. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  52. Otov2: Automatic, generic, user-friendly. In International Conference on Learning Representations, 2023.
  53. Better uncertainty calibration via proper scores for classification and beyond. Advances in Neural Information Processing Systems, 35:8618–8632, 2022.
  54. D. Dua and C. Graff. Breast cancer wisconsin (diagnostic) data set. UCI Machine Learning Repository, 2019. URL https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).
  55. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  56. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  57. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  58. Mlp-mixer: An all-mlp architecture for vision, 2021.
  59. Asdl: A unified interface for gradient preconditioning in pytorch, 2023.
  60. Wide residual networks, 2017.
  61. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  62. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  63. Learning both weights and connections for efficient neural network. In NeurIPS, 2015.
  64. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization, 2019.
Citations (3)

Summary

  • The paper introduces SpaM, a framework that leverages Bayesian marginal likelihood to automatically prune neural networks using Occam's razor.
  • It employs a novel pre-computed posterior Hessian method (Optimal Posterior Damage) that outperforms traditional MAP techniques, especially under high sparsity.
  • Empirical results on architectures like ResNet, MLP, and LeNet demonstrate significant efficiency gains, enabling deployment on resource-constrained hardware.

An Analysis of Bayesian Sparsification for Neural Networks

The paper "Shaving Weights with Occam's Razor: Bayesian Sparsification for Neural Networks using the Marginal Likelihood" presents a novel framework for neural network sparsification, known as SpaM (Sparsifiability via the Marginal likelihood). This work addresses significant issues related to the growing computational overhead associated with large AI models, proposing a method that harmonizes Bayesian marginal likelihood with sparsity-inducing priors to improve a model's capacity for sparsification.

Core Contributions

Sparsifiability and Marginal Likelihood: The paper focuses on an arguably overlooked aspect of neural network design, namely the innate sparsifiability or a model’s capability to prune parameters without degrading performance. By leveraging the Bayesian marginal likelihood, SpaM employs an automatic Occam's razor, seeking to select models that are inherently more sparsifiable. This method integrates prior parameter selection alongside Laplace approximations to regularize parameters in a meaningful manner.

Pre-computed Posterior Hessian as a Pruning Criterion: The authors introduce an innovative use of pre-computed posterior Hessians as a novel and cost-effective pruning criterion, termed Optimal Posterior Damage (OPD). Unlike traditional methods that focus extensively on expensive computations, OPD offers a computationally efficient alternative that still outperforms existing approaches, particularly under high sparsity conditions.

Methodological Insights

Structured and Unstructured Sparsification: The SpaM's design accommodates both structured and unstructured sparsification scenarios. This flexibility broadens its utility across diverse architectures and datasets. The framework promotes a balanced approach between performance retention and computational cost, utilizing tailored prior configurations for optimizing sparsifiability.

Empirical Validation: A comprehensive suite of experiments demonstrates SpaM's resilient performance across various architectures such as ResNet, MLP, and LeNet. Notably, significant performance was sustained even at high sparsity levels, marking an improvement over Maximum A Posteriori (MAP) training practices. This provides empirical evidence supporting the proposed framework's efficacy and reliability across different neural network types and inputs.

Implications and Future Perspectives

This research offers a noteworthy contribution to the development of efficient neural network architectures by enhancing model sparsifiability through Bayesian approaches. Practical implications are noteworthy: by lowering computational overhead, SpaM facilitates the deployment of large AI models on consumer hardware with constrained resources.

Theoretically, this paper nudges further investigations into Bayesian regularization methods that can enhance network sparsity without compromising model quality. Regularization strategies, like those in SpaM, represent avenues worth exploring. Future work could extend the dataset size and diversity and evaluate the framework's adaptability to even more sophisticated or specialized architectures, like graph-based neural networks or LLMs.

In conclusion, this work stands as a robust step in neural network sparsification strategies, providing a novel perspective through a Bayesian lens—setting a foundational pathway for subsequent advancements in AI efficiency without compromising integrity and performance.