Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging (2306.16788v3)

Published 29 Jun 2023 in cs.LG and cs.AI

Abstract: Neural networks can be significantly compressed by pruning, yielding sparse models with reduced storage and computational demands while preserving predictive performance. Model soups (Wortsman et al., 2022) enhance generalization and out-of-distribution (OOD) performance by averaging the parameters of multiple models into a single one, without increasing inference time. However, achieving both sparsity and parameter averaging is challenging as averaging arbitrary sparse models reduces the overall sparsity due to differing sparse connectivities. This work addresses these challenges by demonstrating that exploring a single retraining phase of Iterative Magnitude Pruning (IMP) with varied hyperparameter configurations such as batch ordering or weight decay yields models suitable for averaging, sharing identical sparse connectivity by design. Averaging these models significantly enhances generalization and OOD performance over their individual counterparts. Building on this, we introduce Sparse Model Soups (SMS), a novel method for merging sparse models by initiating each prune-retrain cycle with the averaged model from the previous phase. SMS preserves sparsity, exploits sparse network benefits, is modular and fully parallelizable, and substantially improves IMP's performance. We further demonstrate that SMS can be adapted to enhance state-of-the-art pruning-during-training approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=CQsmMYmlP5T.
  2. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Uuf2q9TfXGA.
  3. Random initialisations performing above chance and how to find them. September 2022.
  4. What is the state of neural network pruning? In I. Dhillon, D. Papailiopoulos, and V. Sze (eds.), Proceedings of Machine Learning and Systems, volume 2, pp.  129–146, 2020. URL https://proceedings.mlsys.org/paper/2020/file/d2ddea18f00665ce8623e36bd4e3c7c5-Paper.pdf.
  5. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation, pp.  131–198, Berlin, Germany, August 2016. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W/W16/W16-2301.
  6. Learning-compression algorithms for neural net pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  7. Gradient perturbation-based efficient deep ensembles. In Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD), pp. 28–36, 2023.
  8. Fusing finetuned models for better pretraining. April 2022.
  9. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3213–3223, 2016.
  10. Seasoning model soups for robustness to adversarial and natural distribution shifts. February 2023.
  11. Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840, July 2019.
  12. Global sparse momentum sgd for pruning very deep neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/f34185c4ca5d58e781d4f14173d41e5d-Paper.pdf.
  13. The role of permutation invariance in linear mode connectivity of neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=dNigytemkL.
  14. Rigging the lottery: Making all tickets winners. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  2943–2952. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/evci20a.html.
  15. Gradient flow in sparse neural networks and how lottery tickets win. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  6577–6586, 2022.
  16. Deep ensembles: A loss landscape perspective. December 2019.
  17. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2018.
  18. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. PMLR, 2020.
  19. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
  20. Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence, 2022, April 2021. doi: 10.1016/j.engappai.2022.105151.
  21. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
  22. Knowledge is a region in weight space for fine-tuned language models. February 2023.
  23. Learning both weights and connections for efficient neural networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/ae0eb3eed39d2bcef4622b2499a05fe6-Paper.pdf.
  24. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
  25. Benchmarking neural network robustness to common corruptions and perturbations. March 2019.
  26. Distilling the knowledge in a neural network. March 2015.
  27. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. arXiv preprint arXiv:2102.00554, January 2021.
  28. What do compressed deep neural networks forget? November 2019.
  29. Characterising bias in compressed models. October 2020.
  30. Snapshot ensembles: Train 1, get m for free. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=BJYwwY9ll.
  31. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis R. Bach and David M. Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pp.  448–456. JMLR.org, 2015. URL http://proceedings.mlr.press/v37/ioffe15.html.
  32. Averaging weights leads to wider optima and better generalization. March 2018.
  33. Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models. June 2023.
  34. Population parameter averaging (papa). April 2023.
  35. Repair: Renormalizing permuted activations for interpolation repair. November 2022.
  36. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  37. Fair-ensemble: When fairness naturally emerges from deep ensembling. March 2023.
  38. Diverse lottery tickets boost ensemble from a single pretrained model. May 2022.
  39. Learning multiple layers of features from tiny images. Technical report, 2009.
  40. Soft threshold weight reparameterization for learnable sparsity. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  5544–5555. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/kusupati20a.html.
  41. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  42. Network pruning that matters: A case study on retraining variants. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Cb54AMqHQFP.
  43. Layer-adaptive sparsity for the magnitude-based pruning. In International Conference on Learning Representations, October 2020.
  44. Snip: Singleshot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1VZqjAcYX.
  45. Eagleeye: Fast sub-net evaluation for efficient neural network pruning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp.  639–654. Springer, 2020.
  46. Pruning filters for efficient convnets. August 2016.
  47. Dynamic model pruning with feedback. In International Conference on Learning Representations, 2020.
  48. Dynamic sparse training: Find efficient sparse network from scratch with trainable masked layers. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJlbGJrtDB.
  49. Deep ensembling with no overhead for either training or testing: The all-round blessings of dynamic sparsity. Proceedings of the International Conference on Machine Learning (ICLR 2022), June 2021.
  50. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp.  3730–3738, 2015.
  51. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
  52. Pep: Parameter ensembling by perturbation. Advances in neural information processing systems, 33:8895–8906, 2020.
  53. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9(1), June 2018. doi: 10.1038/s41467-018-04316-3.
  54. What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
  55. Michela Paganini. Prune responsibly. September 2020.
  56. Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=xSsW2Am-ukZ.
  57. Deep neural network training with frank-wolfe. arXiv preprint arXiv:2010.07243, 2020.
  58. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  59. Diverse weight averaging for out-of-distribution generalization. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=tq_J_MqB3UB.
  60. Comparing rewinding and fine-tuning in neural network pruning. In International Conference on Learning Representations, 2020.
  61. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  62. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020.
  63. Federated progressive sparsification (purge, merge, tune)+. April 2022.
  64. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems 2020, June 2020.
  65. Pruning has a disparate impact on model accuracy. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=11nMVZK0WYM.
  66. Maxvit: Multi-axis vision transformer. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pp.  459–479. Springer, 2022.
  67. Neural networks with late-phase weights. Published as a conference paper at ICLR 2021, July 2020.
  68. Prune and tune ensembles: Low-cost ensemble learning with sparse independent subnetworks. February 2022.
  69. Discovering neural wirings. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/d010396ca8abf6ead8cacc2c2f2f26c7-Paper.pdf.
  70. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022a.
  71. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7959–7971, 2022b.
  72. Lottery pools: Winning more by interpolating tickets without increasing training or inference cost. August 2022a.
  73. Superposing many tickets into one: A performance booster for sparse neural network training. May 2022b.
  74. Wide residual networks. arXiv preprint arXiv:1605.07146, May 2016.
  75. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, November 2016.
  76. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2881–2890, 2017.
  77. To prune, or not to prune: Exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, October 2017.
  78. Compression-aware training of neural networks using frank-wolfe. arXiv preprint arXiv:2205.11921, 2022.
  79. How I Learned To Stop Worrying And Love Retraining. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_nF5imFKQI.
Citations (11)

Summary

  • The paper’s main contribution is the adaptive Sparse Model Soup method that leverages previous pruning phases to preserve sparse connectivity and enhance generalization.
  • It integrates iterative magnitude pruning with model averaging to achieve up to a 2% accuracy improvement on benchmarks like CIFAR-10/100 and ImageNet.
  • The approach offers practical benefits for resource-constrained environments and opens avenues for further research in balancing sparsity with neural model robustness.

Sparse Model Soups: A Synthesis for Enhanced Pruning via Model Averaging

This paper explores a critical challenge in the field of sparse neural networks: the tension between achieving model sparsity through pruning and the potential degradation of sparsity when adopting model averaging techniques like model soups. Sparse Model Soups (SMS), the method introduced, strategically integrates these two techniques to enhance generalization and out-of-distribution (OOD) performance.

Sparse neural networks, achieved by pruning, are known to significantly reduce model complexity, storage, and computational requirements without sacrificing predictive power. Despite the advances, integrating multiple sparse models into a single one through parameter averaging—model soups—has not been straightforward due to diverse sparse patterns across models. SMS addresses this by averaging sparse models that share identical sparse connectivity, thereby preserving sparsity.

Methodology and Contributions

The paper's primary innovation, SMS, is an adaptive procedure that evolves from Iterative Magnitude Pruning (IMP) by using the average model of the prior phase as the starting point in subsequent pruning phases. This iteration ensures consistency and leverages the previous phase's knowledge, improving both the sparse model's performance and its generalization capability.

SMS is developed through the following key steps:

  • A pretrained model undergoes pruning to eliminate low-magnitude weights.
  • Derived models are retrained under different hyperparameters, ensuring diverse yet structurally consistent candidate models for averaging.
  • These models are averaged to form a Sparse Model Soup, which maintains the sparse structure due to shared connectivity.

The paper uses experiments on diverse benchmarks—such as CIFAR-10/100 and ImageNet—demonstrating SMS's effectiveness across architectures and tasks. The results point to SMS's superior performance over standard IMP and other adaptations like extended retraining IMP (IMPIMP) and IMP with repruning (IMP-RePrune).

Numerical Results and Insights

Numerical experiments reveal that SMS delivers consistent improvements in test accuracy. The results are especially compelling for high target sparsities (98% and beyond), where SMS maintains robust performance even as traditional methods falter due to sparsity degradation when merging diverging models. Among the results, SMS outperforms baselines such as IMPIMP and IMP-RePrune by up to 2% in accuracy, manifesting enhanced generalization and OOD performance.

SMS's utility extends beyond just enhancing sparse model averaging; it demonstrates adaptability in pruning-during-training methodologies like GMP and DPF. By integrating SMS into these frameworks, performance gains emphasize SMS's modular design and its potential for broader applicability across various sparsification methodologies.

Theoretical and Practical Implications

Theoretically, SMS suggests a robust model synthesis approach that adeptly balances between sparsity and model robustness, paving the way for more efficient learning strategies that make effective use of existing models without extensive retraining from scratch. Practically, the potential for parallelization and embedding into existing frameworks promises substantial improvements in resource-limited environments where efficiency is a paramount concern.

Outlook and Future Research Directions

The SMS framework opens promising avenues for future research in both theoretical and applied AI. Exploring further integration within dynamic sparse training paradigms or extending its application into more complex transfer learning scenarios could yield deeper insights and broader applicability. Furthermore, investigating the interplay between sparsity, regularization, and neural architecture choices stands as a potential research frontier catalyzed by SMS's insights.

In summary, Sparse Model Soups present a significant contribution to the field of neural model compression, revealing a sophisticated interplay between pruning efficiency and parameter averaging. This work stands as a testament to the viability of model averaging within the structured constraints of sparse connectivity, reiterating the potential for such techniques in advancing the efficiency and adaptability of neural architectures.