Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging (2306.16788v3)
Abstract: Neural networks can be significantly compressed by pruning, yielding sparse models with reduced storage and computational demands while preserving predictive performance. Model soups (Wortsman et al., 2022) enhance generalization and out-of-distribution (OOD) performance by averaging the parameters of multiple models into a single one, without increasing inference time. However, achieving both sparsity and parameter averaging is challenging as averaging arbitrary sparse models reduces the overall sparsity due to differing sparse connectivities. This work addresses these challenges by demonstrating that exploring a single retraining phase of Iterative Magnitude Pruning (IMP) with varied hyperparameter configurations such as batch ordering or weight decay yields models suitable for averaging, sharing identical sparse connectivity by design. Averaging these models significantly enhances generalization and OOD performance over their individual counterparts. Building on this, we introduce Sparse Model Soups (SMS), a novel method for merging sparse models by initiating each prune-retrain cycle with the averaged model from the previous phase. SMS preserves sparsity, exploits sparse network benefits, is modular and fully parallelizable, and substantially improves IMP's performance. We further demonstrate that SMS can be adapted to enhance state-of-the-art pruning-during-training approaches.
- Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=CQsmMYmlP5T.
- Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Uuf2q9TfXGA.
- Random initialisations performing above chance and how to find them. September 2022.
- What is the state of neural network pruning? In I. Dhillon, D. Papailiopoulos, and V. Sze (eds.), Proceedings of Machine Learning and Systems, volume 2, pp. 129–146, 2020. URL https://proceedings.mlsys.org/paper/2020/file/d2ddea18f00665ce8623e36bd4e3c7c5-Paper.pdf.
- Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation, pp. 131–198, Berlin, Germany, August 2016. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W/W16/W16-2301.
- Learning-compression algorithms for neural net pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- Gradient perturbation-based efficient deep ensembles. In Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD), pp. 28–36, 2023.
- Fusing finetuned models for better pretraining. April 2022.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223, 2016.
- Seasoning model soups for robustness to adversarial and natural distribution shifts. February 2023.
- Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840, July 2019.
- Global sparse momentum sgd for pruning very deep neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/f34185c4ca5d58e781d4f14173d41e5d-Paper.pdf.
- The role of permutation invariance in linear mode connectivity of neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=dNigytemkL.
- Rigging the lottery: Making all tickets winners. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 2943–2952. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/evci20a.html.
- Gradient flow in sparse neural networks and how lottery tickets win. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 6577–6586, 2022.
- Deep ensembles: A loss landscape perspective. December 2019.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2018.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. PMLR, 2020.
- The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
- Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence, 2022, April 2021. doi: 10.1016/j.engappai.2022.105151.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
- Knowledge is a region in weight space for fine-tuned language models. February 2023.
- Learning both weights and connections for efficient neural networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/ae0eb3eed39d2bcef4622b2499a05fe6-Paper.pdf.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
- Benchmarking neural network robustness to common corruptions and perturbations. March 2019.
- Distilling the knowledge in a neural network. March 2015.
- Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. arXiv preprint arXiv:2102.00554, January 2021.
- What do compressed deep neural networks forget? November 2019.
- Characterising bias in compressed models. October 2020.
- Snapshot ensembles: Train 1, get m for free. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=BJYwwY9ll.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis R. Bach and David M. Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pp. 448–456. JMLR.org, 2015. URL http://proceedings.mlr.press/v37/ioffe15.html.
- Averaging weights leads to wider optima and better generalization. March 2018.
- Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models. June 2023.
- Population parameter averaging (papa). April 2023.
- Repair: Renormalizing permuted activations for interpolation repair. November 2022.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Fair-ensemble: When fairness naturally emerges from deep ensembling. March 2023.
- Diverse lottery tickets boost ensemble from a single pretrained model. May 2022.
- Learning multiple layers of features from tiny images. Technical report, 2009.
- Soft threshold weight reparameterization for learnable sparsity. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5544–5555. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/kusupati20a.html.
- Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
- Network pruning that matters: A case study on retraining variants. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Cb54AMqHQFP.
- Layer-adaptive sparsity for the magnitude-based pruning. In International Conference on Learning Representations, October 2020.
- Snip: Singleshot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1VZqjAcYX.
- Eagleeye: Fast sub-net evaluation for efficient neural network pruning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 639–654. Springer, 2020.
- Pruning filters for efficient convnets. August 2016.
- Dynamic model pruning with feedback. In International Conference on Learning Representations, 2020.
- Dynamic sparse training: Find efficient sparse network from scratch with trainable masked layers. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJlbGJrtDB.
- Deep ensembling with no overhead for either training or testing: The all-round blessings of dynamic sparsity. Proceedings of the International Conference on Machine Learning (ICLR 2022), June 2021.
- Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738, 2015.
- Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
- Pep: Parameter ensembling by perturbation. Advances in neural information processing systems, 33:8895–8906, 2020.
- Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9(1), June 2018. doi: 10.1038/s41467-018-04316-3.
- What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
- Michela Paganini. Prune responsibly. September 2020.
- Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=xSsW2Am-ukZ.
- Deep neural network training with frank-wolfe. arXiv preprint arXiv:2010.07243, 2020.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- Diverse weight averaging for out-of-distribution generalization. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=tq_J_MqB3UB.
- Comparing rewinding and fine-tuning in neural network pruning. In International Conference on Learning Representations, 2020.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
- Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020.
- Federated progressive sparsification (purge, merge, tune)+. April 2022.
- Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems 2020, June 2020.
- Pruning has a disparate impact on model accuracy. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=11nMVZK0WYM.
- Maxvit: Multi-axis vision transformer. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pp. 459–479. Springer, 2022.
- Neural networks with late-phase weights. Published as a conference paper at ICLR 2021, July 2020.
- Prune and tune ensembles: Low-cost ensemble learning with sparse independent subnetworks. February 2022.
- Discovering neural wirings. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/d010396ca8abf6ead8cacc2c2f2f26c7-Paper.pdf.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022a.
- Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7959–7971, 2022b.
- Lottery pools: Winning more by interpolating tickets without increasing training or inference cost. August 2022a.
- Superposing many tickets into one: A performance booster for sparse neural network training. May 2022b.
- Wide residual networks. arXiv preprint arXiv:1605.07146, May 2016.
- Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, November 2016.
- Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890, 2017.
- To prune, or not to prune: Exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, October 2017.
- Compression-aware training of neural networks using frank-wolfe. arXiv preprint arXiv:2205.11921, 2022.
- How I Learned To Stop Worrying And Love Retraining. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_nF5imFKQI.