Decoupled Weight Decay for Any $p$ Norm (2404.10824v2)
Abstract: With the success of deep neural networks (NNs) in a variety of domains, the computational and storage requirements for training and deploying large NNs have become a bottleneck for further improvements. Sparsification has consequently emerged as a leading approach to tackle these issues. In this work, we consider a simple yet effective approach to sparsification, based on the Bridge, or $L_p$ regularization during training. We introduce a novel weight decay scheme, which generalizes the standard $L_2$ weight decay to any $p$ norm. We show that this scheme is compatible with adaptive optimizers, and avoids the gradient divergence associated with $0<p<1$ norms. We empirically demonstrate that it leads to highly sparse networks, while maintaining generalization performance comparable to standard $L_2$ regularization.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
- Variational quantum algorithms. Nature Reviews Physics, 3(9):625–644, August 2021. ISSN 2522-5820. doi: 10.1038/s42254-021-00348-9. URL http://dx.doi.org/10.1038/s42254-021-00348-9.
- Computing the proximity operator of the lp norm with 0 ¡ p ¡ 1. IET Signal Processing, 10(5):557–565, July 2016. ISSN 1751-9675. doi: 10.1049/iet-spr.2015.0244. Publisher Copyright: © The Institution of Engineering and Technology 2016.
- Only train once: A one-shot neural network training and pruning framework. CoRR, abs/2107.07467, 2021. URL https://arxiv.org/abs/2107.07467.
- An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, 2003.
- de Vries, A. The growing energy footprint of artificial intelligence. Joule, 7(10):2191–2194, 2023. ISSN 2542-4351. doi: https://doi.org/10.1016/j.joule.2023.09.004. URL https://www.sciencedirect.com/science/article/pii/S2542435123003653.
- Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
- Convergence of online gradient method for feedforward neural networks with smoothing l1/2 regularization penalty. Neurocomputing, 131:208–216, 2014.
- A statistical view of some chemometrics regression tools. Technometrics, 35(2):109–135, 1993.
- Handbook of convergence theorems for (stochastic) gradient methods, 2024.
- Biconvex sets and optimization with biconvex functions: a survey and extensions. Mathematical Methods of Operations Research, 66(3):373–407, 2007. doi: 10.1007/s00186-007-0161-1. URL https://doi.org/10.1007/s00186-007-0161-1.
- The unreasonable ineffectiveness of the deeper layers, 2024.
- Learning both weights and connections for efficient neural network. In NIPS, pp. 1135–1143, 2015.
- Second order derivatives for network pruning: Optimal brain surgeon. In Hanson, S., Cowan, J., and Giles, C. (eds.), Advances in Neural Information Processing Systems, volume 5. Morgan-Kaufmann, 1992. URL https://proceedings.neurips.cc/paper_files/paper/1992/file/303ed4c69846ab36c2904d3ba8573050-Paper.pdf.
- Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC, 2015. ISBN 1498712169.
- Deep residual learning for image recognition, 2015.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp. 770–778, Las Vegas, NV, USA, June 27-30 2016.
- Deep learning scaling is predictable, empirically. CoRR, abs/1712.00409, 2017.
- Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks, 2021.
- Efficient neural audio synthesis. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 2415–2424, 2018.
- Scaling laws for neural language models, 2020.
- Karpathy, A. char-rnn. https://github.com/karpathy/char-rnn, 2015.
- Karpathy, A. nanogpt, 2023. URL https://github.com/karpathy/nanoGPT.
- Bridgeout: stochastic bridge regularization for deep neural networks, 2018.
- Adam: A method for stochastic optimization, 2014.
- Krizhevsky, A. Learning multiple layers of features from tiny images. University of Toronto, 05 2012. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
- A simple weight decay can improve generalization. In Moody, J., Hanson, S., and Lippmann, R. (eds.), Advances in Neural Information Processing Systems, volume 4. Morgan-Kaufmann, 1991. URL https://proceedings.neurips.cc/paper_files/paper/1991/file/8eefcfdf5990e441f0fb6f3fad709e21-Paper.pdf.
- Sparse fine-tuning for inference acceleration of large language models, 2023.
- A fast post-training pruning framework for transformers. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=0GRBKLBjJE.
- Dynamic sparse training with structured sparsity, 2024.
- Optimal brain damage. In Touretzky, D. (ed.), Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann, 1989. URL https://proceedings.neurips.cc/paper_files/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf.
- Support vector machines with adaptive lq penalty. Computational Statistics & Data Analysis, 51(12):6380–6394, 2007.
- Sure-tuned bridge regression, 2023.
- Decoupled weight decay regularization. In Proceedings of the Seventh International Conference on Learning Representations, 2019.
- On sparse regression, lp-regularization, and automated model discovery, 2023.
- Accelerating sparse deep neural networks, 2021.
- Moody, J. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In Moody, J., Hanson, S., and Lippmann, R. (eds.), Advances in Neural Information Processing Systems, volume 4. Morgan-Kaufmann, 1991. URL https://proceedings.neurips.cc/paper_files/paper/1991/file/d64a340bcb633f536d56e51874281454-Paper.pdf.
- Exploring sparsity in recurrent neural networks. CoRR, abs/1704.05119, 2017.
- Simplifying neural networks by soft weight-sharing. Neural Computation, 4:473–493, 1992. URL https://api.semanticscholar.org/CorpusID:5597033.
- Inexact proximal operators for ℓpsubscriptℓ𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-quasinorm minimization. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4724–4728, 2018. doi: 10.1109/ICASSP.2018.8462524.
- Gpt-4 technical report, 2023.
- Bridge regression: adaptivity and group selection. Journal of Statistical Planning and Inference, 141(11):3506–3519, 2011.
- Experiments on learning back propagation. Technical Report CMU–CS–86–126, Carnegie–Mellon University, Pittsburgh, PA, 1986.
- The bayesian bridge, 2012.
- The bayesian bridge. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(4):713–733, 2014.
- Language models are unsupervised multitask learners. 2019.
- Reed, R. Pruning algorithms-a survey. IEEE Transactions on Neural Networks, 4(5):740–747, 1993. doi: 10.1109/72.248452.
- A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951. ISSN 00034851. URL http://www.jstor.org/stable/2236626.
- Lp quasi-norm minimization: algorithm and applications. EURASIP Journal on Advances in Signal Processing, 2024(1):22, 2024. doi: 10.1186/s13634-024-01114-6. URL https://doi.org/10.1186/s13634-024-01114-6.
- Inception-v4, inception-resnet and the impact of residual connections on learning, 2016.
- Faster gaze prediction with dense networks and fisher pruning. CoRR, abs/1801.05787, 2018. URL http://arxiv.org/abs/1801.05787.
- Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.
- Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- Deterministic bridge regression for compressive classification. Information Sciences, 648:119505, 2023. ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2023.119505. URL https://www.sciencedirect.com/science/article/pii/S0020025523010903.
- Soft weight-sharing for neural network compression. CoRR, abs/1702.04008, 2017.
- Lpcnet: Improving neural speech synthesis through linear prediction. CoRR, abs/1810.11846, 2018. URL http://arxiv.org/abs/1810.11846.
- Wavenet: A generative model for raw audio. The 9th ISCA Speech Synthesis Workshop, pp. 125, 2016.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, pp. 6000–6010, 2017.
- Regularization matters: Generalization and optimization of neural nets v.s. their induced kernel. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/8744cf92c88433f8cb04a02e6db69a0d-Paper.pdf.
- L 1/2 regularization. Science China Information Sciences, 53:1159–1169, 2010.
- l1/2subscript𝑙12l_{1/2}italic_l start_POSTSUBSCRIPT 1 / 2 end_POSTSUBSCRIPT regularization: A thresholding representation theory and a fast solver. IEEE Transactions on Neural Networks and Learning Systems, 23(7):1013–1027, 2012. doi: 10.1109/TNNLS.2012.2197412.
- L1/2 regularization learning for smoothing interval neural networks: Algorithms and convergence analysis. Neurocomputing, 272:122–129, 2018.
- Dynamic sparsity is channel-level sparsity learner, 2023.
- Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010. ISSN 00905364, 21688966. URL http://www.jstor.org/stable/25662264.
- A general adaptive ridge regression method for generalized linear models: an iterative re-weighting approach. Communications in Statistics - Theory and Methods, 52(18):6420–6443, 2023. doi: 10.1080/03610926.2022.2028841. URL https://doi.org/10.1080/03610926.2022.2028841.
- Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320, 2005.