Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Decoupled Weight Decay for Any $p$ Norm (2404.10824v2)

Published 16 Apr 2024 in cs.LG, cs.AI, cs.NE, and math.OC

Abstract: With the success of deep neural networks (NNs) in a variety of domains, the computational and storage requirements for training and deploying large NNs have become a bottleneck for further improvements. Sparsification has consequently emerged as a leading approach to tackle these issues. In this work, we consider a simple yet effective approach to sparsification, based on the Bridge, or $L_p$ regularization during training. We introduce a novel weight decay scheme, which generalizes the standard $L_2$ weight decay to any $p$ norm. We show that this scheme is compatible with adaptive optimizers, and avoids the gradient divergence associated with $0<p<1$ norms. We empirically demonstrate that it leads to highly sparse networks, while maintaining generalization performance comparable to standard $L_2$ regularization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
  2. Variational quantum algorithms. Nature Reviews Physics, 3(9):625–644, August 2021. ISSN 2522-5820. doi: 10.1038/s42254-021-00348-9. URL http://dx.doi.org/10.1038/s42254-021-00348-9.
  3. Computing the proximity operator of the lp norm with 0 ¡ p ¡ 1. IET Signal Processing, 10(5):557–565, July 2016. ISSN 1751-9675. doi: 10.1049/iet-spr.2015.0244. Publisher Copyright: © The Institution of Engineering and Technology 2016.
  4. Only train once: A one-shot neural network training and pruning framework. CoRR, abs/2107.07467, 2021. URL https://arxiv.org/abs/2107.07467.
  5. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, 2003.
  6. de Vries, A. The growing energy footprint of artificial intelligence. Joule, 7(10):2191–2194, 2023. ISSN 2542-4351. doi: https://doi.org/10.1016/j.joule.2023.09.004. URL https://www.sciencedirect.com/science/article/pii/S2542435123003653.
  7. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
  8. Convergence of online gradient method for feedforward neural networks with smoothing l1/2 regularization penalty. Neurocomputing, 131:208–216, 2014.
  9. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109–135, 1993.
  10. Handbook of convergence theorems for (stochastic) gradient methods, 2024.
  11. Biconvex sets and optimization with biconvex functions: a survey and extensions. Mathematical Methods of Operations Research, 66(3):373–407, 2007. doi: 10.1007/s00186-007-0161-1. URL https://doi.org/10.1007/s00186-007-0161-1.
  12. The unreasonable ineffectiveness of the deeper layers, 2024.
  13. Learning both weights and connections for efficient neural network. In NIPS, pp.  1135–1143, 2015.
  14. Second order derivatives for network pruning: Optimal brain surgeon. In Hanson, S., Cowan, J., and Giles, C. (eds.), Advances in Neural Information Processing Systems, volume 5. Morgan-Kaufmann, 1992. URL https://proceedings.neurips.cc/paper_files/paper/1992/file/303ed4c69846ab36c2904d3ba8573050-Paper.pdf.
  15. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC, 2015. ISBN 1498712169.
  16. Deep residual learning for image recognition, 2015.
  17. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp.  770–778, Las Vegas, NV, USA, June 27-30 2016.
  18. Deep learning scaling is predictable, empirically. CoRR, abs/1712.00409, 2017.
  19. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks, 2021.
  20. Efficient neural audio synthesis. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp.  2415–2424, 2018.
  21. Scaling laws for neural language models, 2020.
  22. Karpathy, A. char-rnn. https://github.com/karpathy/char-rnn, 2015.
  23. Karpathy, A. nanogpt, 2023. URL https://github.com/karpathy/nanoGPT.
  24. Bridgeout: stochastic bridge regularization for deep neural networks, 2018.
  25. Adam: A method for stochastic optimization, 2014.
  26. Krizhevsky, A. Learning multiple layers of features from tiny images. University of Toronto, 05 2012. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
  27. A simple weight decay can improve generalization. In Moody, J., Hanson, S., and Lippmann, R. (eds.), Advances in Neural Information Processing Systems, volume 4. Morgan-Kaufmann, 1991. URL https://proceedings.neurips.cc/paper_files/paper/1991/file/8eefcfdf5990e441f0fb6f3fad709e21-Paper.pdf.
  28. Sparse fine-tuning for inference acceleration of large language models, 2023.
  29. A fast post-training pruning framework for transformers. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=0GRBKLBjJE.
  30. Dynamic sparse training with structured sparsity, 2024.
  31. Optimal brain damage. In Touretzky, D. (ed.), Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann, 1989. URL https://proceedings.neurips.cc/paper_files/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf.
  32. Support vector machines with adaptive lq penalty. Computational Statistics & Data Analysis, 51(12):6380–6394, 2007.
  33. Sure-tuned bridge regression, 2023.
  34. Decoupled weight decay regularization. In Proceedings of the Seventh International Conference on Learning Representations, 2019.
  35. On sparse regression, lp-regularization, and automated model discovery, 2023.
  36. Accelerating sparse deep neural networks, 2021.
  37. Moody, J. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In Moody, J., Hanson, S., and Lippmann, R. (eds.), Advances in Neural Information Processing Systems, volume 4. Morgan-Kaufmann, 1991. URL https://proceedings.neurips.cc/paper_files/paper/1991/file/d64a340bcb633f536d56e51874281454-Paper.pdf.
  38. Exploring sparsity in recurrent neural networks. CoRR, abs/1704.05119, 2017.
  39. Simplifying neural networks by soft weight-sharing. Neural Computation, 4:473–493, 1992. URL https://api.semanticscholar.org/CorpusID:5597033.
  40. Inexact proximal operators for ℓpsubscriptℓ𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-quasinorm minimization. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  4724–4728, 2018. doi: 10.1109/ICASSP.2018.8462524.
  41. Gpt-4 technical report, 2023.
  42. Bridge regression: adaptivity and group selection. Journal of Statistical Planning and Inference, 141(11):3506–3519, 2011.
  43. Experiments on learning back propagation. Technical Report CMU–CS–86–126, Carnegie–Mellon University, Pittsburgh, PA, 1986.
  44. The bayesian bridge, 2012.
  45. The bayesian bridge. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(4):713–733, 2014.
  46. Language models are unsupervised multitask learners. 2019.
  47. Reed, R. Pruning algorithms-a survey. IEEE Transactions on Neural Networks, 4(5):740–747, 1993. doi: 10.1109/72.248452.
  48. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951. ISSN 00034851. URL http://www.jstor.org/stable/2236626.
  49. Lp quasi-norm minimization: algorithm and applications. EURASIP Journal on Advances in Signal Processing, 2024(1):22, 2024. doi: 10.1186/s13634-024-01114-6. URL https://doi.org/10.1186/s13634-024-01114-6.
  50. Inception-v4, inception-resnet and the impact of residual connections on learning, 2016.
  51. Faster gaze prediction with dense networks and fisher pruning. CoRR, abs/1801.05787, 2018. URL http://arxiv.org/abs/1801.05787.
  52. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.
  53. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
  54. Deterministic bridge regression for compressive classification. Information Sciences, 648:119505, 2023. ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2023.119505. URL https://www.sciencedirect.com/science/article/pii/S0020025523010903.
  55. Soft weight-sharing for neural network compression. CoRR, abs/1702.04008, 2017.
  56. Lpcnet: Improving neural speech synthesis through linear prediction. CoRR, abs/1810.11846, 2018. URL http://arxiv.org/abs/1810.11846.
  57. Wavenet: A generative model for raw audio. The 9th ISCA Speech Synthesis Workshop, pp.  125, 2016.
  58. Attention is all you need. In Advances in Neural Information Processing Systems 30, pp. 6000–6010, 2017.
  59. Regularization matters: Generalization and optimization of neural nets v.s. their induced kernel. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/8744cf92c88433f8cb04a02e6db69a0d-Paper.pdf.
  60. L 1/2 regularization. Science China Information Sciences, 53:1159–1169, 2010.
  61. l1/2subscript𝑙12l_{1/2}italic_l start_POSTSUBSCRIPT 1 / 2 end_POSTSUBSCRIPT regularization: A thresholding representation theory and a fast solver. IEEE Transactions on Neural Networks and Learning Systems, 23(7):1013–1027, 2012. doi: 10.1109/TNNLS.2012.2197412.
  62. L1/2 regularization learning for smoothing interval neural networks: Algorithms and convergence analysis. Neurocomputing, 272:122–129, 2018.
  63. Dynamic sparsity is channel-level sparsity learner, 2023.
  64. Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010. ISSN 00905364, 21688966. URL http://www.jstor.org/stable/25662264.
  65. A general adaptive ridge regression method for generalized linear models: an iterative re-weighting approach. Communications in Statistics - Theory and Methods, 52(18):6420–6443, 2023. doi: 10.1080/03610926.2022.2028841. URL https://doi.org/10.1080/03610926.2022.2028841.
  66. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320, 2005.
Citations (1)

Summary

We haven't generated a summary for this paper yet.