Why Do We Need Weight Decay in Modern Deep Learning? (2310.04415v2)
Abstract: Weight decay is a broadly used technique for training state-of-the-art deep networks from image classification to LLMs. Despite its widespread usage and being extensively studied in the classical literature, its role remains poorly understood for deep learning. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For deep networks on vision tasks trained with multipass SGD, we show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD via the loss stabilization mechanism. In contrast, for LLMs trained with nearly one-epoch training, we describe how weight decay balances the bias-variance tradeoff in stochastic optimization leading to lower training loss and improved training stability. Overall, we present a unifying perspective from ResNets on vision tasks to LLMs: weight decay is never useful as an explicit regularizer but instead changes the training dynamics in a desirable way. The code is available at https://github.com/tml-epfl/why-weight-decay
- SGD with large step sizes learns sparse features. In International Conference on Machine Learning, 2023.
- Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process. In Conference on Learning Theory, 2020.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021.
- Adaptive gradient methods at the edge of stability. arXiv preprint arXiv:2207.14484, 2022.
- Label noise sgd provably prefers flat global minimizers. In Advances in Neural Information Processing Systems, volume 34, pp. 27449–27461, 2021.
- Manus Foster. An application of the wiener-kolmogorov smoothing theory to matrix inversion. Journal of the Society for Industrial and Applied Mathematics, 9(3):387–392, 1961.
- Openwebtext corpus, 2019. http://Skylion007.github.io/OpenWebTextCorpus.
- Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. ICCV, 2015.
- Deep residual learning for image recognition. In CVPR, 2016.
- Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
- Arthur R Hoerl. Application of ridge analysis to regression problems. Chemical Engineering Progress, 58:54–59, 1962.
- Training compute-optimal large language models. In Advances in Neural Information Processing Systems, 2022.
- Fantastic generalization measures and where to find them. In International Conference on Learning Representations, 2020.
- A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019.
- Mistral - a journey towards reproducible language model training, 2021. URL https://github.com/stanford-crfm/mistral.
- Andrej Karpathy. Nanogpt repository, 2023. URL https://github.com/karpathy/nanoGPT/.
- On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2016.
- Training scale-invariant neural networks on the sphere can happen in three regimes. Advances in Neural Information Processing Systems, 35:14058–14070, 2022.
- Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
- A simple weight decay can improve generalization. Advances in Neural Information Processing Systems, 1991.
- Aitor Lewkowycz. How to decay your learning rate. arXiv preprint arXiv:2103.12682, 2021.
- On the training dynamics of deep networks with l_2𝑙_2l\_2italic_l _ 2 regularization. In Advances in Neural Information Processing Systems, volume 33, pp. 4790–4799, 2020.
- Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems, 2019.
- An exponential learning rate schedule for deep learning. arXiv preprint arXiv:1910.07454, 2019.
- Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. In Advances in Neural Information Processing Systems, volume 33, pp. 14544–14555, 2020.
- What happens after sgd reaches zero loss?–a mathematical framework. arXiv preprint arXiv:2110.06914, 2021.
- Robust training of neural networks using scale invariant architectures. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 12656–12684. PMLR, 17–23 Jul 2022a. URL https://proceedings.mlr.press/v162/li22b.html.
- Fast mixing of stochastic gradient descent with normalization and weight decay. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/forum?id=sof8l4cki9.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. URL https://proceedings.neurips.cc/paper_files/paper/2011/file/40008b9a5380fcacce3976bf7c08af5b-Paper.pdf.
- Vardan Papyan. The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size. arXiv preprint arXiv:1811.07062, 2018.
- Label noise (stochastic) gradient descent implicitly solves the lasso for quadratic parametrisation. In Conference on Learning Theory, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
- Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
- Understanding the effectiveness of early weight averaging for training large language models. arXiv preprint arXiv:2306.03241, 2023.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Very deep convolutional networks for large-scale image recognition. In NeurIPS, 2014.
- The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
- Andrey Nikolayevich Tikhonov. On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, volume 39, pp. 195–198, 1943.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Twan Van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
- Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis. arXiv preprint arXiv:2105.01650, 2021.
- Pyhessian: Neural networks through the lens of the hessian. In 2020 IEEE international conference on big data (Big data), pp. 581–590. IEEE, 2020.
- Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2016.
- Three mechanisms of weight decay regularization. In International Conference on Learning Representations, 2018.
- Maksym Andriushchenko (33 papers)
- Francesco D'Angelo (21 papers)
- Aditya Varre (4 papers)
- Nicolas Flammarion (63 papers)