Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Why Do We Need Weight Decay in Modern Deep Learning? (2310.04415v2)

Published 6 Oct 2023 in cs.LG

Abstract: Weight decay is a broadly used technique for training state-of-the-art deep networks from image classification to LLMs. Despite its widespread usage and being extensively studied in the classical literature, its role remains poorly understood for deep learning. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For deep networks on vision tasks trained with multipass SGD, we show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD via the loss stabilization mechanism. In contrast, for LLMs trained with nearly one-epoch training, we describe how weight decay balances the bias-variance tradeoff in stochastic optimization leading to lower training loss and improved training stability. Overall, we present a unifying perspective from ResNets on vision tasks to LLMs: weight decay is never useful as an explicit regularizer but instead changes the training dynamics in a desirable way. The code is available at https://github.com/tml-epfl/why-weight-decay

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. SGD with large step sizes learns sparse features. In International Conference on Machine Learning, 2023.
  2. Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process. In Conference on Learning Theory, 2020.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  5. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021.
  6. Adaptive gradient methods at the edge of stability. arXiv preprint arXiv:2207.14484, 2022.
  7. Label noise sgd provably prefers flat global minimizers. In Advances in Neural Information Processing Systems, volume 34, pp.  27449–27461, 2021.
  8. Manus Foster. An application of the wiener-kolmogorov smoothing theory to matrix inversion. Journal of the Society for Industrial and Applied Mathematics, 9(3):387–392, 1961.
  9. Openwebtext corpus, 2019. http://Skylion007.github.io/OpenWebTextCorpus.
  10. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  11. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. ICCV, 2015.
  12. Deep residual learning for image recognition. In CVPR, 2016.
  13. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
  14. Arthur R Hoerl. Application of ridge analysis to regression problems. Chemical Engineering Progress, 58:54–59, 1962.
  15. Training compute-optimal large language models. In Advances in Neural Information Processing Systems, 2022.
  16. Fantastic generalization measures and where to find them. In International Conference on Learning Representations, 2020.
  17. A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019.
  18. Mistral - a journey towards reproducible language model training, 2021. URL https://github.com/stanford-crfm/mistral.
  19. Andrej Karpathy. Nanogpt repository, 2023. URL https://github.com/karpathy/nanoGPT/.
  20. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2016.
  21. Training scale-invariant neural networks on the sphere can happen in three regimes. Advances in Neural Information Processing Systems, 35:14058–14070, 2022.
  22. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
  23. A simple weight decay can improve generalization. Advances in Neural Information Processing Systems, 1991.
  24. Aitor Lewkowycz. How to decay your learning rate. arXiv preprint arXiv:2103.12682, 2021.
  25. On the training dynamics of deep networks with l⁢_⁢2𝑙_2l\_2italic_l _ 2 regularization. In Advances in Neural Information Processing Systems, volume 33, pp.  4790–4799, 2020.
  26. Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems, 2019.
  27. An exponential learning rate schedule for deep learning. arXiv preprint arXiv:1910.07454, 2019.
  28. Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. In Advances in Neural Information Processing Systems, volume 33, pp.  14544–14555, 2020.
  29. What happens after sgd reaches zero loss?–a mathematical framework. arXiv preprint arXiv:2110.06914, 2021.
  30. Robust training of neural networks using scale invariant architectures. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  12656–12684. PMLR, 17–23 Jul 2022a. URL https://proceedings.mlr.press/v162/li22b.html.
  31. Fast mixing of stochastic gradient descent with normalization and weight decay. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/forum?id=sof8l4cki9.
  32. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  33. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. URL https://proceedings.neurips.cc/paper_files/paper/2011/file/40008b9a5380fcacce3976bf7c08af5b-Paper.pdf.
  34. Vardan Papyan. The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size. arXiv preprint arXiv:1811.07062, 2018.
  35. Label noise (stochastic) gradient descent implicitly solves the lasso for quadratic parametrisation. In Conference on Learning Theory, 2022.
  36. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  37. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  38. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
  39. Understanding the effectiveness of early weight averaging for training large language models. arXiv preprint arXiv:2306.03241, 2023.
  40. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  41. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  42. Very deep convolutional networks for large-scale image recognition. In NeurIPS, 2014.
  43. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
  44. Andrey Nikolayevich Tikhonov. On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, volume 39, pp.  195–198, 1943.
  45. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  46. Twan Van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017.
  47. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
  48. Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis. arXiv preprint arXiv:2105.01650, 2021.
  49. Pyhessian: Neural networks through the lens of the hessian. In 2020 IEEE international conference on big data (Big data), pp.  581–590. IEEE, 2020.
  50. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2016.
  51. Three mechanisms of weight decay regularization. In International Conference on Learning Representations, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Maksym Andriushchenko (33 papers)
  2. Francesco D'Angelo (21 papers)
  3. Aditya Varre (4 papers)
  4. Nicolas Flammarion (63 papers)
Citations (17)

Summary

Overview of Weight Decay in Modern Deep Learning

The paper, "Why Do We Need Weight Decay in Modern Deep Learning?" explores the multifaceted role of weight decay in training contemporary neural networks. Notably, it challenges the classical view of weight decay as a mere regularization tool, proposing instead that it significantly modifies optimization dynamics across a spectrum of deep learning tasks.

Main Contributions

The authors explore weight decay's function through a detailed empirical paper and present a theoretical framework to understand its effects. The paper's main contributions are as follows:

  1. Enhanced Implicit Regularization: For overparameterized networks, weight decay is shown to influence the optimization dynamics by enhancing the implicit regularization effect of the stochastic gradient descent (SGD) noise. This is achieved through a process termed "loss stabilization."
  2. Role in LLMs: In contrast to overparameterized models, for LLMs trained with nearly one-pass SGD, weight decay does not serve as a traditional regularizer. Instead, it balances the bias-variance tradeoff in stochastic optimization, leading to a reduction in training loss.
  3. Prevention of Divergences: The paper also highlights an unexpected benefit of weight decay in preventing sudden loss divergences during bfloat16 mixed-precision training—crucial for scalable LLM training.

Empirical Analysis and Insights

The paper presents compelling empirical evidence supporting the view of weight decay as a tool that modifies training dynamics favorably:

  • Loss Stabilization Mechanism: In overparameterized networks, weight decay alters the effective learning rate, allowing the model to capitalize on the implicit noise-driven regularization of SGD. The paper provides evidence through experiments on VGG and ResNet models across CIFAR-10/100 datasets.
  • Optimization and Stability in LLMs: For LLMs, the paper reproduces empirical findings showing that weight decay leads to lower training loss, especially towards the end of the training period, when paired with decaying learning rates.

Theoretical Implications

The authors propose a unifying theory to explain these observations:

  • Regularization via Hessian Trace: A central conjecture posits that weight decay modifies optimization trajectories such that the dynamic of SGD closely aligns with a process that regularizes the trace of the Hessian. This results in better generalization performance.
  • Effective Learning Rate: The work suggests that weight decay alters the effective learning rates through controlling parameter norms, thereby implicitly adjusting the learning rate schedule, especially in LLMs.

Future Perspectives and Practical Takeaways

This paper opens several avenues for future research and practical improvements in AI:

  • AI Model Training: Understanding weight decay as a dynamic optimizer rather than a static regularizer could inform better hyperparameter tuning and optimization strategies in future AI research.
  • Wider Applications: The insights into how weight decay prevents divergences in mixed-precision training can be crucial for developing more robust large-scale AI models.
  • Refinement of LLM Training: The bias-variance tradeoff analysis presents opportunities for developing more efficient training protocols for LLMs, potentially informing new adaptive learning algorithms.

In conclusion, this paper reframes the traditional understanding of weight decay within the deep learning community, offering a nuanced perspective that aligns its usage with improved training dynamics and model stability.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com