Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective (2011.11152v6)

Published 23 Nov 2020 in cs.LG and cs.AI

Abstract: Weight decay is a simple yet powerful regularization technique that has been very widely used in training of deep neural networks (DNNs). While weight decay has attracted much attention, previous studies fail to discover some overlooked pitfalls on large gradient norms resulted by weight decay. In this paper, we discover that, weight decay can unfortunately lead to large gradient norms at the final phase (or the terminated solution) of training, which often indicates bad convergence and poor generalization. To mitigate the gradient-norm-centered pitfalls, we present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method that can dynamically adjust the weight decay strength according to the gradient norm and significantly penalize large gradient norms during training. Our experiments also support that SWD indeed mitigates large gradient norms and often significantly outperforms the conventional constant weight decay strategy for Adaptive Moment Estimation (Adam).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Critical learning periods in deep networks. In International Conference on Learning Representations.
  2. Can we use gradient norm as a measure of generalization error for model selection in practice?
  3. Implicit gradient regularization. In International Conference on Learning Representations.
  4. Understanding decoupled and early weight decay. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6777–6785.
  5. Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763.
  6. An introduction to optimization. John Wiley & Sons.
  7. Escaping saddles with stochastic gradients. In International Conference on Machine Learning, pages 1155–1164.
  8. On the convergence of adam and adagrad. arXiv preprint arXiv:2003.02395.
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
  10. The early phase of neural network training. In International Conference on Learning Representations.
  11. Stochastic training is not necessary for generalization. In International Conference on Learning Representations.
  12. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368.
  13. Comparing biases for minimal network construction with back-propagation. In Advances in neural information processing systems, pages 177–185.
  14. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225–1234.
  15. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  16. Simplifying neural nets by discovering flat minima. In Advances in neural information processing systems, pages 529–536.
  17. Flat minima. Neural Computation, 9(1):1–42.
  18. Long short-term memory. Neural computation, 9(8):1735–1780.
  19. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708.
  20. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623.
  21. Fantastic generalization measures and where to find them. In International Conference on Learning Representations.
  22. Adam: A method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015.
  23. Learning multiple layers of features from tiny images.
  24. A simple weight decay can improve generalization. In Advances in neural information processing systems, pages 950–957.
  25. Deep learning. nature, 521(7553):436.
  26. On the training dynamics of deep networks with l⁢_⁢2𝑙_2l\_2italic_l _ 2 regularization. Conference on Neural Information Processing Systems.
  27. On generalization error bounds of noisy gradient methods for non-convex learning. In International Conference on Learning Representations.
  28. Stochastic modified equations and adaptive stochastic gradient algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2101–2110. JMLR. org.
  29. Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems, pages 11669–11680.
  30. On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations.
  31. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.
  32. Decoupled weight decay regularization. In International Conference on Learning Representations.
  33. Adaptive gradient methods with dynamic bound of learning rate. 7th International Conference on Learning Representations, ICLR 2019.
  34. Building a large annotated corpus of english: The penn treebank.
  35. On the convergence of adam and beyond. 6th International Conference on Learning Representations, ICLR 2018.
  36. Convergence analysis of gradient descent stochastic algorithms. Journal of optimization theory and applications, 91(2):439–454.
  37. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  38. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9.
  39. Spherical motion dynamics: Learning dynamics of normalized neural network using sgd and weight decay. Advances in Neural Information Processing Systems, 34.
  40. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4151–4161.
  41. Wolfe, P. (1969). Convergence conditions for ascent methods. SIAM review, 11(2):226–235.
  42. Artificial neural variability for deep learning: On overfitting, noise memorization, and catastrophic forgetting. Neural Computation, 33(8).
  43. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations.
  44. On the overlooked structure of stochastic gradients. In Thirty-seventh Conference on Neural Information Processing Systems.
  45. On the power-law spectrum in deep learning: A bridge to protein science. arXiv preprint arXiv:2201.13011.
  46. Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 24430–24459.
  47. Positive-negative momentum: Manipulating stochastic gradient noise to improve generalization. In International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11448–11458. PMLR.
  48. A unified analysis of stochastic momentum methods for deep learning. In IJCAI International Joint Conference on Artificial Intelligence.
  49. Adaptive methods for nonconvex optimization. In Advances in neural information processing systems, pages 9793–9803.
  50. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
  51. Understanding deep learning requires rethinking generalization. In International Conference on Machine Learning.
  52. Three mechanisms of weight decay regularization. International Conference on Learning Representations.
  53. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations.
  54. Gradient norm regularizer seeks flat minima and improves generalization.
  55. Penalizing gradient norm for efficiently improving generalization in deep learning. In International Conference on Machine Learning.
  56. Towards theoretically understanding why sgd generalizes better than adam in deep learning. arXiv preprint arXiv:2010.05627.
  57. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In ICML, pages 7654–7663.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zeke Xie (34 papers)
  2. Zhiqiang Xu (88 papers)
  3. Jingzhao Zhang (54 papers)
  4. Issei Sato (82 papers)
  5. Masashi Sugiyama (286 papers)
Citations (16)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com