Papers
Topics
Authors
Recent
Search
2000 character limit reached

Neglected Hessian component explains mysteries in Sharpness regularization

Published 19 Jan 2024 in cs.LG | (2401.10809v2)

Abstract: Recent work has shown that methods like SAM which either explicitly or implicitly penalize second order information can improve generalization in deep learning. Seemingly similar methods like weight noise and gradient penalties often fail to provide such benefits. We show that these differences can be explained by the structure of the Hessian of the loss. First, we show that a common decomposition of the Hessian can be quantitatively interpreted as separating the feature exploitation from feature exploration. The feature exploration, which can be described by the Nonlinear Modeling Error matrix (NME), is commonly neglected in the literature since it vanishes at interpolation. Our work shows that the NME is in fact important as it can explain why gradient penalties are sensitive to the choice of activation function. Using this insight we design interventions to improve performance. We also provide evidence that challenges the long held equivalence of weight noise and gradient penalties. This equivalence relies on the assumption that the NME can be ignored, which we find does not hold for modern networks since they involve significant feature learning. We find that regularizing feature exploitation but not feature exploration yields performance similar to gradient penalties.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. SAM operates far from home: Eigenvalue regularization as a dynamical phenomenon. In Proceedings of the 40th International Conference on Machine Learning, pp.  152–168. PMLR, July 2023.
  2. Temperature check: Theory and practice for training models with softmax-cross-entropy losses, October 2020.
  3. Second-order regression models exhibit progressive sharpening to the edge of stability, October 2022.
  4. Guozhong An. The effects of adding noise during backpropagation training on a generalization performance. Neural computation, 8(3):643–674, 1996.
  5. Towards understanding sharpness-aware minimization. In International Conference on Machine Learning, pp.  639–668. PMLR, 2022.
  6. David G. T. Barrett and Benoit Dherin. Implicit gradient regularization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=3q5IqUrkcF.
  7. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  8. Chris M Bishop. Training with noise is equivalent to tikhonov regularization. Neural computation, 7(1):108–116, 1995.
  9. On Lazy Training in Differentiable Programming. In Advances in Neural Information Processing Systems 32, pp.  2937–2947. Curran Associates, Inc., 2019.
  10. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012.
  11. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  12. Sharpness-aware training for free. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=xK6wRfL2mv7.
  13. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
  14. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2018.
  15. Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–1076, 1989.
  16. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in Neural Information Processing Systems 31, pp.  8571–8580. Curran Associates, Inc., 2018.
  17. The asymptotic spectrum of the Hessian of DNN throughout training. In International Conference on Learning Representations, March 2020.
  18. Learning multiple layers of features from tiny images. 2009.
  19. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. In Advances in Neural Information Processing Systems 32, pp.  8570–8581. Curran Associates, Inc., 2019.
  20. Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2023.
  21. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  22. James Martens. New Insights and Perspectives on the Natural Gradient Method. Journal of Machine Learning Research, 21(146):1–76, 2020. ISSN 1533-7928.
  23. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. In Proceedings of the 32nd International Conference on Machine Learning, pp.  2408–2417. PMLR, June 2015.
  24. Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th international conference on machine learning (ICML-11), pp.  1033–1040, 2011.
  25. Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping. arXiv:2110.01765 [cs], October 2021.
  26. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
  27. Resurrecting the sigmoid in deep learning through dynamical isometry: Theory and practice. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  28. SAMBA: Regularized autoencoders perform sharpness-aware minimization. In Fifth Symposium on Advances in Approximate Bayesian Inference, 2023. URL https://openreview.net/forum?id=gk3PAmy_UNz.
  29. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
  30. A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  9481–9488, 2021.
  31. Analytic Insights into Structure and Rank of Neural Network Hessian Maps, July 2021.
  32. The Hessian perspective into the Nature of Convolutional Neural Networks. In Proceedings of the 40th International Conference on Machine Learning, pp.  31930–31968. PMLR, July 2023.
  33. On the origin of implicit regularization in stochastic gradient descent, 2021.
  34. The implicit and explicit regularization effects of dropout. In International conference on machine learning, pp.  10181–10192. PMLR, 2020.
  35. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  36. Penalizing gradient norm for efficiently improving generalization in deep learning, 2022.
Citations (4)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.