Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach (2306.08553v4)

Published 14 Jun 2023 in cs.LG, cs.DS, math.OC, and stat.ML

Abstract: The training of over-parameterized neural networks has received much study in recent literature. An important consideration is the regularization of over-parameterized networks due to their highly nonconvex and nonlinear geometry. In this paper, we study noise injection algorithms, which can regularize the Hessian of the loss, leading to regions with flat loss surfaces. Specifically, by injecting isotropic Gaussian noise into the weight matrices of a neural network, we can obtain an approximately unbiased estimate of the trace of the Hessian. However, naively implementing the noise injection via adding noise to the weight matrices before backpropagation presents limited empirical improvements. To address this limitation, we design a two-point estimate of the Hessian penalty, which injects noise into the weight matrices along both positive and negative directions of the random noise. In particular, this two-point estimate eliminates the variance of the first-order Taylor's expansion term on the Hessian. We show a PAC-Bayes generalization bound that depends on the trace of the Hessian (and the radius of the weight space), which can be measured from data. We conduct a detailed experimental study to validate our approach and show that it can effectively regularize the Hessian and improve generalization. First, our algorithm can outperform prior approaches on sharpness-reduced training, delivering up to a 2.4% test accuracy increase for fine-tuning ResNets on six image classification datasets. Moreover, the trace of the Hessian reduces by 15.8%, and the largest eigenvalue is reduced by 9.7% with our approach. We also find that the regularization of the Hessian can be combined with weight decay and data augmentation, leading to stronger regularization. Second, our approach remains effective for improving generalization in pretraining multimodal CLIP models and chain-of-thought fine-tuning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Pierre Alquier “User-friendly introduction to PAC-Bayes bounds” In arXiv preprint arXiv:2110.11216, 2021
  2. Guozhong An “The effects of adding noise during backpropagation training on a generalization performance” In Neural computation 8.3 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 1996, pp. 643–674
  3. “Efficient approaches for escaping higher order saddle points in non-convex optimization” In Conference on learning theory, 2016, pp. 81–102 PMLR
  4. “Sharpness-aware minimization leads to low-rank features” In Advances in Neural Information Processing Systems 36, 2024
  5. Francis Bach “Learning theory from first principles” In Online version, 2021
  6. Peter L Bartlett, Philip M Long and Olivier Bousquet “The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima” In Journal of Machine Learning Research 24.316, 2023, pp. 1–36
  7. Devansh Bisla, Jing Wang and Anna Choromanska “Low-pass filtering sgd for recovering flat optima in the deep learning optimization landscape” In International Conference on Artificial Intelligence and Statistics, 2022, pp. 8299–8339 PMLR
  8. “Explicit regularisation in gaussian noise injections” In Advances in Neural Information Processing Systems 33, 2020, pp. 16603–16614
  9. “Lower bounds for finding stationary points I” In Mathematical Programming 184.1-2 Springer, 2020, pp. 71–120
  10. “Gradient descent on neural networks typically occurs at the edge of stability” In ICLR, 2021
  11. Alex Damian, Tengyu Ma and Jason D Lee “Label noise sgd provably prefers flat global minimizers” In NeurIPS, 2021
  12. Constantinos Daskalakis, Stratis Skoulakis and Manolis Zampetakis “The complexity of constrained min-max optimization” In Symposium on Theory of Computing, 2021
  13. “The complexity of finding stationary points with stochastic gradient descent” In ICML, 2020
  14. “On the role of data in PAC-Bayes bounds” In International Conference on Artificial Intelligence and Statistics, 2021, pp. 604–612 PMLR
  15. Gintare Karolina Dziugaite and Daniel M Roy “Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data” In UAI, 2017
  16. “Sharpness-aware minimization for efficiently improving generalization” In ICLR, 2021
  17. “Escaping from saddle points—online stochastic gradient for tensor decomposition” In Conference on learning theory, 2015, pp. 797–842 PMLR
  18. “Stochastic first-and zeroth-order methods for nonconvex stochastic programming” In SIAM Journal on Optimization 23.4 SIAM, 2013, pp. 2341–2368
  19. Alex Graves “Practical variational inference for neural networks” In Advances in neural information processing systems 24, 2011
  20. “Implicit regularization in matrix factorization” In Advances in neural information processing systems 30, 2017
  21. Geoffrey E Hinton and Drew Van Camp “Keeping the neural networks simple by minimizing the description length of the weights” In Proceedings of the sixth annual conference on Computational learning theory, 1993, pp. 5–13
  22. “Flat minima” In Neural computation 9.1 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 1997, pp. 1–42
  23. “How to escape saddle points efficiently” In International conference on machine learning, 2017, pp. 1724–1732 PMLR
  24. Haotian Ju, Dongyue Li and Hongyang R Zhang “Robust Fine-tuning of Deep Neural Networks with Hessian-based Generalization Guarantees” In ICML, 2022
  25. “ASAM: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks” In ICML, 2021
  26. Yuanzhi Li, Tengyu Ma and Hongyang Zhang “Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations” In Conference On Learning Theory, 2018, pp. 2–47 PMLR
  27. “Random sharpness-aware minimization” In Advances in Neural Information Processing Systems, 2022
  28. Philip M Long and Peter L Bartlett “Sharpness-aware minimization and the edge of stability” In arXiv preprint arXiv:2309.12488, 2023
  29. David A McAllester “Some pac-bayesian theorems” In Machine Learning Springer, 1999
  30. Thomas Möllenhoff and Mohammad Emtiyaz Khan “SAM as an Optimal Relaxation of Bayes” In International Conference on Learning Representations, 2023
  31. Rafael Müller, Simon Kornblith and Geoffrey E Hinton “When does label smoothing help?” In NeurIPS, 2019
  32. Vaishnavh Nagarajan and J Zico Kolter “Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience” In ICLR, 2020
  33. “Random gradient-free minimization of convex functions” In Foundations of Computational Mathematics 17 Springer, 2017, pp. 527–566
  34. “Explicit regularization in overparametrized models via noise injection” In AISTATS, 2023
  35. Benjamin Recht, Maryam Fazel and Pablo A Parrilo “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization” In SIAM review 52.3 SIAM, 2010, pp. 471–501
  36. John Shawe-Taylor and Robert C Williamson “A PAC analysis of a Bayesian estimator” In Proceedings of the tenth annual conference on Computational learning theory, 1997, pp. 2–9
  37. Yusuke Tsuzuku, Issei Sato and Masashi Sugiyama “Normalized flat minima: Exploring scale invariant definition of flat minima for neural networks using pac-bayesian analysis” In International Conference on Machine Learning, 2020, pp. 9636–9647 PMLR
  38. Martin J Wainwright “High-dimensional statistics: A non-asymptotic viewpoint” Cambridge University Press, 2019
  39. Kaiyue Wen, Tengyu Ma and Zhiyuan Li “How Does Sharpness-Aware Minimization Minimize Sharpness?” In ICLR, 2023
  40. Rubing Yang, Jialin Mao and Pratik Chaudhari “Does the data induce capacity control in deep learning?” In International Conference on Machine Learning, 2022, pp. 25166–25197 PMLR
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Haotian Ju (5 papers)
  2. Dongyue Li (27 papers)
  3. Hongyang R. Zhang (19 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.