Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach (2306.08553v4)
Abstract: The training of over-parameterized neural networks has received much study in recent literature. An important consideration is the regularization of over-parameterized networks due to their highly nonconvex and nonlinear geometry. In this paper, we study noise injection algorithms, which can regularize the Hessian of the loss, leading to regions with flat loss surfaces. Specifically, by injecting isotropic Gaussian noise into the weight matrices of a neural network, we can obtain an approximately unbiased estimate of the trace of the Hessian. However, naively implementing the noise injection via adding noise to the weight matrices before backpropagation presents limited empirical improvements. To address this limitation, we design a two-point estimate of the Hessian penalty, which injects noise into the weight matrices along both positive and negative directions of the random noise. In particular, this two-point estimate eliminates the variance of the first-order Taylor's expansion term on the Hessian. We show a PAC-Bayes generalization bound that depends on the trace of the Hessian (and the radius of the weight space), which can be measured from data. We conduct a detailed experimental study to validate our approach and show that it can effectively regularize the Hessian and improve generalization. First, our algorithm can outperform prior approaches on sharpness-reduced training, delivering up to a 2.4% test accuracy increase for fine-tuning ResNets on six image classification datasets. Moreover, the trace of the Hessian reduces by 15.8%, and the largest eigenvalue is reduced by 9.7% with our approach. We also find that the regularization of the Hessian can be combined with weight decay and data augmentation, leading to stronger regularization. Second, our approach remains effective for improving generalization in pretraining multimodal CLIP models and chain-of-thought fine-tuning.
- Pierre Alquier “User-friendly introduction to PAC-Bayes bounds” In arXiv preprint arXiv:2110.11216, 2021
- Guozhong An “The effects of adding noise during backpropagation training on a generalization performance” In Neural computation 8.3 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 1996, pp. 643–674
- “Efficient approaches for escaping higher order saddle points in non-convex optimization” In Conference on learning theory, 2016, pp. 81–102 PMLR
- “Sharpness-aware minimization leads to low-rank features” In Advances in Neural Information Processing Systems 36, 2024
- Francis Bach “Learning theory from first principles” In Online version, 2021
- Peter L Bartlett, Philip M Long and Olivier Bousquet “The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima” In Journal of Machine Learning Research 24.316, 2023, pp. 1–36
- Devansh Bisla, Jing Wang and Anna Choromanska “Low-pass filtering sgd for recovering flat optima in the deep learning optimization landscape” In International Conference on Artificial Intelligence and Statistics, 2022, pp. 8299–8339 PMLR
- “Explicit regularisation in gaussian noise injections” In Advances in Neural Information Processing Systems 33, 2020, pp. 16603–16614
- “Lower bounds for finding stationary points I” In Mathematical Programming 184.1-2 Springer, 2020, pp. 71–120
- “Gradient descent on neural networks typically occurs at the edge of stability” In ICLR, 2021
- Alex Damian, Tengyu Ma and Jason D Lee “Label noise sgd provably prefers flat global minimizers” In NeurIPS, 2021
- Constantinos Daskalakis, Stratis Skoulakis and Manolis Zampetakis “The complexity of constrained min-max optimization” In Symposium on Theory of Computing, 2021
- “The complexity of finding stationary points with stochastic gradient descent” In ICML, 2020
- “On the role of data in PAC-Bayes bounds” In International Conference on Artificial Intelligence and Statistics, 2021, pp. 604–612 PMLR
- Gintare Karolina Dziugaite and Daniel M Roy “Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data” In UAI, 2017
- “Sharpness-aware minimization for efficiently improving generalization” In ICLR, 2021
- “Escaping from saddle points—online stochastic gradient for tensor decomposition” In Conference on learning theory, 2015, pp. 797–842 PMLR
- “Stochastic first-and zeroth-order methods for nonconvex stochastic programming” In SIAM Journal on Optimization 23.4 SIAM, 2013, pp. 2341–2368
- Alex Graves “Practical variational inference for neural networks” In Advances in neural information processing systems 24, 2011
- “Implicit regularization in matrix factorization” In Advances in neural information processing systems 30, 2017
- Geoffrey E Hinton and Drew Van Camp “Keeping the neural networks simple by minimizing the description length of the weights” In Proceedings of the sixth annual conference on Computational learning theory, 1993, pp. 5–13
- “Flat minima” In Neural computation 9.1 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 1997, pp. 1–42
- “How to escape saddle points efficiently” In International conference on machine learning, 2017, pp. 1724–1732 PMLR
- Haotian Ju, Dongyue Li and Hongyang R Zhang “Robust Fine-tuning of Deep Neural Networks with Hessian-based Generalization Guarantees” In ICML, 2022
- “ASAM: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks” In ICML, 2021
- Yuanzhi Li, Tengyu Ma and Hongyang Zhang “Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations” In Conference On Learning Theory, 2018, pp. 2–47 PMLR
- “Random sharpness-aware minimization” In Advances in Neural Information Processing Systems, 2022
- Philip M Long and Peter L Bartlett “Sharpness-aware minimization and the edge of stability” In arXiv preprint arXiv:2309.12488, 2023
- David A McAllester “Some pac-bayesian theorems” In Machine Learning Springer, 1999
- Thomas Möllenhoff and Mohammad Emtiyaz Khan “SAM as an Optimal Relaxation of Bayes” In International Conference on Learning Representations, 2023
- Rafael Müller, Simon Kornblith and Geoffrey E Hinton “When does label smoothing help?” In NeurIPS, 2019
- Vaishnavh Nagarajan and J Zico Kolter “Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience” In ICLR, 2020
- “Random gradient-free minimization of convex functions” In Foundations of Computational Mathematics 17 Springer, 2017, pp. 527–566
- “Explicit regularization in overparametrized models via noise injection” In AISTATS, 2023
- Benjamin Recht, Maryam Fazel and Pablo A Parrilo “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization” In SIAM review 52.3 SIAM, 2010, pp. 471–501
- John Shawe-Taylor and Robert C Williamson “A PAC analysis of a Bayesian estimator” In Proceedings of the tenth annual conference on Computational learning theory, 1997, pp. 2–9
- Yusuke Tsuzuku, Issei Sato and Masashi Sugiyama “Normalized flat minima: Exploring scale invariant definition of flat minima for neural networks using pac-bayesian analysis” In International Conference on Machine Learning, 2020, pp. 9636–9647 PMLR
- Martin J Wainwright “High-dimensional statistics: A non-asymptotic viewpoint” Cambridge University Press, 2019
- Kaiyue Wen, Tengyu Ma and Zhiyuan Li “How Does Sharpness-Aware Minimization Minimize Sharpness?” In ICLR, 2023
- Rubing Yang, Jialin Mao and Pratik Chaudhari “Does the data induce capacity control in deep learning?” In International Conference on Machine Learning, 2022, pp. 25166–25197 PMLR
- Haotian Ju (5 papers)
- Dongyue Li (27 papers)
- Hongyang R. Zhang (19 papers)