How to escape sharp minima with random perturbations (2305.15659v3)
Abstract: Modern machine learning applications have witnessed the remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this design choice, we undertake a formal study that (i) formulates the notion of flat minima, and (ii) studies the complexity of finding them. Specifically, we adopt the trace of the Hessian of the cost function as a measure of flatness, and use it to formally define the notion of approximate flat minima. Under this notion, we then analyze algorithms that find approximate flat minima efficiently. For general cost functions, we discuss a gradient-based algorithm that finds an approximate flat local minimum efficiently. The main component of the algorithm is to use gradients computed from randomly perturbed iterates to estimate a direction that leads to flatter minima. For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization, supporting its success in practice.
- Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1195–1199, 2017.
- Learning threshold neurons via edge of stability. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=9cQ6kToLnJ.
- Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. Advances in neural information processing systems, 31, 2018.
- A modern look at the relationship between sharpness and generalization. arXiv preprint arXiv:2302.07011, 2023.
- Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1-2):165–214, 2023.
- Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning, pages 948–1024. PMLR, 2022.
- Sharpness-aware minimization improves language model generalization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7360–7371, 2022.
- The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima. arXiv preprint arXiv:2210.01513, 2022.
- Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on learning theory, pages 483–513. PMLR, 2020.
- Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772, 2018.
- Lower bounds for finding stationary points i. Mathematical Programming, 184(1-2):71–120, 2020.
- Lower bounds for finding stationary points ii: first-order methods. Mathematical Programming, 185(1-2):315–355, 2021.
- Entropy-SGD: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.
- An sde for modeling sam: Theory and insights. In International Conference on Machine Learning, pages 25209–25253. PMLR, 2023.
- Yaim Cooper. Global minima of overparameterized neural networks. SIAM Journal on Mathematics of Data Science, 3(2):676–691, 2021.
- The crucial role of normalization in sharpness-aware minimization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=zq4vFneRiA.
- Label noise SGD provably prefers flat global minimizers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=x2TMPhseWAW.
- Self-stabilization: The implicit bias of gradient descent at the edge of stability. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=nhKHA59gXz.
- Escaping saddles with stochastic gradients. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1155–1164. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/daneshmand18a.html.
- Flat minima generalize for low-rank matrix recovery. arXiv preprint arXiv:2203.03756, 2022.
- Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019–1028. PMLR, 2017.
- Essentially no barriers in neural network energy landscape. In International conference on machine learning, pages 1309–1318. PMLR, 2018.
- The complexity of finding stationary points with stochastic gradient descent. In International Conference on Machine Learning, pages 2658–2667. PMLR, 2020.
- Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012.
- Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
- Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in Neural Information Processing Systems, 31, 2018.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. arXiv preprint arXiv:1802.10026, 2018.
- What is the inductive bias of flatness regularization? a study of deep matrix factorization models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=2hQ7MBQApp.
- Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on learning theory, pages 797–842. PMLR, 2015.
- Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Why (and when) does local SGD generalize better than SGD? In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=svCcui6Drl.
- Asymmetric valleys: Beyond sharp and flat local minima. arXiv preprint arXiv:1902.00744, 2019.
- Flat minima. Neural computation, 9(1):1–42, 1997.
- Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
- How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724–1732. PMLR, 2017.
- On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. Journal of the ACM (JACM), 68(2):1–29, 2021.
- When do flat minima optimizers work? Advances in Neural Information Processing Systems, 35:16577–16595, 2022.
- Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pages 795–811. Springer, 2016.
- On the maximum hessian eigenvalue and generalization. In Proceedings on, pages 51–65. PMLR, 2023.
- On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017.
- What happens after SGD reaches zero loss? –a mathematical framework. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=siCt4xZn5Ve.
- Same pre-training loss, better downstream: Implicit bias matters for language models. In International Conference on Machine Learning, pages 22188–22214. PMLR, 2023.
- Bad global minima exist and SGD can reach them. Advances in Neural Information Processing Systems, 33:8543–8552, 2020.
- On linear stability of SGD and input-smoothness of neural networks. Advances in Neural Information Processing Systems, 34:16805–16817, 2021.
- Unique properties of flat minima in deep networks. In International conference on machine learning, pages 7108–7118. PMLR, 2020.
- Exploring generalization in deep learning. Advances in neural information processing systems, 30, 2017.
- Diametrical risk minimization: Theory and computations. Machine Learning, pages 1–19, 2021.
- Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323. PMLR, 2016.
- Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
- Normalized flat minima: Exploring scale invariant definition of flat minima for neural networks using pac-bayesian analysis. In International Conference on Machine Learning, pages 9636–9647. PMLR, 2020.
- Large learning rate tames homogeneity: Convergence and balancing effect. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=3tbDrs77LJ5.
- How does sharpness-aware minimization minimize sharpness? arXiv preprint arXiv:2211.05729, 2022.
- Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Dkmpa6wCIx.
- Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems, 33:2958–2969, 2020.
- A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=wXgk_iCiYGo.
- Hessian-based analysis of large batch training and robustness to adversaries. Advances in Neural Information Processing Systems, 31, 2018.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- Regularizing neural networks via adversarial model perturbation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8156–8165, 2021.
- Stochastic recursive variance-reduced cubic regularization methods. In International Conference on Artificial Intelligence and Statistics, pages 3980–3990. PMLR, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.