A Precise Characterization of SGD Stability Using Loss Surface Geometry (2401.12332v1)
Abstract: Stochastic Gradient Descent (SGD) stands as a cornerstone optimization algorithm with proven real-world empirical successes but relatively limited theoretical understanding. Recent research has illuminated a key factor contributing to its practical efficacy: the implicit regularization it instigates. Several studies have investigated the linear stability property of SGD in the vicinity of a stationary point as a predictive proxy for sharpness and generalization error in overparameterized neural networks (Wu et al., 2022; Jastrzebski et al., 2019; Cohen et al., 2021). In this paper, we delve deeper into the relationship between linear stability and sharpness. More specifically, we meticulously delineate the necessary and sufficient conditions for linear stability, contingent on hyperparameters of SGD and the sharpness at the optimum. Towards this end, we introduce a novel coherence measure of the loss Hessian that encapsulates pertinent geometric properties of the loss function that are relevant to the linear stability of SGD. It enables us to provide a simplified sufficient condition for identifying linear instability at an optimum. Notably, compared to previous works, our analysis relies on significantly milder assumptions and is applicable for a broader class of loss functions than known before, encompassing not only mean-squared error but also cross-entropy loss.
- A. Agarwala and Y. Dauphin. Sam operates far from home: eigenvalue regularization as a dynamical phenomenon. In International Conference on Machine Learning, 2023.
- A convergence theory for deep learning via over-parameterization. In International conference on machine learning, pp. 242–252. PMLR, 2019.
- A modern look at the relationship between sharpness and generalization. arXiv preprint arXiv:2302.07011, 2023.
- The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima, 2022. URL https://arxiv.org/abs/2210.01513.
- mSAM: Micro-batch-averaged sharpness-aware minimization. arXiv preprint arXiv:2302.09693, 2023.
- R. Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.
- L. Bottou. Stochastic gradient learning in neural networks. In Proceedings of Neuro-Nîmes 91, Nimes, France, 1991. EC2. URL http://leon.bottou.org/papers/bottou-91c.
- Entropy-SGD: Biasing gradient descent into wide valleys. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B1YfAfcgl.
- Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065, 2021.
- Adaptive gradient methods at the edge of stability, 2022. URL https://arxiv.org/abs/2207.14484.
- Y. Cooper. Global minima of overparameterized neural networks. SIAM Journal on Mathematics of Data Science, 3(2):676–691, 2021.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2020.
- S. Hochreiter and J. Schmidhuber. Flat Minima. Neural Computation, 9(1):1–42, 01 1997. ISSN 0899-7667.
- Matrix concentration for products. Foundations of Computational Mathematics, 22(6):1767–1799, 2022.
- Averaging weights leads to wider optima and better generalization. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pp. 876–885, 2018.
- The break-even point on optimization trajectories of deep neural networks. In International Conference on Learning Representations, 2019.
- Fantastic generalization measures and where to find them, 2019.
- On large-batch training for deep learning: Generalization gap and sharp minima. ICLR, 2017.
- Fisher sam: Information geometry and sharpness aware minimisation, 2022. URL https://arxiv.org/abs/2206.04920.
- Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In Proc. of ICML, volume 139, pp. 5905–5914, 2021.
- Stochastic modified equations and adaptive stochastic gradient algorithms. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2101–2110. PMLR, 06–11 Aug 2017.
- Bad global minima exist and sgd can reach them. Advances in Neural Information Processing Systems, 33:8543–8552, 2020.
- Towards efficient and scalable sharpness-aware minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12360–12370, 2022.
- C. Ma and L. Ying. On linear stability of sgd and input-smoothness of neural networks. Advances in Neural Information Processing Systems, 34:16805–16817, 2021.
- In search of the real inductive bias: On the role of implicit regularization in deep learning, 2015.
- Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 2818–2826. IEEE Computer Society, 2016.
- Rethinking sharpness-aware minimization as variational inference, 2022. URL https://arxiv.org/abs/2210.10452.
- How does sharpness-aware minimization minimize sharpness?, 2022. URL https://arxiv.org/abs/2211.05729.
- L. Wu and W. Su. The implicit regularization of dynamical stability in stochastic gradient descent. In International Conference on Machine Learning, 2023.
- Towards understanding generalization of deep learning: Perspective of loss landscapes. ArXiv, abs/1706.10239, 2017.
- How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31, 2018.
- The alignment property of sgd noise and how it helps select flat minima: A stability analysis. Advances in Neural Information Processing Systems, 35:4680–4693, 2022.
- A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations, 2021.
- On the power-law hessian spectrums in deep learning. arXiv preprint arXiv:2201.13011, 2022.
- Understanding deep learning requires rethinking generalization, 2017.
- Surrogate gap minimization improves sharpness-aware training. arXiv preprint arXiv:2203.08065, 2022.
- The probabilistic stability of stochastic gradient descent. arXiv preprint arXiv:2303.13093, 2023.