No Wrong Turns: The Simple Geometry Of Neural Networks Optimization Paths (2306.11922v1)
Abstract: Understanding the optimization dynamics of neural networks is necessary for closing the gap between theory and practice. Stochastic first-order optimization algorithms are known to efficiently locate favorable minima in deep neural networks. This efficiency, however, contrasts with the non-convex and seemingly complex structure of neural loss landscapes. In this study, we delve into the fundamental geometric properties of sampled gradients along optimization paths. We focus on two key quantities, which appear in the restricted secant inequality and error bound. Both hold high significance for first-order optimization. Our analysis reveals that these quantities exhibit predictable, consistent behavior throughout training, despite the stochasticity induced by sampling minibatches. Our findings suggest that not only do optimization trajectories never encounter significant obstacles, but they also maintain stable dynamics during the majority of training. These observed properties are sufficiently expressive to theoretically guarantee linear convergence and prescribe learning rate schedules mirroring empirical practices. We conduct our experiments on image classification, semantic segmentation and LLMing across different batch sizes, network architectures, datasets, optimizers, and initialization seeds. We discuss the impact of each factor. Our work provides novel insights into the properties of neural network loss functions, and opens the door to theoretical frameworks more relevant to prevalent practice.
- Beyond rgb: Very high resolution urban remote sensing with multimodal deep networks. ISPRS Journal of Photogrammetry and Remote Sensing, 2017. ISSN 0924-2716. doi: https://doi.org/10.1016/j.isprsjprs.2017.11.011.
- Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495, 2017. doi: 10.1109/TPAMI.2016.2644615.
- Training a 3-node neural network is np-complete. Neural Networks, 5(1):117–127, 1992. ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(05)80010-3. URL https://www.sciencedirect.com/science/article/pii/S0893608005800103.
- On lazy training in differentiable programming. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ae614c557843b1df326cb29c57225459-Paper.pdf.
- Gradient descent on neural networks typically occurs at the edge of stability. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=jh-rTtvkGeM.
- M. Cramer and N. Haala. Dgpf project: Evaluation of digital photogrammetric aerial-based imaging systems- overview and results from the pilot center. Photogrammetric engineering and remote sensing, 76(9):1019–1029, 2010.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
- Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019–1028. PMLR, 2017.
- A. S. L. Dmitriy Drusvyatskiy. Error bounds, quadratic growth, and linear convergence of proximal methods. In Mathematics of Operations Research 43(3):919-948, 2018. URL https://doi.org/10.1287/moor.2017.0889.
- Wasserstein Adversarial Regularization for learning with label noise. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 5850–5861. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/405075699f065e43581f27d67bb68478-Paper.pdf.
- Linear mode connectivity and the lottery ticket hypothesis. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 3259–3269. PMLR, 2020. URL http://proceedings.mlr.press/v119/frankle20a.html.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/be3087e74e9100d4bc4c6268cdbe8456-Paper.pdf.
- Qualitatively characterizing neural network optimization problems. In International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6544.
- A study of condition numbers for first-order optimization. In International Conference on Artificial Intelligence and Statistics, pages 1261–1269. PMLR, 2021.
- Gradient descent is optimal under lower restricted secant inequality and upper error bound. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=s1yaWFDLxVG.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/CVPR.2016.90.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015a. PMLR. URL https://proceedings.mlr.press/v37/ioffe15.html.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015b.
- Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
- On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points, 2019.
- Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In P. Frasconi, N. Landwehr, G. Manco, and J. Vreeken, editors, Machine Learning and Knowledge Discovery in Databases, volume abs/1608.04636, pages 795–811, Cham, 2016. Springer International Publishing.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2017.
- A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
- Learning multiple layers of features from tiny images. Advances in Neural Information Processing Systems, 2012. URL http://www.cs.toronto.edu/~kriz/cifar.html.
- Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
- Visualizing the loss landscape of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf.
- Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85–116, 2022. ISSN 1063-5203. doi: https://doi.org/10.1016/j.acha.2021.12.009. URL https://www.sciencedirect.com/science/article/pii/S106352032100110X. Special Issue on Harmonic Analysis and Machine Learning.
- Barack’s wife Hillary: Using knowledge graphs for fact-aware language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy, July 2019. Association for Computational Linguistics.
- Stochastic polyak step-size for SGD: an adaptive learning rate for fast convergence. In A. Banerjee and K. Fukumizu, editors, The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pages 1306–1314. PMLR, 2021. URL http://proceedings.mlr.press/v130/loizou21a.html.
- Analyzing monotonic linear interpolation in neural network loss landscapes. CoRR, abs/2104.11044, 2021. URL https://arxiv.org/abs/2104.11044.
- Z.-Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research, 46(1):157–178, 1993.
- Pointer sentinel mixture models, 2016.
- SGD and hogwild! convergence without the bounded gradients assumption. In J. G. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 3747–3755. PMLR, 2018. URL http://proceedings.mlr.press/v80/nguyen18c.html.
- First-order methods almost always avoid saddle points: The case of vanishing step-sizes. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/3fb04953d95a94367bb133f862402bce-Paper.pdf.
- Pytorch: An imperative style, high-performance deep learning library, 2019.
- SGD with arbitrary sampling: General analysis and improved rates. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 5200–5209. PMLR, 2019. URL http://proceedings.mlr.press/v97/qian19b.html.
- Error bound and exact penalty method for optimization problems with nonnegative orthogonal constraint. IMA Journal of Numerical Analysis, 02 2023. ISSN 0272-4979. doi: 10.1093/imanum/drac084. URL https://doi.org/10.1093/imanum/drac084. drac084.
- U-net: Convolutional networks for biomedical image segmentation. In The International Conference on Medical image computing and computer-assisted intervention, 2015.
- The ISPRS benchmark on urban object classification and 3D building reconstruction. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2012.
- How does batch normalization help optimization? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018a. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/905056c1ac1dad141560467e0a99e1cf-Paper.pdf.
- How does batch normalization help optimization? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018b. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/905056c1ac1dad141560467e0a99e1cf-Paper.pdf.
- F. Schöpfer. Linear convergence of descent methods for the unconstrained minimization of restricted strongly convex functions. SIAM J. Optim., 26:1883–1911, 2016.
- O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pages 71–79. JMLR.org, 2013. URL http://proceedings.mlr.press/v28/shamir13.html.
- Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Exponential convergence for distributed smooth optimization under the restricted secant inequality condition, 2019.
- On the convergence of decentralized gradient descent. SIAM J. Optim., 26:1835–1854, 2016.
- H. Zhang and W. Yin. Gradient methods for convex minimization: better rates under weaker conditions. Cam report, UCLA, 2013.
- Z. Zhou and A. So. A unified approach to error bounds for structured convex optimization problems. Mathematical Programming, 12 2015. doi: 10.1007/s10107-016-1100-9.