Random Scaling and Momentum for Non-smooth Non-convex Optimization (2405.09742v1)
Abstract: Training neural networks requires optimizing a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on stochastic gradient descent with momentum (SGDM), for which classical analysis applies only if the loss is either convex or smooth. We show that a very small modification to SGDM closes this gap: simply scale the update at each time point by an exponentially distributed random scalar. The resulting algorithm achieves optimal convergence guarantees. Intriguingly, this result is not derived by a specific analysis of SGDM: instead, it falls naturally out of a more general framework for converting online convex optimization algorithms to non-convex optimization algorithms.
- Understanding adam optimizer via online learning of updates: Adam is ftrl in disguise, 2024.
- Allen-Zhu, Z. Natasha 2: Faster non-convex optimization than sgd. In Advances in neural information processing systems, pp. 2675–2686, 2018.
- Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.
- Second-order information in non-convex stochastic optimization: Power and limitations. In Conference on Learning Theory, pp. 242–299, 2020.
- Lower bounds for non-convex stochastic optimization. Mathematical Programming, pp. 1–50, 2022.
- Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
- “convex until proven guilty”: Dimension-free acceleration of gradient descent on non-convex functions. In International Conference on Machine Learning, pp. 654–663. PMLR, 2017.
- Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772, 2018.
- Lower bounds for finding stationary points i. Mathematical Programming, pp. 1–50, 2019.
- Prediction, learning, and games. Cambridge university press, 2006.
- Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Optimization, 3(3):538–543, 1993.
- Momentum improves normalized sgd. In International Conference on Machine Learning, 2020.
- Momentum-based variance reduction in non-convex sgd. In Advances in Neural Information Processing Systems, pp. 15210–15219, 2019.
- Optimal stochastic non-smooth non-convex optimization through online-to-non-convex conversion. In International Conference on Machine Learning (ICML), 2023.
- Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019. doi: 10.1137/18M1178244.
- Composite objective mirror descent. In COLT, volume 10, pp. 14–26. Citeseer, 2010.
- Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pp. 689–699, 2018.
- Sharp analysis for nonconvex sgd escaping from saddle points. In Conference on Learning Theory, pp. 1192–1234, 2019.
- Online mirror descent and dual averaging: keeping pace in the dynamic case, 2021.
- Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Goldstein, A. A. Optimization of lipschitz continuous functions. Math. Program., 13(1):14–22, dec 1977. ISSN 0025-5610. doi: 10.1007/BF01584320. URL https://doi.org/10.1007/BF01584320.
- Hazan, E. Introduction to online convex optimization. arXiv preprint arXiv:1909.05207, 2019.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Parameter-free mirror descent. In Loh, P.-L. and Raginsky, M. (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp. 4160–4211. PMLR, 2022. URL https://proceedings.mlr.press/v178/jacobsen22a.html.
- Deterministic nonsmooth nonconvex optimization, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- On the complexity of finding small subgradients in nonsmooth optimization. arXiv preprint arXiv:2209.10346, 2022a.
- Oracle complexity in nonsmooth nonconvex optimization. Journal of Machine Learning Research, 23(314):1–44, 2022b.
- An algorithm with optimal dimension-dependence for zero-order nonsmooth nonconvex stochastic optimization, 2023.
- Learning multiple layers of features from tiny images, 2009.
- Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization, 2022.
- Convergence of a stochastic gradient method with momentum for non-smooth non-convex optimization. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 6630–6639. PMLR, 13–18 Jul 2020.
- Orabona, F. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.
- Scale-free online learning. arXiv preprint arXiv:1601.01974, 2016.
- On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, volume 28, pp. 1139–1147, 2013.
- Stochastic cubic regularization for fast nonconvex optimization. In Advances in neural information processing systems, pp. 2899–2908, 2018.
- Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 6:12, 2017.
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
- Complexity of finding stationary points of nonsmooth nonconvex functions. 2020.
- Stochastic nested variance reduction for nonconvex optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3925–3936, 2018.
- Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928–936, 2003.
- Laprop: Separating momentum and adaptivity in adam, 2021.