SDEs for Minimax Optimization (2402.12508v1)
Abstract: Minimax optimization problems have attracted a lot of attention over the past few years, with applications ranging from economics to machine learning. While advanced optimization methods exist for such problems, characterizing their dynamics in stochastic scenarios remains notably challenging. In this paper, we pioneer the use of stochastic differential equations (SDEs) to analyze and compare Minimax optimizers. Our SDE models for Stochastic Gradient Descent-Ascent, Stochastic Extragradient, and Stochastic Hamiltonian Gradient Descent are provable approximations of their algorithmic counterparts, clearly showcasing the interplay between hyperparameters, implicit regularization, and implicit curvature-induced noise. This perspective also allows for a unified and simplified analysis strategy based on the principles of It^o calculus. Finally, our approach facilitates the derivation of convergence conditions and closed-form solutions for the dynamics in simplified settings, unveiling further insights into the behavior of different optimizers.
- Stochastic modified equations for the asynchronous stochastic gradient descent. Information and Inference: A Journal of the IMA, 9(4):851–873.
- The mechanics of n-player differentiable games. In International Conference on Machine Learning, pages 354–363. PMLR.
- Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–10. IEEE.
- Reducing noise in gan training with variance reduced extragradient. Advances in Neural Information Processing Systems, 32.
- Continuous-time analysis for variational inequalities: An overview and desiderata. arXiv preprint arXiv:2207.07105.
- Last-iterate convergence of saddle point optimizers via high-resolution differential equations. In Minimax Theory and its Applications 08 (2023), No. 2, pages 333–380. Heldermann Verlag.
- On the convergence of stochastic gradient mcmc algorithms with high-order integrators. Advances in neural information processing systems, 28.
- Convergence rates in forward–backward splitting. SIAM Journal on Optimization, 7(2):421–444.
- An sde for modeling sam: Theory and insights. In International Conference on Machine Learning, pages 25209–25253. PMLR.
- Optimal extragradient-based bilinearly-coupled saddle-point optimization. arXiv preprint arXiv:2206.08573.
- A variational inequality perspective on generative adversarial networks. ICLR.
- Deep learning, volume 1. MIT Press.
- Stochastic extragradient: General analysis and improved rates. In International Conference on Artificial Intelligence and Statistics, pages 7865–7901. PMLR.
- Differential equations for modeling asynchronous algorithms. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, page 2220–2226. AAAI Press.
- On the convergence of single-call stochastic extra-gradient methods. Advances in Neural Information Processing Systems, 32.
- Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. Advances in Neural Information Processing Systems, 33:16223–16234.
- The limits of min-max optimization algorithms: Convergence to spurious non-critical sets. In International Conference on Machine Learning, pages 4337–4348. PMLR.
- Three factors influencing minima in sgd. ICANN 2018.
- Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58.
- Korpelevich, G. M. (1976). The extragradient method for finding saddle points and other problems. Matecon, 12:747–756.
- Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media.
- On the convergence of stochastic extragradient for bilinear games using restarted iteration averaging. In International Conference on Artificial Intelligence and Statistics, pages 9793–9826. PMLR.
- Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101–2110. PMLR.
- Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations. The Journal of Machine Learning Research, 20(1):1474–1520.
- On the validity of modeling SGD with stochastic differential equations (SDEs). In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems.
- Stochastic approximation and optimization of random systems, volume 17. Birkhäuser.
- Stochastic gradient descent-ascent and consensus optimization for smooth games: Convergence analysis under expected co-coercivity. Advances in Neural Information Processing Systems, 34:19095–19108.
- Stochastic hamiltonian gradient methods for smooth games. In International Conference on Machine Learning, pages 6370–6381. PMLR.
- Lu, H. (2022). An o (sr)-resolution ode framework for understanding discrete-time algorithms and applications to the linear convergence of minimax problems. Mathematical Programming, 194(1-2):1061–1112.
- On the SDEs and scaling rules for adaptive gradient algorithms. In Advances in Neural Information Processing Systems.
- Continuous-time limit of stochastic gradient descent revisited. NIPS-2015.
- Mao, X. (2007). Stochastic differential equations and applications. Elsevier.
- Mil’shtein, G. (1986). Weak approximation of solutions of systems of stochastic differential equations. Theory of Probability & Its Applications, 30(4):750–766.
- Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, pages 4573–4582. PMLR.
- Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609.
- Noor, M. A. (2003). New extragradient-type methods for general variational inequalities. Journal of Mathematical Analysis and Applications, 277(2):379–394.
- Øksendal, B. (1990). When is a stochastic integral a time change of a diffusion? Journal of theoretical probability, 3(2):207–226.
- Continuous-time models for stochastic optimization algorithms. Advances in Neural Information Processing Systems, 32.
- Ode analysis of stochastic gradient methods with optimism and anchoring for minimax problems. arXiv preprint arXiv:1905.10899.
- A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems.
- A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations.
- Xu, M. et al. (2022). Experimental Evaluation of Iterative Methods for Games. PhD thesis, Johns Hopkins University.
- Batch size selection by stochastic optimal control. In Has it Trained Yet? NeurIPS 2022 Workshop.
- The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. ICML 2019.