Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accelerating optimization over the space of probability measures (2310.04006v4)

Published 6 Oct 2023 in math.OC and cs.LG

Abstract: The acceleration of gradient-based optimization methods is a subject of significant practical and theoretical importance, particularly within machine learning applications. While much attention has been directed towards optimizing within Euclidean space, the need to optimize over spaces of probability measures in machine learning motivates exploration of accelerated gradient methods in this context too. To this end, we introduce a Hamiltonian-flow approach analogous to momentum-based approaches in Euclidean space. We demonstrate that, in the continuous-time setting, algorithms based on this approach can achieve convergence rates of arbitrarily high order. We complement our findings with numerical examples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Z. Allen-Zhu and L. Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. arXiv preprint arXiv:1407.1537, 2014.
  2. L. Ambrosio and W. Gangbo. Hamiltonian ODEs in the Wasserstein space of probability measures. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 61(1):18–53, 2008.
  3. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.
  4. H. Attouch and F. Alvarez. The heavy ball with friction dynamical system for convex constrained minimization problems. In Optimization: Proceedings of the 9th Belgian-French-German Conference on Optimization Namur, September 7–11, 1998, pages 25–35. Springer, 2000.
  5. H. Attouch and A. Cabot. Asymptotic stabilization of inertial gradient dynamics with time-dependent viscosity. Journal of Differential Equations, 263(9):5412–5458, 2017.
  6. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming, 168:123–175, 2018.
  7. A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
  8. On symplectic optimization. arXiv preprint arXiv:1802.03653, 2018.
  9. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
  10. J. Borwein and A. Lewis. Convex Analysis. Springer, 2006.
  11. Y. Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375–417, 1991.
  12. On the long time behavior of second order differential equations with asymptotically small dissipation. Transactions of the American Mathematical Society, 361(11):5983–6017, 2009.
  13. Convergence to equilibrium in Wasserstein distance for damped Euler equations with interaction forces. Communications in Mathematical Physics, 365:329–361, 2019a.
  14. A blob method for diffusion. Calculus of Variations and Partial Differential Equations, 58:1–53, 2019b.
  15. X. Cheng and P. Bartlett. Convergence of Langevin MCMC in KL-divergence. In Algorithmic Learning Theory, pages 186–211. PMLR, 2018.
  16. Underdamped Langevin MCMC: A non-asymptotic analysis. In Conference on learning theory, pages 300–323. PMLR, 2018.
  17. L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
  18. Wasserstein Hamiltonian flows. Journal of Differential Equations, 268(3):1205–1219, 2020.
  19. Acceleration methods. Foundations and Trends® in Optimization, 5(1-2):1–245, 2021.
  20. J. Diakonikolas and M. I. Jordan. Generalized momentum-based methods: A Hamiltonian perspective. SIAM Journal on Optimization, 31(1):915–944, 2021.
  21. On the global convergence of gradient descent for multi-layer ResNets in the mean-field regime. arXiv preprint arXiv:2110.02926, 2021.
  22. Overparameterization of deep ResNet: zero loss and mean-field analysis. Journal of machine learning research, 2022.
  23. On the geometry of Stein variational gradient descent. arXiv preprint arXiv:1912.00894, 2019.
  24. Log-concave sampling: Metropolis-Hastings algorithms are fast! In Conference on learning theory, pages 793–797. PMLR, 2018.
  25. Conformal symplectic and relativistic optimization. Advances in Neural Information Processing Systems, 33:16916–16926, 2020.
  26. Interacting Langevin diffusions: Gradient structure and ensemble Kalman sampler. SIAM Journal on Applied Dynamical Systems, 19(1):412–441, 2020.
  27. N. García Trillos and J. Morales. Semi-discrete optimization through semi-discrete optimal transport: a framework for neural architecture search. Journal of Nonlinear Science, 32(3):27, 2022.
  28. T. Geffner and J. Domke. Langevin diffusion variational inference. In International Conference on Artificial Intelligence and Statistics, pages 576–593. PMLR, 2023.
  29. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  30. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  31. Stochastic variational inference. Journal of Machine Learning Research, 2013.
  32. M. I. Jordan. Dynamical, symplectic and stochastic perspectives on gradient-based optimization. In Proceedings of the International Congress of Mathematicians: Rio de Janeiro 2018, pages 523–549. World Scientific, 2018.
  33. An introduction to variational methods for graphical models. Machine learning, 37:183–233, 1999.
  34. D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  35. Accelerated mirror descent in continuous and discrete time. Advances in neural information processing systems, 28, 2015.
  36. Variational inference via Wasserstein gradient flows. arXiv preprint arXiv:2205.15902, 2022.
  37. S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
  38. Understanding and accelerating particle-based variational inference. In International Conference on Machine Learning, pages 4082–4092. PMLR, 2019.
  39. Q. Liu. Stein variational gradient descent as gradient flow. Advances in neural information processing systems, 30, 2017.
  40. Second order ensemble Langevin method for sampling and inverse problems. arXiv preprint arXiv:2208.04506, 2022.
  41. Accelerating Langevin sampling with birth-death. arXiv preprint arXiv:1905.09863, 2019.
  42. Is there an analog of Nesterov acceleration for gradient-based MCMC? 2021.
  43. Hamiltonian descent methods. arXiv preprint arXiv:1809.05042, 2018.
  44. R. J. McCann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153–179, 1997.
  45. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  46. A systematic approach to Lyapunov analyses of continuous-time models in convex optimization. arXiv preprint arXiv:2205.12772, 2022.
  47. M. Muehlebach and M. Jordan. A dynamical systems perspective on Nesterov acceleration. In International Conference on Machine Learning, pages 4656–4662. PMLR, 2019.
  48. Problem complexity and method efficiency in optimization. 1983.
  49. Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  50. Y. E. Nesterov. A method of solving a convex programming problem with convergence rate O⁢(k2)𝑂superscript𝑘2O\left(k^{2}\right)italic_O ( italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). In Doklady Akademii Nauk, volume 269, pages 543–547. Russian Academy of Sciences, 1983.
  51. B. O’donoghue and E. Candes. Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 15:715–732, 2015.
  52. F. Otto. The geometry of dissipative evolution equations: the porous medium equation. 2001.
  53. Relative entropy policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, pages 1607–1612, 2010.
  54. B. Polyak and P. Shcherbakov. Lyapunov functions: An optimization theory perspective. IFAC-PapersOnLine, 50(1):7456–7461, 2017.
  55. B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR computational mathematics and mathematical physics, 4(5):1–17, 1964.
  56. D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
  57. R. T. Rockafellar. Convex Analysis, volume 11. Princeton university press, 1997.
  58. Integration methods and optimization algorithms. Advances in Neural Information Processing Systems, 30, 2017.
  59. R. Shen and Y. T. Lee. The randomized midpoint method for log-concave sampling. Advances in Neural Information Processing Systems, 32, 2019.
  60. Understanding the acceleration phenomenon via high-resolution differential equations. Mathematical Programming, pages 1–70, 2021.
  61. J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80(2):725–752, 2020.
  62. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  63. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  64. A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. Advances in neural information processing systems, 27, 2014.
  65. A. Taghvaei and P. Mehta. Accelerated flow for probability distributions. In International Conference on Machine Learning, pages 6076–6085. PMLR, 2019.
  66. M. Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pages 1049–1056, 2009.
  67. C. Villani et al. Optimal Transport: Old and New, volume 338. Springer, 2009.
  68. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
  69. Y. Wang and W. Li. Accelerated information gradient flow. Journal of Scientific Computing, 90:1–47, 2022.
  70. A. Wibisono and A. C. Wilson. On accelerated methods in optimization. arXiv preprint arXiv:1509.03616, 2015.
  71. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358, 2016.
  72. A Lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635, 2016.
  73. Direct Runge-Kutta discretization achieves acceleration. Advances in neural information processing systems, 31, 2018.
  74. Improved discretization analysis for underdamped Langevin Monte Carlo. In The Thirty Sixth Annual Conference on Learning Theory, pages 36–71. PMLR, 2023.
  75. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
Citations (3)

Summary

We haven't generated a summary for this paper yet.