Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stochastic Modified Flows for Riemannian Stochastic Gradient Descent (2402.03467v1)

Published 2 Feb 2024 in cs.LG, math.OC, math.PR, and stat.ML

Abstract: We give quantitative estimates for the rate of convergence of Riemannian stochastic gradient descent (RSGD) to Riemannian gradient flow and to a diffusion process, the so-called Riemannian stochastic modified flow (RSMF). Using tools from stochastic differential geometry we show that, in the small learning rate regime, RSGD can be approximated by the solution to the RSMF driven by an infinite-dimensional Wiener process. The RSMF accounts for the random fluctuations of RSGD and, thereby, increases the order of approximation compared to the deterministic Riemannian gradient flow. The RSGD is build using the concept of a retraction map, that is, a cost efficient approximation of the exponential map, and we prove quantitative bounds for the weak error of the diffusion approximation under assumptions on the retraction map, the geometry of the manifold, and the random estimators of the gradient.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. P.-A. Absil and J. Malick. Projection-like retractions on matrix manifolds. SIAM J. Optim., 22(1):135–158, 2012.
  2. Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ, 2008.
  3. S. Ankirchner and S. Perko. Towards diffusion approximations for stochastic gradient descent without replacement. hal-03527878, 2022.
  4. Tree-like structure in large social and information networks. In IEEE 13th international conference on data mining, pages 1–10. IEEE, 2013.
  5. S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Automat. Control, 58(9):2217–2229, 2013.
  6. N. Boumal. An introduction to optimization on smooth manifolds. Cambridge University Press, 2023.
  7. C. Criscitiello and N. Boumal. Efficiently escaping saddle points on manifolds. In Neural Information Processing Systems, volume 32, 2019.
  8. Neural embeddings of graphs in hyperbolic space. In CoRR. MLG Workshop 2017, 2017.
  9. Fisher information distance: A geometrical reading. Discrete Applied Mathematics, 197:59–69, 2015.
  10. M. P. Do Carmo. Riemannian geometry, volume 6. Springer, 1992.
  11. Convergence analysis of Riemannian stochastic approximation schemes. arXiv:2005.13284, 2021.
  12. On Riemannian stochastic approximation schemes with fixed step-size. In International Conference on Artificial Intelligence and Statistics, pages 1018–1026. PMLR, 2021.
  13. Lie groups. Springer-Verlag, Berlin, 2000.
  14. S. Dereich and S. Kassing. Cooling down stochastic differential equations: almost sure convergence. Stochastic Process. Appl., 152:289–311, 2022.
  15. S. Dereich and S. Kassing. On minimal representations of shallow ReLU networks. Neural Networks, 148:121–128, 2022.
  16. G. Da Prato and J. Zabczyk. Stochastic equations in infinite dimensions. Cambridge university press, 2014.
  17. The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl., 20(2):303–353, 1999.
  18. J. Eldering. Normally hyperbolic invariant manifolds: the noncompact case, volume 2. Springer, 2013.
  19. K. D. Elworthy. Stochastic differential equations on manifolds, volume 70. Cambridge University Press, 1982.
  20. L. C. Evans. Partial differential equations, volume 19. American Mathematical Society, 2022.
  21. Convergence rates and approximation results for SGD and its continuous-time counterpart. In Conference on Learning Theory, pages 1965–2058. PMLR, 2021.
  22. Convergence rates for the stochastic gradient descent method for non-convex objective functions. J. Mach. Learn. Res., 21:5354–5401, 2020.
  23. Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation. Commun. Math. Sci., 18(1):163–188, 2020.
  24. Semigroups of stochastic gradient descent and online principal component analysis: properties and diffusion approximations. Commun. Math. Sci., 16(3), 2018.
  25. B. Gess and S. Kassing. Convergence rates for momentum stochastic gradient descent with noise of machine learning type. arXiv:2302.03550, 2023.
  26. Stochastic modified flows, mean-field limits and dynamics of stochastic gradient descent. arXiv:2302.07125, 2023.
  27. L. Gawarecki and V. Mandrekar. Stochastic differential equations in infinite dimensions: with applications to stochastic partial differential equations. Springer Science & Business Media, 2010.
  28. Generative adversarial nets. In Neural Information Processing Systems, volume 27, 2014.
  29. Newton’s method on Grassmann manifolds. arXiv:0709.2205, 2007.
  30. Riemannian stochastic optimization methods avoid strict saddle points. In Neural Information Processing Systems, volume 37, 2023.
  31. On the diffusion approximation of nonconvex stochastic gradient descent. Ann. Math. Sci. Appl., 4(1), 2019.
  32. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.
  33. E. P. Hsu. Stochastic analysis on manifolds. Number 38. American Mathematical Soc., 2002.
  34. N. Ikeda and S. Watanabe. Stochastic differential equations and diffusion processes, volume 24 of North-Holland Mathematical Library. North-Holland Publishing Co., Amsterdam; Kodansha, Ltd., Tokyo, second edition, 1989.
  35. Riemannian stochastic approximation algorithms. arXiv:2206.06795, 2022.
  36. J. M. Lee. Introduction to Smooth Manifolds. Springer New York, 2012.
  37. Projection robust wasserstein distance and Riemannian optimization. Neural Information Processing Systems, 33, 2020.
  38. X.-M. Li. Properties at infinity of diffusion semigroups and stochastic flows via weak uniform covers. Potential Anal., 3(4):339–357, 1994.
  39. X.-M. Li. Strong p𝑝pitalic_p-completeness of stochastic differential equations and the existence of smooth flows on noncompact manifolds. Probab. Theory Related Fields, 100(4):485–511, 1994.
  40. On the validity of modeling SGD with stochastic differential equations (SDEs). Neural Information Processing Systems, 34, 2021.
  41. W. Liu and M. Röckner. Stochastic partial differential equations: an introduction. Springer, 2015.
  42. G. Leobacher and A. Steinicke. Existence, uniqueness and regularity of the projection onto differentiable manifolds. Ann. Global Anal. Geom., 60(3):559–587, 2021.
  43. Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line search and stochastic variance-reduced gradient methods. Math. Program., 178(1-2):215–262, 2019.
  44. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101–2110. PMLR, 2017.
  45. Stochastic modified equations and dynamics of stochastic gradient algorithms I: mathematical foundations. J. Mach. Learn. Res., 20:Paper No. 40, 47, 2019.
  46. L. Li and Y. Wang. On uniform-in-time diffusion approximation for stochastic gradient descent. arXiv:2207.04922, 2022.
  47. What happens after SGD reaches zero loss ? – a mathematical framework. In International Conference on Learning Representations, 2021.
  48. Gradient algorithms for principal component analysis. J. Austral. Math. Soc. Ser. B, 37(4):430–450, 1996.
  49. On closed-form expressions for the fisher-rao distance. arXiv preprint arXiv:2304.14885, 2023.
  50. A systematic approach to Lyapunov analyses of continuous-time models in convex optimization. SIAM J. Optim., 33(3):1558–1586, 2023.
  51. O. Müller. A note on closed isometric embeddings. J. Math. Anal. Appl., 349(1):297–298, 2009.
  52. M. Nickel and D. Kiela. Poincaré embeddings for learning hierarchical representations. Neural information processing systems, 30, 2017.
  53. E. Oja. Principal components, minor components, and linear neural networks. Neural Networks, 5:927–935, 1992.
  54. S. Perko. Unlocking optimal batch size schedules using continuous-time control and perturbation theory. arXiv:2312.01898, 2023.
  55. The fisher–rao distance between multivariate normal distributions: Special cases, bounds and applications. Entropy, 22(4):404, 2020.
  56. M. Riedle. Cylindrical Wiener processes. In Séminaire de Probabilités XLIII, pages 191–214. Springer, 2011.
  57. Representation tradeoffs for hyperbolic embeddings. In International conference on machine learning, pages 4460–4469. PMLR, 2018.
  58. Escaping from saddle points on Riemannian manifolds. Neural Information Processing Systems, 32, 2019.
  59. S. M. Shah. Stochastic approximation on Riemannian manifolds. Appl. Math. Optim., 83:1123–1151, 2021.
  60. H. Sakai and H. Iiduka. Convergence of riemannian stochastic gradient descent on Hadamard manifold. arXiv:2312.07990, 2023.
  61. T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Neural information processing systems, 29, 2016.
  62. Sinkhorn natural gradient for generative models. In Neural Information Processing Systems, pages 1646–1656, 2020.
  63. Improving GANs using optimal transport. In International Conference on Learning Representations, 2018.
  64. Averaging stochastic gradient descent on Riemannian manifolds. In Conference On Learning Theory, pages 650–687. PMLR, 2018.
  65. C. Udrişte. Convex functions and optimization methods on Riemannian manifolds, volume 297 of Mathematics and its Applications. Kluwer Academic Publishers Group, Dordrecht, 1994.
  66. B. Wilson and M. Leimeister. Gradient descent in hyperbolic space. arXiv:1805.08207, 2018.
  67. S. Wojtowytsch. Stochastic Gradient Descent with Noise of Machine Learning Type Part II: Continuous Time Analysis. J. Nonlinear Sci., 34(1):Paper No. 16, 2024.
  68. Batch size selection by stochastic optimal control. In Has it Trained Yet? NeurIPS 2022 Workshop, 2022.
  69. Riemannian svrg: Fast stochastic optimization on Riemannian manifolds. Neural Information Processing Systems, 29, 2016.
  70. H. Zhang and S. Sra. First-order methods for geodesically convex optimization. In Conference on Learning Theory, pages 1617–1638. PMLR, 2016.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com