Continuous-time Riemannian SGD and SVRG Flows on Wasserstein Probabilistic Space (2401.13530v3)
Abstract: Recently, optimization on the Riemannian manifold has provided new insights to the optimization community. In this regard, the manifold taken as the probability measure metric space equipped with the second-order Wasserstein distance is of particular interest, since optimization on it can be linked to practical sampling processes. In general, the standard (continuous) optimization method on Wasserstein space is Riemannian gradient flow (i.e., Langevin dynamics when minimizing KL divergence). In this paper, we aim to enrich the continuous optimization methods in the Wasserstein space, by extending the gradient flow on it into the stochastic gradient descent (SGD) flow and stochastic variance reduction gradient (SVRG) flow. The two flows in Euclidean space are standard continuous stochastic methods, while their Riemannian counterparts are unexplored. By leveraging the property of Wasserstein space, we construct stochastic differential equations (SDEs) to approximate the corresponding discrete dynamics of desired Riemannian stochastic methods in Euclidean space. Then, our probability measures flows are obtained by the Fokker-Planck equation. Finally, the convergence rates of our Riemannian stochastic flows are proven, which match the results in Euclidean space.
- Optimization algorithms on matrix manifolds. Princeton University Press.
- Towards a theory of non-log-concave sampling: first-order stationarity guarantees for langevin monte carlo. In Conference on Learning Theory.
- Riemannian adaptive optimization methods. In International Conference on Learning Representations.
- Bonnabel, S. (2013). Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229.
- Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311.
- Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 39(1):1–33.
- On the theory of variance reduction for stochastic gradient monte carlo. In International Conference on Machine Learning.
- Convergence of langevin mcmc in kl-divergence. In Algorithmic Learning Theory.
- Chewi, S. (2023a). Log-concave sampling. Lecture Notes.
- Chewi, S. (2023b). An optimization perspective on log-concave sampling and beyond. PhD thesis, Massachusetts Institute of Technology.
- Analysis of langevin monte carlo from poincare to log-sobolev. In Conference on Learning Theory.
- Riemannian approach to batch normalization. In Advances in Neural Information Processing Systems.
- Gradient descent finds global minima of deep neural networks. In International conference on machine learning.
- Variance reduction in stochastic gradient langevin dynamics. Advances in Neural Information Processing Systems.
- High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A):2854–2882.
- Log-concave sampling: Metropolis-hastings algorithms are fast! In Conference on learning theory.
- Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator.
- Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368.
- On the diffusion approximation of nonconvex stochastic gradient descent. Annals of Mathematical Sciences and Applications, 4(1).
- Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323.
- Brownian motion and stochastic calculus, volume 113. Springer Science & Business Media.
- Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference.
- Improved convergence rate of stochastic gradient langevin dynamics with variance reduction and its application to optimization.
- Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning.
- Understanding and accelerating particle-based variational inference. In International Conference on Machine Learning.
- Accelerated first-order methods for geodesically convex optimization on riemannian manifolds. In Advances in Neural Information Processing Systems.
- Interacting particle solutions of fokker–planck equations through gradient–log–density estimation. Entropy, 22(8):802.
- Towards a complete analysis of langevin monte carlo: Beyond poincar\\\backslash\’e inequality. In Conference on learning theory.
- Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning.
- Oksendal, B. (2013). Stochastic differential equations: an introduction with applications. Springer Science & Business Media.
- Continuous-time models for stochastic optimization algorithms. Advances in Neural Information Processing Systems.
- Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361–400.
- Parisi, G. (1981). Correlation functions and computer simulations. Nuclear Physics B, 180(3):378–384.
- The matrix cookbook. Technical University of Denmark, 7(15):510.
- Platen, E. (1999). An introduction to numerical methods for stochastic differential equations. Acta numerica, 8:197–246.
- Sam Patterson, Y. W. T. (2013). Stochastic gradient riemannian langevin dynamics on the probability simplex.
- Santambrogio, F. (2017). {{\{{Euclidean, metric, and Wasserstein}}\}} gradient flows: an overview. Bulletin of Mathematical Sciences, 7:87–154.
- Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162:83–112.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations.
- A differential equation for modeling nesterov’s accelerated gradient method: theory and insights.
- Van Handel, R. (2014). Probability in high dimension. Lecture Notes (Princeton University).
- Villani, C. et al. (2009). Optimal transport: old and new, volume 338. Springer.
- Von Neumann, J. (1937). Some matrix-inequalities and metrization of matric space.
- Information newton’s flow: second-order optimization method in probability space. Preprint arXiv:2001.04341.
- Accelerated information gradient flow. Journal of Scientific Computing, 90:1–47.
- Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning.
- Yi, M. (2022). Accelerating training of batch normalization: A manifold perspective. In Uncertainty in Artificial Intelligence.
- Characterization of excess risk for locally strongly convex population risk.
- Riemannian svrg: Fast stochastic optimization on riemannian manifolds.
- First-order methods for geodesically convex optimization. In Conference on Learning Theory, pages 1617–1638.
- Towards riemannian accelerated gradient methods. Preprint arXiv:1806.02812.
- R-spider: A fast riemannian stochastic optimization algorithm with curvature independent rate. Preprint arXiv:1811.04194.
- Subsampled stochastic variance-reduced gradient langevin dynamics. In International Conference on Uncertainty in Artificial Intelligence.
- Sampling from non-log-concave distributions via variance-reduced gradient langevin dynamics. In International Conference on Artificial Intelligence and Statistics.