Streamlining in the Riemannian Realm: Efficient Riemannian Optimization with Loopless Variance Reduction (2403.06677v1)
Abstract: In this study, we investigate stochastic optimization on Riemannian manifolds, focusing on the crucial variance reduction mechanism used in both Euclidean and Riemannian settings. Riemannian variance-reduced methods usually involve a double-loop structure, computing a full gradient at the start of each loop. Determining the optimal inner loop length is challenging in practice, as it depends on strong convexity or smoothness constants, which are often unknown or hard to estimate. Motivated by Euclidean methods, we introduce the Riemannian Loopless SVRG (R-LSVRG) and PAGE (R-PAGE) methods. These methods replace the outer loop with probabilistic gradient computation triggered by a coin flip in each iteration, ensuring simpler proofs, efficient hyperparameter selection, and sharp convergence guarantees. Using R-PAGE as a framework for non-convex Riemannian optimization, we demonstrate its applicability to various important settings. For example, we derive Riemannian MARINA (R-MARINA) for distributed settings with communication compression, providing the best theoretical communication complexity guarantees for non-convex distributed optimization over Riemannian manifolds. Experimental results support our theoretical findings.
- A. Ajalloeian and S. U. Stich. On the convergence of sgd with biased gradients. arXiv preprint arXiv:2008.00051, 2020.
- Qsgd: Communication-efficient sgd via gradient quantization and encoding. Advances in neural information processing systems, 30, 2017.
- The convergence of sparsified gradient methods. Advances in Neural Information Processing Systems, 31, 2018.
- Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. Journal of Machine Learning Research, 18(221):1–51, 2018a.
- Z. Allen-Zhu. Katyusha x: Practical momentum method for stochastic sum-of-nonconvex optimization. In International Conference on Machine Learning, 2018b.
- Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In International conference on machine learning, pages 699–707. PMLR, 2016.
- Some large-scale matrix computation problems. Journal of Computational and Applied Mathematics, 74(1-2):71–89, 1996.
- Iteration-complexity of gradient, subgradient and proximal point methods on riemannian manifolds. Journal of Optimization Theory and Applications, 173(2):548–562, 2017.
- Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000.
- S. Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
- Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
- Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 39(1):1–33, 2019.
- Projected gradient methods for linearly constrained problems. Mathematical programming, 39(1):93–116, 1987.
- A. Cherian and S. Sra. Riemannian dictionary learning and sparse coding for positive definite matrices. IEEE transactions on neural networks and learning systems, 28(12):2859–2871, 2016.
- Tamuna: Accelerated federated learning with local training and partial participation. arXiv preprint arXiv:2302.09832, 2023.
- A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.
- Geodesic algorithms in riemannian geometry. Balkan J. Geom. Appl, 3(2):89–100, 1998.
- Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014.
- A guide through the zoo of biased sgd. arXiv preprint arXiv:2305.16296, 2023.
- Fast stochastic bregman gradient methods: Sharp analysis and variance reduction. In International Conference on Machine Learning, pages 2815–2825. PMLR, 2021.
- The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
- Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing systems, 31, 2018.
- D. Garber and E. Hazan. Fast and simple pca via convex optimization. arXiv preprint arXiv:1509.05647, 2015.
- Deep learning. MIT press, 2016.
- A unified theory of sgd: Variance reduction, sampling, quantization and coordinate descent. In International Conference on Artificial Intelligence and Statistics, pages 680–690. PMLR, 2020.
- Marina: Faster non-convex distributed learning with compression. In International Conference on Machine Learning, pages 3788–3798. PMLR, 2021.
- Variance reduction is an antidote to byzantines: Better rates, weaker assumptions and communication compression as a cherry on the top. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- Sgd: General analysis and improved rates. In International conference on machine learning, pages 5200–5209. PMLR, 2019.
- Variance-reduced methods for machine learning. Proceedings of the IEEE, 108(11):1968–1983, 2020.
- Improving accelerated federated learning with compression and importance sampling. arXiv preprint arXiv:2306.03240, 2023.
- A. Haghighat and J. C. Wagner. Monte carlo variance reduction with deterministic importance functions. Progress in Nuclear Energy, 42(1):25–53, 2003.
- A. Han and J. Gao. Improved variance reduction methods for riemannian non-convex optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7610–7623, 2021.
- Sega: Variance reduction via gradient sketching. Advances in Neural Information Processing Systems, 31, 2018.
- Variance reduced coordinate descent with acceleration: New method with a surprising application to finite-sum problems. In International Conference on Machine Learning, pages 4039–4048. PMLR, 2020.
- Projected gradient descent on riemannian manifolds with applications to online power system optimization. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 225–232. IEEE, 2016.
- Variance reduced stochastic gradient descent with neighbors. Advances in Neural Information Processing Systems, 28, 2015.
- Adaptivity of stochastic gradient methods for nonconvex optimization. SIAM Journal on Mathematics of Data Science, 4(2):634–648, 2022.
- Stochastic distributed learning with gradient quantization and double-variance reduction. Optimization Methods and Software, 38(1):91–106, 2023.
- R. Hosseini and S. Sra. Matrix manifold optimization for gaussian mixtures. Advances in neural information processing systems, 28, 2015.
- An improved analysis and rates for variance reduction under without-replacement sampling orders. Advances in Neural Information Processing Systems, 34:3232–3243, 2021.
- Communication-efficient distributed sgd with sketching. Advances in Neural Information Processing Systems, 32, 2019.
- On variance reduction in stochastic gradient descent and its asynchronous variants. Advances in neural information processing systems, 28, 2015.
- Robust shift-and-invert preconditioning: Faster and more sample efficient algorithms for eigenvector computation. arXiv preprint arXiv:1510.08896, 2015.
- R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013.
- Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning, pages 3252–3261. PMLR, 2019.
- H. Kasai and B. Mishra. Low-rank tensor completion: a riemannian manifold preconditioning approach. In International conference on machine learning, pages 1012–1021. PMLR, 2016.
- Riemannian stochastic variance reduced gradient on grassmann manifold. arXiv preprint arXiv:1605.07367, 2016.
- Adaptive stochastic variance reduction for non-convex finite-sum minimization. Advances in Neural Information Processing Systems, 35:23524–23538, 2022.
- A. Khaled and P. Richtárik. Better theory for SGD in the nonconvex world. Trans. Mach. Learn. Res., 2023, 2023.
- Unified analysis of stochastic gradient methods for composite convex and smooth optimization. Journal of Optimization Theory and Applications, 199(2):499–540, 2023.
- J. Konečný and P. Richtárik. Semi-stochastic gradient descent methods. Frontiers in Applied Mathematics and Statistics, 3:9, 2017.
- Don’t jump through hoops and remove those loops: Svrg and katyusha are better without the outer loop. In Algorithmic Learning Theory, pages 451–467. PMLR, 2020.
- Independent component analysis. Springer, 1998.
- J. Li and S. Ma. Federated learning on riemannian manifolds. ArXiv, abs/2206.05668, 2022. URL https://api.semanticscholar.org/CorpusID:249625803.
- Acceleration for compressed gradient descent in distributed and federated optimization. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5895–5904. PMLR, 13–18 Jul 2020.
- Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In International conference on machine learning, pages 6286–6295. PMLR, 2021.
- Variance reduced methods for non-convex composition optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44:5813–5825, 2017.
- Optimal linear representations of images for object recognition. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., volume 1, pages I–I. IEEE, 2003.
- D. G. Luenberger. The gradient projection method along geodesics. Management Science, 18(11):620–631, 1972.
- G. Malinovsky and P. Richtárik. Federated random reshuffling with compression and variance reduction. arXiv preprint arXiv:2205.03914, 2022.
- Variance reduced proxskip: Algorithm, theory and application to federated learning. Advances in Neural Information Processing Systems, 35:15176–15189, 2022.
- Byzantine robustness and partial participation can be achieved simultaneously: Just clip gradient differences. arXiv preprint arXiv:2311.14127, 2023a.
- Random reshuffling with variance reduction: New analysis and better rates. In Uncertainty in Artificial Intelligence, pages 1347–1357. PMLR, 2023b.
- D. Martínez-Rubio. Global riemannian acceleration in hyperbolic and spherical spaces. In S. Dasgupta and N. Haghtalab, editors, Proceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167 of Proceedings of Machine Learning Research, pages 768–826. PMLR, 29 Mar–01 Apr 2022.
- D. Martínez-Rubio and S. Pokutta. Accelerated riemannian optimization: Handling constraints with a prox to bound geometric penalties. In G. Neu and L. Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 359–393. PMLR, 12–15 Jul 2023.
- Linear regression under fixed-rank constraints: a riemannian approach. In 28th International Conference on Machine Learning, 2011.
- Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019.
- B. Mishra and R. Sepulchre. R3mc: A riemannian three-factor algorithm for low-rank matrix completion. In 53rd IEEE Conference on Decision and Control, pages 1137–1142. IEEE, 2014.
- M. Moakher. Means and averaging in the group of rotations. SIAM journal on matrix analysis and applications, 24(1):1–16, 2002.
- Y. E. Nesterov. A method of solving a convex programming problem with convergence rate o\\\backslash\bigl(k^2\\\backslash\bigr). In Doklady Akademii Nauk, volume 269, pages 543–547. Russian Academy of Sciences, 1983.
- M. Nickel and D. Kiela. Poincaré embeddings for learning hierarchical representations. Advances in neural information processing systems, 30, 2017.
- E. Oja. Principal components, minor components, and linear neural networks. Neural networks, 5(6):927–935, 1992.
- Correlated quantization for faster nonconvex distributed optimization. CoRR, abs/2401.05518, 2024. doi: 10.48550/ARXIV.2401.05518. URL https://doi.org/10.48550/arXiv.2401.05518.
- Sgd in the large: Average-case analysis, asymptotics, and stepsize criticality. In Conference on Learning Theory, pages 3548–3626. PMLR, 2021.
- L-svrg and l-katyusha with arbitrary sampling. The Journal of Machine Learning Research, 22(1):4991–5039, 2021a.
- Error compensated distributed sgd can be accelerated. Advances in Neural Information Processing Systems, 34:30401–30413, 2021b.
- Communication compression for byzantine robust learning: New efficient algorithms and improved rates. arXiv preprint arXiv:2310.09804, 2023.
- Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323. PMLR, 2016.
- Ef21: A new, simpler, theoretically better, and practically faster error feedback. Advances in Neural Information Processing Systems, 34:4384–4396, 2021.
- H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- Simulation and the Monte Carlo method. John Wiley & Sons, 2016.
- Federated optimization algorithms with random reshuffling and gradient compression. arXiv preprint arXiv:2206.07021, 2022.
- Riemannian stochastic variance reduced gradient algorithm with retraction and vector transport. SIAM Journal on Optimization, 29(2):1444–1472, 2019.
- Sparse binary compression: Towards distributed deep learning with minimal communication. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2019.
- Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162:83–112, 2017.
- 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Interspeech, 2014.
- O. Shamir. A stochastic pca and svd algorithm with an exponential convergence rate. In International conference on machine learning, pages 144–152. PMLR, 2015.
- Vr-sgd: A simple stochastic variance reduction method for machine learning. IEEE Transactions on Knowledge and Data Engineering, 32(1):188–202, 2018.
- A convergence analysis of distributed sgd with communication-efficient gradient sparsification. In IJCAI, pages 3411–3417, 2019.
- Variance reduction via accelerated dual averaging for finite-sum optimization. Advances in Neural Information Processing Systems, 33:833–844, 2020.
- S. Sra and R. Hosseini. Geometric optimisation on positive definite matrices for elliptically contoured distributions. Advances in Neural Information Processing Systems, 26, 2013.
- The error-feedback framework: Better rates for sgd with delayed gradients and compressed communication. arXiv preprint arXiv:1909.05350, 2019.
- Sparsified sgd with memory. Advances in Neural Information Processing Systems, 31, 2018.
- Complete dictionary recovery over the sphere ii: Recovery by riemannian trust-region method. IEEE Transactions on Information Theory, 63(2):885–914, 2016.
- R.-Y. Sun. Optimization for deep learning: An overview. Journal of the Operations Research Society of China, 8(2):249–294, 2020.
- Permutation compressors for provably faster distributed nonconvex optimization, 2021.
- Riemannian pursuit for big matrix recovery. In International Conference on Machine Learning, pages 1539–1547. PMLR, 2014.
- Communication-efficient distributed sgd with compressed sensing. IEEE Control Systems Letters, 6:2054–2059, 2021.
- Pymanopt: A python toolbox for optimization on manifolds using automatic differentiation. Journal of Machine Learning Research, 17(137):1–5, 2016. URL http://jmlr.org/papers/v17/16-177.html.
- Averaging stochastic gradient descent on riemannian manifolds. In Conference on Learning Theory, pages 650–687. PMLR, 2018.
- A. Tyurin and P. Richtárik. Dasha: Distributed nonconvex optimization with communication compression, optimal oracle complexity, and no client synchronization. arXiv preprint arXiv:2202.01268, 2022.
- C. Udriste. Optimization methods on riemannian manifolds. In Proceedings of IRB International Workshop, Monteroduni, Italy, August 8-12, 1995, 1997.
- C. Udriste. Convex functions and optimization methods on Riemannian manifolds, volume 297. Springer Science & Business Media, 2013.
- B. Vandereycken. Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization, 23(2):1214–1236, 2013.
- Gradient sparsification for communication-efficient distributed optimization. Advances in Neural Information Processing Systems, 31, 2018.
- TernGrad: Ternary gradients to reduce communication in distributed deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- A. Wiesel. Geodesic convexity and covariance estimation. IEEE transactions on signal processing, 60(12):6182–6189, 2012.
- Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52, 1987.
- Y. Xu and Y. Xu. Katyusha acceleration for convex finite-sum compositional optimization. INFORMS Journal on Optimization, 3(4):418–443, 2021.
- Fastsgd: A fast compressed sgd framework for distributed machine learning. arXiv preprint arXiv:2112.04291, 2021.
- Variance-reduced stochastic learning under random reshuffling. IEEE Transactions on Signal Processing, 68:1390–1408, 2020.
- H. Zhang and S. Sra. First-order methods for geodesically convex optimization. In Conference on Learning Theory, pages 1617–1638. PMLR, 2016.
- H. Zhang and S. Sra. An estimate sequence for geodesically convex optimization. In Conference On Learning Theory, pages 1703–1723. PMLR, 2018.
- Riemannian svrg: Fast stochastic optimization on riemannian manifolds. Advances in Neural Information Processing Systems, 29, 2016.
- R-spider: A fast riemannian stochastic optimization algorithm with curvature independent rate. arXiv preprint arXiv:1811.04194, 2018.
- Faster first-order methods for stochastic non-convex optimization on riemannian manifolds. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 138–147. PMLR, 2019.