Stochastic Gradient Descent on Riemannian Manifolds

Updated 1 June 2026

Stochastic Gradient Descent on Riemannian Manifolds is an extension of classical gradient methods that operates on curved spaces using tangent space projections and retraction mappings.
It computes updates by projecting gradients onto the manifold's tangent spaces and navigating non-linear geometries, ensuring robust convergence under geometric conditions.
Practical applications span matrix factorization, reinforcement learning, and adversarial deep network training, highlighting its versatility in complex optimization tasks.

Stochastic Gradient Descent on Riemannian Manifolds is an extension of classical stochastic gradient techniques to the setting where the optimization variable is constrained to lie on a non-Euclidean, smooth manifold endowed with a Riemannian metric. Such problems arise in diverse applications, including matrix factorization with orthogonality constraints, optimization over symmetric positive-definite matrices, policy evaluation in reinforcement learning, and adversarial robustness in deep networks. The core challenge is to generalize stochastic approximation and descent methods in a way that fully leverages and respects manifold geometry, nonlinearity, and curvature.

1. Algorithmic Foundations

The canonical stochastic gradient descent (SGD) iteration on a Riemannian manifold $(\mathcal{M},g)$ is defined for the objective

$f(x) = \mathbb{E}_{\xi\sim\mathcal{D}}[Q(x,\xi)],$

where $x \in \mathcal{M}$ , and $Q(x,\xi)$ is the sample loss. At each step $t$ , given a stochastic estimate $H(x_t,\xi_t)\in T_{x_t}\mathcal{M}$ of the Riemannian gradient, the iterate is updated by moving along a geodesic or, more commonly in practice, via a retraction $R_{x_t}$ : $x_{t+1} = R_{x_t}\left(-\gamma_t H(x_t, \xi_t)\right).$ The retraction $R_{x_t}:T_{x_t}\mathcal{M}\to\mathcal{M}$ is a smooth mapping satisfying $R_{x_t}(0) = x_t$ , $f(x) = \mathbb{E}_{\xi\sim\mathcal{D}}[Q(x,\xi)],$ 0. In cases where the exponential map is computationally tractable, it may be used directly; otherwise, a suitable retraction provides a numerically efficient alternative (Bonnabel, 2011, Sakai et al., 2023).

For batch or mini-batch stochasticity, the update becomes

$f(x) = \mathbb{E}_{\xi\sim\mathcal{D}}[Q(x,\xi)],$ 1

where $f(x) = \mathbb{E}_{\xi\sim\mathcal{D}}[Q(x,\xi)],$ 2 are IID samples.

2. Geometric Assumptions and Convergence Theory

Standard convergence analysis for Riemannian SGD relies on the following geometric and analytical requirements:

Manifold regularity: $f(x) = \mathbb{E}_{\xi\sim\mathcal{D}}[Q(x,\xi)],$ 3 is connected, complete (typically, Hadamard if global nonpositive curvature desired), and equipped with a well-defined injectivity radius.
Retraction quality: $f(x) = \mathbb{E}_{\xi\sim\mathcal{D}}[Q(x,\xi)],$ 4 is at least a first-order retraction, with higher-order retractions affording improved error bounds (notably for weak-approximation order in diffusion limits) (Gess et al., 2024).
Objective regularity: The cost $f(x) = \mathbb{E}_{\xi\sim\mathcal{D}}[Q(x,\xi)],$ 5 is geodesically $f(x) = \mathbb{E}_{\xi\sim\mathcal{D}}[Q(x,\xi)],$ 6-smooth, i.e., $f(x) = \mathbb{E}_{\xi\sim\mathcal{D}}[Q(x,\xi)],$ 7.
Stochastic gradient model: Unbiasedness and bounded second moments, $f(x) = \mathbb{E}_{\xi\sim\mathcal{D}}[Q(x,\xi)],$ 8, $f(x) = \mathbb{E}_{\xi\sim\mathcal{D}}[Q(x,\xi)],$ 9, extending naturally to mini-batch gradients with variance decaying as $x \in \mathcal{M}$ 0 (Sakai et al., 2023).

Under standard Robbins-Monro step-size conditions ( $x \in \mathcal{M}$ 1, $x \in \mathcal{M}$ 2), almost sure convergence to a critical point (in the sense $x \in \mathcal{M}$ 3) is guaranteed on compact manifolds or under suitable growth controls (for Hadamard settings, possibly with step-size normalization by local geometry) (Bonnabel, 2011, Sakai et al., 2023).

The convergence rate for geodesically convex, $x \in \mathcal{M}$ 4-smooth objectives and constant mini-batch size $x \in \mathcal{M}$ 5 is

$x \in \mathcal{M}$ 6

with $x \in \mathcal{M}$ 7 the iteration count. For polynomial decaying step sizes (e.g., $x \in \mathcal{M}$ 8), rates are $x \in \mathcal{M}$ 9, mirroring Euclidean results (Sakai et al., 2023).

3. Architectural Variants and Acceleration Techniques

Beyond vanilla Riemannian SGD, a suite of variance-reduced and accelerated schemes have been developed:

Riemannian SVRG (R-SVRG): Organizes SGD into epochs, correcting each stochastic gradient via a full-gradient anchor and parallel-transported corrections. This reduces variance and accelerates convergence for finite-sum objectives. In the geodesically strongly convex case, R-SVRG achieves a linear rate, and for nonconvex objectives, an $Q(x,\xi)$ 0 gradient complexity for an $Q(x,\xi)$ 1-stationary point (Zhang et al., 2016, Sato et al., 2017).
Riemannian Stochastic Hybrid Gradient (R-SHG): Blends R-SGD, R-SVRG, and stochastic recursive gradients, with time-varying coefficients. This offers single-loop variance reduction with $Q(x,\xi)$ 2 convergence under decaying steps and improved asymptotics under fixed-step regimes (Yang, 2021).
Variance-reduced saddle escaping: In nonconvex settings, methods such as perturbed Riemannian SRG inject isotropic noise in the tangent space and use recursive gradients for saddle point escape, with optimal (up to log factors) second-order convergence complexities in both finite-sum and online settings (Han et al., 2020).
Averaging and Polyak–Ruppert extensions: Retraction-based iterative averaging improves the convergence rate of RSGD for strongly convex problems to $Q(x,\xi)$ 3 for the averaged iterate, matching the optimal asymptotic rate and distribution (Tripuraneni et al., 2018).
Learning-rate-free RSGD: Recent advances have introduced adaptive step-size Riemannian SGD procedures (e.g., RDoG), which use geometric quantities to automatically adjust learning rates on the fly, removing the need for meticulous hyperparameter tuning and still attaining optimal (up to logarithms) $Q(x,\xi)$ 4 rates (Dodd et al., 2024).

4. Extensions: Non-Smooth, Composite, and Decentralized Optimization

Nonsmooth/tame objectives: SGD with retraction remains convergent for locally Lipschitz, Whitney $Q(x,\xi)$ 5–stratifiable objectives (e.g., those arising from ReLU, batch-norm, or composite regularization) under essentially the same step-size policies as in the smooth setting, ensuring almost sure convergence to Clarke-stationary points (Aspman et al., 2023).
Composition/nested objectives: Riemannian Stochastic Composition Gradient Descent (R-SCGD) addresses situations where the objective function is a nested composition of expectations. A key innovation is tracking inner expectations via auxiliary sequences, yielding unbiased gradients and $Q(x,\xi)$ 6 iteration complexity (Zhang et al., 2022).
Decentralized/distributed settings: Methods have been developed for networks of agents each optimizing local objectives under consensus constraints, crucial for federated and distributed learning. Convergence rates up to $Q(x,\xi)$ 7 in consensus error and $Q(x,\xi)$ 8 in global optimality have been achieved, for both intrinsic (exp and log) and consensus-algorithmic approaches, over general manifolds including the Stiefel and Grassmann cases (Chen et al., 2021, Nguyen et al., 17 Mar 2026, Zhao et al., 2024).

5. Practical Considerations and Applications

Implementation requires careful construction of stochastic Riemannian gradients—generally, by projecting the Euclidean gradient onto the tangent space via the metric, followed by retraction. On matrix manifolds (Stiefel, Grassmann, SPD), explicit closed-form retractions and projections are exploited for computational efficiency (Kasai et al., 2019).

Minibatch size selection introduces a bias-variance tradeoff: increasing batch size reduces the gradient estimator variance ( $Q(x,\xi)$ 9), resulting in faster convergence per iteration but at higher per-iteration cost. Theory predicts strict monotonic decrease and convexity of the total iteration count as a function of batch size, with a critical optimal batch size that minimizes overall cost (Sakai et al., 2023).

Empirical work demonstrates robust convergence across statistical manifold learning, decentralized covariance estimation, streaming PCA, robust distributional-robust DNN training, and policy evaluation for reinforcement learning (Huang et al., 2020, Zhao et al., 2024, Zhang et al., 2022, Bonnabel, 2011).

6. Stochastic Riemannian Optimization in Adversarial and Minimax Settings

Riemannian stochastic gradient descent forms the basis for gradient-based solvers in manifold-constrained minimax problems, where one seeks, for example,

$t$ 0

Dedicated algorithms such as Riemannian stochastic gradient descent ascent (RSGDA) have been developed with rigorous sample complexity bounds. In Geodesically-Nonconvex, Strongly-Concave (GNSC) settings, RSGDA achieves sample complexity $t$ 1 for $t$ 2-stationarity, with further acceleration to $t$ 3 via momentum-based variance reduction (e.g., STORM-like updates with vector transport) (Huang et al., 2020).

Key to the analysis are geometric Lipschitz constants governing regularity of the manifold gradients and the effects of retraction, as well as Lyapunov-type descent arguments adapted to the Riemannian context.

7. Diffusion Approximations and Continuous-Time Limits

In the small-step regime, stochastic Riemannian SGD can be rigorously approximated by a diffusion process—a Riemannian stochastic modified flow (RSMF)—with drift and covariance structure determined by the geometry, retraction, and gradient noise. Under appropriate regularity of the retraction and manifold curvature, the weak error of the Euler discretization enjoys an order of $t$ 4 (Gess et al., 2024). These insights connect discrete SGD dynamics to well-posed stochastic differential equations on manifolds, clarifying algorithmic scaling and the effect of geometric structures on noise propagation.

Additionally, for infinite-dimensional settings such as Wasserstein space, Riemannian SGD and SVRG flows have been derived and analyzed at the level of Fokker–Planck equations, matching the Euclidean continuous-time rates and clarifying connections to Langevin sampling and geometric MCMC (Yi et al., 2024).

References:

(Bonnabel, 2011) Bonnabel, "Stochastic gradient descent on Riemannian manifolds"
(Sakai et al., 2023) "Convergence of Riemannian Stochastic Gradient Descent on Hadamard Manifold"
(Zhang et al., 2016) "Riemannian SVRG: Fast Stochastic Optimization on Riemannian Manifolds"
(Tripuraneni et al., 2018) "Averaging Stochastic Gradient Descent on Riemannian Manifolds"
(Han et al., 2020) "Escape saddle points faster on manifolds via perturbed Riemannian stochastic recursive gradient"
(Huang et al., 2020) "Gradient Descent Ascent for Minimax Problems on Riemannian Manifolds"
(Zhang et al., 2022) "Riemannian Stochastic Gradient Method for Nested Composition Optimization"
(Gess et al., 2024) "Stochastic Modified Flows for Riemannian Stochastic Gradient Descent"
(Chen et al., 2021) "Decentralized Riemannian Gradient Descent on the Stiefel Manifold"
(Zhao et al., 2024) "Distributed Riemannian Stochastic Gradient Tracking Algorithm on the Stiefel Manifold"
(Dodd et al., 2024) "Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds"
(Kasai et al., 2019) "Riemannian adaptive stochastic gradient algorithms on matrix manifolds"
(Aspman et al., 2023) "Riemannian Stochastic Approximation for Minimizing Tame Nonsmooth Objective Functions"
(Yi et al., 2024) "Continuous-time Riemannian SGD and SVRG Flows on Wasserstein Probabilistic Space"
(Yang, 2021) "Riemannian Stochastic Hybrid Gradient Algorithm for Nonconvex Optimization"
(Nguyen et al., 17 Mar 2026) "Intrinsic Decentralized Stochastic Riemannian Optimization on Manifolds with Bounded Sectional Curvature"
(Sato et al., 2017) "Riemannian stochastic variance reduced gradient algorithm with retraction and vector transport"