Riemannian Gradient Descent

Updated 23 September 2025

Riemannian gradient descent is an optimization method on manifolds that computes the intrinsic steepest descent direction using the geometry of the space.
The method updates iterates via retractions along geodesics, carefully accounting for curvature and local smoothness to ensure convergence.
Applications range from Bayesian inference to low-rank matrix completion and decentralized optimization, demonstrating its practical and theoretical impact.

The Riemannian Gradient Descent Method generalizes classical first-order optimization to Riemannian manifolds, replacing the notion of “steepest descent” in Euclidean settings with an intrinsic steepest descent direction that respects the geometry of curved spaces. The fundamental step is to move iterates along geodesics by following the negative Riemannian gradient, which is the unique tangent vector at each point minimizing a given objective function’s local Taylor expansion under the manifold's metric. This method forms the basis of modern non-Euclidean machine learning and statistical inference, enabling efficient algorithms for manifold-constrained problems arising in Bayesian inference, matrix and tensor completion, low-rank recovery, robust learning, distributed nonconvex optimization, and geometric inverse problems.

1. Mathematical Principles of Riemannian Gradient Descent

In Riemannian optimization, one considers an objective $f:\mathcal{M}\to\mathbb{R}$ defined on a Riemannian manifold $(\mathcal{M},g)$ . The Riemannian gradient, $\operatorname{grad}f(x)$ , is the unique tangent vector at $x\in\mathcal{M}$ satisfying $g_x(\operatorname{grad} f(x),v) = D_v f(x)$ for all $v\in T_x\mathcal{M}$ . Given a smooth retraction $\mathcal{R}_x$ , the canonical update at iteration $k$ is:

$x_{k+1} = \mathcal{R}_{x_k}(-\eta_k\operatorname{grad} f(x_k)),$

where $\eta_k$ is a step size.

Distinct from Euclidean algorithms, descent occurs along geodesics or, more generally, retraction curves, which are curves on the manifold starting at $x_k$ in the tangent direction $-\operatorname{grad} f(x_k)$ . The convergence rates and step size selection critically depend on the manifold's sectional curvature and the local smoothness and convexity of $f$ (see (Martínez-Rubio et al., 15 Mar 2024)).

For concrete matrix manifolds:

In the manifold $\mathbb{P}_n$ of $n\times n$ positive definite matrices, the Riemannian metric is $g_A(X,Y)=\operatorname{tr}(A^{-1}XA^{-1}Y)$ . The Riemannian gradient and geodesic are explicitly computable via matrix logarithms and exponentials (see (Duan et al., 2019)).
In the manifold of rank- $r$ matrices, the tangent space projection of a perturbation $Y$ at $X=U\Sigma V^\top$ is $P_{T_X}(Y)=UU^\top Y + Y VV^\top - UU^\top Y VV^\top$ , and retraction is given by truncating the SVD. (See (Hsu et al., 2022, Bian et al., 2023)).

2. Curvature, Step Size, and Convergence

The interplay between manifold curvature and gradient descent dynamics is quantitatively analyzed in (Martínez-Rubio et al., 15 Mar 2024). For $L$ -smooth, geodesically convex $f$ defined in a ball of radius $R$ around an optimizer $x^*$ , choosing $\eta=1/L$ ensures that iterates remain in a ball no larger than a factor $\varphi\zeta_{\rm geo} R$ , where $\varphi=(1+\sqrt{5})/2$ and $\zeta_{\rm geo}$ captures curvature influence. Sublinear rates $O(LR^2/\varepsilon)$ are attained in the convex case, and linear (exponential) rates $O((L/\mu)\ln(LR^2/\varepsilon))$ if $f$ is strongly convex. The bound tightens by choosing a curvature-aware step size $\eta=1/(L\zeta_{\rm geo})$ , containing the iterates within $O(R)$ of $x^*$ .

Curvature further impacts the local validity of smoothness and convexity—high positive curvature necessitates restricting the feasible region diameter (e.g., to below $\pi/\sqrt{\kappa_{\max}}$ ). These non-Euclidean geometric constants appear in both the step size and the contraction factor, governing convergence.

3. Algorithmic Variants and Practical Implementation

Multiple variants expand the core method, adapting RGD to challenges such as inexact gradients, stochasticity, and compositional objectives.

Inexact Gradients: When only approximate gradients are available—due to sub-sampling, adversarial perturbation, or surrogate steps—the update $x_{k+1} = \mathcal{R}_{x_k}(-\eta_k g_k)$ is used, with $g_k$ satisfying either $||g_k - \operatorname{grad} f(x_k)|| \le \epsilon_k$ or $||g_k - \operatorname{grad} f(x_k)|| \leq \nu ||\operatorname{grad} f(x_k)||$ . Under suitable summability and decay of $\epsilon_k$ (or bounded relative error $\nu<1$ ), and under the Riemannian Kurdyka–Łojasiewicz property, convergence to stationary points with explicit rates is established, and applications include sharpness-aware minimization and extragradient methods (see (Zhou et al., 17 Sep 2024)).
Adaptive Step Sizes: A natural adaptive scheme approximates the local Lipschitz constant via the difference of parallel transported gradients (using the derivative of the exponential map or parallel transport). This strategy, effective on manifolds with nonnegative curvature, adjusts $\eta_k$ automatically to the function’s local geometry, avoiding line search and leveraging parallel transport for curvature compensation (Ansari-Önnestam et al., 23 Apr 2025).
Stochastic and Compositional Methods: Riemannian Stochastic Gradient Descent (RSGD) extends the approach to stochastic objectives, often with minibatch averaging. The expected squared gradient norm converges at rate $O(1/K + \sigma^2/b)$ for constant step size $\eta<2/L$ and minibatch size $b$ , with the total complexity minimized at a critical batch size dependent on the noise ( $\sigma^2$ ) and target precision ( $\epsilon$ ) (see (Sakai et al., 2023)). For nested expectations and policy evaluation, etc., auxiliary “tracking” variables address gradient bias and retain $O(1/\epsilon^2)$ oracle complexity (Zhang et al., 2022).
Preconditioning and Advanced Metrics: Incorporating preconditioners in the tangent space—e.g., diagonal approximations reflecting local gradient energy—substantially accelerates convergence in matrix completion and sensing, with empirical speedups up to an order of magnitude (Bian et al., 2023).
Retraction and Projection Schemes: Efficient retractions (e.g., QR-based, SVD-based, exponential map, or normalization) are critical for manifold-constrained problems. Case-specific projections and coordinate embeddings (such as onto the Stiefel manifold or a product of spheres) are used where global coordinate charts are unavailable.

4. Applications Across Domains

Riemannian gradient descent forms the core of manifold-based optimization in diverse scientific and engineering contexts, with direct instantiations including:

Bayesian Inference and Stein Methods: RSVGD generalizes Stein Variational Gradient Descent to arbitrary manifolds, yielding coordinate-invariant, particle-efficient inference engines. The method exploits information geometry for preconditioning even in Euclidean spaces and enables inference over distributions defined on hyperspheres, Stiefel manifolds, and other curved spaces (Liu et al., 2017).
Low-Rank Recovery and Tensor Completion: For matrix and tensor completion under rank constraints, RGD (and its conjugate or preconditioned variants) achieves nearly linear recovery rates with random initialization, robust to degeneracy and “spurious” critical points (Hou et al., 2020, Song et al., 2020). Theoretical error bounds and phase transitions are characterized numerically and analytically.
Quantum State Tomography: RGD deployed on the manifold of rank- $r$ positive semidefinite matrices reconstructs quantum states exponentially fast, with error contraction independent of the physical state's spectral condition number, bypassing the dimensionality barrier (Hsu et al., 2022).
Distributed and Decentralized Optimization: Decentralized adaptations using local gradient steps and consensus or gradient-tracking mechanisms on manifolds such as the Stiefel manifold enable globally convergent, communication-efficient algorithms for nonconvex, networked objectives (Chen et al., 2021, Chen et al., 2023).
Geometric Inverse Problems: Inverse eigenvalue problems, surface parameterizations, and other physically constrained inverse tasks benefit from the RGD abstraction by framing constraints as projections or retractions on manifolds, leveraging efficient partial eigendecompositions or special geometries (Riley et al., 10 Apr 2025, Sutti et al., 18 Mar 2024).

5. Variants, Extensions, and Methodological Advances

Multiple methodological developments and variants have been proposed to improve and generalize Riemannian gradient descent:

Sufficient Descent and Conjugate Directions: Hybrid conjugate gradient schemes with well-designed momentum coefficients and properly adapted vector transport strategies guarantee sufficient descent and accelerated convergence on Riemannian manifolds, paralleling and extending Euclidean CG methods (Sakai et al., 2020).
Minimax and Saddle-Point Optimization: Algorithms integrating Riemannian gradient descent in the primal manifold variable and projected ascent in a dual (possibly Euclidean) variable achieve provable rates (e.g., $\tilde O(1/\epsilon^3)$ ) in nonconvex–strongly-concave minimax problems, with practical impact for robust and fair machine learning (Huang et al., 2020, Xu et al., 2022).
Handling Inexactness and Bias: Systematic frameworks for accommodating inexactness—either due to samplers, adversarial perturbations, or intermediate tracking variables—allow global convergence guarantees to be preserved provided the error is appropriately controlled and matches the structure of the manifold (Zhou et al., 17 Sep 2024, Zhang et al., 2022).
Sharpness-Aware and Adversarial Techniques: Variants such as Riemannian Sharpness-Aware Minimization (RSAM) and extragradient methods perform updates at adversarial or predictor points to promote flat minima or robustify against nonconvexity, inheriting the main convergence features of RGD under adequate error bounds (Zhou et al., 17 Sep 2024).

6. Empirical Behavior and Practical Considerations

Empirical evaluations in high dimensions, both synthetic and application-driven, demonstrate that Riemannian gradient descent and its modern variants exhibit:

Robustness to Initialization and Geometry: With carefully chosen step sizes and proper handling of curvature (e.g., via adaptive or curvature-aware steps), iterates remain well-behaved, typically exhibiting monotonic reduction in distance to the optimizer and objective function value beyond the theoretical worst-case predictions (Martínez-Rubio et al., 15 Mar 2024, Ansari-Önnestam et al., 23 Apr 2025).
Acceleration via Preconditioning and Adaptivity: Adaptive and preconditioned variants empirically decrease function value and gradient norm substantially faster than vanilla RGD, especially in ill-conditioned or large-scale settings (Bian et al., 2023), and outperform gradient methods with line search in iteration count due to large admissible steps.
Scalability in Large-Scale and Real-World Problems: The methods scale to very large problems (e.g., reconstruction of $32,400\times 32,400$ matrices or mesh parameterizations on half-million vertex surfaces) with implementation leveraging partial eigendecompositions, tangent space projections, and parallel updates (Riley et al., 10 Apr 2025, Sutti et al., 18 Mar 2024).
Effectiveness in Distributed and Structured Settings: Decentralized and communication-efficient schemes on manifolds such as the Stiefel manifold show robust performance and exact convergence even with limited communication and local data (Chen et al., 2021).

7. Theoretical and Methodological Impact

The Riemannian gradient descent method and its modern extensions have dramatically expanded the reach of first-order optimization to settings with complex constraints and non-Euclidean structure, enabling rigorous analysis and effective computation in scenarios where classical theory is inapplicable. Key foundational results now include:

The quantification of the impact of manifold curvature on convergence rates and algorithmic stability (Martínez-Rubio et al., 15 Mar 2024, Ansari-Önnestam et al., 23 Apr 2025).
Systematic treatment of inexact and stochastic gradients with global convergence guarantees leveraging the Riemannian KL property (Zhou et al., 17 Sep 2024, Sakai et al., 2023).
Embedding or coordinate-invariant constructions for kernelized Stein variational methods (Liu et al., 2017).
Adaptive and preconditioned variants for geometric acceleration (Bian et al., 2023).
Comprehensive frameworks for nonconvex, distributed, and minimax optimization over manifold structures (Huang et al., 2020, Chen et al., 2021, Xu et al., 2022, Chen et al., 2023).

The method thus constitutes a foundational pillar for modern optimization and inference where geometry cannot be ignored and intrinsic structure is central to algorithmic progress.