Papers
Topics
Authors
Recent
Search
2000 character limit reached

Riemannian Stochastic Gradient Descent

Updated 21 April 2026
  • Riemannian Stochastic Gradient Descent (RSGD) is an optimization method that generalizes classical SGD to curved spaces by enforcing manifold consistency through tangent-space projections and retractions.
  • It incorporates sharpness-aware minimization (RSAM), adjusting local adversarial perturbations within the manifold for improved robustness and convergence.
  • Empirical evaluations show that RSGD variants, including Monge SAM, outperform standard SGD and Euclidean SAM in classification, pretraining, and multi-modal alignment tasks.

Riemannian Stochastic Gradient Descent (RSGD) encompasses a family of optimization algorithms that generalize the classical stochastic gradient descent paradigm to settings where the model parameters live on a Riemannian manifold rather than flat Euclidean space. This framework is essential when the feasible set or the natural geometry of the problem is non-Euclidean, as in learning on the Stiefel or Grassmann manifold, or when parameter space symmetry/constraints are most naturally enforced with manifold structure. Recent advances have extended sharpness-aware minimization (SAM)—originally formulated for flat, Euclidean spaces—to the Riemannian context, resulting in Riemannian Sharpness-Aware Minimization (RSAM). Notable modern instantiations include Monge SAM (M-SAM) and geometric approaches tailored for learning on manifolds.

1. Mathematical Foundations of Riemannian SAM

Let MRk\mathcal{M} \subset \mathbb{R}^k be a dd-dimensional Riemannian manifold, embedding the parameter set of a model fθf_\theta with θM\theta \in \mathcal{M}. The loss function, typically sample-averaged,

LS(θ)=1ni=1n(fθ(xi),yi),\mathcal{L}_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i), y_i),

is defined for θ\theta on M\mathcal{M}. At a given iterate θM\theta \in \mathcal{M}, the sharpness-aware objective is

LRSAM(θ)=maxΔTθM,Δ2ρ  LS(Expθ(Δ)),L_{\mathrm{RSAM}}(\theta) = \max_{\Delta \in T_\theta \mathcal{M},\, \|\Delta\|_2 \leq \rho} \; \mathcal{L}_S\left(\mathrm{Exp}_\theta(\Delta)\right),

where TθMT_\theta \mathcal{M} is the tangent space at dd0, dd1 is the Riemannian exponential map (or a computationally tractable retraction dd2), and dd3 is the standard Euclidean norm in dd4 as a subset of dd5 (Truong et al., 2023).

The optimization of this objective requires machinery unique to Riemannian geometry—such as projections onto tangent spaces, manipulation through retraction/exponential maps, and computation of Riemannian gradients and transports.

2. RSGD Algorithmic Structure: Teleportation and Descent

Riemannian SAM (RSAM) generalizes SAM via a two-step process at each iteration dd6:

  1. Inner maximization ("Teleportation" step):

    • Compute the Riemannian gradient dd7—the projection of the Euclidean gradient onto dd8.
    • Solve the adversarial ascent direction in the tangent space, typically via

    dd9

    where fθf_\theta0 is a metric-adjustment matrix (often fθf_\theta1 or fθf_\theta2), and fθf_\theta3 projects back to the tangent space (Truong et al., 2023). - Teleport the parameters onto the manifold: fθf_\theta4.

  2. Outer minimization (Riemannian descent):

    • Compute the Riemannian gradient at the perturbed point.
    • Update by retracting along the negative gradient:

    fθf_\theta5

Complete pseudocode is specified explicitly in (Truong et al., 2023). This framework accommodates both exact and approximate solutions to the inner maximization; the latter (e.g., direct relaxation using fθf_\theta6 as identity) is fθf_\theta7 faster with negligible loss in empirical accuracy.

3. Geometric and Theoretical Properties

RSAM inherits and augments critical geometric properties relative to its Euclidean counterparts:

  • Manifold Consistency: All updates and perturbations are confined to fθf_\theta8, leveraging projection, retraction, and manipulation in tangent spaces.
  • Reparametrization Invariance (M-SAM): Monge SAM generalizes SAM by introducing a loss-induced Riemannian metric fθf_\theta9 (the Monge metric). Adversarial directions and steps are computed with respect to this geometry, yielding invariance under smooth reparametrizations: if θM\theta \in \mathcal{M}0 is a diffeomorphism, the metric and steps transform covariantly, and the constrained step size is preserved under change of variables (Jacobsen et al., 12 Feb 2025).
  • Generalization Bound: Under compactness, θM\theta \in \mathcal{M}1-Lipschitz loss, and controlled retraction error, RSAM yields a generalization bound

θM\theta \in \mathcal{M}2

where θM\theta \in \mathcal{M}3. The dependence on intrinsic dimension θM\theta \in \mathcal{M}4 (as opposed to ambient θM\theta \in \mathcal{M}5 in Euclidean SAM) underpins improved statistical guarantees in manifold-constrained problems (Truong et al., 2023).

  • Critical Point Behavior: In RSAM and especially in M-SAM, the method is less prone than Euclidean SAM to become trapped at suboptimal saddle points, due to its step size automodulation. The effective radius

θM\theta \in \mathcal{M}6

endows M-SAM with self-damping properties that make it more robust to hyperparameter choices and gradient magnitude (Jacobsen et al., 12 Feb 2025).

4. Implementation Details and Computational Overhead

All Riemannian SAM-type methods require no explicit computation of Hessians or high-rank matrix inverses. Key operations per iteration include:

  • One forward and one backward pass for the base point.
  • Computation of Riemannian (projected) gradients and norm/scaling.
  • One additional forward-backward for the adversarial direction.
  • Retraction/exponential-map computations, and (optionally) tangent-space projections.

The runtime increase relative to standard SAM is marginal: RSAM incurs θM\theta \in \mathcal{M}7 additional overhead per epoch over Euclidean SAM (both θM\theta \in \mathcal{M}8 slower than vanilla SGD, due to the double-backprop requirement). All empirical evidence in the literature uses tractable choices of retraction and projection, such as the identity map in unconstrained θM\theta \in \mathcal{M}9, or projection-based retraction for embedded manifolds (Truong et al., 2023, Jacobsen et al., 12 Feb 2025).

5. Empirical Performance and Benchmarks

Empirical investigations show RSAM and its Monge metric variant outperform standard SGD and Euclidean SAM in multiple settings, particularly:

  • Supervised Classification (ResNet50, CIFAR-100): RSAM achieves 77.78% top-1 accuracy compared to 75.04% for SAM and 74.62% for SGD at identical hyperparameters (Truong et al., 2023).
  • Contrastive Pretraining (SupCon + RSAM): Linear evaluation after SupCon pretraining and RSAM achieves 81.62% accuracy (vs. 76.73% for SAM and 75.29% for SGD) (Truong et al., 2023).
  • Robustness to Hyperparameter Choices: On CIFAR-10, both SAM and M-SAM can escape local minima inaccessible to SGD, but only M-SAM avoids catastrophic divergence with large LS(θ)=1ni=1n(fθ(xi),yi),\mathcal{L}_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i), y_i),0, demonstrating its conservative step size adaptation (Jacobsen et al., 12 Feb 2025).
  • Multi-modal Representation Alignment (CLIP Fine-tuning): On WIT/MS-COCO, M-SAM achieved higher mutual-kNN similarity (LS(θ)=1ni=1n(fθ(xi),yi),\mathcal{L}_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i), y_i),1) than SAM (LS(θ)=1ni=1n(fθ(xi),yi),\mathcal{L}_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i), y_i),2) or SGD/Adam (LS(θ)=1ni=1n(fθ(xi),yi),\mathcal{L}_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i), y_i),3), exhibiting less sensitivity to LS(θ)=1ni=1n(fθ(xi),yi),\mathcal{L}_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i), y_i),4 (Jacobsen et al., 12 Feb 2025).
  • Ablations: Empirical evaluations demonstrate that approximate inner maximization for adversarial perturbations is nearly as accurate as exact projection, and that choice of metric-adjustment matrix has limited effect on final performance. RSAM is robust to auxiliary constraints, such as orthogonality in autoencoders, where Euclidean regularization strategies struggle (Truong et al., 2023).

6. Limitations and Open Directions

Current limitations of Riemannian SAM-type methods include the following:

  • Approximate Inner Maximization: The practical implementations use heuristics for the inner maximization in the tangent space; more accurate manifold-specific solvers (e.g., geodesic searches) remain an open area (Truong et al., 2023).
  • Extension to Quotient Manifolds: While existing RSAM treats embedded manifolds, extension to quotient structures (e.g., the Grassmann manifold) with explicit parallel transport between tangent spaces poses technical challenges.
  • Second-order Corrections: Incorporating Riemannian analogues of curvature- or sharpness-aware corrections beyond first-order remains unexplored (Truong et al., 2023).
  • Empirical Overheads: Although the per-iteration cost is only marginally higher than Euclidean SAM, the overall training cost remains a barrier compared to vanilla SGD, particularly for large-scale applications.

7. Comparison of Methodological Variants

The following table summarizes core distinctions and similarities between Monge SAM, RSAM, and classical SAM, as discussed in the referenced works:

Method Geometry Reparametrization Invariant Manifold Support
SAM Euclidean No Flat LS(θ)=1ni=1n(fθ(xi),yi),\mathcal{L}_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i), y_i),5 only
Monge SAM (M-SAM) Loss-induced Yes Any model, LS(θ)=1ni=1n(fθ(xi),yi),\mathcal{L}_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i), y_i),6
RSAM Manifold Yes (by construction) General LS(θ)=1ni=1n(fθ(xi),yi),\mathcal{L}_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i), y_i),7

Monge SAM is a special case of manifold-aware sharpness-aware minimization where the metric is induced by the loss graph in ambient space (Jacobsen et al., 12 Feb 2025), while RSAM is a general framework for optimizing over arbitrary Riemannian manifolds (Truong et al., 2023).


Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Riemannian Stochastic Gradient Descent (RSGD).