Manifold Constrained Steepest Descent (MCSD)

Updated 3 February 2026

MCSD is a unified optimization framework for minimizing smooth or nonsmooth functions under manifold constraints by integrating steepest descent with Riemannian geometry.
It employs tangent space projections and retractions to maintain feasibility on diverse manifolds, such as the Stiefel and Grassmann, enabling practical deep learning and signal processing applications.
The method guarantees convergence under various settings, adapting to multiobjective, stochastic, and nonconvex problems through norm-induced descent directions and efficient algorithmic updates.

Manifold Constrained Steepest Descent (MCSD) is a general optimization paradigm for minimizing smooth or nonsmooth objective functions subject to manifold constraints, where update steps are restricted to the feasible set described by the manifold geometry. MCSD unifies steepest descent principles, generalized norm-induced descent directions, and Riemannian geometry to enable scalable, projection-based algorithms with theoretical guarantees for equality, inequality, or general nonlinear constraints, including cases such as the Stiefel or Grassmann manifolds, quotient spaces, and multiobjective setups (Yang et al., 29 Jan 2026, Ramesh et al., 2023, A. et al., 27 Feb 2025, Zhang et al., 2021, Alimisis et al., 2022, Huang et al., 2017, Birtea et al., 2017, Ferreira et al., 2019).

1. Mathematical Formulation and General Principles

Let $f:\mathbb{R}^d\to\mathbb{R}$ be a differentiable function and $\mathcal{M}\subset\mathbb{R}^d$ an embedded, smooth manifold. The MCSD method seeks to solve

$\min_{x\in\mathcal{M}} f(x)$

by iteratively selecting descent directions $d_t$ in the tangent space $T_{x_t}\mathcal{M}$ and using a retraction or projection operator to maintain feasibility: $x_{t+1} = \operatorname{Retr}_{x_t}( \alpha_t d_t )$ where $\operatorname{Retr}_x$ is a retraction map (an approximation of the exponential map), and $\alpha_t>0$ is a stepsize (A. et al., 27 Feb 2025, Yang et al., 29 Jan 2026, Huang et al., 2017).

The Riemannian gradient at $x\in\mathcal{M}$ is the projection of the ambient gradient onto the tangent space: $\operatorname{grad}_{\mathcal{M}} f(x) = P_{T_x\mathcal{M}}( \nabla f(x) )$ A general MCSD update is

$x_{k+1} = \operatorname{Retr}_{x_k}( -\alpha_k \operatorname{grad}_{\mathcal{M}} f(x_k) )$

with additional flexibility for norm choices and subproblem formulations, as detailed below (A. et al., 27 Feb 2025, Yang et al., 29 Jan 2026, Birtea et al., 2017).

2. Model Variants and Subproblem Structure

MCSD encompasses several instantiations determined by the constraint geometry and norm selection:

Equality-constrained MCSD in the 1-norm: For $f:\mathbb{R}^n\to\mathbb{R}$ with $\mathcal{M} = \{x : 1^\top x = y\}$ , MCSD seeks

$\min_{d\in\mathbb{R}^n,\,1^\top d=0} \nabla f(x)^\top d + \frac{1}{2\alpha}\|d\|_1^2$

yielding a sparse, two-coordinate update per iteration:

$d^* = \theta(e_i - e_j),\,\quad \theta = -\frac{\nabla_i f(x) - \nabla_j f(x)}{L_2}$

where $i=\arg\max\{\nabla_i f(x)\}$ , $j=\arg\min\{\nabla_j f(x)\}$ (Ramesh et al., 2023).

General smooth manifolds: The steepest-descent direction is chosen as the solution to a norm-induced linear minimization oracle (LMO):

$d_t = \arg\min_{\|d\|\leq1,\, d\in T_{x_t}\mathcal{M}} \langle \operatorname{grad}_{\mathcal{M}} f(x_t), d\rangle,$

followed by projecting $x_t + \alpha_t d_t$ to $\mathcal{M}$ (Yang et al., 29 Jan 2026).

Multiobjective and nonsmooth settings: For multiobjective $F=(f_1,\ldots,f_m)$ on $\mathcal{M}$ , the pointwise steepest descent direction is

$v_p = \arg\min_{v\in T_p\mathcal{M}} \max_i \langle \operatorname{grad} f_i(p), v\rangle + \frac12\|v\|^2$

(Ferreira et al., 2019). For nonsmooth $f$ , smoothing functions $\tilde{f}(x,\mu)$ are constructed and descent is performed on the smoothed objective, with careful control of subdifferentials and limiting stationary points (Zhang et al., 2021).

3. Tangent Spaces, Retractions, and Projections

A central component of MCSD methods lies in characterizing the tangent space $T_x\mathcal{M}$ and the corresponding projection/retraction operators. Key examples:

Stiefel manifold $\operatorname{St}(n,p)=\{X\in\mathbb{R}^{n\times p}: X^\top X=I_p\}$ :

$T_X \operatorname{St} = \{Z: X^\top Z + Z^\top X = 0\}$

with gradient projection

$P_X(G) = G - X \, \operatorname{sym}(X^\top G)$

and retraction via polar decomposition or QR (Birtea et al., 2017, Yang et al., 29 Jan 2026).

Grassmann manifold $\operatorname{Gr}(n,k)$ : Tangent vectors are $G$ such that $X^\top G=0$ , and the exponential map or first-order retractions can be used for the update (Alimisis et al., 2022).
Quotient manifolds: In blind deconvolution, equivalence classes under scaling yield a quotient manifold geometry. MCSD is implemented by projecting gradients to the horizontal space and retracting via ambient addition (Huang et al., 2017).
Bound-constrained scenarios: Under equality and box constraints, MCSD employs O( $n\log n$ ) two-pointer sorting and heap strategies to identify and update the most promising directions, with step truncation at variable bounds (Ramesh et al., 2023).

4. Convergence Rates and Theoretical Guarantees

MCSD exhibits multiple regimes of convergence, depending on manifold geometry, function properties (smoothness, strong convexity, Polyak–Łojasiewicz), and algorithmic variant:

Dimension-independent linear rates (equality-constrained, 1-norm):

$f(x_k) - f^* \leq (1 - 2{\mu_1}/{L_2})^k [f(x_0) - f^*]$

under $L_2$ -smoothness on any two coordinates and a proximal Polyak–Łojasiewicz (p-PL) condition in the 1-norm (Ramesh et al., 2023).

Global sublinear convergence (Riemannian smoothness):

$f(x_k) - f^* \leq \frac{L d_{g}(x_0, x^*)^2}{2k}$

where $L$ is a Lipschitz constant for the Riemannian gradient, $d_g$ denotes geodesic distance (A. et al., 27 Feb 2025). For multiobjective problems, $O(1/\sqrt{N})$ complexity bounds are obtained for stationarity (Ferreira et al., 2019).

Exponential convergence on manifolds with local convexity (Grassmannian Rayleigh quotient, block eigenvalue):

$\operatorname{dist}^2(\mathcal{X}_t, V_\alpha) \leq \left(1 - 2c_Q \cos(\operatorname{dist}(\mathcal{X}_0,V_\alpha))\delta\eta\right)^t \operatorname{dist}^2(\mathcal{X}_0, V_\alpha)$

when eigengap $\delta>0$ , with linear dependence of the rate on this gap (Alimisis et al., 2022).

Non-Lipschitz and nonconvex settings: Under mild assumptions of lower-semicontinuity and the existence of smoothing functions with gradient sub-consistency, any accumulation point is a limiting stationary point (Zhang et al., 2021).
Stochastic settings: With non-Gaussian noise and bounded moments, expected suboptimality decays as $O(1/k^{\gamma-0.5})$ for $\alpha_k\sim 1/k^\gamma$ , $1/2< \gamma \leq 1$ (A. et al., 27 Feb 2025).

5. Algorithmic Implementations and Practical Variants

MCSD encompasses a range of practical algorithms adapted to manifold type, cost function, and computational constraints:

Exact or first-order retractions: Updates can use the exponential map (geodesic stepping) or approximate retractions (e.g., normalization for the sphere, QR or Cayley transforms for the Stiefel manifold) (Birtea et al., 2017, A. et al., 27 Feb 2025).
Norm-induced directions via LMO: In general MCSD (Yang et al., 29 Jan 2026), the step $d_t$ solves the LMO problem induced by the chosen norm, followed by projection/retraction to the manifold.
Single-loop vs. nested-loop frameworks: MCSD adverses nested tangent-space subproblem solvers, emphasizing single-loop schemes with closed-form or efficient updates (Yang et al., 29 Jan 2026).
Spectral-norm MCSD ("SPEL") on Stiefel: The update uses matrix sign computations for both the LMO and projection:

$X_{t+1} = \mathrm{msign}( X_t - \alpha_t \, \mathrm{msign}( \operatorname{grad}_M f(X_t) ) )$

Implementation employs Polar Express or Newton–Schulz algorithms (Yang et al., 29 Jan 2026).

Greedy 2-coordinate MCSD for box constraints: Efficient $O(n\log n)$ per-iteration complexity is achieved via sorted gradient structures and two-pointer methods (Ramesh et al., 2023).
Blind deconvolution on quotient manifolds: MCSD operates with spectral initialization, Barzilai–Borwein and Armijo step-size strategies, and explicit quotient-metric handling (Huang et al., 2017).
Multiobjective MCSD: Each step involves solving a QP for the maximal directional derivative and supports adaptive, Lipschitz, or Armijo-type step sizes (Ferreira et al., 2019).

6. Applications and Empirical Results

MCSD has demonstrated empirical and computational efficacy in a variety of structured optimization tasks:

PCA and Brockett cost minimization: On the Stiefel manifold, MCSD/SPEL matches or outperforms nested-loop manifold Muon-type methods and Riemannian gradient descent, achieving up to $4\times$ wall-clock speed improvement in high dimensions (Yang et al., 29 Jan 2026).
Orthogonality-constrained deep learning: In Wide ResNet-28 on CIFAR-100, SPEL attains competitive test accuracy and epoch times versus custom SGD/Adam variants, maintaining robustness under layerwise scaling (Yang et al., 29 Jan 2026).
LLM adapter tuning: Application to Stiefel-constrained LoRA factors with MCSD/SPEL matches specialist optimizers in downstream QA tasks, achieving comparable accuracy with reduced optimizer state (Yang et al., 29 Jan 2026).
Blind deconvolution: MCSD on the natural quotient manifold achieves linear convergence up to noise, surpasses Wirtinger and alternating minimization variants, and attains near-optimal sample complexity (Huang et al., 2017).
Multiobjective convex and nonconvex benchmarks: Riemannian MCSD rapidly approaches Pareto fronts and outperforms Euclidean variants in both iteration count and function/gradient evaluations (Ferreira et al., 2019).

7. Extensions, Generalizations, and Outlook

MCSD admits numerous extensions:

Stochastic and momentum MCSD: Stochastic MCSD variants with Polyak-type or heavy-ball momentum provide provable rates in both deterministic and noisy settings (Yang et al., 29 Jan 2026, A. et al., 27 Feb 2025).
Infinite-dimensional Riemannian Hilbert spaces: Weak convergence and $O(1/k^2)$ rates are established for MCSD with momentum and adaptive step sizes (A. et al., 27 Feb 2025).
Non-Lipschitz and nonsmooth functions: Smoothing-based MCSD achieves stationarity with respect to the limiting subdifferential, broadening the class of admissible objective functions (Zhang et al., 2021).
Manifold selection: MCSD frameworks are implemented for spheres, Stiefel, Grassmann, quotient, positive-definite cone, and other non-Euclidean geometries, each requiring specific projection and retraction operations (Birtea et al., 2017, Alimisis et al., 2022, Huang et al., 2017, Ferreira et al., 2019).
Algorithmic complexity: Through tailored updates and data structures, MCSD implementations can achieve computational complexity per iteration that matches or improves upon classical methods—e.g., $O(n\log n)$ in SVM-dual optimization with box and summation constraints (Ramesh et al., 2023).

In summary, MCSD provides a unified, theoretically grounded, and practically efficient approach to manifold-constrained optimization, blending Riemannian geometry, norm-induced directional search, and application-specific algorithmic engineering (Yang et al., 29 Jan 2026, Ramesh et al., 2023, A. et al., 27 Feb 2025, Zhang et al., 2021, Alimisis et al., 2022, Huang et al., 2017, Birtea et al., 2017, Ferreira et al., 2019).