Orthogonality-Based Optimizers

Updated 6 December 2025

Orthogonality-Based Optimizers are frameworks that enforce orthonormality constraints on matrices (e.g., on the Stiefel manifold) to ensure decorrelation and stability.
They implement methods like Riemannian gradient descent, structured quasi-Newton, and block coordinate updates for efficient and scalable optimization.
Applications include eigenvalue computation, neural network training, sparse matrix factorization, and quantum-compatible algorithms, backed by rigorous convergence theories.

Orthogonality-based optimizers are algorithmic frameworks that enforce, exploit, or bias toward orthogonality constraints in continuous optimization problems, most characteristically problems on the Stiefel manifold: $\min_{X \in \mathbb{R}^{n \times p}} f(X) \quad \text{s.t.} \quad X^\top X = I_p,$ and its variants such as optimization over the orthogonal group, nonnegative orthogonality, or J-orthogonality (hyperbolic signature). These methods are foundational in large-scale eigenproblems, manifold learning, deep neural network training, structured low-rank factorization, and statistical models requiring decorrelated or redundancy-free bases.

1. Mathematical Formulation and Problem Classes

Orthogonality constraints encode the feasible set as the Stiefel manifold $\mathrm{St}(n,p)$ (or its specializations) and are ubiquitous in problems including:

Eigenvalue/eigenvector computation for symmetric/Hermitian matrices and electronic structure problems, where the solution basis must be orthonormal (Hu et al., 2018).
Neural network weight learning, where imposing $W^\top W=I$ stabilizes gradients and preserves dynamical isometry (Huang et al., 2020, Bu et al., 2022, Kerenidis et al., 2021).
Sparse and nonnegative matrix factorization, where both sparsity and orthogonality are required (Jiang et al., 2019, Yuan, 2023).
Low-rank learning with explicit SVD-type decompositions, leveraging orthogonality of factors for adaptive parameter-efficient training (Coquelin et al., 2024).
Hyperbolic or J-orthogonality in metric learning and knowledge embeddings (He et al., 2024).

Formally, the constraints are realized as $X^\top X = I$ , $X^\top J X = J$ , or $X^\top X = I, X \ge 0$ , with the tangent space at $X$ defined by $\{Z: X^\top Z + Z^\top X = 0\}$ (Stiefel), and equivalent structures for other settings.

2. Riemannian, Quasi-Newton, and Block-Coordinate Optimization on the Stiefel Manifold

Canonical approaches for orthogonality constraints utilize manifold-based methods:

Riemannian Gradient Descent and Retraction: At iterates $X^k\in\mathrm{St}(n,p)$ , the Riemannian gradient is computed by projecting the Euclidean gradient $\nabla f(X^k)$ into the tangent space. Retraction maps $X^k+\xi^k$ back onto the manifold, using QR, polar, Cayley, or exponential map retractions. Common in eigenproblems, manifold learning, and optimization software such as ManOpt (Hu et al., 2018, Siegel, 2019, Kerenidis et al., 2021, Han et al., 18 May 2025).
Structured Quasi-Newton (SQN) Methods: These preserve orthogonality by leveraging the decomposition of the Hessian $\nabla^2 f(X)=H_c(X)+H_e(X)$ , using quasi-Newton (e.g., L-BFGS) only for the expensive part $H_e$ , and leave $H_c$ exact. Subproblems are formulated as quadratic models on the manifold and solved in the ambient space, followed by orthonormalization of iterates. Convergence analysis shows global convergence and $q$ -superlinear rates (Hu et al., 2018).
Block Coordinate Descent (BCD): BCD alternately updates small blocks (e.g., 2 rows/columns) using subproblems with closed-form or efficiently solvable orthogonality-preserving steps (via Givens rotations for the orthogonal group, or specialized CS decompositions for J-orthogonality). Recent work extends this to nonsmooth composite and hyperbolic constraints, ensuring strong block- $k$ stationary points (Shalit et al., 2013, Yuan, 2023, He et al., 2024). Parallelism is inherent, especially in column-wise BCD.
Parallelizable and Infeasible Methods: PLAM and PCAL avoid costly per-iteration orthonormalization by taking steps in the ambient space, invoking retraction only at the final stage. These methods yield high parallelism and are suitable for large-scale problems (Gao et al., 2018).

3. Stochastic, Momentum-based, and Global Orthogonality-based Algorithms

Recent algorithms introduce stochasticity, momentum, and new dynamics:

Stochastic Diffusion on Manifolds: Stochastic differential equations (SDEs) defined on the Stiefel manifold, discretized via structure-preserving schemes (e.g., Cayley update), are used to escape local minima in highly nonconvex settings. Theoretical results establish convergence to global minimizers in diffusion–descent cycles, and empirical tests show superior exploration in polynomial and combinatorial optimization (Yuan et al., 2017).
High-order and Momentum-integrated Flows: Variational principles derive second-order (momentum) dynamics for $X(t)$ and $Q(t)$ , respecting manifold constraints without explicit projection of momentum. Discretizations such as the Momentum Stiefel Optimizer (MSO) achieve efficient and accurate manifold-aware momentum, empirically outperforming both projected and soft-regularizer baselines. Adaptive, Adam-style extensions are feasible (Kong et al., 2022).
Feedback-based and Attraction-driven Algorithms: Continuous-time flows such as the feedback gradient descent (FGD) and the "landing" algorithm augment standard dynamics with orthogonality–attracting feedback, maintaining proximity to the Stiefel manifold (Bu et al., 2022, Ablin et al., 2023). These methods afford computational speedups by eliminating retractions or orthonormalizations during most iterations.

4. Orthogonality-based Methods in Deep Learning and Large-scale Optimization

Orthogonality-based optimization is central to scalable modern deep learning:

Layer-wise Orthogonalization and Gradient Pre-processing: Methods such as ONI and Turbo-Muon orthogonalize the weight updates via Newton–Schulz or polynomial iterations, or precondition via almost-orthogonal layer scaling (Huang et al., 2020, Boissin et al., 4 Dec 2025). These achieve near-exact orthogonality at controllable computational cost and enable adaptive trade-offs between expressivity and regularization. Turbo-Muon introduces a preconditioning step that accelerates convergence of the polar decomposition, offering up to 2.8 $\times$ speedup in the Newton–Schulz stage with no loss in model quality (Boissin et al., 4 Dec 2025).
Gradient Orthogonalization: Orthogonalization of the per-layer gradient matrix ensures diversified filter updates in convolutional or transformer architectures, accelerating convergence, reducing overfitting, and delivering higher empirical accuracy, with mild computational overhead due to SVD or Gram-Schmidt computations (Tuddenham et al., 2022).
Parameter-efficient Fine-tuning (LoRA, DoRA, OIALR): Explicit Stiefel-manifold optimizers enforce B-matrix orthogonality during LoRA PEFT, yielding maximal effective rank, basis decorrelation, and significantly improved downstream accuracy on LLM benchmarks, outperforming standard AdamW optimizers (Park et al., 25 Aug 2025). OIALR shows that post–SVD decomposition, updating only the singular values suffices once the associated bases stabilize, enabling highly compressed and efficient low-rank neural network training (Coquelin et al., 2024).
Quantum-compatible Orthogonality: Pyramidal circuit constructions enable forward and backward operations with exact orthogonality at classical $O(n^2)$ cost, and quantum hardware inference at $O(n)$ depth (Kerenidis et al., 2021).

5. Algorithms for Nonsmooth, Composite, and Nonconvex Problems

Advanced orthogonality-based optimizers address nonconvex, composite, or nonsmooth loss functions:

Constrained ADMM (OADMM, RADMM): ADMM variants for the Stiefel manifold leverage Moreau-smoothed splitting or Riemannian retraction. These methods establish ergodic complexity $O(1/\epsilon^3)$ for nonsmooth composite objectives and set the first non-ergodic KL-based convergence rates in this class. Two primal update choices—projection (OADMM-EP) and retraction (OADMM-RR)—offer flexibility and strong theoretical backing (Yuan, 2024).
Exact Penalty and Projection–Penalty Methods: For nonnegative orthogonality, exact penalty approaches convert the problem into nonlinear penalty form, ensuring solution equivalence under sufficiently large penalty, tractable per-iteration cost, and guarantee to stationary points with single-step post-processing (Jiang et al., 2019).
Nonsmooth Block-Coordinate Descent (OBCD, JOBCD): Block coordinate frameworks (both on classical and J-orthogonality) update small variable blocks per-iteration, maintaining feasibility, scaling to high-dimensional nonsmooth settings, and providing convergence to block-stationarity points with $O(1/\epsilon)$ ergodic rate or non-ergodic KL-based rates (Yuan, 2023, He et al., 2024).

6. Scalability and Randomized, Parallel Submanifold Approaches

Recent advances focus on reducing the scaling bottlenecks of classic retraction-based methods:

Randomized Submanifold Methods: Each iteration restricts the update to a randomly selected low-dimensional submanifold (typically $O(r)$ embedded in $O(n)$ ); this allows most computation (projection, retraction) to be performed in the much cheaper $r$ -dimensional case. High-probability and expected-case convergence rates match full Riemannian approaches up to a $n^2/r^2$ factor, while per-iteration cost drops from $O(np^2)$ to $O(npr)$ or even $O(nr^2)$ for permutation approaches (Han et al., 18 May 2025).
Parallelizable, Deferred Orthonormalization Frameworks: Algorithms such as PLAM/PCAL and block Jacobi coordinate updates admit per-column or per-block parallelism, invoking orthogonality restoration only once at the end or infrequently thereafter, enabling near-linear or even super-linear speedup on modern multicore architectures (Gao et al., 2018, He et al., 2024).

7. Empirical and Theoretical Guarantees; Applications and Limitations

Theoretical analyses for most orthogonality-based optimizers guarantee:

Global convergence to critical points or block-stationary points under mild smoothness/Lipschitz assumptions; local $q$ -superlinear or even linear rates under sufficient second-order conditions (Hu et al., 2018, Siegel, 2019, Yuan, 2024).
Oracle complexity and scalability: Ergodic complexity scaling as $O(1/\epsilon^3)$ or $O(1/\epsilon)$ for nonsmooth objectives; sublinear or linear rates in the Riemannian Polyak-Lojasiewicz regime; linear scaling for column- or block-level parallelism (Yuan, 2024, Yuan, 2023, Han et al., 18 May 2025).
Empirical superiority: Across Kohn–Sham DFT, large-scale PCA, low-rank vision transformers, sparse PCA, matrix factorization, and LLM fine-tuning, orthogonality-based optimizers deliver either better accuracy, lower runtime, or superior robustness compared to feasible or unconstrained baselines (Gao et al., 2018, Park et al., 25 Aug 2025, Huang et al., 2020, Tuddenham et al., 2022).

Principal limitations identified include:

Retraction cost: Cubic in $p$ for Stiefel retraction (via QR/SVD), mitigated by random submanifold restriction, blockwise updating, or deferred orthonormalization (Han et al., 18 May 2025, Gao et al., 2018).
Expressivity–regularization tradeoff: Over-enforcing exact orthogonality can reduce expressivity if the number of orthonormal columns becomes a bottleneck (e.g., in deep or overparameterized networks) (Huang et al., 2020).
Approximation constraints: “Soft” penalty or feedback approaches induce only approximate orthogonality unless suitably regularized or post-processed (Ablin et al., 2023, Bu et al., 2022).
Adaptive and second-order acceleration: True manifold-adapted Nesterov or adaptive-momentum methods are less developed; most momentum schemes act in the ambient space with manifold-corrective terms (Siegel, 2019, Kong et al., 2022).

Orthogonality-based optimizers now set the state-of-the-art for a range of large-scale, high-dimensional, or highly structured learning and optimization problems, providing both rigorous convergence guarantees and practical runtime and accuracy benefits.