Orthogonality-Constrained Update Rules

Updated 20 April 2026

Orthogonality-constrained update rules are mechanisms that maintain the feasibility of matrix variables on the Stiefel manifold through Riemannian gradient projections and retractions.
They employ diverse strategies like QR retractions, block-coordinate updates, and penalty methods to balance computational efficiency with convergence guarantees in applications such as machine learning and quantum chemistry.
Adaptive step sizes, stochastic variants, and quasi-Newton methods further enhance performance, offering tradeoffs between accuracy, scalability, and numerical stability.

Orthogonality-constrained update rules are the foundational mechanisms underlying optimization procedures on matrix manifolds where the variables are restricted to satisfy orthogonality constraints, notably the Stiefel manifold $\mathrm{St}(n, p) = \{ X \in \mathbb{R}^{n \times p} : X^\top X = I_p \}$ . Such constraints are pervasive in problems across statistics, machine learning, signal processing, quantum chemistry, and computational physics. The design of update rules that preserve or efficiently control orthogonality is central to the scalability, convergence, and correctness of algorithms in these domains.

1. Riemannian Geometry and Tangent-space Projections

Orthogonality constraints endow the feasible set with a Riemannian manifold structure, which leads to algorithms that perform ascent or descent intrinsically on the manifold. The canonical approach replaces the Euclidean gradient with the Riemannian gradient, computed as the projection of the Euclidean gradient onto the tangent space: $\mathrm{grad} \, f(X) = \nabla f(X) - X \cdot \mathrm{sym}(X^\top \nabla f(X)),$ where $\mathrm{sym}(A) = \frac{1}{2}(A + A^\top)$ , and $\nabla f(X)$ denotes the Euclidean gradient (Harandi et al., 2016, Zhu et al., 2018, Han et al., 18 May 2025). Steps are taken in the tangent bundle, which enforces infinitesimal feasibility.

In Riemannian methods, each iteration typically consists of:

Computing the Riemannian gradient or a geometry-consistent search direction.
Taking a step along this direction.
Applying a retraction, which maps the resulting point back onto the feasible manifold.

Typical retractions include QR-based retractions, polar decompositions, or Cayley transforms. QR retraction is implemented as follows: given $X$ and tangent vector $\xi$ , set $Y = X + \xi$ , find the thin QR factorization $Y = Q R$ , and set the next iterate as $Q$ (with sign correction as needed) (Harandi et al., 2016, Zhu et al., 2018, Hu et al., 2018).

2. Structured Update Rules and Algorithmic Variants

Several algorithmic families have been developed to exploit the manifold structure:

Manifold SGD and Backpropagation: In deep learning, Stiefel layers enforce orthogonality on fully connected or convolutional filters by integrating the Riemannian projection and retraction into the SGDs or backpropagation routines. This results in update rules that maintain $W^\top W = I$ throughout training (Harandi et al., 2016).
Block-Coordinate and Randomized Updates: Block coordinate descent (BCD) methods update a subset (block) of rows or columns and resolve the subproblem on the corresponding lower-dimensional Stiefel manifold, e.g., OBCD (Yuan, 2023). Randomized submanifold methods apply local manifold optimization on randomly chosen subspaces, greatly reducing per-iteration complexity and enabling scalability for large $\mathrm{grad} \, f(X) = \nabla f(X) - X \cdot \mathrm{sym}(X^\top \nabla f(X)),$ 0 or $\mathrm{grad} \, f(X) = \nabla f(X) - X \cdot \mathrm{sym}(X^\top \nabla f(X)),$ 1 (Han et al., 18 May 2025).
Augmented Lagrangian and Penalty Methods: Proximal linearized augmented Lagrangian (PLAM) or column-wise block minimization (PCAL) methods replace expensive per-iteration retraction with a penalty term and update the dual variables in closed form. Iterates remain close to the manifold and a final retraction provides exact feasibility (Gao et al., 2018).
Non-feasible ("Infeasible") and Landing Algorithms: The landing algorithm introduces an "attracting field" penalizing deviations from orthogonality so that iterates are driven toward the manifold, while ensuring the limiting point and convergence rate coincide with those of exact manifold methods. The update has the form

$\mathrm{grad} \, f(X) = \nabla f(X) - X \cdot \mathrm{sym}(X^\top \nabla f(X)),$ 2

where $\mathrm{grad} \, f(X) = \nabla f(X) - X \cdot \mathrm{sym}(X^\top \nabla f(X)),$ 3 is the off-manifold Riemannian gradient (Ablin et al., 2023).

Augmented ADMM and Variants for Nonsmooth Problems: Algorithms such as OADMM apply ADMM splitting to structured optimization under orthogonality, using either exact projection onto the Stiefel manifold via SVD or retraction with a Riemannian gradient. Both variants ensure $\mathrm{grad} \, f(X) = \nabla f(X) - X \cdot \mathrm{sym}(X^\top \nabla f(X)),$ 4 is satisfied after each update (Yuan, 2024).
Quasi-Newton and Second-order Methods: Quasi-Newton frameworks (e.g., LSR1, L-BFGS) approximate the Riemannian Hessian or exploit the structure of the problem (e.g., cheap/high-cost Hessian splits) and resolve the subproblem with orthogonality-preserving solvers (Hu et al., 2018).

3. Multiplicative and Nonnegative-Orthogonality Updates

In the context of nonnegative matrix factorization and clustering, orthogonality is imposed alongside nonnegativity. Multiplicative update rules preserve nonnegativity automatically and are modified by adding orthogonality penalties to the objective: $\mathrm{grad} \, f(X) = \nabla f(X) - X \cdot \mathrm{sym}(X^\top \nabla f(X)),$ 5 These rules, together with majorization-minimization arguments, guarantee convergence to stationary points under increasing penalty parameters (Mirzal, 2017).

Support-set algorithms for nonnegative and orthogonality constraints exploit the property that every feasible $\mathrm{grad} \, f(X) = \nabla f(X) - X \cdot \mathrm{sym}(X^\top \nabla f(X)),$ 6 with $\mathrm{grad} \, f(X) = \nabla f(X) - X \cdot \mathrm{sym}(X^\top \nabla f(X)),$ 7 and $\mathrm{grad} \, f(X) = \nabla f(X) - X \cdot \mathrm{sym}(X^\top \nabla f(X)),$ 8 can have at most one positive entry per row. A block-wise support update, coupled with closed-form column updates, ensures all iterates remain feasible (Wang et al., 5 Nov 2025).

4. Adaptive Step Size, Retraction, and Complexity Considerations

Adaptive step size strategies have been developed to eliminate the need for computationally expensive backtracking line searches. For the Stiefel or Grassmann setting, the step is computed via an explicit Taylor expansion, using function and Hessian information: $\mathrm{grad} \, f(X) = \nabla f(X) - X \cdot \mathrm{sym}(X^\top \nabla f(X)),$ 9 This leads to provable convergence and significant wall-clock time savings, especially in electronic structure problems (Dai et al., 2019).

Retraction dominates the computational cost of feasible manifold methods. Randomized and block submanifold updates can reduce per-iteration complexity from $\mathrm{sym}(A) = \frac{1}{2}(A + A^\top)$ 0 (full QR) to $\mathrm{sym}(A) = \frac{1}{2}(A + A^\top)$ 1 for block size $\mathrm{sym}(A) = \frac{1}{2}(A + A^\top)$ 2, allowing scalable large-dimensional optimization (Han et al., 18 May 2025).

5. Stochastic, Variance-reduced, and Distributed Updates

Stochastic (SGD-type) and variance-reduction (e.g., SAGA, SVRG) variants of manifold methods use unbiased or variance-reduced Riemannian gradients. Recent advances enable the same sample complexity and convergence rates as full Riemannian updates while avoiding full retraction at each iteration (Ablin et al., 2023). In online PCA, implicit and distributed updates eliminate the explicit orthonormalization step entirely and lead to minibatch, parallelizable learning schemes (Amid et al., 2019). Empirical results demonstrate that, for a wide class of machine-learning tasks, these infeasible but manifold-attracting updates offer strict control over orthogonality violations while maintaining competitive accuracy and time-to-solution compared to classical feasible methods (Ablin et al., 2023).

6. Convergence Theory and Global/Local Rates

A broad spectrum of theoretical guarantees exist for orthogonality-constrained update rules.

Feasible Riemannian gradient methods on compact manifolds with L-smooth objectives satisfy non-asymptotic sublinear rates, e.g., $\mathrm{sym}(A) = \frac{1}{2}(A + A^\top)$ 3 (Han et al., 18 May 2025).
Accelerated first-order methods (Nesterov-type) achieve local $\mathrm{sym}(A) = \frac{1}{2}(A + A^\top)$ 4 rates under local strong convexity and smoothness; global sublinear convergence is always guaranteed (Siegel, 2019).
For nonsmooth and nonconvex composite objectives, ADMM variants guarantee ergodic complexity $\mathrm{sym}(A) = \frac{1}{2}(A + A^\top)$ 5, with sharper non-ergodic rates under KL assumptions (Yuan, 2024, Yuan, 2023).
Infeasible, landing-type algorithms have provable Lyapunov descent and matching rates with exact Riemannian analogues, with explicit polynomial complexity for target stationarity (Ablin et al., 2023).
Quasi-Newton and low-memory methods can achieve local q-superlinear convergence so long as the Hessian approximations satisfy secant constraints and are accurate in critical directions (Hu et al., 2018).

7. Applications and Algorithmic Tradeoffs

Orthogonality-constrained optimization is central to:

PCA/CCA, FDA, and ICA, via subspace or generalized eigenvalue problems (Shustin et al., 2021).
Orthogonal nonnegative matrix factorization in clustering, community detection, and bi-orthogonal decompositions (Mirzal, 2017, Wang et al., 5 Nov 2025).
Robust or orthogonal neural network layers (RNNs, CNNs, Transformers) for intrinsic norm-preservation and improved generalization (Harandi et al., 2016, Ablin et al., 2023).
Quantum chemistry, where scalable methods are essential for Kohn–Sham density-functional theory and related problems (Gao et al., 2018, Dai et al., 2019).
Semiparametric regression and survival analysis, with orthogonality-preserving first-order solvers implemented in software such as orthoDr (Zhu et al., 2018).

Algorithmic choices depend on balancing feasibility (exact or asymptotic), per-iteration cost, scalability, numerical stability, and required stationarity accuracy. Recent advances focus on scalable, parallelizable, and randomized methods, with explicit control of feasibility and strong theoretical convergence results.

References:

(Harandi et al., 2016) Generalized BackPropagation, Étude De Cas: Orthogonality
(Zhu et al., 2018) orthoDr: Semiparametric Dimension Reduction via Orthogonality Constrained Optimization
(Hu et al., 2018) Structured Quasi-Newton Methods for Optimization with Orthogonality Constraints
(Mirzal, 2017) A Convergent Algorithm for Bi-orthogonal Nonnegative Matrix Tri-Factorization
(Gao et al., 2018) Parallelizable Algorithms for Optimization Problems with Orthogonality Constraints
(Dai et al., 2019) Adaptive Step Size Strategy for Orthogonality Constrained Line Search Methods
(Siegel, 2019) Accelerated Optimization With Orthogonality Constraints
(Amid et al., 2019) An Implicit Form of Krasulina's k-PCA Update without the Orthonormality Constraint
(Ablin et al., 2023) Infeasible Deterministic, Stochastic, and Variance-Reduction Algorithms for Optimization under Orthogonality Constraints
(Yuan, 2023) A Block Coordinate Descent Method for Nonsmooth Composite Optimization under Orthogonality Constraints
(Yuan, 2024) ADMM for Nonsmooth Composite Optimization under Orthogonality Constraints
(Han et al., 18 May 2025) Efficient Optimization with Orthogonality Constraint: a Randomized Riemannian Submanifold Method
(Wang et al., 5 Nov 2025) A Support-Set Algorithm for Optimization Problems with Nonnegative and Orthogonal Constraints
(Shustin et al., 2021) Faster Randomized Methods for Orthogonality Constrained Problems