Papers
Topics
Authors
Recent
Search
2000 character limit reached

Iterative Cayley Retraction Overview

Updated 17 June 2026
  • Iterative Cayley retraction is a computational method for enforcing orthonormality in Riemannian optimization by leveraging fixed-point iterations to avoid high-cost matrix inversions.
  • It recasts retraction as a fixed-point update, reducing complexity from O(n^3) to O(np^2) and proving highly effective in large-scale, non-Euclidean optimization tasks.
  • Adaptive and generalized Cayley parametrizations further enhance numerical stability and allow integration with standard Euclidean solvers in deep learning and matrix analysis.

The iterative Cayley retraction is a computational technique for Riemannian optimization over the Stiefel manifold that enables efficient enforcement of orthonormality constraints on matrix parameters. Leveraging the Cayley transform, this approach provides a numerically effective alternative to classical retraction methods such as QR or polar decompositions, with significant advantages in computational scaling, storage, and practical implementation. Iterative Cayley retractions have been further generalized and localized using adaptive and chart-based parametrizations to enhance both robustness and efficiency in large-scale and non-Euclidean optimization tasks.

1. Mathematical Foundations: The Stiefel Manifold and Retractions

The real Stiefel manifold St(n,p)\operatorname{St}(n, p) is defined as the set of n×pn \times p matrices with orthonormal columns:

St(n,p)={XRn×p:XX=Ip},np.\operatorname{St}(n, p) = \left\{ X \in \mathbb{R}^{n \times p} : X^\top X = I_p \right\}, \quad n \geq p.

Tangent vectors at XSt(n,p)X \in \operatorname{St}(n,p) satisfy XZ+ZX=0X^\top Z + Z^\top X = 0, and tangents can be written as Δ=WX\Delta = W X for a skew-symmetric WRn×nW \in \mathbb{R}^{n \times n}. A retraction RX:TXSt(n,p)St(n,p)R_X : T_X\operatorname{St}(n,p) \to \operatorname{St}(n,p) is a smooth map agreeing with the exponential map to first order but with lower computational complexity (2002.01113).

2. Classical and Iterative Cayley Retraction

Given a tangent vector ηTXSt(n,p)\eta \in T_X\operatorname{St}(n,p), a canonical skew-symmetric generator is

A=ηXXη,A = \eta X^\top - X \eta^\top,

with n×pn \times p0 and n×pn \times p1. The Cayley retraction then writes, for step size n×pn \times p2 and n×pn \times p3,

n×pn \times p4

guaranteeing n×pn \times p5 and n×pn \times p6. This closed form, however, involves the inversion of an n×pn \times p7 matrix, imposing prohibitive n×pn \times p8 costs for large n×pn \times p9 (2002.01113).

The iterative Cayley retraction circumvents high-cost inversion by recasting the update as a fixed-point equation:

St(n,p)={XRn×p:XX=Ip},np.\operatorname{St}(n, p) = \left\{ X \in \mathbb{R}^{n \times p} : X^\top X = I_p \right\}, \quad n \geq p.0

which is solved for St(n,p)={XRn×p:XX=Ip},np.\operatorname{St}(n, p) = \left\{ X \in \mathbb{R}^{n \times p} : X^\top X = I_p \right\}, \quad n \geq p.1 via a small number St(n,p)={XRn×p:XX=Ip},np.\operatorname{St}(n, p) = \left\{ X \in \mathbb{R}^{n \times p} : X^\top X = I_p \right\}, \quad n \geq p.2 of inner iterations. This yields an St(n,p)={XRn×p:XX=Ip},np.\operatorname{St}(n, p) = \left\{ X \in \mathbb{R}^{n \times p} : X^\top X = I_p \right\}, \quad n \geq p.3-cost update by exploiting the low-rank structure of St(n,p)={XRn×p:XX=Ip},np.\operatorname{St}(n, p) = \left\{ X \in \mathbb{R}^{n \times p} : X^\top X = I_p \right\}, \quad n \geq p.4, making the method highly competitive for St(n,p)={XRn×p:XX=Ip},np.\operatorname{St}(n, p) = \left\{ X \in \mathbb{R}^{n \times p} : X^\top X = I_p \right\}, \quad n \geq p.5 compared to QR (≈St(n,p)={XRn×p:XX=Ip},np.\operatorname{St}(n, p) = \left\{ X \in \mathbb{R}^{n \times p} : X^\top X = I_p \right\}, \quad n \geq p.6), polar/SVD (St(n,p)={XRn×p:XX=Ip},np.\operatorname{St}(n, p) = \left\{ X \in \mathbb{R}^{n \times p} : X^\top X = I_p \right\}, \quad n \geq p.7), or closed-form Cayley (St(n,p)={XRn×p:XX=Ip},np.\operatorname{St}(n, p) = \left\{ X \in \mathbb{R}^{n \times p} : X^\top X = I_p \right\}, \quad n \geq p.8) retractions. Empirically, two inner fixed-point steps (St(n,p)={XRn×p:XX=Ip},np.\operatorname{St}(n, p) = \left\{ X \in \mathbb{R}^{n \times p} : X^\top X = I_p \right\}, \quad n \geq p.9) suffice for high accuracy (2002.01113).

3. Generalized and Adaptive Cayley Parametrizations

Recent research extends the Cayley retraction to generalized and adaptive schemes suitable for broader subclasses of Stiefel-type optimization problems. The generalized Cayley map uses a center point XSt(n,p)X \in \operatorname{St}(n,p)0 to parameterize open dense subsets of the Stiefel manifold:

XSt(n,p)X \in \operatorname{St}(n,p)1

where XSt(n,p)X \in \operatorname{St}(n,p)2 and XSt(n,p)X \in \operatorname{St}(n,p)3 are defined in terms of XSt(n,p)X \in \operatorname{St}(n,p)4 and XSt(n,p)X \in \operatorname{St}(n,p)5. The inverse XSt(n,p)X \in \operatorname{St}(n,p)6 provides a XSt(n,p)X \in \operatorname{St}(n,p)7 diffeomorphism from a vector space XSt(n,p)X \in \operatorname{St}(n,p)8 back to the manifold, acting as a retraction (Kume et al., 2023, Kume et al., 2023).

This strategy allows any Euclidean optimization algorithm to be applied in XSt(n,p)X \in \operatorname{St}(n,p)9. When iterates approach a singular-point set (where XZ+ZX=0X^\top Z + Z^\top X = 00), an adaptive scheme "re-centers" at a new XZ+ZX=0X^\top Z + Z^\top X = 01 chosen (e.g., via SVD of the new iterate) to maintain numerical stability and efficiency. Such adaptivity can eliminate slow convergence associated with poor center choice in naive Cayley parametrization (Kume et al., 2023).

4. Algorithmic Schemes and Computational Complexity

For iterative Cayley retraction, each update involves:

  1. Momentum calculation: XZ+ZX=0X^\top Z + Z^\top X = 02
  2. Tangent-space projection: compute XZ+ZX=0X^\top Z + Z^\top X = 03 from the skew-part of projected XZ+ZX=0X^\top Z + Z^\top X = 04
  3. Step-size selection: XZ+ZX=0X^\top Z + Z^\top X = 05 ensuring contraction
  4. XZ+ZX=0X^\top Z + Z^\top X = 06 fixed-point iterations: initialize XZ+ZX=0X^\top Z + Z^\top X = 07, update XZ+ZX=0X^\top Z + Z^\top X = 08, set XZ+ZX=0X^\top Z + Z^\top X = 09

For the generalized Cayley parametrization, the descent is performed in Δ=WX\Delta = W X0, with Armijo line search, and recentering when parameter norms indicate approach to chart singularities (Kume et al., 2023, Kume et al., 2023).

Per-iteration cost is Δ=WX\Delta = W X1—matching or improving upon QR and polar retractions, and avoiding explicit tangent-vector transports required by classical Riemannian CG/Quasi-Newton methods. The Cayley-parametrization strategy stays entirely in one vector space Δ=WX\Delta = W X2 between re-centering, simplifying the use of advanced Euclidean solvers such as accelerated gradients, conjugate gradient (CG), and BFGS without additional vector transport (Kume et al., 2023).

5. Convergence Properties and Theoretical Guarantees

For the iterative Cayley retraction, the fixed-point iteration exhibits contraction if the step size satisfies Δ=WX\Delta = W X3, and the error decays superlinearly: Δ=WX\Delta = W X4. Under a standard Δ=WX\Delta = W X5-Lipschitz gradient assumption, the Cayley SGD algorithm achieves a sublinear rate on the Stiefel manifold: Δ=WX\Delta = W X6 (2002.01113).

The adaptive and localized Cayley approaches extend these guarantees: under Δ=WX\Delta = W X7 smoothness of Δ=WX\Delta = W X8, Lipschitz gradients, and bounded step sizes, every limit point of the iterates is stationary on the Stiefel manifold, i.e., Δ=WX\Delta = W X9. This is a standard "liminf gradient WRn×nW \in \mathbb{R}^{n \times n}0" stationarity result (Kume et al., 2023, Kume et al., 2023). The equivalence of stationarity conditions between the chart space and the manifold is formalized via gradient-chart correspondence theorems (Kume et al., 2023).

6. Empirical Performance and Applications

In practical deep learning and matrix optimization tasks, the iterative Cayley retraction offers competitive or superior empirical performance. For convolutional neural networks (CNNs) on CIFAR10/CIFAR100 using Wide ResNet-28-10, Cayley SGD and Cayley ADAM achieved errors of WRn×nW \in \mathbb{R}^{n \times n}1 and WRn×nW \in \mathbb{R}^{n \times n}2, respectively, with per-epoch cost approximately WRn×nW \in \mathbb{R}^{n \times n}3--WRn×nW \in \mathbb{R}^{n \times n}4 seconds, considerably lower than QR, polar, or closed-form Cayley retractions (which ranged from WRn×nW \in \mathbb{R}^{n \times n}5 to WRn×nW \in \mathbb{R}^{n \times n}6 seconds per epoch). For unitary RNNs, iterative Cayley reduced the per-iteration training time from WRn×nW \in \mathbb{R}^{n \times n}7 s (closed-form Cayley) to WRn×nW \in \mathbb{R}^{n \times n}8--WRn×nW \in \mathbb{R}^{n \times n}9 s (iterative), maintaining comparable test accuracy (RX:TXSt(n,p)St(n,p)R_X : T_X\operatorname{St}(n,p) \to \operatorname{St}(n,p)0) (2002.01113).

Generalized and adaptive Cayley schemes have demonstrated efficient optimization in eigen-basis extraction and other problems, with CPU time to convergence being roughly half that of QR/polar methods and 2--3× faster than Cayley-retraction in classical implementations. The adaptive recentering scheme effectively mitigates the slowdowns induced by chart singularities (Kume et al., 2023, Kume et al., 2023).

7. Connections, Extensions, and Implementation Considerations

The iterative Cayley and generalized Cayley parametrization frameworks provide a foundation for embedding momentum dynamics and vector transport directly into the retraction step. In particular, implicit vector transport is achieved by projecting the momentum update into the tangent space and applying the Cayley retraction, obviating the need for separate, explicit vector transport operations (2002.01113).

Further, the flexibility of these approaches enables "local trivialization" of the Stiefel manifold, allowing use of standard Euclidean optimizers transparently. Adaptive chart strategies can be implemented with negligible additional computational cost by leveraging SVD-based center selection (RX:TXSt(n,p)St(n,p)R_X : T_X\operatorname{St}(n,p) \to \operatorname{St}(n,p)1). The avoidance of RX:TXSt(n,p)St(n,p)R_X : T_X\operatorname{St}(n,p) \to \operatorname{St}(n,p)2 operations and reduced per-iteration flops and storage recommend the iterative Cayley retraction and its generalizations for large-scale learning tasks with strict orthogonality constraints (Kume et al., 2023, Kume et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Iterative Cayley Retraction.