Rotational Variants of Optimizers
- Rotational variants of optimizers are a class of algorithms that exploit rotation invariance by leveraging manifold geometry, matrix whitening, and EGOP reparameterization to improve convergence.
- They incorporate techniques such as intrinsic momentum on the Stiefel manifold, cyclic scheduling, and explicit angular control to enforce orthogonality and achieve error reductions, e.g., 8.3% error on CIFAR-10 for Vision Transformers.
- Matrix-whitening and EGOP-based methods yield 2–5× speedup and enhanced generalization by adapting update directions to the underlying geometric structure of parameter space.
Rotational variants of optimizers are a class of optimization algorithms designed to respect, exploit, or impose invariance with respect to rotations (orthogonal transformations) in parameter or data space. Compared to standard gradient-based methods, which often operate in an axis-aligned (coordinate-wise) manner or ignore the geometric structure of constraints, rotationally-aware optimizers utilize manifold geometry, matrix preconditioning, or reparameterization to achieve rotation-equivariant or -invariant learning dynamics. Recent work spans intrinsic momentum methods for the Stiefel manifold, matrix-whitening strategies, EGOP-based reparameterizations, cyclic schedule coupling, and explicit control of angular update statistics.
1. Manifold-Aware Optimization: Stiefel and Orthogonality-Constrained Methods
Gradient-based optimization on orthogonality-constrained manifolds, notably the Stiefel manifold , requires preserving orthonormality throughout training. The Momentum Stiefel Optimizer (Kong et al., 2022) develops a continuous-time ODE for on the tangent bundle , incorporating intrinsic momentum and friction:
- The velocity evolves on the tangent space .
- The ODE is discretized by splitting into sub-flows , each exactly preserving constraints such as and .
- Momentum is tracked in both "rotation" ( skew-symmetric) and "drift" ( tangent) directions, with updates maintaining geometric feasibility.
- Adaptive variants generalize Adam: per-step second-moment buffers for skew and tangent directions, normalization, and polar retraction recovery of orthonormality.
In empirical studies:
- For projection-robust Wasserstein distance, Stiefel momentum optimizers outperform both projected SGDs and Riemannian-Adam on tasks requiring subspace projections.
- In Vision Transformer training with orthogonality enforced per attention head, imposing "within-head" constraints and using Stiefel optimizers achieves lower error rates (8.3% on CIFAR-10) and faster convergence compared to vanilla ViT and competing orthogonalization schemes.
- The updates move along (infinitesimal) geodesic rotations and damped drift, ensuring stability and efficiency whenever parameters are intrinsically "rotational" (Kong et al., 2022).
2. Matrix-Whitening and Spectral Variants: Rotation-Invariant Preconditioning
Matrix-whitening optimizers deploy preconditioners derived from the (regularized) covariance of recent gradients, , to yield update directions
which are provably equivariant under any orthogonal transformation , i.e., (Frans et al., 28 Oct 2025). This is in contrast to Adam's axis-aligned preconditioning, which is broken by rotations.
Two key mechanisms underpin matrix-whitening:
- Spectral normalization: Equalizes or orthogonalizes the magnitudes of principal components (singular values) in the gradient. Exact updates recover the signed SVD direction (e.g., for gradient ), closely related to spectral-norm steepest descent.
- Variance adaptation: Scales each direction by the inverse of its signal power, , in some eigenbasis. This is fundamental to performance; empirical ablation shows that variance-adapted variants (e.g., SOAP, AdaMuon) outperform sign-descent or pure spectral methods, and as much as 80–100% of the whitening benefit can be attributed to variance adaptation (Frans et al., 28 Oct 2025).
Notable algorithms and complexity characteristics:
- Shampoo approximates full matrix whitening via Kronecker-factored covariance estimation and applies blockwise preconditioners.
- SOAP combines blockwise whitening with per-basis variance adaptation, further improving convergence.
- Muon and AdaMuon use iterative Newton-Schulz orthogonalization; AdaMuon adds variance scaling.
Empirical results indicate that for large-scale models (e.g., GPT-2), matrix-whitening methods reach target validation losses 1.3–1.4 faster than Adam. Low-rank or blockwise variance estimates trade memory for negligible accuracy drop (Frans et al., 28 Oct 2025).
3. Rotational Equivariance via EGOP Reparameterization
Coordinate-wise adaptive methods such as Adam and Adagrad are not inherently rotation-invariant: a simple orthogonal change of basis dramatically alters their optimization trajectories and sometimes eliminates favorable implicit bias (the "richness bias") (DePavia et al., 27 Oct 2025, DePavia et al., 3 Feb 2025). Orthogonal reparameterization using the expected gradient outer product (EGOP)
achieves full -equivariance:
- Compute the EGOP matrix, decompose as .
- Reparameterize , optimize with a standard optimizer, then map back.
- In the EGOP basis, coordinate-wise updates naturally align with the principal axes of gradient variation, absorbing the curvature geometry of the underlying problem.
Key findings:
- Even small random rotations can cause Adam to lose its richness bias, collapsing back to less expressive solutions; EGOP reparameterization universally restores the preferred implicit bias regardless of rotation (DePavia et al., 27 Oct 2025).
- Theoretical analysis links the convergence benefits of EGOP reparameterization to spectral decay in the EGOP matrix; empirical tests show 2–5 speedup for Adam/Adagrad on both convex and nonconvex benchmarks when reparameterized (DePavia et al., 3 Feb 2025).
- For high-dimensional models, block or low-rank EGOP approximations offer scalability.
4. Cyclic and Angular-Controlled Rotational Dynamics
Rotational variants also arise in the coupling of cyclic schedules or through explicit angular update control:
- CLMR (Cyclic Learning/Momentum Rate) employs triangle-shaped, periodically synchronized schedules in both learning rate and momentum rate (Mortazi et al., 2023). When step size is low, momentum is high, and vice versa, creating a rotational dynamic in the (momentum, gradient) plane. This mechanism avoids rapid convergence to sharp minima by steering trajectory rotation and balances exploration and exploitation.
- Experiments on medical image segmentation demonstrate that CLMR improves generalization beyond both adaptive (Adam) and fixed-schedule momentum baselines, with a 2–3% improvement (Dice metric) and no extra overhead.
- Rotational equilibrium and explicit angular-speed control (Kosson et al., 2023): Weight decay causes weight vectors to reach a steady-state where the expected magnitude and instantaneous rotation (as measured by the expected angle between successive weight vectors) are stable. For optimizers like AdamW, SGD (with momentum), and Lion, formulas relate steady-state norm and angular step size to hyperparameters; AdamW (decoupled weight decay) yields uniform angular speed across layers, in contrast to classical L2-regularization, which results in imbalanced rotation and degraded accuracy.
- Explicit control of target angular update via a "Rotational Wrapper" enforces homogeneity of learning dynamics across neurons/layers, replaces the need for learning-rate warmup, and leads to stable, robust optimization.
5. Practical Consequences, Limitations, and Recommendations
- Rotation-invariant/aware optimizers outperform standard axis-aligned methods in tasks where the parameter geometry or the problem itself is rotation-sensitive—orthogonality-constrained learning, subspace methods, network initializations, and when seeking robustness to input reparametrization.
- For whitening-type optimizers, variance adaptation in preconditioning is typically more impactful than perfect spectral normalization; practical guidelines suggest favoring algorithms (e.g., SOAP with low-rank variance, AdaMuon) that decouple the two (Frans et al., 28 Oct 2025).
- For coordinate-wise adaptive optimizers (e.g., Adam, Adagrad), without explicit reparameterization or EGOP-adaptation, arbitrary rotations can degrade generalization and slow down or compromise convergence (DePavia et al., 27 Oct 2025, DePavia et al., 3 Feb 2025).
- In deep models, enforcing rotational homogeneity at the per-layer or per-neuron level—either via decoupled weight decay or explicit angular control—improves both stability and final accuracy, especially for architectures with normalization layers (Kosson et al., 2023).
- Empirical observation across domains (vision, language, optimal transport, segmentation) corroborates the efficiency, robustness, and convergence improvements of rotational variants over purely Euclidean or axis-aligned alternatives.
6. Algorithmic Summary
| Rotational Variant | Principle | Core Implementation Features |
|---|---|---|
| Momentum Stiefel Optimizer | Manifold-geometry & momentum | Intrinsic ODE flow, tangent bundle updates, splitting |
| Matrix-Whitening (SOAP) | Invariant preconditioning | Kronecker/blockwise eigenbasis, variance-adapted scaling |
| EGOP-based Reparam | -equivariant updates | Axis aligned to EGOP eigenbasis, standard optimizers |
| CLMR | Rotational coupling in plane | Cyclic learning/momentum rate, Nesterov dynamics |
| Rotational Equilibrium | Angular update control | Decoupled weight decay, explicit target |
Each method leverages rotation-group symmetry—either by preserving geometric constraints (Stiefel, whitening), erasing axis-alignment artifacts (EGOP), or balancing rotation across layers (rotational equilibrium/CLMR)—to achieve more stable, expressive, and robust learning in settings where rotation matters.
7. Significance and Outlook
Rotational variants of optimizers represent a crucial evolution in gradient-based optimization: they dissolve artificial biases toward coordinate axes and enable methods to adapt step-sizes, momentum, and drift in a rotationally-symmetric way, aligning more closely with both the algebraic constraints and natural invariances of modern machine learning tasks. Theoretical and empirical results confirm that such methods yield faster convergence, improved generalization, more stable dynamics, and robustness to reparameterizations and geometry-induced pathologies. Future work may further unify rotation-invariant preconditioning, low-rank scalable EGOP estimation, and explicit angular control as fundamental building blocks for foundation model optimization.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free