Rotational Adam Optimizer

Updated 6 May 2026

The Rotational Adam Optimizer restores rotation equivariance by modifying second moment adaptation to be insensitive to orthogonal transformations.
It integrates techniques such as full-matrix adaptation, dynamic basis diagonalization, and vector-grouped updates to align parameter updates with inherent data structures.
Empirical results demonstrate improved convergence metrics and reduced training steps in transformer and vision architectures compared to standard Adam.

Rotational Adam Optimizer is an extension of adaptive gradient methods that seeks to restore or enforce rotation-equivariance—insensitivity to orthogonal changes of basis—in Adam-style preconditioning. Standard Adam operates with strictly per-coordinate second-moment adaptation, which breaks rotation equivariance and can result in significant algorithmic artifacts, degraded training speed, and lost generalization benefits when the parameterization or data undergoes orthogonal transformation. Recent work has developed both theoretically principled and pragmatically scalable forms of Rotational Adam, including matrix preconditioner diagonalization, expected gradient outer product reparameterization, symmetry-aware adaptation, Riemannian manifold generalizations, and vector-grouped updates.

1. Failure Modes of Standard Adam under Rotations

Adam maintains running averages of the first and second moments of stochastic gradients for each parameter entry, updating as

$m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla f(\theta_{t-1}) \ v_t = \beta_2 v_{t-1} + (1-\beta_2) [\nabla f(\theta_{t-1})]^2,$

with elementwise operations. The parameter update is

$\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.$

This formulation is sensitive to the coordinate basis: under an arbitrary rotation $R$ , the moments transform (first moment: $m_t \mapsto R m_t$ ; second moment: $v_t \mapsto$ not generally $R v_t$ , due to the elementwise square), but the elementwise division does not commute with $R$ . Standard SGD, by contrast, is fully rotation-equivariant. Empirically, Adam’s performance—both convergence speed and implicit bias—degrades under random, layerwise, or global basis rotations in transformer and vision architectures, with the degree of slowdown correlated with the scale and type of rotation. For instance, global rotations can increase GPT-2 training time by ~16% and decrease ViT/S ImageNet-1K convergence rate by up to 96%, while ResNet-50 is robust to such transformations (Maes et al., 2024).

2. Mathematical Foundations of Rotation Equivariance

An optimizer $A$ is rotation-equivariant if for any orthogonal $R$ and all $t$ ,

$\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.$ 0

where $\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.$ 1. Adam’s coordinatewise adaptation (division by $\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.$ 2) violates this property because, in general, $\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.$ 3; the update trajectory depends on the choice of parameter basis (Maes et al., 2024, Ling et al., 2022).

Several approaches have been developed to achieve rotation-equivariance by restructuring Adam’s adaptation, including:

Full-matrix second moment adaptation:

$\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.$ 4

$\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.$ 5

(This update is formally rotation-invariant.)

Dynamic basis diagonalization: Diagonalize $\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.$ 6 as $\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.$ 7, then update in the rotated basis, mapping back to original coordinates. This generalizes SVD-based rotation strategies to full preconditioning (Maes et al., 2024, Nguyen et al., 11 Feb 2025).
Block-wise or vector-grouped moments: Aggregate second moments over structured subsets (e.g., rows, channels, global vectors), yielding equivariance and eliminating axis artifacts (Ling et al., 2022).

3. Rotation-Equivariant Adam Variants

A variety of practical rotational strategies for Adam optimization have emerged:

3.1 EGOP-Reparameterized (Covariance-Aligned) Adam

Using the Expected Gradient Outer Product (EGOP) matrix,

$\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.$ 8

with eigendecomposition $\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.$ 9, a fixed orthonormal rotation $R$ 0 is applied to parameters, mapping $R$ 1. The Adam update is performed in $R$ 2-space and mapped back. This approach is rotation-equivalent and exploits the dominant principal gradient directions present in tasks with spectral decay, yielding improved step efficiency and insensitivity to parameterization (DePavia et al., 3 Feb 2025, DePavia et al., 27 Oct 2025).

3.2 Adaptive Preconditioner Diagonalization (AdaDiag++/Rotational Adam)

The empirical gradient covariance $R$ 3 is diagonalized via periodic SVD (on matrix-shaped parameters),

$R$ 4

and parameter updates are performed in the basis diagonalizing $R$ 5, with diagonal preconditioning, and mapped back:

$R$ 6

This yields large improvements in convergence metrics for large-scale language and vision models, with significant step and epoch reductions (LLaMA pretraining: 2x fewer steps vs Adam; ResNet/ImageNet-1K: 1.2–1.5x fewer epochs) (Nguyen et al., 11 Feb 2025).

3.3 VectorAdam

VectorAdam groups vector-valued parameters (e.g., 3D coordinates, neural features) and computes a scalar second moment for each block,

$R$ 7

removing axis-aligned update artifacts in geometric learning and adversarial point-cloud optimization (Ling et al., 2022).

3.4 Adaptive and Symmetry-Aware Rotational Policies (ARO)

ARO (Adaptively Rotated Optimization) introduces an adaptive rotation for each parameter matrix, selecting $R$ 8 via a Procrustes-QR policy to maximize a dual-norm loss decrease proxy, then performing the Adam-style or alternative normed descent in this rotated basis. ARO unifies and extends prior “eigen-rotation,” “SOAP,” and “Muon” schemes, and delivers up to 1.3–1.35x step-speedup over AdamW at LLM scales with negligible added wall-clock cost (Gong et al., 9 Feb 2026).

4. Algorithms and Implementation

4.1 EGOP-Reparameterized Rotational Adam

Estimate EGOP $R$ 9 by sampling gradients under distribution $m_t \mapsto R m_t$ 0.
Compute $m_t \mapsto R m_t$ 1; set $m_t \mapsto R m_t$ 2.
Transform initial parameters: $m_t \mapsto R m_t$ 3.
For $m_t \mapsto R m_t$ $m_{t} \mapsto R m_{t}$ 4:
- $m_t \mapsto R m_t$ 5; $m_t \mapsto R m_t$ 6.
- Update Adam moments in $m_t \mapsto R m_t$ 7-space: $m_t \mapsto R m_t$ 8, $m_t \mapsto R m_t$ 9.
- Bias-correct, step update: $v_t \mapsto$ 0.
- Map back: $v_t \mapsto$ 1.
Use block-wise or low-rank $v_t \mapsto$ 2 for tractability at scale (DePavia et al., 3 Feb 2025, DePavia et al., 27 Oct 2025).

4.2 Preconditioner Diagonalization (AdaDiag++)

Aggregate $v_t \mapsto$ 3 (second-moment estimator) as a moving average of gradient outer products.
Periodically diagonalize $v_t \mapsto$ 4 via SVD (reshaping gradient), yielding rotation $v_t \mapsto$ 5.
Apply Adam updates in rotated space, then inverse-rotate updates back.
AdafacDiag integrates with Adafactor for memory efficiency; only row/column second-moment statistics are needed (Nguyen et al., 11 Feb 2025).

4.3 Riemannian/Stiefel Manifold Rotational Adam

For optimization on orthogonality-constrained spaces (e.g., Stiefel manifold),
Project gradients, accumulate moments in the intrinsic tangent space,
Retract by computing exponential map update: $v_t \mapsto$ 6.
Preserves manifold constraints, and carries Adam’s adaptivity over non-Euclidean geometries (Brantner, 2023).

4.4 Adaptively Rotated Optimization (ARO)

Maintain running momentum $v_t \mapsto$ 7 (matrix).
Compute lookahead step via base optimizer (e.g., Adam).
Select $v_t \mapsto$ 8 by maximizing a dual-norm proxy; implement via Cholesky–QR on Gram matrices.
Transform gradients, moments, and apply update in $v_t \mapsto$ 9-rotated basis.
Retain per-iteration overhead $R v_t$ 0 for $R v_t$ 1 matrices, compatible with LLM-training workloads.
Supports hierarchical/global/shared rotation schemes (Gong et al., 9 Feb 2026).

5. Theoretical and Empirical Properties

Rotational Adam methodologies inherit the convergence properties of standard Adam in the Euclidean or Riemannian sense, as all bias-correction, adaptation, and isometric mappings (rotations, eigen-bases) preserve the necessary convexity and bounded-gradient assumptions in online convex optimization (DePavia et al., 3 Feb 2025, Nguyen et al., 11 Feb 2025, Brantner, 2023). When the EGOP spectrum decays, or when curvature structure is block-diagonalizable, rotational approaches deliver substantial convergence acceleration and restore invariance. Edge cases where the curvature is isotropic (flat EGOP) do not benefit from rotation.

Empirical results include:

2x step reduction in LLaMA pretraining compared to Adam (Nguyen et al., 11 Feb 2025).
Restoration of “richness bias” in small-rotation ReLU nets (decision boundaries remain nonlinear and Bayes-optimal) (DePavia et al., 27 Oct 2025).
Elimination of axis-aligned artifacts in geometry and adversarial optimization (Ling et al., 2022).
Stable, memory-efficient integration with Adafactor yields similar performance at order-of-magnitude lower storage (Nguyen et al., 11 Feb 2025).
ARO delivers 1.3–1.35x step-speedup over AdamW and 1.1–1.15x over orthogonalization methods across a range of LLM scales, with controlled benchmarking (Gong et al., 9 Feb 2026).

6. Practical Considerations and Memory/Complexity

Rotational Adam variants incur additional costs:

EGOP or full-matrix approaches: $R v_t$ 2 per iteration for $R v_t$ 3-dimensional parameters (impractical for very large $R v_t$ 4).
Block-wise or low-rank schemes decrease per-step cost to $R v_t$ 5, $R v_t$ 6 blocks of size $R v_t$ 7.
Periodic SVD/amortized QR rotation cost is small relative to forward/backward passes in large models, especially with windowed basis updates (Nguyen et al., 11 Feb 2025, Gong et al., 9 Feb 2026).
VectorAdam reduces memory footprint for vector-structured blocks by a factor of $R v_t$ 8 (for $R v_t$ 9 parameter blocks), and slightly decreases computation by eliminating per-coordinate variances (Ling et al., 2022).
AdafacDiag maintains sublinear memory comparable to Adafactor, suitable for large-scale deployment.

7. Connections, Limitations, and Future Work

Rotational Adam aligns with a broader framework of symmetry-aware optimization, where natural group actions (e.g., rotations) leave the objective function invariant. Generalizations extend to:

Dynamic or per-layer rotation sharing; cross-module rotational symmetry exploitation.
Hybrid permutation-equivariant and gauge-invariant extensions.
Riemannian gradient and moment transport for non-Euclidean manifolds.
Adaptive rotation selection via data-driven or curvature-informed policies (e.g., via dual-norm maximization) (Gong et al., 9 Feb 2026).

Current limitations include:

Intractability of full-matrix preconditioning at extreme ( $R$ 0) scale unless blocked/low-rank.
Diminishing returns as the eigenvalue spectrum of the second-moment matrix flattens.
The need for expert tuning of block granularity, SVD update periods, and memory-efficient representation in large transformer settings.

By exposing and counteracting Adam’s rotation-pathologies, Rotational Adam optimizers provide a unified, theoretically-grounded, and empirically-validated path to efficient and robust adaptive optimization in modern large-scale learning (Maes et al., 2024, DePavia et al., 3 Feb 2025, DePavia et al., 27 Oct 2025, Nguyen et al., 11 Feb 2025, Ling et al., 2022, Gong et al., 9 Feb 2026).