Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rotational Adam Optimizer

Updated 6 May 2026
  • The Rotational Adam Optimizer restores rotation equivariance by modifying second moment adaptation to be insensitive to orthogonal transformations.
  • It integrates techniques such as full-matrix adaptation, dynamic basis diagonalization, and vector-grouped updates to align parameter updates with inherent data structures.
  • Empirical results demonstrate improved convergence metrics and reduced training steps in transformer and vision architectures compared to standard Adam.

Rotational Adam Optimizer is an extension of adaptive gradient methods that seeks to restore or enforce rotation-equivariance—insensitivity to orthogonal changes of basis—in Adam-style preconditioning. Standard Adam operates with strictly per-coordinate second-moment adaptation, which breaks rotation equivariance and can result in significant algorithmic artifacts, degraded training speed, and lost generalization benefits when the parameterization or data undergoes orthogonal transformation. Recent work has developed both theoretically principled and pragmatically scalable forms of Rotational Adam, including matrix preconditioner diagonalization, expected gradient outer product reparameterization, symmetry-aware adaptation, Riemannian manifold generalizations, and vector-grouped updates.

1. Failure Modes of Standard Adam under Rotations

Adam maintains running averages of the first and second moments of stochastic gradients for each parameter entry, updating as

mt=β1mt1+(1β1)f(θt1) vt=β2vt1+(1β2)[f(θt1)]2,m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla f(\theta_{t-1}) \ v_t = \beta_2 v_{t-1} + (1-\beta_2) [\nabla f(\theta_{t-1})]^2,

with elementwise operations. The parameter update is

θt=θt1αm^tv^t+ϵ.\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.

This formulation is sensitive to the coordinate basis: under an arbitrary rotation RR, the moments transform (first moment: mtRmtm_t \mapsto R m_t; second moment: vtv_t \mapsto not generally RvtR v_t, due to the elementwise square), but the elementwise division does not commute with RR. Standard SGD, by contrast, is fully rotation-equivariant. Empirically, Adam’s performance—both convergence speed and implicit bias—degrades under random, layerwise, or global basis rotations in transformer and vision architectures, with the degree of slowdown correlated with the scale and type of rotation. For instance, global rotations can increase GPT-2 training time by ~16% and decrease ViT/S ImageNet-1K convergence rate by up to 96%, while ResNet-50 is robust to such transformations (Maes et al., 2024).

2. Mathematical Foundations of Rotation Equivariance

An optimizer AA is rotation-equivariant if for any orthogonal RR and all tt,

θt=θt1αm^tv^t+ϵ.\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.0

where θt=θt1αm^tv^t+ϵ.\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.1. Adam’s coordinatewise adaptation (division by θt=θt1αm^tv^t+ϵ.\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.2) violates this property because, in general, θt=θt1αm^tv^t+ϵ.\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.3; the update trajectory depends on the choice of parameter basis (Maes et al., 2024, Ling et al., 2022).

Several approaches have been developed to achieve rotation-equivariance by restructuring Adam’s adaptation, including:

  1. Full-matrix second moment adaptation:

θt=θt1αm^tv^t+ϵ.\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.4

θt=θt1αm^tv^t+ϵ.\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.5

(This update is formally rotation-invariant.)

  1. Dynamic basis diagonalization: Diagonalize θt=θt1αm^tv^t+ϵ.\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.6 as θt=θt1αm^tv^t+ϵ.\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.7, then update in the rotated basis, mapping back to original coordinates. This generalizes SVD-based rotation strategies to full preconditioning (Maes et al., 2024, Nguyen et al., 11 Feb 2025).
  2. Block-wise or vector-grouped moments: Aggregate second moments over structured subsets (e.g., rows, channels, global vectors), yielding equivariance and eliminating axis artifacts (Ling et al., 2022).

3. Rotation-Equivariant Adam Variants

A variety of practical rotational strategies for Adam optimization have emerged:

3.1 EGOP-Reparameterized (Covariance-Aligned) Adam

Using the Expected Gradient Outer Product (EGOP) matrix,

θt=θt1αm^tv^t+ϵ.\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.8

with eigendecomposition θt=θt1αm^tv^t+ϵ.\theta_t = \theta_{t-1} - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.9, a fixed orthonormal rotation RR0 is applied to parameters, mapping RR1. The Adam update is performed in RR2-space and mapped back. This approach is rotation-equivalent and exploits the dominant principal gradient directions present in tasks with spectral decay, yielding improved step efficiency and insensitivity to parameterization (DePavia et al., 3 Feb 2025, DePavia et al., 27 Oct 2025).

3.2 Adaptive Preconditioner Diagonalization (AdaDiag++/Rotational Adam)

The empirical gradient covariance RR3 is diagonalized via periodic SVD (on matrix-shaped parameters),

RR4

and parameter updates are performed in the basis diagonalizing RR5, with diagonal preconditioning, and mapped back:

RR6

This yields large improvements in convergence metrics for large-scale language and vision models, with significant step and epoch reductions (LLaMA pretraining: 2x fewer steps vs Adam; ResNet/ImageNet-1K: 1.2–1.5x fewer epochs) (Nguyen et al., 11 Feb 2025).

3.3 VectorAdam

VectorAdam groups vector-valued parameters (e.g., 3D coordinates, neural features) and computes a scalar second moment for each block,

RR7

removing axis-aligned update artifacts in geometric learning and adversarial point-cloud optimization (Ling et al., 2022).

3.4 Adaptive and Symmetry-Aware Rotational Policies (ARO)

ARO (Adaptively Rotated Optimization) introduces an adaptive rotation for each parameter matrix, selecting RR8 via a Procrustes-QR policy to maximize a dual-norm loss decrease proxy, then performing the Adam-style or alternative normed descent in this rotated basis. ARO unifies and extends prior “eigen-rotation,” “SOAP,” and “Muon” schemes, and delivers up to 1.3–1.35x step-speedup over AdamW at LLM scales with negligible added wall-clock cost (Gong et al., 9 Feb 2026).

4. Algorithms and Implementation

4.1 EGOP-Reparameterized Rotational Adam

  • Estimate EGOP RR9 by sampling gradients under distribution mtRmtm_t \mapsto R m_t0.
  • Compute mtRmtm_t \mapsto R m_t1; set mtRmtm_t \mapsto R m_t2.
  • Transform initial parameters: mtRmtm_t \mapsto R m_t3.
  • For mtRmtm_t \mapsto R m_t4:
    • mtRmtm_t \mapsto R m_t5; mtRmtm_t \mapsto R m_t6.
    • Update Adam moments in mtRmtm_t \mapsto R m_t7-space: mtRmtm_t \mapsto R m_t8, mtRmtm_t \mapsto R m_t9.
    • Bias-correct, step update: vtv_t \mapsto0.
    • Map back: vtv_t \mapsto1.
  • Use block-wise or low-rank vtv_t \mapsto2 for tractability at scale (DePavia et al., 3 Feb 2025, DePavia et al., 27 Oct 2025).

4.2 Preconditioner Diagonalization (AdaDiag++)

  • Aggregate vtv_t \mapsto3 (second-moment estimator) as a moving average of gradient outer products.
  • Periodically diagonalize vtv_t \mapsto4 via SVD (reshaping gradient), yielding rotation vtv_t \mapsto5.
  • Apply Adam updates in rotated space, then inverse-rotate updates back.
  • AdafacDiag integrates with Adafactor for memory efficiency; only row/column second-moment statistics are needed (Nguyen et al., 11 Feb 2025).

4.3 Riemannian/Stiefel Manifold Rotational Adam

  • For optimization on orthogonality-constrained spaces (e.g., Stiefel manifold),
  • Project gradients, accumulate moments in the intrinsic tangent space,
  • Retract by computing exponential map update: vtv_t \mapsto6.
  • Preserves manifold constraints, and carries Adam’s adaptivity over non-Euclidean geometries (Brantner, 2023).

4.4 Adaptively Rotated Optimization (ARO)

  • Maintain running momentum vtv_t \mapsto7 (matrix).
  • Compute lookahead step via base optimizer (e.g., Adam).
  • Select vtv_t \mapsto8 by maximizing a dual-norm proxy; implement via Cholesky–QR on Gram matrices.
  • Transform gradients, moments, and apply update in vtv_t \mapsto9-rotated basis.
  • Retain per-iteration overhead RvtR v_t0 for RvtR v_t1 matrices, compatible with LLM-training workloads.
  • Supports hierarchical/global/shared rotation schemes (Gong et al., 9 Feb 2026).

5. Theoretical and Empirical Properties

Rotational Adam methodologies inherit the convergence properties of standard Adam in the Euclidean or Riemannian sense, as all bias-correction, adaptation, and isometric mappings (rotations, eigen-bases) preserve the necessary convexity and bounded-gradient assumptions in online convex optimization (DePavia et al., 3 Feb 2025, Nguyen et al., 11 Feb 2025, Brantner, 2023). When the EGOP spectrum decays, or when curvature structure is block-diagonalizable, rotational approaches deliver substantial convergence acceleration and restore invariance. Edge cases where the curvature is isotropic (flat EGOP) do not benefit from rotation.

Empirical results include:

  • 2x step reduction in LLaMA pretraining compared to Adam (Nguyen et al., 11 Feb 2025).
  • Restoration of “richness bias” in small-rotation ReLU nets (decision boundaries remain nonlinear and Bayes-optimal) (DePavia et al., 27 Oct 2025).
  • Elimination of axis-aligned artifacts in geometry and adversarial optimization (Ling et al., 2022).
  • Stable, memory-efficient integration with Adafactor yields similar performance at order-of-magnitude lower storage (Nguyen et al., 11 Feb 2025).
  • ARO delivers 1.3–1.35x step-speedup over AdamW and 1.1–1.15x over orthogonalization methods across a range of LLM scales, with controlled benchmarking (Gong et al., 9 Feb 2026).

6. Practical Considerations and Memory/Complexity

Rotational Adam variants incur additional costs:

  • EGOP or full-matrix approaches: RvtR v_t2 per iteration for RvtR v_t3-dimensional parameters (impractical for very large RvtR v_t4).
  • Block-wise or low-rank schemes decrease per-step cost to RvtR v_t5, RvtR v_t6 blocks of size RvtR v_t7.
  • Periodic SVD/amortized QR rotation cost is small relative to forward/backward passes in large models, especially with windowed basis updates (Nguyen et al., 11 Feb 2025, Gong et al., 9 Feb 2026).
  • VectorAdam reduces memory footprint for vector-structured blocks by a factor of RvtR v_t8 (for RvtR v_t9 parameter blocks), and slightly decreases computation by eliminating per-coordinate variances (Ling et al., 2022).
  • AdafacDiag maintains sublinear memory comparable to Adafactor, suitable for large-scale deployment.

7. Connections, Limitations, and Future Work

Rotational Adam aligns with a broader framework of symmetry-aware optimization, where natural group actions (e.g., rotations) leave the objective function invariant. Generalizations extend to:

  • Dynamic or per-layer rotation sharing; cross-module rotational symmetry exploitation.
  • Hybrid permutation-equivariant and gauge-invariant extensions.
  • Riemannian gradient and moment transport for non-Euclidean manifolds.
  • Adaptive rotation selection via data-driven or curvature-informed policies (e.g., via dual-norm maximization) (Gong et al., 9 Feb 2026).

Current limitations include:

  • Intractability of full-matrix preconditioning at extreme (RR0) scale unless blocked/low-rank.
  • Diminishing returns as the eigenvalue spectrum of the second-moment matrix flattens.
  • The need for expert tuning of block granularity, SVD update periods, and memory-efficient representation in large transformer settings.

By exposing and counteracting Adam’s rotation-pathologies, Rotational Adam optimizers provide a unified, theoretically-grounded, and empirically-validated path to efficient and robust adaptive optimization in modern large-scale learning (Maes et al., 2024, DePavia et al., 3 Feb 2025, DePavia et al., 27 Oct 2025, Nguyen et al., 11 Feb 2025, Ling et al., 2022, Gong et al., 9 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rotational Adam Optimizer.