Advantage of AOL preconditioning in high-iteration Newton–Schulz regimes
Determine whether Almost Orthogonal Layer (AOL) preconditioning provides any advantage when the Newton–Schulz orthogonalization is run with a high number of iterations (large t), specifically in terms of convergence behavior and approximation accuracy of the polar factor, compared to Frobenius normalization within orthogonality-based optimizers such as Muon and Turbo-Muon.
Sponsor
References
While it is unclear if AOL preconditioning confers an advantage in high $t$ regimes.
— Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning
(2512.04632 - Boissin et al., 4 Dec 2025) in Section “Beyond Iterative Approximates: On the Asymptotic Behavior of Turbo-Muon”, paragraph “Confirming That the Estimation Bias Does Not Affect Training”