Dice Question Streamline Icon: https://streamlinehq.com

Advantage of AOL preconditioning in high-iteration Newton–Schulz regimes

Determine whether Almost Orthogonal Layer (AOL) preconditioning provides any advantage when the Newton–Schulz orthogonalization is run with a high number of iterations (large t), specifically in terms of convergence behavior and approximation accuracy of the polar factor, compared to Frobenius normalization within orthogonality-based optimizers such as Muon and Turbo-Muon.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper shows that AOL preconditioning significantly improves convergence and reduces polar error in low-iteration regimes (t ≤ 5), enabling the removal of one Newton–Schulz iteration without loss of accuracy and yielding runtime benefits. It also decomposes the remaining error into an approximation term and a bias term introduced by AOL, and proves the preconditioned update remains a strict descent direction.

However, when the iteration count is large, the authors note that while the bias persists, the practical benefits of preconditioning may be absorbed. They explicitly state uncertainty about whether AOL preconditioning offers any advantage in high-t regimes, leaving this as an unresolved question.

References

While it is unclear if AOL preconditioning confers an advantage in high $t$ regimes.

Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning (2512.04632 - Boissin et al., 4 Dec 2025) in Section “Beyond Iterative Approximates: On the Asymptotic Behavior of Turbo-Muon”, paragraph “Confirming That the Estimation Bias Does Not Affect Training”