Newton–Schulz Orthogonalization

Updated 10 March 2026

Newton–Schulz orthogonalization is an iterative algorithm that leverages matrix-polynomial methods to extract the orthogonal factor without explicit QR or SVD.
It achieves quadratic convergence using only matrix–matrix multiplications, which makes it highly efficient for large-scale and GPU-accelerated applications.
Key applications include eigensolvers and advanced optimization methods in machine learning, with refinements using polynomial acceleration and preconditioning.

Newton–Schulz Orthogonalization refers to a class of iterative, matrix-polynomial algorithms designed to efficiently orthogonalize matrices, particularly in scenarios where explicit QR factorization or singular value decomposition (SVD) is computationally prohibitive. By leveraging only matrix–matrix multiplications, the Newton–Schulz (NS) scheme is highly suitable for large-scale problems, GPU acceleration, and applications requiring repeated or controlled approximate orthogonalization. Key domains of application include eigensolvers, modern stochastic and geometry-aware optimizers (such as Muon), and Riemannian optimization on matrix manifolds.

1. Mathematical Formulation and Derivation

The Newton–Schulz orthogonalization process is rooted in the polar decomposition of a nonsingular matrix $A \in \mathbb{R}^{n \times n}$ , which admits the unique factorization

$A = H P$

where $H$ is orthogonal ( $H^T H = I$ ), and $P = (A^T A)^{1/2}$ is symmetric positive definite. In terms of the SVD ( $A = U \Sigma V^T$ ), $H = U V^T$ and $P = V \Sigma V^T$ .

To extract the orthogonal factor $H$ without explicit SVD, NS leverages Newton’s method on the matrix equation $F(X) = X^T X - I$ :

$X_{k+1} = \frac12 (X_k + X_k^{-T})$

This iteration converges quadratically to $H$ for nonsingular $A$ . By avoiding explicit inversion, a Schulz-type approximation delivers the update:

$X_{k+1} = X_k \cdot \frac12 (3I - X_k^T X_k)$

or equivalently,

$X_{k+1} = X_k (3I - X_k^T X_k)/2$

The quadratic convergence is preserved when the spectrum of $A$ (or $X_k^T X_k$ ) is inside $(0, \sqrt{3})$ , and convergence is rapid once the singular values of $A$ are well-scaled (Zhou, 30 Aug 2025).

2. Algorithmic Implementation and GPU Efficiency

Newton–Schulz orthogonalization is particularly attractive for large matrices and GPU contexts due to its reliance solely on matrix–matrix products, with no data-dependent branching or explicit orthonormalization. The practical mixed-precision eigensolver in (Zhou, 30 Aug 2025) demonstrates MATLAB pseudocode making direct use of the above iteration, requiring only 2–3 matrix multiplications for double precision accuracy. Similarly, in Muon-style optimizer kernels, each NS step is implemented as a series of batched fused GEMMs, enabling hardware-level acceleration (Boissin et al., 4 Dec 2025, Grishina et al., 12 Jun 2025, Zhou, 30 Aug 2025).

Polynomial variants, such as fixed-degree (cubic/quintic) and Chebyshev-optimized schemes, can be constructed to further tailor accuracy/cost tradeoffs, with higher-degree polynomials produced using a Remez algorithm for best approximation on the target spectral interval (Grishina et al., 12 Jun 2025). This enhances convergence per matrix-multiply, at the cost of increased per-iteration operations.

3. Theoretical Convergence, Error Bounds, and Conditioning

Each iteration contracts the orthogonality residual $R_k = X_k^T X_k - I$ quadratically:

$R_{k+1} = \frac14 (3I - R_k) R_k^2$

Hence, for any subordinate matrix norm,

$\|R_{k+1}\| \leq \frac{3+\|R_k\|}{4} \|R_k\|^2 = O(\|R_k\|^2)$

In practice, with appropriate initial scaling (e.g., spectral or Frobenius norm), 2–5 steps suffice to reduce errors to machine epsilon (Zhou, 30 Aug 2025, Boissin et al., 4 Dec 2025). For applications involving ill-conditioned matrices (e.g., low-rank, anisotropic gradient moments), convergence constants depend on the condition number $\kappa$ , and residual errors decay geometrically as $(1-1/\kappa)^{2^i}$ after $i$ NS steps (Refael et al., 30 May 2025). This dependence motivates either specialization to well-conditioned subspaces (as in SUMO) or improvement via polynomial tuning (Grishina et al., 12 Jun 2025).

4. Newton–Schulz Orthogonalization in Optimization Algorithms

Newton–Schulz orthogonalization has been extensively adopted in modern optimizers designed for highly anisotropic or constrained landscapes. The Muon optimizer, in particular, performs spectral norm–inducing step normalization by applying $q$ NS steps of degree- $\kappa$ polynomial to the momentum matrix, constructing

$O_t = X_{t,q}, \quad X_{t,j+1} = p_\kappa(X_{t,j} X_{t,j}^T) X_{t,j}$

where $p_\kappa(\lambda)$ is a carefully constructed Taylor or Chebyshev polynomial (Kim et al., 27 Jan 2026, Shulgin et al., 22 Oct 2025). For typical $q = 2$ –$5$, the orthogonality error decays doubly-exponentially, so that the convergence rate gap to the ideal SVD-polar update becomes negligible after few iterations (Kim et al., 27 Jan 2026). SUMO, in contrast, uses exact SVD in a low-rank subspace to circumvent the condition-number dependency encountered by standard NS (Refael et al., 30 May 2025).

A critical practical insight is the fundamental coupling between NS approximation error and optimizer hyperparameters: smaller NS iteration count (i.e., looser orthogonalization) demands smaller learning rate and higher momentum to maintain optimization stability (Shulgin et al., 22 Oct 2025).

5. Polynomial Acceleration and Chebyshev Optimized Variants

The generic NS update can be recast in a polynomial framework: at each step, the SVD decomposition is used to show that

$X_{k+1} = U p_d(S_k) V^T,\quad p_d(x) = \sum_{j=0}^{d-1} \alpha_{2j+1} x^{2j+1}$

Chebyshev-optimized Newton–Schulz (CANS) replaces the fixed-degree polynomial of classical NS by constructing best uniform odd polynomial approximations (using the alternation theorem and Remez algorithm) on the estimated spectral interval $[a, b]$ of the current iterate (Grishina et al., 12 Jun 2025). In practice, this polynomial tailoring yields a reduction in iteration count for a given level of residual, directly translating to computational savings on accelerators. However, construction becomes numerically unstable for polynomial degrees beyond 5, so practical implementations favor low-degree CANS.

6. Preconditioning and Algorithmic Enhancements

Preconditioning is key to further improving Newton–Schulz orthogonalization performance, especially for high aspect-ratio or badly conditioned matrices. The "Almost-Orthogonal Layer" (AOL) diagonal preconditioner rescales input columns to shrink the spectral spread before NS iteration, reducing necessary step count and total number of expensive matrix multiplications (Boissin et al., 4 Dec 2025). Empirically, this enables the removal of one iteration in the Muon pipeline—translating to a ∼20% reduction in orthogonalization kernel cost and up to 10% end-to-end optimizer speedup on modern hardware, with no loss in output accuracy.

7. Applications, Limitations, and Comparative Analysis

Applications of Newton–Schulz orthogonalization encompass:

Mixed-precision eigensolvers: Achieving double-precision spectral decompositions by orthogonalizing single-precision approximate eigenvectors (Zhou, 30 Aug 2025).
Large-scale stochastic and geometry-aware optimization: Core to Muon, Turbo-Muon, and similar step-normalizing optimizers (Kim et al., 27 Jan 2026, Boissin et al., 4 Dec 2025, Shulgin et al., 22 Oct 2025).
Efficient retractions in Riemannian optimization on the Stiefel manifold, providing controlled accuracy at fixed computational cost (Grishina et al., 12 Jun 2025).

Comparative performance analyses show that Newton–Schulz orthogonalization outperforms classical Householder QR in both speed and accuracy for large $n$ , owing to optimized BLAS matmul throughput and quadratic error decay. While exact SVD or QR orthogonalization is more stable in highly ill-conditioned or subspace-adaptive scenarios, matrix-polynomial/NS variants often dominate in overall wall-clock efficiency for moderately conditioned, full-rank contexts (Zhou, 30 Aug 2025, Grishina et al., 12 Jun 2025, Refael et al., 30 May 2025).

Limitations include slower convergence in the presence of extreme condition numbers (e.g., during LLM training with rapidly evolving gradients) unless augmented by preconditioning or restricted to adaptively chosen subspaces (as in SUMO).

Summary Table: Algorithmic Tradeoffs (processes and outcomes present verbatim in the data)

Method	Matmuls per Step	Orthogonalization Error
Householder QR	≈(10/3) $n^3$	$\lesssim 10^{-13}n$
Classical NS	2–3 $n^3$	$\lesssim 10^{-14}$
Chebyshev NS	$d$ $n^3$	$O(\varepsilon^k)$
SUMO (SVD)	O( $nr^2$ )	Exact in subspace

This table summarizes the computational and accuracy properties for the main orthogonalization approaches, as numerically reported in (Zhou, 30 Aug 2025, Grishina et al., 12 Jun 2025, Refael et al., 30 May 2025).

References

"A Mixed Precision Eigensolver Based on the Jacobi Algorithm" (Zhou, 30 Aug 2025)
"Beyond the Ideal: Analyzing the Inexact Muon Update" (Shulgin et al., 22 Oct 2025)
"Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials" (Grishina et al., 12 Jun 2025)
"Convergence of Muon with Newton-Schulz" (Kim et al., 27 Jan 2026)
"SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training" (Refael et al., 30 May 2025)
"Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning" (Boissin et al., 4 Dec 2025)

Markdown Report Issue Upgrade to Chat

References (6)

A Mixed Precision Eigensolver Based on the Jacobi Algorithm (2025)

Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning (2025)

Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials (2025)

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training (2025)

Convergence of Muon with Newton-Schulz (2026)

Beyond the Ideal: Analyzing the Inexact Muon Update (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Newton–Schulz Orthogonalization.

Newton–Schulz Orthogonalization

1. Mathematical Formulation and Derivation

2. Algorithmic Implementation and GPU Efficiency

3. Theoretical Convergence, Error Bounds, and Conditioning

4. Newton–Schulz Orthogonalization in Optimization Algorithms

5. Polynomial Acceleration and Chebyshev Optimized Variants

6. Preconditioning and Algorithmic Enhancements

7. Applications, Limitations, and Comparative Analysis

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Newton–Schulz Orthogonalization

1. Mathematical Formulation and Derivation

2. Algorithmic Implementation and GPU Efficiency

3. Theoretical Convergence, Error Bounds, and Conditioning

4. Newton–Schulz Orthogonalization in Optimization Algorithms

5. Polynomial Acceleration and Chebyshev Optimized Variants

6. Preconditioning and Algorithmic Enhancements

7. Applications, Limitations, and Comparative Analysis

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research