Papers
Topics
Authors
Recent
Search
2000 character limit reached

Newton–Schulz Orthogonalization

Updated 10 March 2026
  • Newton–Schulz orthogonalization is an iterative algorithm that leverages matrix-polynomial methods to extract the orthogonal factor without explicit QR or SVD.
  • It achieves quadratic convergence using only matrix–matrix multiplications, which makes it highly efficient for large-scale and GPU-accelerated applications.
  • Key applications include eigensolvers and advanced optimization methods in machine learning, with refinements using polynomial acceleration and preconditioning.

Newton–Schulz Orthogonalization refers to a class of iterative, matrix-polynomial algorithms designed to efficiently orthogonalize matrices, particularly in scenarios where explicit QR factorization or singular value decomposition (SVD) is computationally prohibitive. By leveraging only matrix–matrix multiplications, the Newton–Schulz (NS) scheme is highly suitable for large-scale problems, GPU acceleration, and applications requiring repeated or controlled approximate orthogonalization. Key domains of application include eigensolvers, modern stochastic and geometry-aware optimizers (such as Muon), and Riemannian optimization on matrix manifolds.

1. Mathematical Formulation and Derivation

The Newton–Schulz orthogonalization process is rooted in the polar decomposition of a nonsingular matrix ARn×nA \in \mathbb{R}^{n \times n}, which admits the unique factorization

A=HPA = H P

where HH is orthogonal (HTH=IH^T H = I), and P=(ATA)1/2P = (A^T A)^{1/2} is symmetric positive definite. In terms of the SVD (A=UΣVTA = U \Sigma V^T), H=UVTH = U V^T and P=VΣVTP = V \Sigma V^T.

To extract the orthogonal factor HH without explicit SVD, NS leverages Newton’s method on the matrix equation F(X)=XTXIF(X) = X^T X - I:

Xk+1=12(Xk+XkT)X_{k+1} = \frac12 (X_k + X_k^{-T})

This iteration converges quadratically to HH for nonsingular AA. By avoiding explicit inversion, a Schulz-type approximation delivers the update:

Xk+1=Xk12(3IXkTXk)X_{k+1} = X_k \cdot \frac12 (3I - X_k^T X_k)

or equivalently,

Xk+1=Xk(3IXkTXk)/2X_{k+1} = X_k (3I - X_k^T X_k)/2

The quadratic convergence is preserved when the spectrum of AA (or XkTXkX_k^T X_k) is inside (0,3)(0, \sqrt{3}), and convergence is rapid once the singular values of AA are well-scaled (Zhou, 30 Aug 2025).

2. Algorithmic Implementation and GPU Efficiency

Newton–Schulz orthogonalization is particularly attractive for large matrices and GPU contexts due to its reliance solely on matrix–matrix products, with no data-dependent branching or explicit orthonormalization. The practical mixed-precision eigensolver in (Zhou, 30 Aug 2025) demonstrates MATLAB pseudocode making direct use of the above iteration, requiring only 2–3 matrix multiplications for double precision accuracy. Similarly, in Muon-style optimizer kernels, each NS step is implemented as a series of batched fused GEMMs, enabling hardware-level acceleration (Boissin et al., 4 Dec 2025, Grishina et al., 12 Jun 2025, Zhou, 30 Aug 2025).

Polynomial variants, such as fixed-degree (cubic/quintic) and Chebyshev-optimized schemes, can be constructed to further tailor accuracy/cost tradeoffs, with higher-degree polynomials produced using a Remez algorithm for best approximation on the target spectral interval (Grishina et al., 12 Jun 2025). This enhances convergence per matrix-multiply, at the cost of increased per-iteration operations.

3. Theoretical Convergence, Error Bounds, and Conditioning

Each iteration contracts the orthogonality residual Rk=XkTXkIR_k = X_k^T X_k - I quadratically:

Rk+1=14(3IRk)Rk2R_{k+1} = \frac14 (3I - R_k) R_k^2

Hence, for any subordinate matrix norm,

Rk+13+Rk4Rk2=O(Rk2)\|R_{k+1}\| \leq \frac{3+\|R_k\|}{4} \|R_k\|^2 = O(\|R_k\|^2)

In practice, with appropriate initial scaling (e.g., spectral or Frobenius norm), 2–5 steps suffice to reduce errors to machine epsilon (Zhou, 30 Aug 2025, Boissin et al., 4 Dec 2025). For applications involving ill-conditioned matrices (e.g., low-rank, anisotropic gradient moments), convergence constants depend on the condition number κ\kappa, and residual errors decay geometrically as (11/κ)2i(1-1/\kappa)^{2^i} after ii NS steps (Refael et al., 30 May 2025). This dependence motivates either specialization to well-conditioned subspaces (as in SUMO) or improvement via polynomial tuning (Grishina et al., 12 Jun 2025).

4. Newton–Schulz Orthogonalization in Optimization Algorithms

Newton–Schulz orthogonalization has been extensively adopted in modern optimizers designed for highly anisotropic or constrained landscapes. The Muon optimizer, in particular, performs spectral norm–inducing step normalization by applying qq NS steps of degree-κ\kappa polynomial to the momentum matrix, constructing

Ot=Xt,q,Xt,j+1=pκ(Xt,jXt,jT)Xt,jO_t = X_{t,q}, \quad X_{t,j+1} = p_\kappa(X_{t,j} X_{t,j}^T) X_{t,j}

where pκ(λ)p_\kappa(\lambda) is a carefully constructed Taylor or Chebyshev polynomial (Kim et al., 27 Jan 2026, Shulgin et al., 22 Oct 2025). For typical q=2q = 2–$5$, the orthogonality error decays doubly-exponentially, so that the convergence rate gap to the ideal SVD-polar update becomes negligible after few iterations (Kim et al., 27 Jan 2026). SUMO, in contrast, uses exact SVD in a low-rank subspace to circumvent the condition-number dependency encountered by standard NS (Refael et al., 30 May 2025).

A critical practical insight is the fundamental coupling between NS approximation error and optimizer hyperparameters: smaller NS iteration count (i.e., looser orthogonalization) demands smaller learning rate and higher momentum to maintain optimization stability (Shulgin et al., 22 Oct 2025).

5. Polynomial Acceleration and Chebyshev Optimized Variants

The generic NS update can be recast in a polynomial framework: at each step, the SVD decomposition is used to show that

Xk+1=Upd(Sk)VT,pd(x)=j=0d1α2j+1x2j+1X_{k+1} = U p_d(S_k) V^T,\quad p_d(x) = \sum_{j=0}^{d-1} \alpha_{2j+1} x^{2j+1}

Chebyshev-optimized Newton–Schulz (CANS) replaces the fixed-degree polynomial of classical NS by constructing best uniform odd polynomial approximations (using the alternation theorem and Remez algorithm) on the estimated spectral interval [a,b][a, b] of the current iterate (Grishina et al., 12 Jun 2025). In practice, this polynomial tailoring yields a reduction in iteration count for a given level of residual, directly translating to computational savings on accelerators. However, construction becomes numerically unstable for polynomial degrees beyond 5, so practical implementations favor low-degree CANS.

6. Preconditioning and Algorithmic Enhancements

Preconditioning is key to further improving Newton–Schulz orthogonalization performance, especially for high aspect-ratio or badly conditioned matrices. The "Almost-Orthogonal Layer" (AOL) diagonal preconditioner rescales input columns to shrink the spectral spread before NS iteration, reducing necessary step count and total number of expensive matrix multiplications (Boissin et al., 4 Dec 2025). Empirically, this enables the removal of one iteration in the Muon pipeline—translating to a ∼20% reduction in orthogonalization kernel cost and up to 10% end-to-end optimizer speedup on modern hardware, with no loss in output accuracy.

7. Applications, Limitations, and Comparative Analysis

Applications of Newton–Schulz orthogonalization encompass:

Comparative performance analyses show that Newton–Schulz orthogonalization outperforms classical Householder QR in both speed and accuracy for large nn, owing to optimized BLAS matmul throughput and quadratic error decay. While exact SVD or QR orthogonalization is more stable in highly ill-conditioned or subspace-adaptive scenarios, matrix-polynomial/NS variants often dominate in overall wall-clock efficiency for moderately conditioned, full-rank contexts (Zhou, 30 Aug 2025, Grishina et al., 12 Jun 2025, Refael et al., 30 May 2025).

Limitations include slower convergence in the presence of extreme condition numbers (e.g., during LLM training with rapidly evolving gradients) unless augmented by preconditioning or restricted to adaptively chosen subspaces (as in SUMO).

Summary Table: Algorithmic Tradeoffs (processes and outcomes present verbatim in the data)

Method Matmuls per Step Orthogonalization Error
Householder QR ≈(10/3) n3n^3 1013n\lesssim 10^{-13}n
Classical NS 2–3 n3n^3 1014\lesssim 10^{-14}
Chebyshev NS dd n3n^3 O(εk)O(\varepsilon^k)
SUMO (SVD) O(nr2nr^2) Exact in subspace

This table summarizes the computational and accuracy properties for the main orthogonalization approaches, as numerically reported in (Zhou, 30 Aug 2025, Grishina et al., 12 Jun 2025, Refael et al., 30 May 2025).

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Newton–Schulz Orthogonalization.