Newton–Schulz Iterations: Matrix Inversion & Beyond

Updated 15 November 2025

Newton–Schulz iterations are a family of fixed-point schemes that compute matrix inverses and polar factors with quadratic or higher-order convergence.
High-order extensions employ optimized polynomial factorizations to minimize matrix multiplications while achieving efficiency indices around 1.41–1.49.
Integrating these iterations with Richardson’s method enables robust preconditioning and fast orthogonalization in applications like failure detection and machine learning.

The Newton–Schulz iteration is a family of high-order fixed-point schemes for constructing matrix inverses, orthogonal polar factors, and polynomial approximations of linear operators. Originating in classical work on iterative inversion and rapidly expanding applications in computational linear algebra, optimization, and machine learning, the family encompasses both the original second-order Newton–Schulz iteration and its numerous high-order and composite generalizations. Central to its efficacy is its amenability to efficient implementation using only matrix-matrix multiplications, making it especially relevant in contexts such as GPU-accelerated computation, preconditioning of ill-conditioned or rank-deficient systems, and fast orthogonalization in large-scale optimization algorithms.

1. General Formulation and Basic Properties

The classical (second-order) Newton–Schulz iteration for inverting a nonsingular matrix $A$ is given by

$X_{k+1} = 2X_k - X_k A X_k, \qquad X_0 \approx A^{-1}$

where $X_0$ is an initial preconditioner. Defining the residual $F_k = I - X_k A$ , the iteration can be recast as

$X_{k+1} = (I + F_k) X_k, \qquad F_{k+1} = F_k^2$

which reveals the quadratic convergence: the norm of the residual is squared at each step. This iteration is a special case of the family of hyperpower or Durand–Schulz methods, whose $n$ -th order generalization employs the truncated Neumann series: $X_{k+1} = \left(I + F_k + F_k^2 + \cdots + F_k^{n-1}\right) X_k$ and $F_{k+1} = F_k^n$ , ensuring local convergence of order $n$ (i.e., $\|F_{k+1}\| \leq C \|F_k\|^n$ ) (Stotsky, 2022, Stotsky, 2020).

The convergence of these schemes is ensured under the spectral condition $\rho(I - X_0 A) < 1$ (where $\rho$ denotes spectral radius), typically guaranteed by choosing $X_0 = \alpha I$ for appropriate $\alpha > 0$ . For symmetric positive definite (SPD) $A$ , taking $X_0$ with eigenvalues in the reciprocal range of $A$ suffices.

2. High-Order Extensions and Unified Polynomial Factorization

While a naïve implementation of the $n$ -th order Newton–Schulz scheme would entail computing each power $F_k^j$ for $j = 1, \ldots, n-1$ , sophisticated factorization of the underlying residual polynomial significantly reduces computational cost. Given

$R_n(F) = I + F + \cdots + F^{n-1}$

one can decompose $R_n$ into products of lower-degree factors, enabling a reduction in the number of matrix multiplications per iteration.

If $n = w(p+1)$ , the factorization

$R_n(F) = \prod_{d=0}^p (I + F^{(p+1)d})$

requires only $p + w + 1$ multiplications (rather than the naïve $n$ ).

The efficiency index (EI) $= n^{1/(p+w+1)}$ quantifies the per-multiply convergence acceleration.

Typical factorization examples (with 6 products per cycle): | $n$ | Polynomial Factorization $(R_n(F))$ | $EI$ | |-----|---------------------------------------------------------|----------| | 8 | $(I+F)(I+F^2)(I+F^4)$ | $\approx 1.4142$ | | 9 | $(I+F+F^2)(I+F^3)$ | $\approx 1.4422$ | | 10 | $(I+F)(I+F^2)(I+F^4)(I+F^8)$ | $\approx 1.4678$ | | 11* | Bespoke (prime) factorization | $\approx 1.4913$ |

Optimal choice of $n$ and factorization yields a balance between per-step cost and asymptotic convergence (Stotsky, 2022, Stotsky, 2020).

3. Composite and Parallelized Newton–Schulz Schemes

Composite expansions can further enhance computational throughput, especially on parallel hardware. By writing the high-order polynomial as a product of several low-order terms, one schedules independent blocks across compute units with different capabilities. Given exponents $(x_1, \ldots, x_w)$ , define

$T_i = \sum_{j=0}^{x_i-1} (I - A X)^{j} \qquad T_c = T_1 T_2 \cdots T_w$

and perform the update $X_{k+1} = T_c X_k$ , ensuring rapid decay of the residual $\|F_{k+1}\|$ controlled by the composite contraction $\rho(T_c)$ . Recursive Horner-type schemes and nested factorizations enable scaling to high orders while maintaining minimal multiply count. For instance, a 45th-order Newton–Schulz iteration can be realized in only 10 matrix-matrix multiplies (Stotsky, 2020).

4. Integration with Richardson Iteration

Richardson’s method solves $A\theta = b$ via the fixed-point scheme

$\theta_k = \theta_{k-1} + \omega (b - A\theta_{k-1})$

with slow practical convergence. By embedding a Newton–Schulz approximate inverse $X_k \approx A^{-1}$ ,

$\theta_k = \theta_{k-1} - X_k (A\theta_{k-1} - b)$

or, equivalently,

$\theta_k = F_k \theta_{k-1} + X_k b \quad (F_k = I - X_kA)$

and updating $X_k$ via a high-order Newton–Schulz step at each outer Richardson iteration, one achieves both high-order convergence of the inverse and accelerated reduction of the solution residual. Skeleton pseudocode for the combined scheme:

Initialize $X_0$ (preconditioner), $\theta_0$ (e.g., zero).
For $k = 1,2,\ldots$ $k = 1, 2, \dots$ do:
- Compute residual $R_{k-1} = I - X_{k-1}A$ .
- Newton–Schulz update: $X_k = R_n(R_{k-1})\, X_{k-1}$ (with efficient factorization).
- Set $F_k = I - X_kA$ .
- Richardson update: $\theta_k = F_k \theta_{k-1} + X_k b$ .
Repeat until convergence.

This nested structure enables dramatic acceleration, especially when robust preconditioning is required (e.g., ill-conditioned or rank-deficient $A$ ) (Stotsky, 2022, Stotsky, 2020).

5. Practical Guidelines and Computational Trade-Offs

Key considerations for real-world use:

For SPD problems of moderate size ( $n \lesssim 500$ ), 8th–9th order factorizations ( $EI \approx 1.42$ –$1.44$) provide a favorable balance of speed and stability.
In very large or sparse systems, repeated low-order ( $n=2,3$ ) Newton–Schulz steps per Richardson iteration can increase robustness and reduce implementation complexity.
For matrices with condition number $\kappa(A) \lesssim 10^{4}$ , a single 4th–6th order NS update per iteration generally suffices; more extreme ill-conditioning may warrant higher $n$ or tighter inner-outer iteration nesting.
The spectral radius condition $\rho(I - X_0A) < 1$ is ensured in practice by setting $X_0 = \alpha I$ with $0 < \alpha < 2/\lambda_{\max}(A)$ . A typical choice is $\alpha \approx 1/(\|A\|_\infty / 2)$ .
Parallel compute resources can be exploited by distributing factorizations across units according to their computational capabilities (Stotsky, 2022, Stotsky, 2020).

6. Applications in Robust Failure Detection and Machine Learning

Failure Detection in Electrical Networks

A central application is in robust parameter estimation for failure detection in electrical networks, where streaming least-squares problems are afflicted by nearly rank-deficient design matrices (e.g., due to missing or faulty data channels): $A_k \theta_k^* = b_k, \qquad A_k = \sum_{p=k-s+1}^k \phi_p \phi_p^T, \quad b_k = \sum_{p=k-s+1}^k \phi_p y_p$ Regularizing with small $\delta > 0$ : $A_{r,k} = A_k + \delta I$ , the systems frequently become severely ill-conditioned. Adaptive preconditioning with

$S_k = (\|A_{r,k}\|_\infty/2 + \epsilon)^{-1} I$

ensures $\rho(I - S_k A_{r,k}) < 1$ . The combined Newton–Schulz–Richardson iteration yields stable solutions even as $\kappa(A_{r,k}) \to \infty$ ( $\delta \to 0$ ), enabling accurate tracking of voltage sags/swells where direct (e.g., LU-based) solvers fail. On real DOE/EPRI data, this approach achieves real-time performance with fewer than 10 Richardson steps per window shift and recovers first-harmonic amplitudes with high fidelity (Stotsky, 2022).

Orthogonalization in Machine Learning Optimizers

In large-scale optimization (e.g., the Muon optimizer), Newton–Schulz iterations provide fast, inexact polar decomposition updates as cheap alternatives to SVD-based projection. The key recurrence

$X_{k+1} = \tfrac12 X_k (3I - X_k^\top X_k)$

is run for a small, fixed number of steps (typically 5), balancing approximation error and computational budget (Shulgin et al., 22 Oct 2025).

Theory predicts that lowering the number of Newton–Schulz steps (incurring greater inexactness $\delta$ ) degrades convergence rate by explicit factors (e.g., a deterministic rate inflated by $(1+\delta)/(1-\delta)$ ). Optimal learning rate and momentum must therefore be co-tuned with the number of Newton–Schulz steps: fewer iterations require a smaller step size and increased momentum, as confirmed in empirical training of deep models such as NanoGPT (Shulgin et al., 22 Oct 2025).

Accelerated variants, such as Chebyshev-optimized Newton–Schulz (CANS), further reduce matmul cost and improve contraction rates by minimizing the worst-case approximation error over the relevant singular value interval, and their use in Muon or Stiefel-manifold optimization achieves faster convergence and lower wall-clock time than classical Newton–Schulz or QR/Cayley retractions (Grishina et al., 12 Jun 2025).

7. Conclusions and Outlook

The Newton–Schulz family, bolstered by unified polynomial factorizations and composite-block constructions, constitutes a highly flexible and tunable class of iterative solvers for inverse, orthogonal, and system-solution computations. Their seamless integration into outer iterative schemes (notably Richardson iteration) imparts both finite-precision safety and exceptional adaptability to scale, conditioning, and computational infrastructure. In both numerical linear algebra and modern machine learning, these schemes have demonstrated robust real-world utility—outperforming direct solvers in speed, robustness, and flexibility across challenging application domains (Stotsky, 2022, Stotsky, 2020, Shulgin et al., 22 Oct 2025, Grishina et al., 12 Jun 2025). Future research directions involve optimal composite design for particular hardware or problem structure, improved contraction analysis for blockwise or layerwise errors, and principled adaptive selection of the order and composition at runtime in large-scale distributed systems.