Papers
Topics
Authors
Recent
Search
2000 character limit reached

Newton–Schulz Iterations: Matrix Inversion & Beyond

Updated 15 November 2025
  • Newton–Schulz iterations are a family of fixed-point schemes that compute matrix inverses and polar factors with quadratic or higher-order convergence.
  • High-order extensions employ optimized polynomial factorizations to minimize matrix multiplications while achieving efficiency indices around 1.41–1.49.
  • Integrating these iterations with Richardson’s method enables robust preconditioning and fast orthogonalization in applications like failure detection and machine learning.

The Newton–Schulz iteration is a family of high-order fixed-point schemes for constructing matrix inverses, orthogonal polar factors, and polynomial approximations of linear operators. Originating in classical work on iterative inversion and rapidly expanding applications in computational linear algebra, optimization, and machine learning, the family encompasses both the original second-order Newton–Schulz iteration and its numerous high-order and composite generalizations. Central to its efficacy is its amenability to efficient implementation using only matrix-matrix multiplications, making it especially relevant in contexts such as GPU-accelerated computation, preconditioning of ill-conditioned or rank-deficient systems, and fast orthogonalization in large-scale optimization algorithms.

1. General Formulation and Basic Properties

The classical (second-order) Newton–Schulz iteration for inverting a nonsingular matrix AA is given by

Xk+1=2XkXkAXk,X0A1X_{k+1} = 2X_k - X_k A X_k, \qquad X_0 \approx A^{-1}

where X0X_0 is an initial preconditioner. Defining the residual Fk=IXkAF_k = I - X_k A, the iteration can be recast as

Xk+1=(I+Fk)Xk,Fk+1=Fk2X_{k+1} = (I + F_k) X_k, \qquad F_{k+1} = F_k^2

which reveals the quadratic convergence: the norm of the residual is squared at each step. This iteration is a special case of the family of hyperpower or Durand–Schulz methods, whose nn-th order generalization employs the truncated Neumann series: Xk+1=(I+Fk+Fk2++Fkn1)XkX_{k+1} = \left(I + F_k + F_k^2 + \cdots + F_k^{n-1}\right) X_k and Fk+1=FknF_{k+1} = F_k^n, ensuring local convergence of order nn (i.e., Fk+1CFkn\|F_{k+1}\| \leq C \|F_k\|^n) (Stotsky, 2022, Stotsky, 2020).

The convergence of these schemes is ensured under the spectral condition Xk+1=2XkXkAXk,X0A1X_{k+1} = 2X_k - X_k A X_k, \qquad X_0 \approx A^{-1}0 (where Xk+1=2XkXkAXk,X0A1X_{k+1} = 2X_k - X_k A X_k, \qquad X_0 \approx A^{-1}1 denotes spectral radius), typically guaranteed by choosing Xk+1=2XkXkAXk,X0A1X_{k+1} = 2X_k - X_k A X_k, \qquad X_0 \approx A^{-1}2 for appropriate Xk+1=2XkXkAXk,X0A1X_{k+1} = 2X_k - X_k A X_k, \qquad X_0 \approx A^{-1}3. For symmetric positive definite (SPD) Xk+1=2XkXkAXk,X0A1X_{k+1} = 2X_k - X_k A X_k, \qquad X_0 \approx A^{-1}4, taking Xk+1=2XkXkAXk,X0A1X_{k+1} = 2X_k - X_k A X_k, \qquad X_0 \approx A^{-1}5 with eigenvalues in the reciprocal range of Xk+1=2XkXkAXk,X0A1X_{k+1} = 2X_k - X_k A X_k, \qquad X_0 \approx A^{-1}6 suffices.

2. High-Order Extensions and Unified Polynomial Factorization

While a naïve implementation of the Xk+1=2XkXkAXk,X0A1X_{k+1} = 2X_k - X_k A X_k, \qquad X_0 \approx A^{-1}7-th order Newton–Schulz scheme would entail computing each power Xk+1=2XkXkAXk,X0A1X_{k+1} = 2X_k - X_k A X_k, \qquad X_0 \approx A^{-1}8 for Xk+1=2XkXkAXk,X0A1X_{k+1} = 2X_k - X_k A X_k, \qquad X_0 \approx A^{-1}9, sophisticated factorization of the underlying residual polynomial significantly reduces computational cost. Given

X0X_00

one can decompose X0X_01 into products of lower-degree factors, enabling a reduction in the number of matrix multiplications per iteration.

  • If X0X_02, the factorization

X0X_03

requires only X0X_04 multiplications (rather than the naïve X0X_05).

  • The efficiency index (EI) X0X_06 quantifies the per-multiply convergence acceleration.

Typical factorization examples (with 6 products per cycle): | X0X_07 | Polynomial Factorization X0X_08 | X0X_09 | |-----|---------------------------------------------------------|----------| | 8 | Fk=IXkAF_k = I - X_k A0 | Fk=IXkAF_k = I - X_k A1 | | 9 | Fk=IXkAF_k = I - X_k A2 | Fk=IXkAF_k = I - X_k A3 | | 10 | Fk=IXkAF_k = I - X_k A4 | Fk=IXkAF_k = I - X_k A5 | | 11* | Bespoke (prime) factorization | Fk=IXkAF_k = I - X_k A6 |

Optimal choice of Fk=IXkAF_k = I - X_k A7 and factorization yields a balance between per-step cost and asymptotic convergence (Stotsky, 2022, Stotsky, 2020).

3. Composite and Parallelized Newton–Schulz Schemes

Composite expansions can further enhance computational throughput, especially on parallel hardware. By writing the high-order polynomial as a product of several low-order terms, one schedules independent blocks across compute units with different capabilities. Given exponents Fk=IXkAF_k = I - X_k A8, define

Fk=IXkAF_k = I - X_k A9

and perform the update Xk+1=(I+Fk)Xk,Fk+1=Fk2X_{k+1} = (I + F_k) X_k, \qquad F_{k+1} = F_k^20, ensuring rapid decay of the residual Xk+1=(I+Fk)Xk,Fk+1=Fk2X_{k+1} = (I + F_k) X_k, \qquad F_{k+1} = F_k^21 controlled by the composite contraction Xk+1=(I+Fk)Xk,Fk+1=Fk2X_{k+1} = (I + F_k) X_k, \qquad F_{k+1} = F_k^22. Recursive Horner-type schemes and nested factorizations enable scaling to high orders while maintaining minimal multiply count. For instance, a 45th-order Newton–Schulz iteration can be realized in only 10 matrix-matrix multiplies (Stotsky, 2020).

4. Integration with Richardson Iteration

Richardson’s method solves Xk+1=(I+Fk)Xk,Fk+1=Fk2X_{k+1} = (I + F_k) X_k, \qquad F_{k+1} = F_k^23 via the fixed-point scheme

Xk+1=(I+Fk)Xk,Fk+1=Fk2X_{k+1} = (I + F_k) X_k, \qquad F_{k+1} = F_k^24

with slow practical convergence. By embedding a Newton–Schulz approximate inverse Xk+1=(I+Fk)Xk,Fk+1=Fk2X_{k+1} = (I + F_k) X_k, \qquad F_{k+1} = F_k^25,

Xk+1=(I+Fk)Xk,Fk+1=Fk2X_{k+1} = (I + F_k) X_k, \qquad F_{k+1} = F_k^26

or, equivalently,

Xk+1=(I+Fk)Xk,Fk+1=Fk2X_{k+1} = (I + F_k) X_k, \qquad F_{k+1} = F_k^27

and updating Xk+1=(I+Fk)Xk,Fk+1=Fk2X_{k+1} = (I + F_k) X_k, \qquad F_{k+1} = F_k^28 via a high-order Newton–Schulz step at each outer Richardson iteration, one achieves both high-order convergence of the inverse and accelerated reduction of the solution residual. Skeleton pseudocode for the combined scheme:

  1. Initialize Xk+1=(I+Fk)Xk,Fk+1=Fk2X_{k+1} = (I + F_k) X_k, \qquad F_{k+1} = F_k^29 (preconditioner), nn0 (e.g., zero).
  2. For nn1 do:
    • Compute residual nn2.
    • Newton–Schulz update: nn3 (with efficient factorization).
    • Set nn4.
    • Richardson update: nn5.
  3. Repeat until convergence.

This nested structure enables dramatic acceleration, especially when robust preconditioning is required (e.g., ill-conditioned or rank-deficient nn6) (Stotsky, 2022, Stotsky, 2020).

5. Practical Guidelines and Computational Trade-Offs

Key considerations for real-world use:

  • For SPD problems of moderate size (nn7), 8th–9th order factorizations (nn8–nn9) provide a favorable balance of speed and stability.
  • In very large or sparse systems, repeated low-order (Xk+1=(I+Fk+Fk2++Fkn1)XkX_{k+1} = \left(I + F_k + F_k^2 + \cdots + F_k^{n-1}\right) X_k0) Newton–Schulz steps per Richardson iteration can increase robustness and reduce implementation complexity.
  • For matrices with condition number Xk+1=(I+Fk+Fk2++Fkn1)XkX_{k+1} = \left(I + F_k + F_k^2 + \cdots + F_k^{n-1}\right) X_k1, a single 4th–6th order NS update per iteration generally suffices; more extreme ill-conditioning may warrant higher Xk+1=(I+Fk+Fk2++Fkn1)XkX_{k+1} = \left(I + F_k + F_k^2 + \cdots + F_k^{n-1}\right) X_k2 or tighter inner-outer iteration nesting.
  • The spectral radius condition Xk+1=(I+Fk+Fk2++Fkn1)XkX_{k+1} = \left(I + F_k + F_k^2 + \cdots + F_k^{n-1}\right) X_k3 is ensured in practice by setting Xk+1=(I+Fk+Fk2++Fkn1)XkX_{k+1} = \left(I + F_k + F_k^2 + \cdots + F_k^{n-1}\right) X_k4 with Xk+1=(I+Fk+Fk2++Fkn1)XkX_{k+1} = \left(I + F_k + F_k^2 + \cdots + F_k^{n-1}\right) X_k5. A typical choice is Xk+1=(I+Fk+Fk2++Fkn1)XkX_{k+1} = \left(I + F_k + F_k^2 + \cdots + F_k^{n-1}\right) X_k6.
  • Parallel compute resources can be exploited by distributing factorizations across units according to their computational capabilities (Stotsky, 2022, Stotsky, 2020).

6. Applications in Robust Failure Detection and Machine Learning

Failure Detection in Electrical Networks

A central application is in robust parameter estimation for failure detection in electrical networks, where streaming least-squares problems are afflicted by nearly rank-deficient design matrices (e.g., due to missing or faulty data channels): Xk+1=(I+Fk+Fk2++Fkn1)XkX_{k+1} = \left(I + F_k + F_k^2 + \cdots + F_k^{n-1}\right) X_k7 Regularizing with small Xk+1=(I+Fk+Fk2++Fkn1)XkX_{k+1} = \left(I + F_k + F_k^2 + \cdots + F_k^{n-1}\right) X_k8: Xk+1=(I+Fk+Fk2++Fkn1)XkX_{k+1} = \left(I + F_k + F_k^2 + \cdots + F_k^{n-1}\right) X_k9, the systems frequently become severely ill-conditioned. Adaptive preconditioning with

Fk+1=FknF_{k+1} = F_k^n0

ensures Fk+1=FknF_{k+1} = F_k^n1. The combined Newton–Schulz–Richardson iteration yields stable solutions even as Fk+1=FknF_{k+1} = F_k^n2 (Fk+1=FknF_{k+1} = F_k^n3), enabling accurate tracking of voltage sags/swells where direct (e.g., LU-based) solvers fail. On real DOE/EPRI data, this approach achieves real-time performance with fewer than 10 Richardson steps per window shift and recovers first-harmonic amplitudes with high fidelity (Stotsky, 2022).

Orthogonalization in Machine Learning Optimizers

In large-scale optimization (e.g., the Muon optimizer), Newton–Schulz iterations provide fast, inexact polar decomposition updates as cheap alternatives to SVD-based projection. The key recurrence

Fk+1=FknF_{k+1} = F_k^n4

is run for a small, fixed number of steps (typically 5), balancing approximation error and computational budget (Shulgin et al., 22 Oct 2025).

Theory predicts that lowering the number of Newton–Schulz steps (incurring greater inexactness Fk+1=FknF_{k+1} = F_k^n5) degrades convergence rate by explicit factors (e.g., a deterministic rate inflated by Fk+1=FknF_{k+1} = F_k^n6). Optimal learning rate and momentum must therefore be co-tuned with the number of Newton–Schulz steps: fewer iterations require a smaller step size and increased momentum, as confirmed in empirical training of deep models such as NanoGPT (Shulgin et al., 22 Oct 2025).

Accelerated variants, such as Chebyshev-optimized Newton–Schulz (CANS), further reduce matmul cost and improve contraction rates by minimizing the worst-case approximation error over the relevant singular value interval, and their use in Muon or Stiefel-manifold optimization achieves faster convergence and lower wall-clock time than classical Newton–Schulz or QR/Cayley retractions (Grishina et al., 12 Jun 2025).

7. Conclusions and Outlook

The Newton–Schulz family, bolstered by unified polynomial factorizations and composite-block constructions, constitutes a highly flexible and tunable class of iterative solvers for inverse, orthogonal, and system-solution computations. Their seamless integration into outer iterative schemes (notably Richardson iteration) imparts both finite-precision safety and exceptional adaptability to scale, conditioning, and computational infrastructure. In both numerical linear algebra and modern machine learning, these schemes have demonstrated robust real-world utility—outperforming direct solvers in speed, robustness, and flexibility across challenging application domains (Stotsky, 2022, Stotsky, 2020, Shulgin et al., 22 Oct 2025, Grishina et al., 12 Jun 2025). Future research directions involve optimal composite design for particular hardware or problem structure, improved contraction analysis for blockwise or layerwise errors, and principled adaptive selection of the order and composition at runtime in large-scale distributed systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Newton-Schulz Iterations.