Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 171 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Newton–Schulz Iterations: Matrix Inversion & Beyond

Updated 15 November 2025
  • Newton–Schulz iterations are a family of fixed-point schemes that compute matrix inverses and polar factors with quadratic or higher-order convergence.
  • High-order extensions employ optimized polynomial factorizations to minimize matrix multiplications while achieving efficiency indices around 1.41–1.49.
  • Integrating these iterations with Richardson’s method enables robust preconditioning and fast orthogonalization in applications like failure detection and machine learning.

The Newton–Schulz iteration is a family of high-order fixed-point schemes for constructing matrix inverses, orthogonal polar factors, and polynomial approximations of linear operators. Originating in classical work on iterative inversion and rapidly expanding applications in computational linear algebra, optimization, and machine learning, the family encompasses both the original second-order Newton–Schulz iteration and its numerous high-order and composite generalizations. Central to its efficacy is its amenability to efficient implementation using only matrix-matrix multiplications, making it especially relevant in contexts such as GPU-accelerated computation, preconditioning of ill-conditioned or rank-deficient systems, and fast orthogonalization in large-scale optimization algorithms.

1. General Formulation and Basic Properties

The classical (second-order) Newton–Schulz iteration for inverting a nonsingular matrix AA is given by

Xk+1=2XkXkAXk,X0A1X_{k+1} = 2X_k - X_k A X_k, \qquad X_0 \approx A^{-1}

where X0X_0 is an initial preconditioner. Defining the residual Fk=IXkAF_k = I - X_k A, the iteration can be recast as

Xk+1=(I+Fk)Xk,Fk+1=Fk2X_{k+1} = (I + F_k) X_k, \qquad F_{k+1} = F_k^2

which reveals the quadratic convergence: the norm of the residual is squared at each step. This iteration is a special case of the family of hyperpower or Durand–Schulz methods, whose nn-th order generalization employs the truncated Neumann series: Xk+1=(I+Fk+Fk2++Fkn1)XkX_{k+1} = \left(I + F_k + F_k^2 + \cdots + F_k^{n-1}\right) X_k and Fk+1=FknF_{k+1} = F_k^n, ensuring local convergence of order nn (i.e., Fk+1CFkn\|F_{k+1}\| \leq C \|F_k\|^n) (Stotsky, 2022, Stotsky, 2020).

The convergence of these schemes is ensured under the spectral condition ρ(IX0A)<1\rho(I - X_0 A) < 1 (where ρ\rho denotes spectral radius), typically guaranteed by choosing X0=αIX_0 = \alpha I for appropriate α>0\alpha > 0. For symmetric positive definite (SPD) AA, taking X0X_0 with eigenvalues in the reciprocal range of AA suffices.

2. High-Order Extensions and Unified Polynomial Factorization

While a naïve implementation of the nn-th order Newton–Schulz scheme would entail computing each power FkjF_k^j for j=1,,n1j = 1, \ldots, n-1, sophisticated factorization of the underlying residual polynomial significantly reduces computational cost. Given

Rn(F)=I+F++Fn1R_n(F) = I + F + \cdots + F^{n-1}

one can decompose RnR_n into products of lower-degree factors, enabling a reduction in the number of matrix multiplications per iteration.

  • If n=w(p+1)n = w(p+1), the factorization

Rn(F)=d=0p(I+F(p+1)d)R_n(F) = \prod_{d=0}^p (I + F^{(p+1)d})

requires only p+w+1p + w + 1 multiplications (rather than the naïve nn).

  • The efficiency index (EI) =n1/(p+w+1)= n^{1/(p+w+1)} quantifies the per-multiply convergence acceleration.

Typical factorization examples (with 6 products per cycle): | nn | Polynomial Factorization (Rn(F))(R_n(F)) | EIEI | |-----|---------------------------------------------------------|----------| | 8 | (I+F)(I+F2)(I+F4)(I+F)(I+F^2)(I+F^4) | 1.4142\approx 1.4142 | | 9 | (I+F+F2)(I+F3)(I+F+F^2)(I+F^3) | 1.4422\approx 1.4422 | | 10 | (I+F)(I+F2)(I+F4)(I+F8)(I+F)(I+F^2)(I+F^4)(I+F^8) | 1.4678\approx 1.4678 | | 11* | Bespoke (prime) factorization | 1.4913\approx 1.4913 |

Optimal choice of nn and factorization yields a balance between per-step cost and asymptotic convergence (Stotsky, 2022, Stotsky, 2020).

3. Composite and Parallelized Newton–Schulz Schemes

Composite expansions can further enhance computational throughput, especially on parallel hardware. By writing the high-order polynomial as a product of several low-order terms, one schedules independent blocks across compute units with different capabilities. Given exponents (x1,,xw)(x_1, \ldots, x_w), define

Ti=j=0xi1(IAX)jTc=T1T2TwT_i = \sum_{j=0}^{x_i-1} (I - A X)^{j} \qquad T_c = T_1 T_2 \cdots T_w

and perform the update Xk+1=TcXkX_{k+1} = T_c X_k, ensuring rapid decay of the residual Fk+1\|F_{k+1}\| controlled by the composite contraction ρ(Tc)\rho(T_c). Recursive Horner-type schemes and nested factorizations enable scaling to high orders while maintaining minimal multiply count. For instance, a 45th-order Newton–Schulz iteration can be realized in only 10 matrix-matrix multiplies (Stotsky, 2020).

4. Integration with Richardson Iteration

Richardson’s method solves Aθ=bA\theta = b via the fixed-point scheme

θk=θk1+ω(bAθk1)\theta_k = \theta_{k-1} + \omega (b - A\theta_{k-1})

with slow practical convergence. By embedding a Newton–Schulz approximate inverse XkA1X_k \approx A^{-1},

θk=θk1Xk(Aθk1b)\theta_k = \theta_{k-1} - X_k (A\theta_{k-1} - b)

or, equivalently,

θk=Fkθk1+Xkb(Fk=IXkA)\theta_k = F_k \theta_{k-1} + X_k b \quad (F_k = I - X_kA)

and updating XkX_k via a high-order Newton–Schulz step at each outer Richardson iteration, one achieves both high-order convergence of the inverse and accelerated reduction of the solution residual. Skeleton pseudocode for the combined scheme:

  1. Initialize X0X_0 (preconditioner), θ0\theta_0 (e.g., zero).
  2. For k=1,2,k = 1,2,\ldots do:
    • Compute residual Rk1=IXk1AR_{k-1} = I - X_{k-1}A.
    • Newton–Schulz update: Xk=Rn(Rk1)Xk1X_k = R_n(R_{k-1})\, X_{k-1} (with efficient factorization).
    • Set Fk=IXkAF_k = I - X_kA.
    • Richardson update: θk=Fkθk1+Xkb\theta_k = F_k \theta_{k-1} + X_k b.
  3. Repeat until convergence.

This nested structure enables dramatic acceleration, especially when robust preconditioning is required (e.g., ill-conditioned or rank-deficient AA) (Stotsky, 2022, Stotsky, 2020).

5. Practical Guidelines and Computational Trade-Offs

Key considerations for real-world use:

  • For SPD problems of moderate size (n500n \lesssim 500), 8th–9th order factorizations (EI1.42EI \approx 1.42–$1.44$) provide a favorable balance of speed and stability.
  • In very large or sparse systems, repeated low-order (n=2,3n=2,3) Newton–Schulz steps per Richardson iteration can increase robustness and reduce implementation complexity.
  • For matrices with condition number κ(A)104\kappa(A) \lesssim 10^{4}, a single 4th–6th order NS update per iteration generally suffices; more extreme ill-conditioning may warrant higher nn or tighter inner-outer iteration nesting.
  • The spectral radius condition ρ(IX0A)<1\rho(I - X_0A) < 1 is ensured in practice by setting X0=αIX_0 = \alpha I with 0<α<2/λmax(A)0 < \alpha < 2/\lambda_{\max}(A). A typical choice is α1/(A/2)\alpha \approx 1/(\|A\|_\infty / 2).
  • Parallel compute resources can be exploited by distributing factorizations across units according to their computational capabilities (Stotsky, 2022, Stotsky, 2020).

6. Applications in Robust Failure Detection and Machine Learning

Failure Detection in Electrical Networks

A central application is in robust parameter estimation for failure detection in electrical networks, where streaming least-squares problems are afflicted by nearly rank-deficient design matrices (e.g., due to missing or faulty data channels): Akθk=bk,Ak=p=ks+1kϕpϕpT,bk=p=ks+1kϕpypA_k \theta_k^* = b_k, \qquad A_k = \sum_{p=k-s+1}^k \phi_p \phi_p^T, \quad b_k = \sum_{p=k-s+1}^k \phi_p y_p Regularizing with small δ>0\delta > 0: Ar,k=Ak+δIA_{r,k} = A_k + \delta I, the systems frequently become severely ill-conditioned. Adaptive preconditioning with

Sk=(Ar,k/2+ϵ)1IS_k = (\|A_{r,k}\|_\infty/2 + \epsilon)^{-1} I

ensures ρ(ISkAr,k)<1\rho(I - S_k A_{r,k}) < 1. The combined Newton–Schulz–Richardson iteration yields stable solutions even as κ(Ar,k)\kappa(A_{r,k}) \to \infty (δ0\delta \to 0), enabling accurate tracking of voltage sags/swells where direct (e.g., LU-based) solvers fail. On real DOE/EPRI data, this approach achieves real-time performance with fewer than 10 Richardson steps per window shift and recovers first-harmonic amplitudes with high fidelity (Stotsky, 2022).

Orthogonalization in Machine Learning Optimizers

In large-scale optimization (e.g., the Muon optimizer), Newton–Schulz iterations provide fast, inexact polar decomposition updates as cheap alternatives to SVD-based projection. The key recurrence

Xk+1=12Xk(3IXkXk)X_{k+1} = \tfrac12 X_k (3I - X_k^\top X_k)

is run for a small, fixed number of steps (typically 5), balancing approximation error and computational budget (Shulgin et al., 22 Oct 2025).

Theory predicts that lowering the number of Newton–Schulz steps (incurring greater inexactness δ\delta) degrades convergence rate by explicit factors (e.g., a deterministic rate inflated by (1+δ)/(1δ)(1+\delta)/(1-\delta)). Optimal learning rate and momentum must therefore be co-tuned with the number of Newton–Schulz steps: fewer iterations require a smaller step size and increased momentum, as confirmed in empirical training of deep models such as NanoGPT (Shulgin et al., 22 Oct 2025).

Accelerated variants, such as Chebyshev-optimized Newton–Schulz (CANS), further reduce matmul cost and improve contraction rates by minimizing the worst-case approximation error over the relevant singular value interval, and their use in Muon or Stiefel-manifold optimization achieves faster convergence and lower wall-clock time than classical Newton–Schulz or QR/Cayley retractions (Grishina et al., 12 Jun 2025).

7. Conclusions and Outlook

The Newton–Schulz family, bolstered by unified polynomial factorizations and composite-block constructions, constitutes a highly flexible and tunable class of iterative solvers for inverse, orthogonal, and system-solution computations. Their seamless integration into outer iterative schemes (notably Richardson iteration) imparts both finite-precision safety and exceptional adaptability to scale, conditioning, and computational infrastructure. In both numerical linear algebra and modern machine learning, these schemes have demonstrated robust real-world utility—outperforming direct solvers in speed, robustness, and flexibility across challenging application domains (Stotsky, 2022, Stotsky, 2020, Shulgin et al., 22 Oct 2025, Grishina et al., 12 Jun 2025). Future research directions involve optimal composite design for particular hardware or problem structure, improved contraction analysis for blockwise or layerwise errors, and principled adaptive selection of the order and composition at runtime in large-scale distributed systems.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Newton-Schulz Iterations.