Papers
Topics
Authors
Recent
2000 character limit reached

Chebyshev-Optimized Newton-Schulz Method

Updated 27 December 2025
  • The CANS method is a matrix polynomial algorithm that accelerates Newton-Schulz iterations by employing Chebyshev minimax polynomials to minimize worst-case spectral error.
  • It replaces fixed coefficients with Chebyshev-optimal ones, reducing matrix multiplies by 20–50% and improving convergence in applications like orthogonalization and inversion.
  • Empirical benchmarks demonstrate significant speedups in deep learning optimizers, Stiefel manifold retraction, and conjugate gradient preconditioning for large SPD systems.

The Chebyshev-Optimized Newton-Schulz (CANS) method is a family of matrix polynomial algorithms that accelerates classical Newton-Schulz iteration by leveraging Chebyshev (minimax) polynomial theory. CANS is primarily used to compute matrix functions such as optimal orthogonal projections, inverses, and preconditioners with improved convergence and computational efficiency. The method replaces fixed Newton-Schulz coefficients with Chebyshev-optimal ones, yielding minimal worst-case spectral error over relevant intervals. This approach has demonstrated substantial speedups in applications including orthogonalization in stochastic optimizers (e.g., Muon), efficient retraction for Riemannian optimization on the Stiefel manifold, and scalable polynomial preconditioning in conjugate gradient methods for large symmetric positive definite (SPD) systems (Grishina et al., 12 Jun 2025, Bergamaschi et al., 2020).

1. Foundations and Classical Newton-Schulz Iteration

The Newton-Schulz iteration is a matrix algorithm for approximating the polar factor of a full-rank matrix XRm×nX \in \mathbb{R}^{m \times n}, with polar decomposition X=WHX = WH, WTW=InW^T W = I_n, H=HT0H = H^T \succeq 0. WW solves the orthogonal Procrustes problem minQTQ=IQXF\min_{Q^TQ=I}\|Q-X\|_F. The Newton-Schulz iteration: Xk+1=32Xk12XkXkTXkX_{k+1} = \tfrac{3}{2}X_k - \tfrac{1}{2}X_k X_k^TX_k relies exclusively on matrix multiplications, making it well-suited for parallel and GPU computation. For inversion of SPD matrices, a related iteration is

Pj+1=2PjPjAPj.P_{j+1} = 2P_j - P_j A P_j.

Quadratic convergence is assured when the initial condition is spectrally well-scaled (e.g., σ1(X)<3\sigma_1(X) < \sqrt{3} for polar decomposition, or IP0A<1\|I - P_0A\| < 1 for inversion), but fixed polynomial coefficients do not exploit problem-specific spectral distributions (Grishina et al., 12 Jun 2025, Bergamaschi et al., 2020).

2. Chebyshev Optimization and Minimax Polynomials

The CANS paradigm replaces the fixed polynomial in Newton-Schulz with a degree-$2d-1$ odd polynomial pd,a,b(s)p_{d,a,b}(s) that optimally approximates 1 on a spectral interval [a,b][σn(X),σ1(X)][a,b] \supset [\sigma_n(X),\sigma_1(X)], minimizing

ε=maxs[a,b]p(s)1\varepsilon = \max_{s\in[a,b]}|p(s) - 1|

according to the Chebyshev alternance theorem. The minimax polynomial is characterized by equioscillation at d+1d+1 alternation points and ensures the smallest worst-case spectral error after one step, which in turn accelerates convergence and reduces iterations.

For the cubic case, the optimal polynomial is: p2,a,b(s)=α1s+α3s3,α1,α3 determined by alternancep_{2,a,b}(s) = \alpha_1 s + \alpha_3 s^3, \qquad \alpha_1, \alpha_3 \text{ determined by alternance} with explicit construction via spectral endpoints a,ba,b and auxiliary quantities, such as E=(a2+ab+b23)3/2E = \left(\frac{a^2 + ab + b^2}{3}\right)^{3/2} (Grishina et al., 12 Jun 2025).

In SPD inversion and CG preconditioning, the CANS polynomial coincides with the minimax (Chebyshev) polynomial for A1A^{-1} on [α,β][\alpha, \beta]: pm(x)=1x[1Tm+1(α+β2xβα)Tm+1(α+ββα)]p_m(x) = \frac{1}{x} \left[ 1 - \frac{T_{m+1}\left(\frac{\alpha + \beta - 2x}{\beta-\alpha}\right)}{T_{m+1}\left(\frac{\alpha + \beta}{\beta-\alpha}\right)} \right] where TkT_k are Chebyshev polynomials of the first kind (Bergamaschi et al., 2020).

3. Algorithmic Construction and Higher-Degree Extensions

For degrees higher than three, the optimal CANS polynomial is computed using a Remez exchange algorithm, iteratively updating alternation points and solving

j=1dα2j1xk2j1=1(1)kε,k=0,,d\sum_{j=1}^d \alpha_{2j-1} x_k^{2j-1} = 1 - (-1)^k \varepsilon, \quad k=0, \ldots, d

until equioscillation is attained. The general iteration becomes: Xk+1=j=1dα2j1Xk(XkTXk)j1X_{k+1} = \sum_{j=1}^d \alpha_{2j-1} X_k (X_k^T X_k)^{j-1} or, equivalently, a matrix polynomial in Sk=XkTXkS_k = X_k^T X_k. Small-degree coefficients are given in closed form, while quintic and higher orders are tabulated (Grishina et al., 12 Jun 2025).

In preconditioning, the iterative application naturally yields polynomials of degree m=2j1m = 2^j - 1, and with careful scaling (e.g., using a slightly larger θ\theta than (α+β)/2(\alpha+\beta)/2), eigenvalue clustering is mitigated and CG convergence is improved (Bergamaschi et al., 2020).

Polynomial Degree dd Iteration Formula Notable Use
1 Classical Newton-Schulz Baseline, fast per step
2 Chebyshev-optimized cubic, explicit form Muon, Stiefel projection
>>2 Remez-optimized, tabulated coefficients High-precision, CG precond.

4. Convergence Analysis and Computational Cost

CANS iterations inherit quadratic convergence from Newton-Schulz: εn+1εn2,\varepsilon_{n+1} \leq \varepsilon_n^2, with the asymptotic improvement limnεn+1/εn2=3/4\lim_{n\to\infty} \varepsilon_{n+1} / \varepsilon_n^2 = 3/4 for the cubic case, i.e., CANS3_3. In empirical settings, CANS reduces matrix-multiply-heavy iterations by 20%20\%--50%50\% relative to classical Newton-Schulz for a given accuracy (Grishina et al., 12 Jun 2025).

For polynomial preconditioning in PCG, using degree mm minimax polynomials,

maxx[α,β]qm+1(x)2(κ1κ+1)m+1,κ=β/α\max_{x \in [\alpha,\beta]} |q_{m+1}(x)| \leq 2\left( \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \right)^{m+1}, \qquad \kappa = \beta/\alpha

yields exponential reduction in condition number, permitting substantial decreases in CG iterations at modest extra computational cost. Practical performance is further enhanced by parameter shifts (e.g., θ1.01 θ\theta \rightarrow 1.01\ \theta) to break eigenvalue clustering (Bergamaschi et al., 2020).

5. Principal Applications

a. Orthogonalization in Deep Learning Optimizers (Muon):

Muon replaces computationally intensive SVD/QR with fixed-degree NS steps for parameter update orthonormalization. CANS, with polynomials optimized for the empirical singular value interval (e.g., [1δ,1+δ][1-\delta,1+\delta]), yields compositions that maximize ϕ(0)\phi'(0) and accelerate convergence, outperforming brute-force tuned polynomials and improving test loss in transformer training workloads such as NanoGPT (Grishina et al., 12 Jun 2025).

b. Retraction on the Stiefel Manifold:

For Riemannian SGD/Adam on St(n,p)St(n, p), the computational bottleneck is retraction. CANS3_3 provides a cubic surrogate, normalized by estimated largest singular value and leveraging spectral interval data. On Wide ResNet-16-10/CIFAR-10, this achieves comparable final accuracy to QR/Cayley approaches with substantially reduced per-epoch wall time (Grishina et al., 12 Jun 2025).

c. Polynomial Preconditioning for PCG:

In the solution of large-scale SPD linear systems, CANS preconditioners yield contraction factors and scalability superior to incomplete Cholesky or Jacobi approaches. On platforms such as CINECA Marconi, wall time reductions up to 2.35×2.35 \times were observed, with near-ideal weak scaling for matrices with up to 8.6×1098.6 \times 10^9 unknowns (Bergamaschi et al., 2020).

Application Benefit Observed Speedup or Accuracy
Muon-style orthogonalization Fewer matrix multiplies, better loss curve 20–50% fewer multiplies
Stiefel manifold retraction Comparable accuracy, reduced time 43–45s vs. 61–71s/epoch
CG preconditioning Fewer iterations, strong scalability 1.4–2.4×\times wall-time

6. Practical Implementation and Recommendations

CANS algorithms require only basic matrix operations and are backward-compatible with existing matrix-multiplication-centric pipelines. For higher polynomial degrees, parameter selection via eigenvalue estimates (power/Lanczos methods) and a mild scaling modification are recommended. Dyadic polynomial degrees (m=2j1m=2^j-1 with j=4j=4–$6$) balance gain and extra computational cost in preconditioning contexts. The Remez algorithm is effective for coefficient computation in applications requiring tolerance-controlled orthogonalization or inversion (Bergamaschi et al., 2020).

Parallel implementations distribute workloads in block-row format, requiring only local-neighbor communications for matrix-vector multiplications, with major global reduction overheads eliminated as CG iterations decrease. This yields superlative scalability in high-performance computing environments.

7. Summary and Empirical Benchmarks

CANS replaces fixed Newton-Schulz polynomials with Chebyshev-minimax odd polynomials adapted to the spectral interval of interest, thus minimizing worst-case one-step spectral error and achieving faster practical convergence. In polar factor computation, this translates to reductions in iterative cost (matrix multiplies) by 20–50% for a given orthogonality tolerance. In conjugate gradient solvers, the CANS preconditioner results in significant reductions in iteration count and total runtime for very large problems, without loss of final solution accuracy.

Empirical benchmarks spanning deep learning (NanoGPT, Wide ResNet) and large-scale SPD system solution substantiate these gains:

  • 1000×10001000 \times 1000 Gaussian matrix: classical NS requires 8\sim8 iterations, CANS3_3 needs 5\sim5, CANS5_5 only 3\sim3.
  • Stiefel-constrained Riemannian SGD/Adam: accuracy nearly identical to QR/Cayley (94.8–95.8%), with epochs 1.4–1.7×\times faster.
  • In parallel CG, dyadic CANS (m=31m=31) achieves $1.4$–2.4×2.4\times speedup, with strong scalability for billions of unknowns (Grishina et al., 12 Jun 2025, Bergamaschi et al., 2020).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Chebyshev-Optimized Newton-Schulz (CANS) Method.