Chebyshev-Optimized Newton-Schulz Method

Updated 27 December 2025

The CANS method is a matrix polynomial algorithm that accelerates Newton-Schulz iterations by employing Chebyshev minimax polynomials to minimize worst-case spectral error.
It replaces fixed coefficients with Chebyshev-optimal ones, reducing matrix multiplies by 20–50% and improving convergence in applications like orthogonalization and inversion.
Empirical benchmarks demonstrate significant speedups in deep learning optimizers, Stiefel manifold retraction, and conjugate gradient preconditioning for large SPD systems.

The Chebyshev-Optimized Newton-Schulz (CANS) method is a family of matrix polynomial algorithms that accelerates classical Newton-Schulz iteration by leveraging Chebyshev (minimax) polynomial theory. CANS is primarily used to compute matrix functions such as optimal orthogonal projections, inverses, and preconditioners with improved convergence and computational efficiency. The method replaces fixed Newton-Schulz coefficients with Chebyshev-optimal ones, yielding minimal worst-case spectral error over relevant intervals. This approach has demonstrated substantial speedups in applications including orthogonalization in stochastic optimizers (e.g., Muon), efficient retraction for Riemannian optimization on the Stiefel manifold, and scalable polynomial preconditioning in conjugate gradient methods for large symmetric positive definite (SPD) systems (Grishina et al., 12 Jun 2025, Bergamaschi et al., 2020).

1. Foundations and Classical Newton-Schulz Iteration

The Newton-Schulz iteration is a matrix algorithm for approximating the polar factor of a full-rank matrix $X \in \mathbb{R}^{m \times n}$ , with polar decomposition $X = WH$ , $W^T W = I_n$ , $H = H^T \succeq 0$ . $W$ solves the orthogonal Procrustes problem $\min_{Q^TQ=I}\|Q-X\|_F$ . The Newton-Schulz iteration: $X_{k+1} = \tfrac{3}{2}X_k - \tfrac{1}{2}X_k X_k^TX_k$ relies exclusively on matrix multiplications, making it well-suited for parallel and GPU computation. For inversion of SPD matrices, a related iteration is

$P_{j+1} = 2P_j - P_j A P_j.$

Quadratic convergence is assured when the initial condition is spectrally well-scaled (e.g., $\sigma_1(X) < \sqrt{3}$ for polar decomposition, or $\|I - P_0A\| < 1$ for inversion), but fixed polynomial coefficients do not exploit problem-specific spectral distributions (Grishina et al., 12 Jun 2025, Bergamaschi et al., 2020).

2. Chebyshev Optimization and Minimax Polynomials

The CANS paradigm replaces the fixed polynomial in Newton-Schulz with a degree-$2d-1$ odd polynomial $p_{d,a,b}(s)$ that optimally approximates 1 on a spectral interval $[a,b] \supset [\sigma_n(X),\sigma_1(X)]$ , minimizing

$\varepsilon = \max_{s\in[a,b]}|p(s) - 1|$

according to the Chebyshev alternance theorem. The minimax polynomial is characterized by equioscillation at $d+1$ alternation points and ensures the smallest worst-case spectral error after one step, which in turn accelerates convergence and reduces iterations.

For the cubic case, the optimal polynomial is: $p_{2,a,b}(s) = \alpha_1 s + \alpha_3 s^3, \qquad \alpha_1, \alpha_3 \text{ determined by alternance}$ with explicit construction via spectral endpoints $a,b$ and auxiliary quantities, such as $E = \left(\frac{a^2 + ab + b^2}{3}\right)^{3/2}$ (Grishina et al., 12 Jun 2025).

In SPD inversion and CG preconditioning, the CANS polynomial coincides with the minimax (Chebyshev) polynomial for $A^{-1}$ on $[\alpha, \beta]$ : $p_m(x) = \frac{1}{x} \left[ 1 - \frac{T_{m+1}\left(\frac{\alpha + \beta - 2x}{\beta-\alpha}\right)}{T_{m+1}\left(\frac{\alpha + \beta}{\beta-\alpha}\right)} \right]$ where $T_k$ are Chebyshev polynomials of the first kind (Bergamaschi et al., 2020).

3. Algorithmic Construction and Higher-Degree Extensions

For degrees higher than three, the optimal CANS polynomial is computed using a Remez exchange algorithm, iteratively updating alternation points and solving

$\sum_{j=1}^d \alpha_{2j-1} x_k^{2j-1} = 1 - (-1)^k \varepsilon, \quad k=0, \ldots, d$

until equioscillation is attained. The general iteration becomes: $X_{k+1} = \sum_{j=1}^d \alpha_{2j-1} X_k (X_k^T X_k)^{j-1}$ or, equivalently, a matrix polynomial in $S_k = X_k^T X_k$ . Small-degree coefficients are given in closed form, while quintic and higher orders are tabulated (Grishina et al., 12 Jun 2025).

In preconditioning, the iterative application naturally yields polynomials of degree $m = 2^j - 1$ , and with careful scaling (e.g., using a slightly larger $\theta$ than $(\alpha+\beta)/2$ ), eigenvalue clustering is mitigated and CG convergence is improved (Bergamaschi et al., 2020).

Polynomial Degree $d$	Iteration Formula	Notable Use
1	Classical Newton-Schulz	Baseline, fast per step
2	Chebyshev-optimized cubic, explicit form	Muon, Stiefel projection
$>$ 2	Remez-optimized, tabulated coefficients	High-precision, CG precond.

4. Convergence Analysis and Computational Cost

CANS iterations inherit quadratic convergence from Newton-Schulz: $\varepsilon_{n+1} \leq \varepsilon_n^2,$ with the asymptotic improvement $\lim_{n\to\infty} \varepsilon_{n+1} / \varepsilon_n^2 = 3/4$ for the cubic case, i.e., CANS $_3$ . In empirical settings, CANS reduces matrix-multiply-heavy iterations by $20\%$ -- $50\%$ relative to classical Newton-Schulz for a given accuracy (Grishina et al., 12 Jun 2025).

For polynomial preconditioning in PCG, using degree $m$ minimax polynomials,

$\max_{x \in [\alpha,\beta]} |q_{m+1}(x)| \leq 2\left( \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \right)^{m+1}, \qquad \kappa = \beta/\alpha$

yields exponential reduction in condition number, permitting substantial decreases in CG iterations at modest extra computational cost. Practical performance is further enhanced by parameter shifts (e.g., $\theta \rightarrow 1.01\ \theta$ ) to break eigenvalue clustering (Bergamaschi et al., 2020).

5. Principal Applications

a. Orthogonalization in Deep Learning Optimizers (Muon):

Muon replaces computationally intensive SVD/QR with fixed-degree NS steps for parameter update orthonormalization. CANS, with polynomials optimized for the empirical singular value interval (e.g., $[1-\delta,1+\delta]$ ), yields compositions that maximize $\phi'(0)$ and accelerate convergence, outperforming brute-force tuned polynomials and improving test loss in transformer training workloads such as NanoGPT (Grishina et al., 12 Jun 2025).

b. Retraction on the Stiefel Manifold:

For Riemannian SGD/Adam on $St(n, p)$ , the computational bottleneck is retraction. CANS $_3$ provides a cubic surrogate, normalized by estimated largest singular value and leveraging spectral interval data. On Wide ResNet-16-10/CIFAR-10, this achieves comparable final accuracy to QR/Cayley approaches with substantially reduced per-epoch wall time (Grishina et al., 12 Jun 2025).

c. Polynomial Preconditioning for PCG:

In the solution of large-scale SPD linear systems, CANS preconditioners yield contraction factors and scalability superior to incomplete Cholesky or Jacobi approaches. On platforms such as CINECA Marconi, wall time reductions up to $2.35 \times$ were observed, with near-ideal weak scaling for matrices with up to $8.6 \times 10^9$ unknowns (Bergamaschi et al., 2020).

Application	Benefit	Observed Speedup or Accuracy
Muon-style orthogonalization	Fewer matrix multiplies, better loss curve	20–50% fewer multiplies
Stiefel manifold retraction	Comparable accuracy, reduced time	43–45s vs. 61–71s/epoch
CG preconditioning	Fewer iterations, strong scalability	1.4–2.4 $\times$ wall-time

6. Practical Implementation and Recommendations

CANS algorithms require only basic matrix operations and are backward-compatible with existing matrix-multiplication-centric pipelines. For higher polynomial degrees, parameter selection via eigenvalue estimates (power/Lanczos methods) and a mild scaling modification are recommended. Dyadic polynomial degrees ( $m=2^j-1$ with $j=4$ –$6$) balance gain and extra computational cost in preconditioning contexts. The Remez algorithm is effective for coefficient computation in applications requiring tolerance-controlled orthogonalization or inversion (Bergamaschi et al., 2020).

Parallel implementations distribute workloads in block-row format, requiring only local-neighbor communications for matrix-vector multiplications, with major global reduction overheads eliminated as CG iterations decrease. This yields superlative scalability in high-performance computing environments.

7. Summary and Empirical Benchmarks

CANS replaces fixed Newton-Schulz polynomials with Chebyshev-minimax odd polynomials adapted to the spectral interval of interest, thus minimizing worst-case one-step spectral error and achieving faster practical convergence. In polar factor computation, this translates to reductions in iterative cost (matrix multiplies) by 20–50% for a given orthogonality tolerance. In conjugate gradient solvers, the CANS preconditioner results in significant reductions in iteration count and total runtime for very large problems, without loss of final solution accuracy.

Empirical benchmarks spanning deep learning (NanoGPT, Wide ResNet) and large-scale SPD system solution substantiate these gains:

$1000 \times 1000$ Gaussian matrix: classical NS requires $\sim8$ iterations, CANS $_3$ needs $\sim5$ , CANS $_5$ only $\sim3$ .
Stiefel-constrained Riemannian SGD/Adam: accuracy nearly identical to QR/Cayley (94.8–95.8%), with epochs 1.4–1.7 $\times$ faster.
In parallel CG, dyadic CANS ( $m=31$ ) achieves $1.4$– $2.4\times$ speedup, with strong scalability for billions of unknowns (Grishina et al., 12 Jun 2025, Bergamaschi et al., 2020).