Chebyshev-Optimized Newton-Schulz (CANS) Method

Updated 14 November 2025

CANS is a polynomial-iteration framework that combines the traditional Newton–Schulz method with Chebyshev-optimality to adapt to spectral properties for matrix function approximation.
It significantly improves convergence rates and reduces computational time in applications like deep learning optimizers and large-scale numerical linear algebra by leveraging efficient matrix multiplications.
Using techniques such as the Remez algorithm for higher-degree polynomials, CANS offers exponential error reduction while maintaining scalability on modern parallel architectures.

The Chebyshev-Optimized Newton-Schulz (CANS) method is a polynomial-iteration framework for matrix function approximation, specifically tailored for tasks that involve matrix inversion, inverse square roots, or matrix orthogonalization. It synergistically combines the structure of Newton–Schulz (or Hotelling’s) iteration with the minimax property of Chebyshev polynomials, resulting in a spectrally-aware, highly parallelizable, and computationally efficient approach. CANS has demonstrated measurable improvements in convergence and wall-clock performance for large-scale numerical linear algebra, deep learning optimizers (such as Muon), and constrained optimization on matrix manifolds, among other applications (Grishina et al., 12 Jun 2025, Bergamaschi et al., 2020).

1. Classical Newton–Schulz Iteration and Its Limitations

The Newton–Schulz iteration is a canonical matrix-only method for computing the matrix inverse or its inverse square root, appealing for its exclusive use of matrix multiplications, which are highly efficient on modern parallel computing architectures:

$X_{k+1} = \frac{1}{2} X_k (3I - A X_k^2)$

for the inverse square root problem, converging quadratically provided $\|I - A X_0^2\| < 1$ . The method is also applicable to orthogonalization, where, given a rectangular $X$ , one iterates: $X_{k+1} = \frac{3}{2} X_k - \frac{1}{2} X_k (X_k^T X_k) X_k.$ However, the scalar coefficients (e.g., $\frac{3}{2},-\frac{1}{2}$ ) are fixed as Taylor approximants about the expansion point $t=1$ , and do not adapt to the spectral distribution of the matrix at hand. This lack of spectral adaptivity imposes limitations on convergence speed and error uniformity, especially in cases where the eigenvalues of the input matrix are poorly clustered or widely spread (Grishina et al., 12 Jun 2025, Bergamaschi et al., 2020).

2. Chebyshev-Optimality: Deriving Spectrum-Aware Polynomial Updates

CANS introduces Chebyshev-optimality to the construction of matrix polynomial iterations. For the cubic case, the goal is to find the best odd cubic polynomial $p(x) = \alpha_1 x + \alpha_3 x^3$ minimizing the maximum uniform deviation $\epsilon$ from unity on the spectral interval $[a, b]$ : $\varepsilon = \max_{x\in[a,b]} |p(x) - 1|.$ Applying Chebyshev’s alternance theorem, there exist extremal points $x_0, x_1, x_2$ with alternating error, yielding explicit closed-form optimal coefficients: $a_1 = \frac{2(a^2 + ab + b^2)}{D}, \quad a_3 = -\frac{2}{D}, \ \text{with}\quad D = 2 \left( \frac{a^2 + ab + b^2}{3} \right)^{3/2} + a^2 b + ab^2.$ Thus, the cubic CANS update is: $X_{k+1} = X_k p(X_k^T X_k) = \frac{2}{D} \left( (a^2 + ab + b^2) X_k - X_k (X_k^T X_k) \right).$ This analytic solution is only available in the cubic case; for higher degrees, the construction is performed numerically.

3. Remez Algorithm and Higher-Degree CANS Polynomials

For higher-order odd-degree polynomial approximations ($2n-1 > 3$), CANS employs a discrete Remez algorithm on the spectrum interval $[a, b]$ :

Initialize with a set of alternation points $a = x_0 < x_1 < \cdots < x_n = b$ .
Solve the linear system: $p(x_j) - 1 = (-1)^j \epsilon$ , $j=0,\cdots,n$ , to obtain the polynomial coefficients and uniform error.
Identify the new extremal points, update the alternation set, and iterate until convergence in $\epsilon$ .

This algorithm computes the unique best uniform odd polynomial $p_{n,a,b}(x)$ that approximates unity over the specified spectral interval, ensuring robust minimax optimality regardless of the underlying spectrum (Grishina et al., 12 Jun 2025).

4. General Formulation and Iteration Structure

Defining the residual matrix at iteration $k$ as $R_k = I - A X_k^2$ , the degree- $d$ CANS iteration applies the minimax polynomial $P_d(t) = \sum_{i=0}^d a_i t^i$ :

$X_{k+1} = \sum_{i=0}^{d} a_i X_k R_k^i = X_k P_d(R_k).$

Each term in the polynomial expansion corresponds to a matrix–matrix multiplication, which is efficient on GPU hardware for moderate $d$ , as the algorithm avoids explicit factorizations or square roots, and does not require explicit knowledge of the full spectrum, only its extremal points.

5. Theoretical Error Analysis and Convergence Behavior

Letting $\kappa = \lambda_{\max}/\lambda_{\min}$ describe the effective spectral condition number, the minimax polynomial's maximum error takes the form:

$E_d(\kappa) \approx 2\left( \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \right)^{d}$

so that after one iteration: $\|I - A X_{k+1}^2\| \leq E_d(\kappa) \|I - A X_k^2\|$ and, after $k$ iterations, the residual decays as $\left(E_d(\kappa)\right)^k$ . Thus, increasing the degree $d$ rapidly reduces the one-step error, yielding an exponential convergence rate improvement compared to the classical fixed-coefficient iteration for fixed spectrum (Grishina et al., 12 Jun 2025, Bergamaschi et al., 2020).

In the context of the Conjugate Gradient (CG) method, using a CANS preconditioner of degree $m$ transforms the effective condition number as: $\kappa_{\rm eff} = \frac{T_{m+1}(\sigma)-1}{T_{m+1}(\sigma)+1}$ where $\sigma=(\alpha+\beta)/(2\delta)$ , and yields a similar rate reduction in CG residual norm.

6. Computational Complexity and Parallelization

Each degree- $d$ CANS step entails:

One matrix–matrix multiplication for $X_k^2 = X_k X_k^T$ .
One multiplication to form $A X_k^2$ .
$d$ multiplications to evaluate powers of $R_k$ applied to $X_k$ .

Thus, the per-iteration complexity is $(d+2)$ large matrix products. On modern GPUs, these operations approach peak compute throughput due to architectural optimization for such kernels; increasing $d$ trades per-step work for fewer total iterations. In practice, typical choices are $d\in\{2,3,5\}$ , balancing convergence rate and resource consumption. For parallel sparse linear systems, the method is implemented in a fully matrix-free manner with block-row data distribution across MPI ranks and no global reductions except for the required scalar products in CG (Bergamaschi et al., 2020).

7. Applications in Large-Scale and Machine Learning Contexts

CANS is incorporated in several computational contexts:

A. Deep Learning Orthogonalization (Muon Optimizer):

Muon requires per-step approximate orthogonalization of gradient matrices $G$ , wherein CANS is used with spectrum estimates from $G^T G$ . Empirical studies with NanoGPT document that composite CANS polynomials (of varying degree per iteration) attain singular-value deviation $\delta \simeq 0.3$ and match or slightly improve convergence rates versus alternative polynomials, all with identical matrix-multiplication costs (Grishina et al., 12 Jun 2025).

B. Riemannian Optimization on the Stiefel Manifold:

Retraction of perturbed points $X+\xi$ to the manifold $Y : Y^T Y = I$ is expedited by using a low-degree CANS iteration as an efficient substitute for explicit polar or SVD-based retraction. For instance, in Wide-ResNet experiments on CIFAR-10, CANS retraction matches the accuracy of NLA-based schemes (Cayley or QR) but improves per-epoch training time by $30$–$40$% on a V100 GPU (Grishina et al., 12 Jun 2025).

C. Preconditioning for Large-Scale Conjugate Gradient:

In systems up to billions of unknowns, CANS preconditioners of degree $m \in \{15, 31\}$ reduce CG iteration counts by factors of 10–100 with minor per-iteration overhead. Weighted weak-scaling experiments report time-to-solution speedups between $1.3\times$ and $2.4\times$ relative to diagonal preconditioning, and up to $97$% reduction in synchronization overhead on over 2,000 MPI ranks (Bergamaschi et al., 2020).

Use Case	Performance Metric	Outcome
Muon/NanoGPT	Singular value deviation	$\delta \simeq 0.3$
Stiefel (WRN/CIFAR-10)	Epoch time/speedup	$30$–$40$% faster retraction
CG Preconditioning	CG iteration reduction	Factors of 10–100; up to 2.4 $\times$ speedup

8. High-Level Algorithmic Realization

For the general degree- $d$ method, CANS computes one update as follows (see procedural details in (Grishina et al., 12 Jun 2025) and (Bergamaschi et al., 2020)):

Form $Y \leftarrow A X$ .
Form $S \leftarrow X^T X$ .
Compute residual $R \leftarrow I - A X^2$ .
Apply Horner's rule to evaluate $P_d(R)$ on $X$ :

Z = a_d * X
for i in range(d-1, -1, -1):
    Z = R @ Z + a_i * X
X_new = Z

Return $X_{\rm new}$ .

This composition is compatible with both dense (GPU) and large-scale sparse (distributed, matrix-free) environments. Polynomial coefficients may be precomputed offline using the Remez algorithm for the relevant spectral interval.

9. Relationships, Limitations, and Extensions

CANS generalizes and strictly subsumes both fixed-coefficient Newton–Schulz iterations and Chebyshev polynomial methods, with a proved exact equivalence for parameters $(m=2^j-1)$ . A "de-clustering" modification, implemented by inflating the Chebyshev interval slightly, mitigates extremal eigenvalue condensation and yields measurable acceleration for Krylov methods (Bergamaschi et al., 2020). The method retains its efficacy across orthogonalization, matrix function evaluation, and preconditioning, and requires only approximate knowledge of the spectral extremities of the argument matrix.

Limitations stem from the up-front cost of computing spectral bounds (potentially 5–10 power method or DACG steps) and increased matrix-multiplication cost for higher-degree polynomials. Nevertheless, the algorithm is highly resilient to problem size growth and is particularly advantageous on architectures where matrix multiplications are inexpensive relative to reductions or factorizations.

A plausible implication is that CANS constitutes a unifying computational paradigm for iterative polynomial matrix functions in large-scale numerical optimization and machine learning, especially where matrix-multiplication efficiency is paramount and spectral information is approximately available.