Papers
Topics
Authors
Recent
Search
2000 character limit reached

Chebyshev-Optimized Newton–Schulz (CANS)

Updated 11 April 2026
  • CANS is a matrix computation framework that leverages Chebyshev minimax polynomials to optimize Newton–Schulz iterations for rapid convergence.
  • It accelerates matrix orthogonalization and inversion by aligning polynomial coefficients with spectral properties, benefiting deep learning and scientific computing.
  • Empirical results show that CANS achieves superior convergence rates and reduced computational times compared to traditional methods.

The Chebyshev-Optimized Newton–Schulz (CANS) method is a unified framework that extends the classical Newton–Schulz (NS) iteration for fast matrix orthogonalization and matrix inversion by leveraging the theory of Chebyshev-type polynomial approximation. CANS employs the minimax optimality of Chebyshev polynomials and algorithmic construction via the Remez algorithm, yielding fixed-point or iterative update steps with provably optimal convergence in spectral norm. CANS is central in deep learning for efficient polar decomposition, Stiefel manifold retraction, and also as a high-performance polynomial preconditioner for large-scale Krylov solvers in scientific computing. Key design innovations enable CANS to achieve faster convergence than traditional NS by aligning polynomial coefficients with the singular spectrum of the input or operator, exploiting parallel hardware, and controlling spectral clustering.

1. Mathematical Principles and Foundational Iterations

The classical NS iteration addresses the problem of computing the nearest matrix with orthonormal columns (the polar factor) to a given full-rank matrix XRm×nX \in \mathbb{R}^{m \times n} (mnm \geq n). The NS update is

Xk+1=32Xk12Xk(XkTXk)=32Xk12XkXkTXk,X1=X.X_{k+1} = \frac{3}{2} X_k - \frac{1}{2} X_k (X_k^T X_k) = \frac{3}{2} X_k - \frac{1}{2} X_k X_k^T X_k, \quad X_1 = X.

This fixed-point iteration is equivalent, at the singular value level, to repeated application of the cubic polynomial

pNS(s)=32s12s3.p_{\rm NS}(s) = \frac{3}{2} s - \frac{1}{2} s^3.

NS converges quadratically provided σ1(X)<3\sigma_1(X) < \sqrt{3} and σn(X)>0\sigma_n(X) > 0.

In matrix inversion contexts, the Newton (Hotelling) iteration is Pj+1=2PjPjAPjP_{j+1} = 2P_j - P_j A P_j for SPD AA. Both scenarios rely solely on matrix multiplications, affording efficient GPU or distributed execution (Grishina et al., 12 Jun 2025, Bergamaschi et al., 2020).

2. Chebyshev Optimality and Coefficient Selection

The NS polynomial is optimal in a Taylor sense at the center point x=1x=1, not in maximum error over the spectral range. CANS formulates the minimax polynomial approximation

minpLnmaxx[a,b]p(x)1,\min_{p \in L_n} \max_{x \in [a, b]} |p(x) - 1|,

where mnm \geq n0 is the class of odd polynomials of degree mnm \geq n1, and the singular values or eigenvalues lie in mnm \geq n2.

Chebyshev’s alternance theorem guarantees the existence and uniqueness of the minimax solution, with the error curve “oscillating” between extremal values at mnm \geq n3 alternate points. For the cubic case (mnm \geq n4), CANS derives explicit optimal coefficients:

mnm \geq n5

where mnm \geq n6, with coefficients

mnm \geq n7

Higher-degree optimal polynomials are obtained constructively via the Remez algorithm (Grishina et al., 12 Jun 2025).

3. High-Order Construction and Algorithmic Framework

The optimal polynomial of chosen (odd) degree is determined by solving a minimax problem over the spectrum. For degree mnm \geq n8:

mnm \geq n9

The Remez algorithm iteratively adjusts alternation points and solves a linear system for coefficients and uniform error, typically converging in a few iterations. For degrees beyond 5, numerical stability deteriorates, so in practice CANS implementations rarely exceed degree 5 (Grishina et al., 12 Jun 2025).

For matrix inversion/preconditioning, CANS polynomials can be generated via Chebyshev recurrence or Newton-type doubling recurrences, with parameterized spectral shifts to avoid undesirable eigenvalue clustering (Bergamaschi et al., 2020).

4. Convergence Properties and Error Analysis

CANS delivers uniform, quadratically accelerating convergence. When repeatedly applying the best cubic polynomial Xk+1=32Xk12Xk(XkTXk)=32Xk12XkXkTXk,X1=X.X_{k+1} = \frac{3}{2} X_k - \frac{1}{2} X_k (X_k^T X_k) = \frac{3}{2} X_k - \frac{1}{2} X_k X_k^T X_k, \quad X_1 = X.0, with error Xk+1=32Xk12Xk(XkTXk)=32Xk12XkXkTXk,X1=X.X_{k+1} = \frac{3}{2} X_k - \frac{1}{2} X_k (X_k^T X_k) = \frac{3}{2} X_k - \frac{1}{2} X_k X_k^T X_k, \quad X_1 = X.1,

Xk+1=32Xk12Xk(XkTXk)=32Xk12XkXkTXk,X1=X.X_{k+1} = \frac{3}{2} X_k - \frac{1}{2} X_k (X_k^T X_k) = \frac{3}{2} X_k - \frac{1}{2} X_k X_k^T X_k, \quad X_1 = X.2

For an initial interval Xk+1=32Xk12Xk(XkTXk)=32Xk12XkXkTXk,X1=X.X_{k+1} = \frac{3}{2} X_k - \frac{1}{2} X_k (X_k^T X_k) = \frac{3}{2} X_k - \frac{1}{2} X_k X_k^T X_k, \quad X_1 = X.3, the number of iterations required to reduce the error below Xk+1=32Xk12Xk(XkTXk)=32Xk12XkXkTXk,X1=X.X_{k+1} = \frac{3}{2} X_k - \frac{1}{2} X_k (X_k^T X_k) = \frac{3}{2} X_k - \frac{1}{2} X_k X_k^T X_k, \quad X_1 = X.4 is approximately

Xk+1=32Xk12Xk(XkTXk)=32Xk12XkXkTXk,X1=X.X_{k+1} = \frac{3}{2} X_k - \frac{1}{2} X_k (X_k^T X_k) = \frac{3}{2} X_k - \frac{1}{2} X_k X_k^T X_k, \quad X_1 = X.5

In matrix preconditioning, the error for Chebyshev polynomial Xk+1=32Xk12Xk(XkTXk)=32Xk12XkXkTXk,X1=X.X_{k+1} = \frac{3}{2} X_k - \frac{1}{2} X_k (X_k^T X_k) = \frac{3}{2} X_k - \frac{1}{2} X_k X_k^T X_k, \quad X_1 = X.6 satisfies

Xk+1=32Xk12Xk(XkTXk)=32Xk12XkXkTXk,X1=X.X_{k+1} = \frac{3}{2} X_k - \frac{1}{2} X_k (X_k^T X_k) = \frac{3}{2} X_k - \frac{1}{2} X_k X_k^T X_k, \quad X_1 = X.7

This translates into robust reduction in residual per iteration in preconditioned conjugate gradient (PCG) methods (Grishina et al., 12 Jun 2025, Bergamaschi et al., 2020).

5. Algorithmic Realization and Computational Aspects

Orthogonalization via CANS for Xk+1=32Xk12Xk(XkTXk)=32Xk12XkXkTXk,X1=X.X_{k+1} = \frac{3}{2} X_k - \frac{1}{2} X_k (X_k^T X_k) = \frac{3}{2} X_k - \frac{1}{2} X_k X_k^T X_k, \quad X_1 = X.8 proceeds as follows:

  1. Initialize by estimating or normalizing the spectral bounds Xk+1=32Xk12Xk(XkTXk)=32Xk12XkXkTXk,X1=X.X_{k+1} = \frac{3}{2} X_k - \frac{1}{2} X_k (X_k^T X_k) = \frac{3}{2} X_k - \frac{1}{2} X_k X_k^T X_k, \quad X_1 = X.9.
  2. For pNS(s)=32s12s3.p_{\rm NS}(s) = \frac{3}{2} s - \frac{1}{2} s^3.0 iterations, compute optimal polynomial coefficients (closed form for cubic, Remez otherwise), update interval pNS(s)=32s12s3.p_{\rm NS}(s) = \frac{3}{2} s - \frac{1}{2} s^3.1 via the last step's error.
  3. Compose the sequence of polynomials, applying each via efficient matmul evaluation.
  4. Output is an approximately orthogonal matrix.

Matrix-multiplication count per CANS step is the polynomial degree (e.g., 2 for cubic). QR decomposition and SVD are avoided entirely. For inversion/preconditioning in PCG, the CANS preconditioner requires recursive mat-vecs with pNS(s)=32s12s3.p_{\rm NS}(s) = \frac{3}{2} s - \frac{1}{2} s^3.2, combined using locally computable recurrences. Implementation is highly parallel, with SpMV and dot-products as main primitives, and preconditioner application decoupled from global communications (Grishina et al., 12 Jun 2025, Bergamaschi et al., 2020).

6. Empirical Performance and Use Cases

CANS accelerates polar factor computation in neural optimizers (e.g., Muon) and as retraction in Riemannian optimization:

  • In Muon, CANS with degree-3 or degree-5 polynomials and tailored step counts achieves reduced wall-clock time and faster loss convergence for NanoGPT (125M parameters, 0.8B tokens) compared to the baseline Muon polynomial; 12–15 CANS matmuls outperform 4-step (8 matmul) Muon baselines (Grishina et al., 12 Jun 2025).
  • For Stiefel-retraction in Wide ResNet-16-10 on CIFAR-10, CANS achieves comparable accuracy to QR or Cayley retraction, but reduces epoch time nearly by half (e.g., Adam + CANS: 95.82% accuracy, 45.1s/epoch vs. Adam + QR: 95.57%, 61.7s/epoch).

In large-scale linear system solving, CANS as a polynomial preconditioner for PCG improves both sequential and strong scaling wall-clock performance:

  • On Opt_Transp (pNS(s)=32s12s3.p_{\rm NS}(s) = \frac{3}{2} s - \frac{1}{2} s^3.3): unpreconditioned PCG requires 3433 iterations/26.4s, while CANS degree-15 reduces to 222 iterations/19.6s.
  • On Emilia_923 (pNS(s)=32s12s3.p_{\rm NS}(s) = \frac{3}{2} s - \frac{1}{2} s^3.4), scaling from 16 to 512 ranks, CANS (degree 31) boosts parallel efficiency from 25% to 58%, reducing total time by a factor of ~2.35 (Bergamaschi et al., 2020).

CANS also matches or surpasses advanced AMG/FSAI preconditioners in high-core-count regimes.

7. Limitations and Practical Considerations

CANS performance is limited by several factors:

  • Remez instability for degrees above 5–7 restricts practical usage to cubics or quintics.
  • Accurate lower spectral bound estimation is important; underestimation slows convergence, overestimation beyond pNS(s)=32s12s3.p_{\rm NS}(s) = \frac{3}{2} s - \frac{1}{2} s^3.5 may prevent convergence. A “pNS(s)=32s12s3.p_{\rm NS}(s) = \frac{3}{2} s - \frac{1}{2} s^3.6-orthogonalization” pre-step can enforce a tight spectral interval if needed.
  • The method assumes well-conditioned inputs; extremely ill-conditioned matrices may require classical SVD/QR fallback or stronger preconditioning.
  • In large parallel preconditioning, optimal performance demands tuning the spectral shift parameter pNS(s)=32s12s3.p_{\rm NS}(s) = \frac{3}{2} s - \frac{1}{2} s^3.7 to avoid eigenvalue clustering, with typical pNS(s)=32s12s3.p_{\rm NS}(s) = \frac{3}{2} s - \frac{1}{2} s^3.8 values in pNS(s)=32s12s3.p_{\rm NS}(s) = \frac{3}{2} s - \frac{1}{2} s^3.9.

CANS is a fully matrix-free, storage-optimal alternative for orthogonalization and preconditioning, controlled by polynomial degree and spectrum adaptation, and is broadly applicable in machine learning and large-scale numerical linear algebra (Grishina et al., 12 Jun 2025, Bergamaschi et al., 2020).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chebyshev-Optimized Newton–Schulz (CANS).