Newton-Muon: Newton-Based Optimization

Updated 3 July 2026

Newton-Muon is an optimization algorithm that employs a Newton-type quadratic surrogate in matrix-structured parameter spaces for data-adaptive right-preconditioning.
It utilizes an SVD-free Newton–Schulz polynomial iteration to approximate the matrix-sign operation, dramatically reducing computational overhead.
Empirical results show that Newton-Muon enhances convergence and stability in large-scale neural network training compared to standard Muon.

Newton-Muon is an optimization algorithm that extends the Muon optimizer family by incorporating an explicit Newton-type local quadratic surrogate model in matrix-structured parameter spaces, yielding a step that fundamentally right-preconditions the update direction using input second-moment information. It operates by applying the matrix-sign functional to the product of the gradient with a right inverse of the data covariance, implemented efficiently via a SVD-free Newton–Schulz polynomial iteration. Newton-Muon can be interpreted as a specialization of the general polar-step Muon method with an explicit, data-adaptive right-preconditioner, and is empirically demonstrated to improve both optimization efficiency and training convergence on modern large-scale neural network tasks (Du et al., 1 Apr 2026).

1. Theoretical Foundations and Motivating Surrogate

Newton-Muon originates from a local quadratic modeling of the empirical risk in matrix coordinates. Specifically, for a layerwise parameter matrix $W\in\mathbb R^{m\times n}$ and a loss $f(W)$ , the minimization of a Taylor-expansion-based surrogate

$J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)$

is conducted, where $G = \nabla_W f(W)$ (the batch gradient), $H\in\mathbb R^{m\times m}$ is an output-space curvature approximation (often treated as identity or isotropic), and $Z\in\mathbb R^{n\times N}$ concatenates the layer inputs (Du et al., 1 Apr 2026). Under a Kronecker-factored curvature approximation, the unique minimizer emerges as

$Q^* = \Sigma_W^{1/2} \, \mathrm{msgn}(\Sigma_W^{1/2} G (Z Z^\top)^{-1}),$

where $\Sigma_W$ encodes the left-side covariance of the parameter update, and $\mathrm{msgn}(X)$ denotes the orthogonally normalized matrix sign operator (mapping $X = U S V^\top \mapsto U V^\top$ ). In practical regimes, the isotropic proxy $f(W)$ 0 is adopted, reducing the update to

$f(W)$ 1

This derivation demonstrates that Newton-Muon is a "right-preconditioned" polar-step optimizer, in contrast to standard Muon which assumes isotropic (identity) input covariance and omits the right-preconditioning (Du et al., 1 Apr 2026).

2. Efficient Implementation: Newton–Schulz-Orthonormalization

Newton-Muon requires computation of $f(W)$ 2. Direct SVD factorization is prohibitively expensive for large-scale training. Instead, Newton–Schulz polynomial iterations are used to approximate the necessary matrix functions without explicit decompositions (Mehta et al., 29 Sep 2025, Du et al., 1 Apr 2026). In general, the Newton–Schulz map for a symmetric positive definite matrix $f(W)$ 3 is of the form

$f(W)$ 4

and for more stable/rapid convergence, practitioners often use quintic polynomials, e.g.,

$f(W)$ 5

with coefficients optimized to minimize the uniform polynomial approximation error over the spectral interval of $f(W)$ 6 (e.g., $f(W)$ 7, $f(W)$ 8, $f(W)$ 9) (Mehta et al., 29 Sep 2025). For Newton-Muon, this iteration is applied to the preconditioned gradient matrix, leveraging the fact that a handful ( $J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)$ 0– $J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)$ 1) of such steps suffices to flatten the spectrum in practice.

3. Comparison to Muon and Spectral Normalization Perspective

Standard Muon applies the normalized-matrix-sign step $J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)$ 2, implicitly assuming isotropic input activation statistics. Newton-Muon, by contrast, explicitly incorporates the right-preconditioning $J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)$ 3 (typically estimated as a running or blockwise moment of activations) and computes $J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)$ 4 (Du et al., 1 Apr 2026). This aligns the update directions to the input data geometry, providing orthogonal equivariance under joint right-rotations $J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)$ 5, which standard Muon lacks if $J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)$ 6 is far from the identity (Du et al., 1 Apr 2026).

More generally, Newton-Muon can be regarded as a member of a parametric spectral-normalization family $J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)$ 7 ( $J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)$ 8), with $J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)$ 9 (Muon), $G = \nabla_W f(W)$ 0 (SGD/Adam), and fractional $G = \nabla_W f(W)$ 1 providing intermediate degrees of spectral compression (Qi et al., 4 Feb 2026). Theoretical and empirical analyses reveal that full flattening ( $G = \nabla_W f(W)$ 2) is not always optimal, but for raw first-moment inputs, it provides strong stabilization, while for RMS-normalized (Adam-type) updates, mild compression often suffices (Qi et al., 4 Feb 2026).

4. Convergence, Practical Tuning, and Theoretical Guarantees

Newton-Muon inherits nonconvex convergence guarantees from the Muon family when implemented with a finite number of Newton–Schulz steps. The convergence rate to a stationary point matches the SVD-polar-idealization up to a constant $G = \nabla_W f(W)$ 3 that converges doubly-exponentially to $G = \nabla_W f(W)$ 4 in the number of Newton–Schulz steps $G = \nabla_W f(W)$ 5 and polynomial degree $G = \nabla_W f(W)$ 6 (Kim et al., 27 Jan 2026, Shulgin et al., 22 Oct 2025). In practical terms, $G = \nabla_W f(W)$ 7– $G = \nabla_W f(W)$ 8 and $G = \nabla_W f(W)$ 9 yield a negligible overhead compared to the ideal case, with a wall-clock speedup of $H\in\mathbb R^{m\times m}$ 0– $H\in\mathbb R^{m\times m}$ 1 versus SVD-based orthogonalization.

Recent analysis further demonstrates that as right-preconditioning (through $H\in\mathbb R^{m\times m}$ 2) becomes more accurate, the effective descent direction is better conditioned with respect to the layer's input geometry. Empirically, Newton-Muon reaches the same target validation loss as standard Muon in $H\in\mathbb R^{m\times m}$ 3 fewer iteration steps and $H\in\mathbb R^{m\times m}$ 4 less wall-clock time on GPT-2 pretraining setups, with only a $H\in\mathbb R^{m\times m}$ 5 per-step computational overhead from Cholesky inversion and Newton–Schulz polynomial evaluation (Du et al., 1 Apr 2026).

However, the preconditioning matrix $H\in\mathbb R^{m\times m}$ 6 estimation carries its own tradeoffs, requiring periodic batchwise updates, stability regularization (e.g., ridge penalties), and blockwise or low-rank approximations to avoid excessive computational cost in very wide layers (Du et al., 1 Apr 2026).

5. Practical Variants, Extensions, and Empirical Findings

Block-diagonal and low-rank variants of Newton-Muon have been shown to suffice in practice, particularly for multi-branch layers or large MLP contractions (Du et al., 1 Apr 2026). Efficient polynomial inversion routines (e.g., via specialized batched GEMM kernels or operator-specific kernels) are commonly employed. Newton-Muon has been deployed in hybrid schemes, e.g., applying AdamW on small parameters and Newton-Muon on hidden-layer weights, yielding improved convergence and generalization on tasks such as CIFAR-10 and large language modeling.

Notably, the empirical results indicate that the inclusion of right-preconditioning systematically improves both iteration efficiency and stability, particularly in the presence of pronounced input anisotropy, a regime common to realistic deep architectures (Du et al., 1 Apr 2026). Fixed or blockwise $H\in\mathbb R^{m\times m}$ 7 updates, as well as regularized Cholesky/symmetrized inverses, are effective in practice.

6. Limitations and Open Problems

A crucial limitation of current Newton-Muon implementations is the reliance on the isotropic-weight approximation for the left covariance $H\in\mathbb R^{m\times m}$ 8, as unbiased and practical alternatives remain an open research challenge (Du et al., 1 Apr 2026). The Kronecker-factored surrogate for the Hessian omits off-token coupling in transformers; more accurate yet efficient surrogates may further improve second-order adaptation. For extremely large-scale distributed scenarios, the estimation and communication of $H\in\mathbb R^{m\times m}$ 9 (activation second moments) can be a bottleneck, motivating work on randomized or structured sketches for distributed preconditioning (Du et al., 1 Apr 2026).

Newton-Muon thus represents a principled, theoretically sound, and empirically effective extension of polar-step geometry-aware optimization. Its key distinguishing feature is explicit adaptation to data geometry via input-moment right-preconditioning, efficiently realized through SVD-free Newton–Schulz polynomial iteration and carefully regulated batchwise updates (Du et al., 1 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (5)

The Newton-Muon Optimizer (2026)

Muon: Training and Trade-offs with Latent Attention and MoE (2025)

Delving into Muon and Beyond: Deep Analysis and Extensions (2026)

Convergence of Muon with Newton-Schulz (2026)

Beyond the Ideal: Analyzing the Inexact Muon Update (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Newton-Muon.