Papers
Topics
Authors
Recent
Search
2000 character limit reached

Newton-Muon: Newton-Based Optimization

Updated 3 July 2026
  • Newton-Muon is an optimization algorithm that employs a Newton-type quadratic surrogate in matrix-structured parameter spaces for data-adaptive right-preconditioning.
  • It utilizes an SVD-free Newton–Schulz polynomial iteration to approximate the matrix-sign operation, dramatically reducing computational overhead.
  • Empirical results show that Newton-Muon enhances convergence and stability in large-scale neural network training compared to standard Muon.

Newton-Muon is an optimization algorithm that extends the Muon optimizer family by incorporating an explicit Newton-type local quadratic surrogate model in matrix-structured parameter spaces, yielding a step that fundamentally right-preconditions the update direction using input second-moment information. It operates by applying the matrix-sign functional to the product of the gradient with a right inverse of the data covariance, implemented efficiently via a SVD-free Newton–Schulz polynomial iteration. Newton-Muon can be interpreted as a specialization of the general polar-step Muon method with an explicit, data-adaptive right-preconditioner, and is empirically demonstrated to improve both optimization efficiency and training convergence on modern large-scale neural network tasks (Du et al., 1 Apr 2026).

1. Theoretical Foundations and Motivating Surrogate

Newton-Muon originates from a local quadratic modeling of the empirical risk in matrix coordinates. Specifically, for a layerwise parameter matrix WRm×nW\in\mathbb R^{m\times n} and a loss f(W)f(W), the minimization of a Taylor-expansion-based surrogate

J(Q)=tr(QG)+12Ntr(HQ(ZZ)Q)J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)

is conducted, where G=Wf(W)G = \nabla_W f(W) (the batch gradient), HRm×mH\in\mathbb R^{m\times m} is an output-space curvature approximation (often treated as identity or isotropic), and ZRn×NZ\in\mathbb R^{n\times N} concatenates the layer inputs (Du et al., 1 Apr 2026). Under a Kronecker-factored curvature approximation, the unique minimizer emerges as

Q=ΣW1/2msgn(ΣW1/2G(ZZ)1),Q^* = \Sigma_W^{1/2} \, \mathrm{msgn}(\Sigma_W^{1/2} G (Z Z^\top)^{-1}),

where ΣW\Sigma_W encodes the left-side covariance of the parameter update, and msgn(X)\mathrm{msgn}(X) denotes the orthogonally normalized matrix sign operator (mapping X=USVUVX = U S V^\top \mapsto U V^\top). In practical regimes, the isotropic proxy f(W)f(W)0 is adopted, reducing the update to

f(W)f(W)1

This derivation demonstrates that Newton-Muon is a "right-preconditioned" polar-step optimizer, in contrast to standard Muon which assumes isotropic (identity) input covariance and omits the right-preconditioning (Du et al., 1 Apr 2026).

2. Efficient Implementation: Newton–Schulz-Orthonormalization

Newton-Muon requires computation of f(W)f(W)2. Direct SVD factorization is prohibitively expensive for large-scale training. Instead, Newton–Schulz polynomial iterations are used to approximate the necessary matrix functions without explicit decompositions (Mehta et al., 29 Sep 2025, Du et al., 1 Apr 2026). In general, the Newton–Schulz map for a symmetric positive definite matrix f(W)f(W)3 is of the form

f(W)f(W)4

and for more stable/rapid convergence, practitioners often use quintic polynomials, e.g.,

f(W)f(W)5

with coefficients optimized to minimize the uniform polynomial approximation error over the spectral interval of f(W)f(W)6 (e.g., f(W)f(W)7, f(W)f(W)8, f(W)f(W)9) (Mehta et al., 29 Sep 2025). For Newton-Muon, this iteration is applied to the preconditioned gradient matrix, leveraging the fact that a handful (J(Q)=tr(QG)+12Ntr(HQ(ZZ)Q)J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)0–J(Q)=tr(QG)+12Ntr(HQ(ZZ)Q)J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)1) of such steps suffices to flatten the spectrum in practice.

3. Comparison to Muon and Spectral Normalization Perspective

Standard Muon applies the normalized-matrix-sign step J(Q)=tr(QG)+12Ntr(HQ(ZZ)Q)J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)2, implicitly assuming isotropic input activation statistics. Newton-Muon, by contrast, explicitly incorporates the right-preconditioning J(Q)=tr(QG)+12Ntr(HQ(ZZ)Q)J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)3 (typically estimated as a running or blockwise moment of activations) and computes J(Q)=tr(QG)+12Ntr(HQ(ZZ)Q)J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)4 (Du et al., 1 Apr 2026). This aligns the update directions to the input data geometry, providing orthogonal equivariance under joint right-rotations J(Q)=tr(QG)+12Ntr(HQ(ZZ)Q)J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)5, which standard Muon lacks if J(Q)=tr(QG)+12Ntr(HQ(ZZ)Q)J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)6 is far from the identity (Du et al., 1 Apr 2026).

More generally, Newton-Muon can be regarded as a member of a parametric spectral-normalization family J(Q)=tr(QG)+12Ntr(HQ(ZZ)Q)J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)7 (J(Q)=tr(QG)+12Ntr(HQ(ZZ)Q)J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)8), with J(Q)=tr(QG)+12Ntr(HQ(ZZ)Q)J(Q) = -\operatorname{tr}(QG^\top) + \frac{1}{2N}\operatorname{tr}(H Q(Z Z^\top) Q^\top)9 (Muon), G=Wf(W)G = \nabla_W f(W)0 (SGD/Adam), and fractional G=Wf(W)G = \nabla_W f(W)1 providing intermediate degrees of spectral compression (Qi et al., 4 Feb 2026). Theoretical and empirical analyses reveal that full flattening (G=Wf(W)G = \nabla_W f(W)2) is not always optimal, but for raw first-moment inputs, it provides strong stabilization, while for RMS-normalized (Adam-type) updates, mild compression often suffices (Qi et al., 4 Feb 2026).

4. Convergence, Practical Tuning, and Theoretical Guarantees

Newton-Muon inherits nonconvex convergence guarantees from the Muon family when implemented with a finite number of Newton–Schulz steps. The convergence rate to a stationary point matches the SVD-polar-idealization up to a constant G=Wf(W)G = \nabla_W f(W)3 that converges doubly-exponentially to G=Wf(W)G = \nabla_W f(W)4 in the number of Newton–Schulz steps G=Wf(W)G = \nabla_W f(W)5 and polynomial degree G=Wf(W)G = \nabla_W f(W)6 (Kim et al., 27 Jan 2026, Shulgin et al., 22 Oct 2025). In practical terms, G=Wf(W)G = \nabla_W f(W)7–G=Wf(W)G = \nabla_W f(W)8 and G=Wf(W)G = \nabla_W f(W)9 yield a negligible overhead compared to the ideal case, with a wall-clock speedup of HRm×mH\in\mathbb R^{m\times m}0–HRm×mH\in\mathbb R^{m\times m}1 versus SVD-based orthogonalization.

Recent analysis further demonstrates that as right-preconditioning (through HRm×mH\in\mathbb R^{m\times m}2) becomes more accurate, the effective descent direction is better conditioned with respect to the layer's input geometry. Empirically, Newton-Muon reaches the same target validation loss as standard Muon in HRm×mH\in\mathbb R^{m\times m}3 fewer iteration steps and HRm×mH\in\mathbb R^{m\times m}4 less wall-clock time on GPT-2 pretraining setups, with only a HRm×mH\in\mathbb R^{m\times m}5 per-step computational overhead from Cholesky inversion and Newton–Schulz polynomial evaluation (Du et al., 1 Apr 2026).

However, the preconditioning matrix HRm×mH\in\mathbb R^{m\times m}6 estimation carries its own tradeoffs, requiring periodic batchwise updates, stability regularization (e.g., ridge penalties), and blockwise or low-rank approximations to avoid excessive computational cost in very wide layers (Du et al., 1 Apr 2026).

5. Practical Variants, Extensions, and Empirical Findings

Block-diagonal and low-rank variants of Newton-Muon have been shown to suffice in practice, particularly for multi-branch layers or large MLP contractions (Du et al., 1 Apr 2026). Efficient polynomial inversion routines (e.g., via specialized batched GEMM kernels or operator-specific kernels) are commonly employed. Newton-Muon has been deployed in hybrid schemes, e.g., applying AdamW on small parameters and Newton-Muon on hidden-layer weights, yielding improved convergence and generalization on tasks such as CIFAR-10 and large language modeling.

Notably, the empirical results indicate that the inclusion of right-preconditioning systematically improves both iteration efficiency and stability, particularly in the presence of pronounced input anisotropy, a regime common to realistic deep architectures (Du et al., 1 Apr 2026). Fixed or blockwise HRm×mH\in\mathbb R^{m\times m}7 updates, as well as regularized Cholesky/symmetrized inverses, are effective in practice.

6. Limitations and Open Problems

A crucial limitation of current Newton-Muon implementations is the reliance on the isotropic-weight approximation for the left covariance HRm×mH\in\mathbb R^{m\times m}8, as unbiased and practical alternatives remain an open research challenge (Du et al., 1 Apr 2026). The Kronecker-factored surrogate for the Hessian omits off-token coupling in transformers; more accurate yet efficient surrogates may further improve second-order adaptation. For extremely large-scale distributed scenarios, the estimation and communication of HRm×mH\in\mathbb R^{m\times m}9 (activation second moments) can be a bottleneck, motivating work on randomized or structured sketches for distributed preconditioning (Du et al., 1 Apr 2026).

Newton-Muon thus represents a principled, theoretically sound, and empirically effective extension of polar-step geometry-aware optimization. Its key distinguishing feature is explicit adaptation to data geometry via input-moment right-preconditioning, efficiently realized through SVD-free Newton–Schulz polynomial iteration and carefully regulated batchwise updates (Du et al., 1 Apr 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Newton-Muon.