Newton-Muon: Newton-Based Optimization
- Newton-Muon is an optimization algorithm that employs a Newton-type quadratic surrogate in matrix-structured parameter spaces for data-adaptive right-preconditioning.
- It utilizes an SVD-free Newton–Schulz polynomial iteration to approximate the matrix-sign operation, dramatically reducing computational overhead.
- Empirical results show that Newton-Muon enhances convergence and stability in large-scale neural network training compared to standard Muon.
Newton-Muon is an optimization algorithm that extends the Muon optimizer family by incorporating an explicit Newton-type local quadratic surrogate model in matrix-structured parameter spaces, yielding a step that fundamentally right-preconditions the update direction using input second-moment information. It operates by applying the matrix-sign functional to the product of the gradient with a right inverse of the data covariance, implemented efficiently via a SVD-free Newton–Schulz polynomial iteration. Newton-Muon can be interpreted as a specialization of the general polar-step Muon method with an explicit, data-adaptive right-preconditioner, and is empirically demonstrated to improve both optimization efficiency and training convergence on modern large-scale neural network tasks (Du et al., 1 Apr 2026).
1. Theoretical Foundations and Motivating Surrogate
Newton-Muon originates from a local quadratic modeling of the empirical risk in matrix coordinates. Specifically, for a layerwise parameter matrix and a loss , the minimization of a Taylor-expansion-based surrogate
is conducted, where (the batch gradient), is an output-space curvature approximation (often treated as identity or isotropic), and concatenates the layer inputs (Du et al., 1 Apr 2026). Under a Kronecker-factored curvature approximation, the unique minimizer emerges as
where encodes the left-side covariance of the parameter update, and denotes the orthogonally normalized matrix sign operator (mapping ). In practical regimes, the isotropic proxy 0 is adopted, reducing the update to
1
This derivation demonstrates that Newton-Muon is a "right-preconditioned" polar-step optimizer, in contrast to standard Muon which assumes isotropic (identity) input covariance and omits the right-preconditioning (Du et al., 1 Apr 2026).
2. Efficient Implementation: Newton–Schulz-Orthonormalization
Newton-Muon requires computation of 2. Direct SVD factorization is prohibitively expensive for large-scale training. Instead, Newton–Schulz polynomial iterations are used to approximate the necessary matrix functions without explicit decompositions (Mehta et al., 29 Sep 2025, Du et al., 1 Apr 2026). In general, the Newton–Schulz map for a symmetric positive definite matrix 3 is of the form
4
and for more stable/rapid convergence, practitioners often use quintic polynomials, e.g.,
5
with coefficients optimized to minimize the uniform polynomial approximation error over the spectral interval of 6 (e.g., 7, 8, 9) (Mehta et al., 29 Sep 2025). For Newton-Muon, this iteration is applied to the preconditioned gradient matrix, leveraging the fact that a handful (0–1) of such steps suffices to flatten the spectrum in practice.
3. Comparison to Muon and Spectral Normalization Perspective
Standard Muon applies the normalized-matrix-sign step 2, implicitly assuming isotropic input activation statistics. Newton-Muon, by contrast, explicitly incorporates the right-preconditioning 3 (typically estimated as a running or blockwise moment of activations) and computes 4 (Du et al., 1 Apr 2026). This aligns the update directions to the input data geometry, providing orthogonal equivariance under joint right-rotations 5, which standard Muon lacks if 6 is far from the identity (Du et al., 1 Apr 2026).
More generally, Newton-Muon can be regarded as a member of a parametric spectral-normalization family 7 (8), with 9 (Muon), 0 (SGD/Adam), and fractional 1 providing intermediate degrees of spectral compression (Qi et al., 4 Feb 2026). Theoretical and empirical analyses reveal that full flattening (2) is not always optimal, but for raw first-moment inputs, it provides strong stabilization, while for RMS-normalized (Adam-type) updates, mild compression often suffices (Qi et al., 4 Feb 2026).
4. Convergence, Practical Tuning, and Theoretical Guarantees
Newton-Muon inherits nonconvex convergence guarantees from the Muon family when implemented with a finite number of Newton–Schulz steps. The convergence rate to a stationary point matches the SVD-polar-idealization up to a constant 3 that converges doubly-exponentially to 4 in the number of Newton–Schulz steps 5 and polynomial degree 6 (Kim et al., 27 Jan 2026, Shulgin et al., 22 Oct 2025). In practical terms, 7–8 and 9 yield a negligible overhead compared to the ideal case, with a wall-clock speedup of 0–1 versus SVD-based orthogonalization.
Recent analysis further demonstrates that as right-preconditioning (through 2) becomes more accurate, the effective descent direction is better conditioned with respect to the layer's input geometry. Empirically, Newton-Muon reaches the same target validation loss as standard Muon in 3 fewer iteration steps and 4 less wall-clock time on GPT-2 pretraining setups, with only a 5 per-step computational overhead from Cholesky inversion and Newton–Schulz polynomial evaluation (Du et al., 1 Apr 2026).
However, the preconditioning matrix 6 estimation carries its own tradeoffs, requiring periodic batchwise updates, stability regularization (e.g., ridge penalties), and blockwise or low-rank approximations to avoid excessive computational cost in very wide layers (Du et al., 1 Apr 2026).
5. Practical Variants, Extensions, and Empirical Findings
Block-diagonal and low-rank variants of Newton-Muon have been shown to suffice in practice, particularly for multi-branch layers or large MLP contractions (Du et al., 1 Apr 2026). Efficient polynomial inversion routines (e.g., via specialized batched GEMM kernels or operator-specific kernels) are commonly employed. Newton-Muon has been deployed in hybrid schemes, e.g., applying AdamW on small parameters and Newton-Muon on hidden-layer weights, yielding improved convergence and generalization on tasks such as CIFAR-10 and large language modeling.
Notably, the empirical results indicate that the inclusion of right-preconditioning systematically improves both iteration efficiency and stability, particularly in the presence of pronounced input anisotropy, a regime common to realistic deep architectures (Du et al., 1 Apr 2026). Fixed or blockwise 7 updates, as well as regularized Cholesky/symmetrized inverses, are effective in practice.
6. Limitations and Open Problems
A crucial limitation of current Newton-Muon implementations is the reliance on the isotropic-weight approximation for the left covariance 8, as unbiased and practical alternatives remain an open research challenge (Du et al., 1 Apr 2026). The Kronecker-factored surrogate for the Hessian omits off-token coupling in transformers; more accurate yet efficient surrogates may further improve second-order adaptation. For extremely large-scale distributed scenarios, the estimation and communication of 9 (activation second moments) can be a bottleneck, motivating work on randomized or structured sketches for distributed preconditioning (Du et al., 1 Apr 2026).
Newton-Muon thus represents a principled, theoretically sound, and empirically effective extension of polar-step geometry-aware optimization. Its key distinguishing feature is explicit adaptation to data geometry via input-moment right-preconditioning, efficiently realized through SVD-free Newton–Schulz polynomial iteration and carefully regulated batchwise updates (Du et al., 1 Apr 2026).