Practical Efficiency of Muon for Pretraining (2505.02222v4)

Published 4 May 2025 in cs.LG and stat.ML

Abstract: We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper demonstrates that Muon, a straightforward second-order optimizer, expands the Pareto frontier on compute-time tradeoffs compared to AdamW.
Muon uses matrix-structured steepest descent with spectral norm regularization and Newton-Schulz iteration to maintain a lower memory footprint and similar learning dynamics.
A telescoping algorithm for hyperparameter tuning with muP is proposed, ensuring improved data efficiency and effective scaling for models up to 4B parameters.

This paper investigates the practical efficiency of Muon (2505.02222), a second-order optimizer, for large-scale LLM pretraining, comparing it directly against the widely used AdamW (Loshchilov et al., 2017). The core argument is that Muon offers superior practical benefits by expanding the Pareto frontier on the compute-time tradeoff, enabling more flexible and economical training, and that it is compatible with efficient hyperparameter tuning techniques like the maximal update parameterization (muP) (Yang et al., 2022).

The paper defines Muon as the simplest instantiation of a second-order optimizer, essentially performing matrix-structured steepest descent with spectral norm regularization. For a weight matrix $W_t \in \mathbb{R}^{m \times n}$ , the update uses a transformation $O_t$ of the gradient $G_t \in \mathbb{R}^{m \times n}$ such that $O_t = U V^\top$ , where $G_t = U \Sigma V^\top$ is the SVD. Practically, Muon avoids explicit SVD computation by using the Newton-Schulz iteration (7106.0358, 7009.0482) for matrix orthogonalization. The standard update rule combines Nesterov momentum, learning rate scaling, and coupled weight decay, as shown in Equation 1. A crucial practical detail is the normalization constant $0.2\sqrt{n}$ (where $n$ is the output dimension), which scales the Muon update to have similar RMS values as AdamW updates, allowing for shared learning rate schedules and weight decay values between Muon and AdamW (used for embedding and normalization layers). The paper provides a Jax implementation adapted from Optax (Appendix B), highlighting that the core logic applies orthogonalization layer-wise based on tensor shape and naming conventions (e.g., 'mlp', 'attention'). This implementation maintains only the first moment, giving Muon a lighter memory footprint than AdamW, which is a practical advantage for distributed training.

A key contribution is the characterization of optimizer efficiency using the compute-time tradeoff Pareto frontier. Standard comparisons based on wall-clock time or FLOPs at fixed resources are deemed insufficient. Instead, the paper plots the total training time versus the number of devices (and corresponding batch size) required to reach a specific target loss. The experiments use decoder-only transformer models based on Gemma 3 (Team et al., 25 Mar 2025) with sizes up to 4 billion parameters, trained on diverse data (text and code) on TPU v5p chips. The results show that Muon's iso-loss curves lie below and to the left of AdamW's on the compute-time plane (Figure 1, Figure 2), meaning Muon can reach the same loss faster with the same compute or achieve the same speed with less compute. This demonstrates a strict improvement in resource allocation flexibility.

The paper attributes Muon's improved compute-time tradeoff to its superior data efficiency, particularly at large batch sizes. As batch size increases in data-parallel training, data efficiency often diminishes beyond a "critical batch size" (McCandlish et al., 2018, Li et al., 2019). The paper proposes the token ratio $R_L(B) = T_{L,A}(B) / T_{L,M}(B)$ (Equation 2) to measure the relative data efficiency, where $T_{L,A}(B)$ and $T_{L,M}(B)$ are the tokens needed for AdamW and Muon, respectively, to reach loss $L$ at batch size $B$ . Empirical results show that $R_L(B)$ is consistently above 1 and is nondecreasing for large batch sizes (Figure 3). This means AdamW requires a persistent, or even growing, token overhead (10-15% fewer tokens for Muon) compared to Muon in the large batch regime, which translates to a non-optimizer FLOP overhead. This enduring data efficiency advantage allows Muon to scale better with increasing batch sizes, facilitating faster training when more devices are available, even though Muon might have higher per-step FLOPs than AdamW due to the orthogonalization step.

Efficient hyperparameter tuning for large models is another practical challenge. The paper addresses the open question of whether muP (Yang et al., 2022) is compatible with Muon (Liu et al., 24 Feb 2025). MuP allows transferring hyperparameters found on smaller proxy models to larger ones by prescribing specific width-dependent scalings for initialization variance, weight multipliers, and learning rates (Table 1). The paper identifies two sources of error in muP transfer: finite-width bias (the optimal hyperparameter shifts by $O(1/n)$ for width $n$ ) and sampling error from discrete grid searches (Equations 3 and 4, Appendix H). To mitigate these, the paper introduces a simple "telescoping" algorithm (Algorithm 1). This algorithm involves training models at increasing widths, geometrically reducing the hyperparameter search grid size at each successive width doubling. The intuition is that this approach tracks the optimal hyperparameter's $O(1/n)$ drift while keeping the computational cost of each stage roughly constant. The paper demonstrates empirically that muP works successfully with Muon for transferring learning rate and weight decay on models up to 3.7B parameters (Figure 4, Appendix E). The telescoping algorithm is shown to effectively refine hyperparameter estimates and control error sources, with an added compute cost of $O(C \log N)$ where $C$ is the final model training cost and $N$ is its width (Figure 5). This ensures that a significant fraction of the total compute budget (over 20% in their experiments) is dedicated to the final model training while using near-optimal hyperparameters.

In summary, the paper makes a strong case for replacing AdamW with Muon in large-scale LLM pretraining. It provides empirical evidence that Muon offers a better compute-time tradeoff by maintaining superior data efficiency at large batch sizes. Furthermore, it demonstrates that Muon can be efficiently tuned using muP combined with the proposed telescoping hyperparameter search strategy, providing a practical, unified recipe for more efficient and flexible pretraining.

PDF Markdown

Follow-up Questions

Related Papers

Authors (24)

First 10 authors:

Tweets

https://twitter.com/ClementDelangue/status/1919889364961681678

https://twitter.com/essential_ai/status/1919808608608395611

https://twitter.com/dair_ai/status/1921606682905301127

https://twitter.com/TheTuringPost/status/1922735015680729276

https://twitter.com/arxivsanitybot/status/1919947918360117493

https://twitter.com/ryansweb/status/1923960854493659162