Practical Efficiency of Muon for Pretraining (2505.02222v4)
Abstract: We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.
Collections
Sign up for free to add this paper to one or more collections.
Summary
- The paper demonstrates that Muon, a straightforward second-order optimizer, expands the Pareto frontier on compute-time tradeoffs compared to AdamW.
- Muon uses matrix-structured steepest descent with spectral norm regularization and Newton-Schulz iteration to maintain a lower memory footprint and similar learning dynamics.
- A telescoping algorithm for hyperparameter tuning with muP is proposed, ensuring improved data efficiency and effective scaling for models up to 4B parameters.
This paper investigates the practical efficiency of Muon (2505.02222), a second-order optimizer, for large-scale LLM pretraining, comparing it directly against the widely used AdamW (Loshchilov et al., 2017). The core argument is that Muon offers superior practical benefits by expanding the Pareto frontier on the compute-time tradeoff, enabling more flexible and economical training, and that it is compatible with efficient hyperparameter tuning techniques like the maximal update parameterization (muP) (Yang et al., 2022).
The paper defines Muon as the simplest instantiation of a second-order optimizer, essentially performing matrix-structured steepest descent with spectral norm regularization. For a weight matrix Wt∈Rm×n, the update uses a transformation Ot of the gradient Gt∈Rm×n such that Ot=UV⊤, where Gt=UΣV⊤ is the SVD. Practically, Muon avoids explicit SVD computation by using the Newton-Schulz iteration (7106.0358, 7009.0482) for matrix orthogonalization. The standard update rule combines Nesterov momentum, learning rate scaling, and coupled weight decay, as shown in Equation 1. A crucial practical detail is the normalization constant 0.2n (where n is the output dimension), which scales the Muon update to have similar RMS values as AdamW updates, allowing for shared learning rate schedules and weight decay values between Muon and AdamW (used for embedding and normalization layers). The paper provides a Jax implementation adapted from Optax (Appendix B), highlighting that the core logic applies orthogonalization layer-wise based on tensor shape and naming conventions (e.g., 'mlp', 'attention'). This implementation maintains only the first moment, giving Muon a lighter memory footprint than AdamW, which is a practical advantage for distributed training.
A key contribution is the characterization of optimizer efficiency using the compute-time tradeoff Pareto frontier. Standard comparisons based on wall-clock time or FLOPs at fixed resources are deemed insufficient. Instead, the paper plots the total training time versus the number of devices (and corresponding batch size) required to reach a specific target loss. The experiments use decoder-only transformer models based on Gemma 3 (Team et al., 25 Mar 2025) with sizes up to 4 billion parameters, trained on diverse data (text and code) on TPU v5p chips. The results show that Muon's iso-loss curves lie below and to the left of AdamW's on the compute-time plane (Figure 1, Figure 2), meaning Muon can reach the same loss faster with the same compute or achieve the same speed with less compute. This demonstrates a strict improvement in resource allocation flexibility.
The paper attributes Muon's improved compute-time tradeoff to its superior data efficiency, particularly at large batch sizes. As batch size increases in data-parallel training, data efficiency often diminishes beyond a "critical batch size" (McCandlish et al., 2018, Li et al., 2019). The paper proposes the token ratio RL(B)=TL,A(B)/TL,M(B) (Equation 2) to measure the relative data efficiency, where TL,A(B) and TL,M(B) are the tokens needed for AdamW and Muon, respectively, to reach loss L at batch size B. Empirical results show that RL(B) is consistently above 1 and is nondecreasing for large batch sizes (Figure 3). This means AdamW requires a persistent, or even growing, token overhead (10-15% fewer tokens for Muon) compared to Muon in the large batch regime, which translates to a non-optimizer FLOP overhead. This enduring data efficiency advantage allows Muon to scale better with increasing batch sizes, facilitating faster training when more devices are available, even though Muon might have higher per-step FLOPs than AdamW due to the orthogonalization step.
Efficient hyperparameter tuning for large models is another practical challenge. The paper addresses the open question of whether muP (Yang et al., 2022) is compatible with Muon (Liu et al., 24 Feb 2025). MuP allows transferring hyperparameters found on smaller proxy models to larger ones by prescribing specific width-dependent scalings for initialization variance, weight multipliers, and learning rates (Table 1). The paper identifies two sources of error in muP transfer: finite-width bias (the optimal hyperparameter shifts by O(1/n) for width n) and sampling error from discrete grid searches (Equations 3 and 4, Appendix H). To mitigate these, the paper introduces a simple "telescoping" algorithm (Algorithm 1). This algorithm involves training models at increasing widths, geometrically reducing the hyperparameter search grid size at each successive width doubling. The intuition is that this approach tracks the optimal hyperparameter's O(1/n) drift while keeping the computational cost of each stage roughly constant. The paper demonstrates empirically that muP works successfully with Muon for transferring learning rate and weight decay on models up to 3.7B parameters (Figure 4, Appendix E). The telescoping algorithm is shown to effectively refine hyperparameter estimates and control error sources, with an added compute cost of O(ClogN) where C is the final model training cost and N is its width (Figure 5). This ensures that a significant fraction of the total compute budget (over 20% in their experiments) is dedicated to the final model training while using near-optimal hyperparameters.
In summary, the paper makes a strong case for replacing AdamW with Muon in large-scale LLM pretraining. It provides empirical evidence that Muon offers a better compute-time tradeoff by maintaining superior data efficiency at large batch sizes. Furthermore, it demonstrates that Muon can be efficiently tuned using muP combined with the proposed telescoping hyperparameter search strategy, providing a practical, unified recipe for more efficient and flexible pretraining.
Follow-up Questions
- How does the spectral norm regularization in Muon influence its generalization compared to AdamW, especially in large-scale language models?
- What are the practical tradeoffs between the increased per-step computational cost of Muon due to orthogonalization and its overall data efficiency?
- How does Muon's memory footprint advantage impact distributed training setups for very large models compared to optimizers like AdamW?
- In what scenarios might the telescoping algorithm for hyperparameter tuning fail, and are there cases where muP transfer does not yield near-optimal results?
- Find recent papers about second-order optimizers for large-scale language model pretraining.
Related Papers
- Muon is Scalable for LLM Training (2025)
- Don't be lazy: CompleteP enables compute-efficient deep transformers (2025)
- Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training (2025)
- The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm (2025)
- MuLoCo: Muon is a practical inner optimizer for DiLoCo (2025)