LiMuon Optimizer for Large-Scale Deep Learning
- LiMuon is a stochastic first-order optimizer that integrates momentum-based variance reduction and randomized SVD to reduce memory usage and improve sample complexity in deep learning.
- It leverages low-rank approximations to compress momentum matrices, reducing storage from O(mn) to O((m+n)r) while preserving key descent directions.
- The optimizer achieves an optimal O(ε⁻³) sample complexity under mild smoothness conditions, showing practical benefits for language models and vision transformers.
LiMuon is a stochastic first-order optimizer developed to address the computational and memory limitations of existing matrix-structured optimizers (such as Muon) for large-scale deep neural networks. By integrating momentum-based variance reduction and low-rank randomized singular value decomposition (RSVD), LiMuon offers lower memory usage and improved sample complexity, enabling efficient training of models with millions to billions of parameters. Its theoretical and empirical advantages are demonstrated for both LLMs and vision transformers, positioning LiMuon as a state-of-the-art technique for scalable deep learning optimization (Huang et al., 18 Sep 2025).
1. Design Principles and Algorithmic Structure
LiMuon is structured around two core principles: variance reduction through momentum and memory-efficient low-rank compression of matrix variables via randomized SVD. Its fundamental update rule maintains a momentum variable, , which aggregates stochastic gradients:
where is the stochastic gradient evaluated at weight matrix with data sample , and is the momentum parameter. For very large parameter matrices, storing in full is prohibitive; LiMuon addresses this by optionally maintaining a low-rank approximation:
where have rank at most (with for a weight matrix of size ). This allows memory reduction from to .
The optimizer then computes the SVD (or its low-rank version) and takes an update step along the product , i.e.,
This matrix update preserves the essential directions of descent with minimal storage, leveraging the principal subspaces of the momentum.
2. Memory-Efficient Low-Rank Approximation via RSVD
A key innovation in LiMuon is the application of randomized SVD to the momentum matrix. At each iteration (or at controlled intervals), RSVD is used to compute:
Theoretical guarantees ensure that
for a known constant , provided and the oversampling parameter are chosen appropriately. The discarded singular values (those not captured in the rank- truncation) are controlled by
ensuring reliable descent despite memory compression. Optionally, a full-rank (non-compressed) variant can be used if memory is not a concern.
3. Theoretical Guarantees and Sample Complexity
LiMuon improves the sample complexity of reaching an -stationary point for non-convex stochastic optimization under both classical and generalized smoothness conditions.
- Under Lipschitz smoothness ():
- Expected stationarity satisfies, after steps with step size , momentum , and model rank :
- Setting and gives a rate , so that to achieve -stationarity requires samples.
- Under generalized smoothness (where the gradient's Lipschitz constant depends mildly on ), similar sample complexity is maintained.
These results hold for both the full-rank and RSVD variants, with careful control of the low-rank approximation error.
4. Empirical Performance and Practical Integration
LiMuon demonstrates practical efficacy for large models, validated on both language and vision domains. Experiments on DistilGPT-2 and ViT architectures show:
- Lower loss and reduced perplexity compared to AdamW, Lion, and (conventional) Muon;
- Faster convergence with respect to training steps;
- For the RSVD variant, substantially lower GPU memory usage, enabling training of larger or deeper models on fixed hardware resources.
Implementation follows standard deep learning practices: each optimizer step updates the matrix momentum (and its low-rank approximation if desired), performs the SVD or RSVD, and takes a projected matrix step. No non-standard hardware or software is required.
5. Comparative Advantages and Relation to Prior Methods
LiMuon advances beyond Muon, AdamW, and other first-order optimizers in several dimensions:
- Memory footprint: By compressing the momentum state via RSVD, LiMuon scales efficiently to very wide/deep networks where storing a full momentum matrix is prohibitive.
- Sample complexity: The algorithm achieves the theoretically optimal rate for finding -stationary points among stochastic methods of this class.
- Smoothness flexibility: Unlike the original Muon, whose convergence analyses demand strong Lipschitz smoothness, LiMuon’s theoretical guarantees extend to relaxed generalized smoothness—fitting observed behavior in LLM training and other real-world regimes.
- Variance reduction: Momentum-based updates supply bias-variance trade-offs essential for stability in highly stochastic environments.
Compared to contemporaneous advances, such as memory-efficient adaptive optimizers built on Fisher information approximations (Gong et al., 11 Feb 2025), LiMuon achieves competitive adaptivity with even lower storage requirements via explicit low-rank structure.
6. Applicability to Modern Deep Learning Architectures
LiMuon targets training settings where network parameter matrices are high-dimensional—typical in LLMs, vision transformers, wide convolutional blocks, and other deep learning architectures. Its design aligns with the observed empirical low-rankness of gradients and updates in such models. The method is also suitable for pipeline and model-parallel contexts, where reducing optimizer memory directly allows larger submodules to fit on accelerator devices.
Applications validated in experiments include:
Model | Benchmark | Loss/Perplexity | Memory Usage (Full vs RSVD) |
---|---|---|---|
DistilGPT-2 | LLMing | lower | Reduced with RSVD |
ViT | Image classification | lower | Reduced with RSVD |
Empirical results show tighter loss curves and robustness to batch size and learning rate schedules. Integration into standard PyTorch or TensorFlow pipelines is straightforward, using either their built-in SVD/RSVD functions or highly optimized external routines.
7. Implications and Future Directions
LiMuon provides a template for optimizer design in high-dimensional matrix-structured parameter spaces under memory and compute constraints. Its convergence rates under mild smoothness conditions are optimal for variance-reduced first-order schemes, and the algorithm’s use of randomized SVD demonstrates synergy between numerical linear algebra and stochastic optimization.
Further research might investigate:
- Automated adaptive rank selection for RSVD to balance memory and accuracy;
- Hybrid schemes combining diagonal and matrix momentum tracking for different layers;
- Extensions to distributed and federated setups, leveraging compressibility for communication efficiency.
By addressing both statistical (sample complexity) and systems-level (memory, computation) challenges, LiMuon marks a significant advancement in the toolbox for scalable deep learning optimization (Huang et al., 18 Sep 2025).