Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 56 tok/s Pro

GPT-5 Medium 34 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 208 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

LiMuon: Light and Fast Muon Optimizer for Large Models (2509.14562v1)

Published 18 Sep 2025 in cs.LG and math.OC

Abstract: Large models recently are widely applied in artificial intelligence, so efficient training of large models has received widespread attention. More recently, a useful Muon optimizer is specifically designed for matrix-structured parameters of large models. Although some works have begun to studying Muon optimizer, the existing Muon and its variants still suffer from high sample complexity or high memory for large models. To fill this gap, we propose a light and fast Muon (LiMuon) optimizer for training large models, which builds on the momentum-based variance reduced technique and randomized Singular Value Decomposition (SVD). Our LiMuon optimizer has a lower memory than the current Muon and its variants. Moreover, we prove that our LiMuon has a lower sample complexity of $O(\epsilon^{-3})$ for finding an $\epsilon$-stationary solution of non-convex stochastic optimization under the smooth condition. Recently, the existing convergence analysis of Muon optimizer mainly relies on the strict Lipschitz smooth assumption, while some artificial intelligence tasks such as training LLMs do not satisfy this condition. We also proved that our LiMuon optimizer has a sample complexity of $O(\epsilon^{-3})$ under the generalized smooth condition. Numerical experimental results on training DistilGPT2 and ViT models verify efficiency of our LiMuon optimizer.

Summary

The paper presents LiMuon, a memory-efficient optimizer that reduces sample complexity from O(ε⁻⁴) to O(ε⁻³) through momentum-based variance reduction and randomized SVD.
It compresses momentum matrices using RSVD, lowering memory requirements from O(mn) to O((m+n)r̂) for large-scale models.
Empirical evaluations on language and vision tasks demonstrate LiMuon’s superior performance and memory savings compared to AdamW, Muon, and other optimizers.

LiMuon: A Memory-Efficient Muon Optimizer for Large-Scale Models

Overview and Motivation

The LiMuon optimizer addresses the computational and memory bottlenecks inherent in training large-scale models with matrix-structured parameters. While adaptive gradient methods such as Adam and AdamW are widely used, recent advances in matrix-aware optimizers (e.g., Muon, SUMO) have demonstrated improved empirical performance by leveraging the structure of neural network parameters. However, existing Muon variants suffer from either high sample complexity ( $O(\epsilon^{-4})$ ) or substantial memory overhead ( $O(mn)$ for $m \times n$ matrices). LiMuon introduces a momentum-based variance reduction technique combined with randomized SVD (RSVD) to achieve both lower sample complexity ( $O(\epsilon^{-3})$ ) and reduced memory requirements ( $(m+n)\hat{r}$ , with $\hat{r} \ll \min(m,n)$ ).

Algorithmic Design

LiMuon builds upon the Muon optimizer's orthogonalized momentum update, but innovates in two key areas:

Momentum-Based Variance Reduction: Inspired by STORM, LiMuon maintains a running estimate of the gradient using a recursive update, which reduces variance and improves convergence rates.
Low-Rank Compression via RSVD: Instead of storing the full momentum matrix $M_t \in \mathbb{R}^{m \times n}$ , LiMuon compresses $M_t$ using RSVD, storing only the factors $\hat{U}_t \in \mathbb{R}^{m \times \hat{r}}$ , $\hat{S}_t \in \mathbb{R}^{\hat{r} \times \hat{r}}$ , and $\hat{V}_t \in \mathbb{R}^{n \times \hat{r}}$ . This drastically reduces memory usage.
Figure 1: Low Rank Compression of Momentum $M_t$ .

The optimizer supports two options:

Option #1: Standard momentum update, no compression.
Option #2: RSVD-based compression, enabling memory savings.

The RSVD step is implemented as follows:

def rsvd(A, r_hat, s):
    l = r_hat + s
    Omega = np.random.randn(A.shape[1], l)
    Y = A @ Omega
    Q, _ = np.linalg.qr(Y)
    B = Q.T @ A
    U_tilde, S, V = np.linalg.svd(B, full_matrices=False)
    U = Q @ U_tilde
    return U, S, V

Theoretical Guarantees

LiMuon achieves a sample complexity of $O(\epsilon^{-3})$ for finding an $\epsilon$ -stationary point in nonconvex stochastic optimization, under both standard Lipschitz and generalized smoothness conditions. This is a marked improvement over the $O(\epsilon^{-4})$ complexity of prior Muon variants, and matches the best known rates for matrix-aware optimizers, but without requiring large batch sizes.

Key theoretical results:

Convergence under Lipschitz Smoothness: For batch size $b=1$ , LiMuon finds an $\epsilon$ -nuclear-norm stationary point in $O(\epsilon^{-3})$ samples.
Convergence under Generalized Smoothness: The same sample complexity holds even when the strict Lipschitz condition is relaxed, which is critical for LLM training where such assumptions may not hold.

Empirical Evaluation

LiMuon was benchmarked against AdamW, Lion, SUMO, Muon, and Muon $^{++}$ on two tasks: LLMing with DistilGPT-2 on WikiText-103, and image classification with ViT on Tiny ImageNet.

DistilGPT-2 on WikiText-103

Figure 2: Performance comparison of training the DistilGPT-2 model on WikiText-103 dataset.

LiMuon(#1) achieves the lowest training and validation perplexity among all tested optimizers, with a significant reduction in memory usage compared to Muon and AdamW. LiMuon(#2) offers further memory savings with competitive performance.

Algorithm	Memory (GB)	Training PPL	Validation PPL
AdamW	7.34	717.22	998.83
Lion	7.30	315.29	491.07
SUMO	6.20	333.14	426.23
Muon	7.20	267.61	367.89
Muon $^{++}$	7.35	187.47	245.40
LiMuon(#1)	6.30	142.85	170.34
LiMuon(#2)	6.05	249.79	346.41

ViT on Tiny ImageNet

Figure 3: Performance comparison of training the ViT model on the Tiny ImageNet dataset (training loss and accuracy).

Figure 4: Performance comparison of training the ViT model on the Tiny ImageNet dataset (validation loss and accuracy).

LiMuon(#1) demonstrates faster convergence and higher final accuracy than all baselines, while LiMuon(#2) achieves the lowest memory footprint.

Algorithm	Memory (GB)	Train Acc (%)	Test Acc (%)
AdamW	4.40	20.31	17.23
Lion	4.40	23.44	20.06
SUMO	4.05	25.00	19.89
Muon	4.13	29.69	25.86
Muon $^{++}$	4.39	21.88	22.45
LiMuon(#1)	4.09	28.02	26.59
LiMuon(#2)	3.85	26.56	23.61

Implementation Considerations

Computational Overhead: RSVD introduces additional computation per iteration, but the reduction in memory usage is substantial for large $m, n$ .
Hyperparameter Selection: The target rank $\hat{r}$ and oversampling parameter $s$ must be chosen to balance approximation quality and memory savings.
Deployment: LiMuon is suitable for distributed and memory-constrained environments, such as multi-GPU or edge deployments for large models.
Scalability: The optimizer is well-suited for models with millions to billions of parameters, as demonstrated in the experiments.

Implications and Future Directions

LiMuon advances the state of matrix-aware optimization for large-scale models by simultaneously reducing sample complexity and memory requirements. The theoretical analysis under generalized smoothness broadens its applicability to tasks where strict smoothness assumptions do not hold, such as LLM training. The empirical results suggest that LiMuon can be adopted as a drop-in replacement for existing optimizers in both NLP and vision domains, with immediate benefits in efficiency and scalability.

Potential future developments include:

Extending LiMuon to tensor-structured parameters for convolutional architectures.
Integrating adaptive rank selection for RSVD to further optimize memory-performance trade-offs.
Exploring LiMuon in distributed and federated learning settings.

Conclusion

LiMuon introduces a principled approach to memory-efficient optimization for large models, leveraging low-rank momentum compression and variance reduction. It achieves lower sample complexity and memory usage than prior Muon variants, with strong empirical performance on both language and vision tasks. The optimizer is theoretically sound under both standard and generalized smoothness conditions, making it a robust choice for large-scale nonconvex stochastic optimization.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper introduces LiMuon, a new way to train very large AI models (like language or vision models) faster and with less memory. It improves on a recent optimizer called Muon, which is designed for parameters that naturally form matrices (grids of numbers). LiMuon keeps Muon’s good ideas but uses smarter tricks so it needs less memory and reaches good solutions in fewer training steps.

Objectives

The researchers wanted to answer three simple questions:

Can we make Muon lighter on memory so it’s easier to train big models on regular GPUs?
Can we make Muon reach good solutions faster, using fewer data samples or steps?
Can we prove it works under realistic conditions (not just perfect mathematical assumptions) that better match how large AI models are trained?

Methods and Approach

To understand LiMuon, it helps to know what an “optimizer” does. When training a model, the optimizer decides how to change the model’s parameters to make its predictions better. Muon is special because it treats parameters as matrices and updates them using an “orthogonal” direction, which often stabilizes training.

LiMuon builds on Muon with two key ideas: 1) Momentum-based variance reduction (like STORM): - Imagine pushing a grocery cart down a bumpy road. The bumps (random noise in gradients) make it wobbly. Momentum reduces wobble by carrying forward a smoothed version of your push. - Variance reduction carefully corrects this momentum using fresh information so the “push” stays accurate even with noise. This helps the optimizer stay on track with fewer samples.

2) Randomized SVD (RSVD) to save memory: - SVD (Singular Value Decomposition) breaks a big matrix into simpler parts (think: taking a large LEGO build apart into a few key shapes). It’s a way to find the most important directions in the matrix. - Randomized SVD does this approximately and much faster by sampling smart “random directions.” - LiMuon uses RSVD to store a compressed version of its momentum (the “push history”), keeping only the most important parts. This drastically reduces memory use while keeping updates effective.

LiMuon offers two practical options:

Option #1: Use momentum with variance reduction (fast and accurate). It keeps full momentum matrices, so it doesn’t save memory as much.
Option #2: Compress the momentum with RSVD (low memory). It stores only a small number of columns and singular values, not the whole big matrix, so it uses much less memory. It’s still accurate enough for training large models.

What do these terms mean in everyday language?

Matrix parameters: Many parts of neural networks are naturally grids of numbers (matrices) or stacks of grids (tensors).
Orthogonal update: Choosing a direction that’s “balanced” and well-behaved, so updates don’t blow up or get stuck.
Momentum: Like pushing a cart—each new push adds to the previous, smoothing the ride.
Variance reduction: Adjusting that push to cancel out random bumps (noise).
SVD/RSVD: Breaking a big block into simpler pieces, keeping only the pieces that matter most to save memory and time.

Main Findings

The paper shows several important results:

Lower memory use: LiMuon’s Option #2 stores a compressed momentum using RSVD, reducing memory from roughly “size of full matrices” to “size of a few columns.” In simple terms, it needs much less GPU memory during training.
Faster convergence in theory:
- To get “error” down to ε (a small number), you need about 1/(ε³) samples or steps.
- This beats earlier Muon analyses that needed O(ε⁻⁴), i.e., more steps.
Works under more realistic conditions: Earlier proofs assumed very strict smoothness (like perfectly gentle hills on the loss landscape). LiMuon’s theory also works under “generalized smoothness,” which better matches how real AI models behave (sometimes the hills are steeper or uneven).
Real-world experiments:
- On DistilGPT2 (a smaller GPT-2), trained on WikiText-103:
- LiMuon Option #1 achieved lower perplexity (meaning it predicts text better) than AdamW, Lion, SUMO, Muon, and Muon++.
- LiMuon Option #2 still performed well and used less memory.
- On ViT (Vision Transformer), trained on Tiny ImageNet:
- LiMuon converged faster (training loss dropped quicker) and had better validation loss compared to others.
- Option #2 used the least memory, showing strong practical benefits.

These results suggest LiMuon is both efficient and effective for training large models on standard hardware.

Implications and Impact

Why this matters:

Training big AI models is expensive, slow, and memory-hungry. An optimizer that uses less memory and reaches good solutions faster can save costs and make large-model training more accessible.
LiMuon’s combination of smarter momentum (variance reduction) and compressed storage (RSVD) enables training on GPUs with limited memory without sacrificing performance much.
Since it works under more realistic assumptions, LiMuon’s theory is more relevant to real-world tasks like LLMs, not just idealized math scenarios.

In short, LiMuon could help researchers and engineers train big models more efficiently, making advanced AI more practical and affordable.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of what remains missing, uncertain, or unexplored in the paper, phrased to be directly actionable for follow-up work:

Clarify and validate the memory model: peak GPU memory vs. persistent optimizer state. The paper claims reduced “state memory,” yet RSVD(M_t) is computed from the full M_t each step—does this require storing M_t transiently, and how does that affect true peak memory in practice across layers and model scales?
Provide wall-clock performance analysis. The theoretical SFO improvement does not guarantee faster training. Quantify per-step runtime, throughput, and total time-to-target across Muon, LiMuon(#1), LiMuon(#2), and Muon++ on realistic model sizes.
Characterize the computational overhead of SVD/RSVD/Newton–Schulz per layer. Report FLOPs and latency for orthogonalization and momentum compression, and identify when RSVD is beneficial vs. when it becomes the bottleneck.
Analyze convergence with approximate orthogonalization. Current proofs assume exact orthogonalization via SVD; provide guarantees under finite-iteration Newton–Schulz or RSVD-based orthogonalization errors in both Lipschitz and generalized smooth settings.
Develop adaptive rank selection strategies. The theory for Option #2 requires γ ≤ β/(c√r(1−β)), linking rank $\hat{r}$ , oversampling s, and spectral tail parameter ρ. Propose and test procedures to (i) estimate ρ online, (ii) choose $\hat{r}$ and s per layer/step to satisfy the bound while minimizing memory.
Empirically validate the low-rank momentum assumption. Measure the spectral decay of M_t across layers and training time to justify Assumption 5 (tail singular values bounded relative to gradient norm) and quantify typical ρ in modern LLMs/ViTs.
Extend theory to tensor-structured parameters. The method and analysis focus on matrices; formalize and analyze tensor variants (e.g., Tucker/TT decompositions) for convolutional and multi-dimensional layers.
Assess robustness to mixed precision and numerical stability. Provide results with float16/bfloat16, quantify sensitivity of RSVD/QR to precision, and analyze error accumulation in long runs.
Hyperparameter sensitivity and scheduling. The proofs require $\eta,\beta=O(T^{-2/3})$ ; experiments use fixed values. Explore schedules vs. constants, report sensitivity curves for $\beta$ , $\hat{r}$ , and s, and offer practical tuning guidance.
Interplay with gradient clipping and heavy-tailed noise. Many large-model trainings clip gradients and exhibit heavy-tailed noise; extend analysis beyond bounded variance, and test with/without clipping.
Interaction with weight decay and norm constraints. Muon variants have connections to spectral norm constraints and decoupled weight decay; analyze whether LiMuon preserves or alters these properties and how to best combine them.
Generalized smoothness realism. The paper assumes $(L_0,L_1)$ -smoothness but does not verify whether typical loss landscapes satisfy it. Estimate $L_0,L_1$ empirically or relax assumptions to more realistic conditions (e.g., Hölder continuity, local smoothness).
Empirical validation of sample complexity. The O(ε^{-3}) result is not corroborated by experiments that vary ε; design experiments that track accuracy/gradient norm vs. number of SFOs to test scaling laws.
Broader and stronger baselines. Include AdaFactor, Adagrad, Shampoo (matrix/tensor preconditioners), K-FAC, and recent memory- or structure-aware optimizers to strengthen the empirical comparison.
Larger-scale evaluations. Results are on DistilGPT2 and Tiny-ImageNet ViT; test on larger LLMs/ViTs, longer training horizons, and real-world fine-tuning tasks to assess scalability and stability.
Layerwise heterogeneity. Investigate per-layer rank choices, different β per layer, and whether certain layers benefit more from low-rank momentum than others.
Effect on generalization. Low-rank momentum may bias the update geometry; measure its impact on generalization, calibration, and robustness (e.g., OOD, adversarial).
Communication and distributed training costs. Analyze how LiMuon interacts with data/tensor parallelism, optimizer state sharding, and communication overhead for U, S, V factors.
Memory–accuracy trade-offs. Provide systematic ablations over $\hat{r}$ and s to map the frontier between memory savings, convergence speed, and final performance.
Stability under non-i.i.d. and curriculum/packing strategies. Extend theory and tests to non-i.i.d. sampling regimes common in large-scale pretraining (packing, curriculum, replay).
Convergence constants and dependence on r. The bounds have explicit dependence on r = min(m, n); investigate whether this dependence can be reduced, and compare with other structured optimizers’ constants.
Practical guidance for choosing orthogonalization method. Offer decision criteria for when to use SVD vs. Newton–Schulz vs. RSVD for the update step itself (line 4), including error–cost trade-offs and their effect on convergence.
Code availability and reproducibility. Provide open-source implementation with reproducible scripts, seeds, and logging to validate memory claims and performance across settings.
Compatibility with common training tricks. Evaluate with gradient accumulation, activation checkpointing, ZeRO-style optimizer partitioning, and mixed-precision loss scaling to ensure practical deployability.
Monitoring and diagnostics for spectral tails. Introduce tools to monitor per-layer singular value tails of M_t online to inform adaptive rank decisions and detect when Option #2 could harm convergence.
Theoretical treatment of compounded randomness. Analyze how RSVD randomness interacts with stochastic gradient noise (two noise sources), and its effect on bias/variance of M_{t+1}.
Applicability beyond Frobenius/nuclear-norm criteria. Explore convergence in terms of other gaps (e.g., Frank-Wolfe gap, KKT conditions) under the LiMuon updates, aligning with recent Muon analyses.

View Paper Prompt View All Prompts

Glossary

Adam: An adaptive gradient optimization algorithm that estimates first-order and second-order moments of gradients to adjust learning rates. "Recently, adaptive gradient methods such as Adam~\citep{kingma2014adam}"
AdamW: A variant of Adam that decouples weight decay from the gradient update to improve generalization. "and AdamW~\citep{loshchilov2017decoupled} have become standard choice for training large-scale models"
Attention mechanisms: Neural network components that compute weighted combinations of inputs (queries, keys, values) to focus on relevant information. "attention mechanisms~\citep{vaswani2017attention}"
Bounded variance: An assumption that the variance of the stochastic gradient estimator is finite and bounded by a constant. "Bounded Variance"
Conditional gradient (Frank-Wolfe): An optimization method that uses linear minimization oracles over constraint sets (norm-balls) instead of projections. "stochastic conditional gradient (a.k.a., stochastic Frank-Wolfe)~\citep{hazan2016variance}"
Decoupled weight decay: A regularization technique where weight decay is applied separately from the gradient-based update rule. "the Muon with decoupled weight decay implicitly solves an optimization problem that enforces a constraint on the spectral norm of weight matrices."
DistilGPT2: A distilled (smaller) version of GPT-2 used for LLMing tasks. "train the DistilGPT2 on the wikitext-103 dataset~\citep{merity2016pointer}"
Frank-Wolfe gap: A measure of optimality in Frank-Wolfe algorithms indicating progress toward a solution. "established its convergence guarantees in terms of the Frank-Wolfe gap, which implies convergence to a KKT point"
Frobenius inner product: The matrix inner product defined as the trace of the product of one matrix transpose and another. "the Frobenius inner product defined as $\langle A,B\rangle = tr(A^\top B)$ "
Frobenius norm: The norm of a matrix defined as the square root of the sum of squares of its entries. "the Frobenius norm defined as $\|A\|_=\sqrt{tr(A^\top A)}$ "
Frobenius-norm Lipschitz smoothness: A smoothness condition where gradient differences are bounded in Frobenius norm by a Lipschitz constant times the parameter difference. "are $L$ -Frobenius norm Lipschitz smooth"
Generalized gradient norm clipping (GGNC): Methods that clip gradients under generalized smoothness to stabilize training. "studied the convergence properties of the generalized gradient norm clipping methods under the $(L_0,L_1)$ -smoothness~\citep{zhang2019gradient}."
Generalized smooth condition: A relaxed smoothness assumption characterized by parameters $(L_0, L_1)$ allowing gradient Lipschitz constants to depend on gradient norms. "under the generalized smooth condition."
KKT point: A point satisfying Karush–Kuhn–Tucker conditions, indicating stationarity under constraints. "which implies convergence to a KKT point of the Muon under a norm constraint"
LiMuon optimizer: A light and fast variant of the Muon optimizer using momentum-based variance reduction and randomized SVD. "To fill this gap, we propose a light and fast Muon (LiMuon) optimizer"
Linear Minimization Oracle (LMO): An oracle that returns a point minimizing a linear function over a constraint set. "by leveraging the linear minimization oracle (LMO) over a norm-ball"
Low-rank approximated estimate: An approximation of a matrix using a small number of singular components to reduce memory. "is a low-rank approximated estimate of momentum $M_t$ ."
Muon optimizer: A matrix-gradient optimizer that orthogonalizes gradient momentum, often via Newton–Schulz iterations. "A typical noteworthy optimization algorithm is the Muon optimizer~\citep{jordanmuon}"
Muon $^{++}$ : A variant of the Muon optimizer analyzed under Frank-Wolfe frameworks and showing improved complexity under large batch sizes. "Muon $^{++}$ ~\citep{sfyraki2025lions}"
Newton–Schulz iterations: Iterative method to approximate matrix functions (e.g., orthogonalization) without explicit SVD. "Newton-Schulz iterations~\citep{jordanmuon,bernstein2024old} to approximate the orthogonalization process."
Non-Euclidean trust-region gradient method: An optimization method using trust regions defined under non-Euclidean geometries. "studied the convergence properties of the stochastic non-Euclidean trust-region gradient method with momentum"
Nuclear norm: The sum of singular values of a matrix, used as a convex surrogate for rank. " $\|A\|_*$ denotes its nuclear norm."
Nuclear-norm stationary point: A stationarity notion measured with respect to the nuclear norm of the gradient. "for finding an $\epsilon$ -nuclear-norm stationary point of nonconvex stochastic optimization."
Orthogonalization process: The procedure to produce an orthogonal (or orthonormal) matrix from gradients or momentum. "to approximate the orthogonalization process."
Orthogonalized gradient momentum: A technique that orthogonalizes momentum directions before applying updates to matrix parameters. "updates matrix parameters with orthogonalized gradient momentum using Newton-Schulz iteration."
Oversampling parameter: An RSVD parameter controlling extra sampled dimensions beyond target rank to improve approximation quality. "oversampling parameter $s>0$ "
Preconditioning perspective: Viewing optimizers as applying transformations to gradients to improve conditioning and convergence. "showed that the Muon is a class of matrix-gradient optimizers from a unifying preconditioning perspective."
QR decomposition: Factorization of a matrix into an orthonormal matrix Q and an upper-triangular matrix R. "compute the QR decomposition $Y = QR$ "
Randomized Singular Value Decomposition (RSVD): A probabilistic algorithm that computes low-rank SVD approximations efficiently. "RSVD( $A,\hat{r},s$ )"
Sample complexity: The number of stochastic gradient samples required to achieve a target accuracy. "has a lower sample complexity of $O(\epsilon^{-3})$ "
Singular Value Decomposition (SVD): Matrix factorization into singular vectors and singular values, used for orthogonalization and low-rank structures. " $(\tilde{U}, \Sigma, V) = \text{SVD}(B)$ "
Spectral norm: The largest singular value of a matrix, measuring its maximum amplification of a vector. "a constraint on the spectral norm of weight matrices."
SFO (Stochastic First-order Oracle) complexity: Complexity measured by the number of stochastic gradient calls. "has a Stochastic First-order Oracle (SFO) complexity of $O(\epsilon^{-4})$ "
STORM: A momentum-based variance reduction method for stochastic optimization. "the momentum-based variance reduced technique of STORM~\citep{cutkosky2019momentum}"
Subspace-aware moment-orthogonalization (SUMO): An optimizer that orthogonalizes momentum in a low-dimensional subspace to reduce memory. "a subspace-aware moment-orthogonalization (SUMO) optimizer."
Variance reduction: Techniques that reduce the variance of stochastic gradient estimates to improve convergence rates. "momentum-based variance reduced technique"
ViT: Vision Transformer, a transformer-based model for image classification using patch embeddings. "training DistilGPT2 and ViT models"
Weight decay: L2 regularization applied to parameters during optimization to prevent overfitting. "the Muon with weight decay can be viewed as special instances of a stochastic Frank-Wolfe algorithm"