LoRA-Muon: Spectral Low-Rank Optimizer

Updated 14 June 2026

LoRA-Muon is a fine-tuning paradigm combining spectral steepest descent with low-rank adaptation to enhance optimization efficiency and stability.
It mitigates optimizer mismatch by constraining updates to a low-dimensional subspace, preserving pretrained structures and ensuring uniform spectral growth.
Empirical results show that LoRA-Muon improves memory efficiency and convergence, achieving robust performance across NLU, NLG, and vision benchmarks.

LoRA-Muon is an optimizer and fine-tuning paradigm that integrates the Muon optimizer—a spectral steepest-descent method—with the Low-Rank Adaptation (LoRA) framework. Originally motivated by the need for efficient and robust low-rank optimization in deep learning, LoRA-Muon offers unique advantages in stability, rank-invariant hyperparameter transferability, and spectral regularization. It addresses both the theoretical underpinnings of spectral descent on the low-rank manifold and practical limitations of existing factorwise adaptive optimizers. LoRA-Muon is especially salient for transfer and fine-tuning tasks, notably when navigating “optimizer mismatch” between Adam-pretrained models and spectral optimizers (Cesista et al., 11 Jun 2026, Qu et al., 11 May 2026, Kang et al., 6 Feb 2026).

1. Mathematical Foundations: Spectral Steepest Descent on the Low-Rank Manifold

LoRA-Muon mathematically derives from steepest descent under the spectral norm constraint. For a loss $f:\mathbb{R}^{m\times n}\to\mathbb{R}$ with parameter matrix $W_t\in\mathbb{R}^{m\times n}$ and gradient $G_t=\nabla f(W_t)$ , the spectral step is

$\Delta W_t^{\star} = -\eta\,\mathrm{msign}(G_t),$

where $\mathrm{msign}(G_t) = U V^T$ for $G_t = U\Sigma V^T$ . In LoRA, the weight update is parameterized as $W = A B^T$ , with $A\in\mathbb{R}^{m\times r}$ , $B\in\mathbb{R}^{n\times r}$ . The tangent space for updates is $T_W \mathcal{M}_r = \{\Delta A\,B^T + A\,\Delta B^T\}$ , and LoRA-Muon splits the update between factors. Specializing to the spectral norm, the factorwise updates are \begin{align*} \Delta A &= -\frac{\eta}{2}\; \mathrm{msign}\left(\nabla_A f\,S_B^{{-1/2}\right)\;} S_B^{-1/2}\ \Delta B &= -\frac{\eta}{2}\; \mathrm{msign}\left(\nabla_B f\,S_A^{{-1/2}\right)\;} S_A^{-1/2} \end{align*} with $W_t\in\mathbb{R}^{m\times n}$ 0, $W_t\in\mathbb{R}^{m\times n}$ 1, and $W_t\in\mathbb{R}^{m\times n}$ 2 implemented via Newton–Schulz polynomial iterations (Cesista et al., 11 Jun 2026).

Compared to standard LoRA-Adam (which optimizes $W_t\in\mathbb{R}^{m\times n}$ 3 and $W_t\in\mathbb{R}^{m\times n}$ 4 with AdamW), this spectral structure leads to gauge invariance (updates depend only on $W_t\in\mathbb{R}^{m\times n}$ 5, not the specific factorization) and a uniform spectral geometry aligned with those of full-rank Muon and Shampoo.

2. Comparison to Adam, Muon, and Optimizer Mismatch

Adam and Muon have fundamentally distinct implicit biases. Adam leverages per-parameter second moments and converges to minimum max-norm solutions, embedding a “max-norm” bias. Muon, via spectral normalization, equalizes step sizes across singular directions and induces a spectral-norm bias in the learned weights (Qu et al., 11 May 2026). This dichotomy leads to optimizer mismatch: attempting Muon fine-tuning on Adam-pretrained parameters can degrade performance, as the spectral structure imposed by Muon disrupts the max-norm “shape” established by Adam.

Empirically, full fine-tuning with Muon on Adam-pretrained models results in a $W_t\in\mathbb{R}^{m\times n}$ 6– $W_t\in\mathbb{R}^{m\times n}$ 7 relative perplexity regression compared to Adam-matched runs, and a shifted and degraded learning-rate curve. This mismatch is proportional to update magnitude: stronger updates more readily override pretrained knowledge, exacerbating forgetting (Qu et al., 11 May 2026).

Low-rank adaptation with LoRA mitigates this mismatch, as the update is constrained to an $W_t\in\mathbb{R}^{m\times n}$ 8-dimensional subspace, inherently limiting the transformer’s deviation from its pretrained state. The worst-case inflation of mismatch is $W_t\in\mathbb{R}^{m\times n}$ 9 in the spectral norm, collapsing to no mismatch at $G_t=\nabla f(W_t)$ 0 and recovering full fine-tuning as $G_t=\nabla f(W_t)$ 1.

3. Dynamical Properties: Uniform Spectral Growth and Global Convergence

A distinctive empirical and theoretical signature of LoRA-Muon is uniform spectral growth under spectral orthogonalization. In LLM fine-tuning, the singular values of $G_t=\nabla f(W_t)$ 2 (i.e., the LoRA adapters) grow in parallel ("lock-step growth"), in stark contrast to AdamW, where larger modes dominate initially (“largest-first”). This effect, observed across practical settings (e.g., RoBERTa-Base and LLaMA-3.2-1B), persists with momentum and multi-factor extensions (Kang et al., 6 Feb 2026).

Theoretical analysis of the continuous-time spectral gradient flow (SpecGF) for LoRA factorization proves that all singular values $G_t=\nabla f(W_t)$ 3 satisfy

$G_t=\nabla f(W_t)$ 4

for almost all initializations. With $G_t=\nabla f(W_t)$ 5 regularization, global convergence to the best rank- $G_t=\nabla f(W_t)$ 6 solution occurs almost surely. This spectral equalization is not present in AdamW or standard gradient flow, where rank learning proceeds unevenly (Kang et al., 6 Feb 2026).

4. Implementation: Algorithmic Formulation and Complexity

The LoRA-Muon algorithm avoids QR decompositions and the storage of second moments. The update steps consist of gradient evaluations, first-moment momentum, Gram matrix construction, inverse square-root computation (Newton–Schulz), and msign evaluations, all expressed as standard GEMMs plus $G_t=\nabla f(W_t)$ 7 operations. Weight decay employs a split rule to match the desired linear dynamics: $G_t=\nabla f(W_t)$ 8 This split ensures that the low-rank update for $G_t=\nabla f(W_t)$ 9 aligns with the correct regularization schedule and preserves gauge symmetry (Cesista et al., 11 Jun 2026).

Per-update computational cost and persistent state scale as $\Delta W_t^{\star} = -\eta\,\mathrm{msign}(G_t),$ 0 for LoRA-Muon versus $\Delta W_t^{\star} = -\eta\,\mathrm{msign}(G_t),$ 1 for dense Muon ( $\Delta W_t^{\star} = -\eta\,\mathrm{msign}(G_t),$ 2). Compared to other spectral low-rank methods (Spectron, LoRA-RITE), LoRA-Muon is both memory- and compute-efficient, and is invariant under arbitrary factor scaling, unlike Spectron (Cesista et al., 11 Jun 2026).

Table: Per-Update Characteristics for Various Methods

Method	Optimizer FLOPs	Persistent State
Dense Muon	$\Delta W_t^{\star} = -\eta\,\mathrm{msign}(G_t),$ 3	$\Delta W_t^{\star} = -\eta\,\mathrm{msign}(G_t),$ 4
LoRA-Muon	$\Delta W_t^{\star} = -\eta\,\mathrm{msign}(G_t),$ 5	$\Delta W_t^{\star} = -\eta\,\mathrm{msign}(G_t),$ 6
Spectron	$\Delta W_t^{\star} = -\eta\,\mathrm{msign}(G_t),$ 7	$\Delta W_t^{\star} = -\eta\,\mathrm{msign}(G_t),$ 8

$\Delta W_t^{\star} = -\eta\,\mathrm{msign}(G_t),$ 9 is the Newton–Schulz step count for msign; $\mathrm{msign}(G_t) = U V^T$ 0 for inverse root.

5. Empirical Performance and Hyperparameter Recommendations

In compute-matched studies (e.g., TinyShakespeare), a rank-2 LoRA-Muon proxy attains the same optimal learning rate and validation loss as dense Muon, and rank-32 LoRA-Muon outperforms the dense baseline (Cesista et al., 11 Jun 2026). On extensive benchmarks spanning NLU (GLUE/T5-Base), NLG (Llama 2-7B), and vision (CLIP ViT-B/32), LoRA-Muon reliably matches or outperforms LoRA-Adam, closes the Adam–Muon gap, and halves optimizer state memory (Qu et al., 11 May 2026).

Learning-rate robustness is a key property: LoRA-Muon selects the same optimal linear adapter rate ( $\mathrm{msign}(G_t) = U V^T$ 1) across $\mathrm{msign}(G_t) = U V^T$ 2, width, depth, and gauge scaling. This is not the case for Adam-based factorwise optimizers, whose optimal rates degrade or fail to transfer. Moderate rank ( $\mathrm{msign}(G_t) = U V^T$ 3–32) is recommended to balance expressiveness with mismatch suppression. For regularization, split weight decay with $\mathrm{msign}(G_t) = U V^T$ 4 matched to dense Muon is effective.

6. LoRA-Muon in Fine-Tuning and Optimizer Mismatch Mitigation

When fine-tuning Adam-pretrained models, naive application of Muon induces optimizer mismatch and catastrophic forgetting. LoRA-Muon constrains update magnitude, preserves Adam-learned structure, and thus sharply reduces the mismatch both theoretically and empirically. On tasks where the mismatch is severe (e.g., MetaMath $\mathrm{msign}(G_t) = U V^T$ 5 GSM8K), LoRA-Muon outperforms LoRA-Adam at low and moderate ranks; where mismatch is mild, LoRA-Muon matches or slightly exceeds Adam across all ranks. LoRA-Muon also incurs less catastrophic forgetting than full Muon fine-tuning.

Variants of LoRA (rsLoRA, LoRA-One, PiSSA, AdaLoRA, LoRA-RITE, DoRA) improve LoRA-Adam to a degree, but optimizer-agnostic enhancements do not further boost LoRA-Muon’s peak accuracy. Techniques that amplify update norms can reinstate mismatch effects.

Practical guidance: Always use LoRA when fine-tuning Adam-pretrained models with Muon. Tune learning rates empirically, as optimal rates do not transfer from Adam to Muon. Recalibrate rank and hyperparameters; avoid transplanting LoRA variants tuned for Adam without verification on Muon (Qu et al., 11 May 2026).

7. Limitations and Future Directions

Current empirical validation is primarily at the TinyShakespeare and moderate-scale NLU/NLG/vision benchmark levels. Full LLM pretraining, downstream fine-tuning, and large-scale distributed settings remain to be systematically tested. The transfer of full hyperparameter landscapes beyond learning rate (e.g., joint $\mathrm{msign}(G_t) = U V^T$ 6– $\mathrm{msign}(G_t) = U V^T$ 7 sweeps) and exploration of further unitary-invariant norm adaptations (e.g., Ky-Fan) or adaptive second-order moment transport on the low-rank manifold are open areas. Interactions with quantization regimes, mixed precision, and distributed computation are unexplored (Cesista et al., 11 Jun 2026).

A plausible implication is that the spectral geometry and optimizer-invariant properties of LoRA-Muon could yield increased resilience to initialization pathologies, scaling instabilities, and irregular optimization landscapes that hamper existing LoRA-Adam implementations, particularly as model size and heterogeneity grow.

References

"LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold" (Cesista et al., 11 Jun 2026)
"Can Muon Fine-tune Adam-Pretrained Models?" (Qu et al., 11 May 2026)
"Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization" (Kang et al., 6 Feb 2026)

Markdown Report Issue Upgrade to Chat

References (3)

LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold (2026)

Can Muon Fine-tune Adam-Pretrained Models? (2026)

Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LoRA-Muon.

LoRA-Muon: Spectral Low-Rank Optimizer

1. Mathematical Foundations: Spectral Steepest Descent on the Low-Rank Manifold

2. Comparison to Adam, Muon, and Optimizer Mismatch

3. Dynamical Properties: Uniform Spectral Growth and Global Convergence

4. Implementation: Algorithmic Formulation and Complexity

Table: Per-Update Characteristics for Various Methods

5. Empirical Performance and Hyperparameter Recommendations

6. LoRA-Muon in Fine-Tuning and Optimizer Mismatch Mitigation

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LoRA-Muon: Spectral Low-Rank Optimizer

1. Mathematical Foundations: Spectral Steepest Descent on the Low-Rank Manifold

2. Comparison to Adam, Muon, and Optimizer Mismatch

3. Dynamical Properties: Uniform Spectral Growth and Global Convergence

4. Implementation: Algorithmic Formulation and Complexity

Table: Per-Update Characteristics for Various Methods

5. Empirical Performance and Hyperparameter Recommendations

6. LoRA-Muon in Fine-Tuning and Optimizer Mismatch Mitigation

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research