Multi-step recovery and convergence for exact Muon iterates

Establish that, in the linear associative memory model with Gaussian embeddings and power-law item frequencies p_i ∝ i^{-α} (α > 1), the exact Muon iterates defined by W_{t+1} = W_t + η h_{λ_t}(G_t), with G_t = -∇_W L(W_t; 𝔅_t) and h_{λ}(z) = z / √(z^2 + λ^2), achieve the same multi-step recovery and convergence rates proved under the thresholded-gradient approximation. Concretely, for appropriate schedules η ≍ √d and λ_t ≍ ˜Θ(d_{t+1}^{-α} √d), prove that after t steps all items of ranks up to d_t = ˜Θ(min{ d^{2 - (1 - 1/(2α))^t}, B^{1/α} }) are recovered with high probability and that the loss satisfies L(W_t) ≤ ˜O(d_t^{1-α}).

Background

The paper studies Muon—an optimizer based on spectral orthogonalization—on a linear associative memory task with Gaussian input/output embeddings and a Zipf-like power-law item frequency. For multi-step training, the authors introduce a simplifying deflation/thresholded-gradient approximation that removes already-recovered items from the gradient and analyze the resulting Muon dynamics.

Under this thresholded update, they prove that t steps of Muon recover the top ˜Θ(min{ d{2 − (1 − 1/(2α))t}, B{1/α} }) items with a fixed learning rate and suitable λ_t, and that the population loss decreases accordingly. They conjecture that these same recovery and convergence rates hold for the exact Muon iterates computed using the true gradient, which would remove the simplifying approximation and fully justify the observed empirical scaling laws.

References

The recovery and convergence rates of Theorem~\ref{thm:multi} also hold for the exact Muon iterates $\bW_{t+1} = \bW_t + \eta h_{\lam_t}(\bG_t)$.

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory  (2603.26554 - Kim et al., 27 Mar 2026) in Conjecture, Section 5 (Multiple Steps of Muon)