Muon Converges under Heavy-Tailed Noise: Nonconvex Hölder-Smooth Empirical Risk Minimization
Abstract: Muon is a recently proposed optimizer that enforces orthogonality in parameter updates by projecting gradients onto the Stiefel manifold, leading to stable and efficient training in large-scale deep neural networks. Meanwhile, the previously reported results indicated that stochastic noise in practical machine learning may exhibit heavy-tailed behavior, violating the bounded-variance assumption. In this paper, we consider the problem of minimizing a nonconvex Hölder-smooth empirical risk that works well with the heavy-tailed stochastic noise. We then show that Muon converges to a stationary point of the empirical risk under the boundedness condition accounting for heavy-tailed stochastic noise. In addition, we show that Muon converges faster than mini-batch SGD.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.