Characterize Muon’s inductive biases and training trajectory

Determine the inductive biases of the Muon optimizer when training deep neural networks, characterize the trajectory it follows through the loss landscape that yields rapid convergence, and ascertain the implications of this trajectory for the properties of the final solution to which Muon-optimized models converge.

Background

The paper emphasizes that optimizer choice can fundamentally alter learning trajectories and the functional properties of solutions, beyond affecting training speed. Muon—an optimizer based on orthogonalizing gradient updates—has gained popularity due to strong empirical performance and faster convergence, yet its inductive biases and the path it takes in the loss landscape remain insufficiently understood.

The authors analyze a simplified variant, Spectral Gradient Descent (Spectral GD), to provide initial theoretical insights. Their results suggest Spectral GD removes the simplicity bias seen with gradient descent, offering speed benefits but potentially harming structural generalization in some settings. Despite these initial steps, fully characterizing Muon’s biases, its precise training trajectory, and the consequences for converged solutions is explicitly noted as unknown.

References

But we still don't know Muon's biases, we don't know which trajectory in the loss landscape Muon takes to be so quick, and we don't know implications of that to the solution Muon-optimized model converges to.

To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters  (2603.00742 - Dragutinović et al., 28 Feb 2026) in Section 1 (Introduction)