Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

Published 19 May 2026 in cs.LG | (2605.19282v1)

Abstract: Muon is a matrix-aware optimizer that leverages Newton-Schulz (NS) iterations to enforce spectral gradient orthogonalization by driving all singular values of the momentum matrix toward 1. While this uniform spectral whitening enhances exploration and outperforms AdamW in LLM pretraining, we show it could lead to fundamental limitations beyond pretraining in two regimes: (i) cross-modality vision-language-action (VLA) training, where inherently low-rank action-module gradients cause amplification of noisy tail directions, and (ii) reinforcement learning with verifiable rewards (RLVR), where low-SNR gradients and the need to preserve per-head specialization from prior training make whitening unstable. To address these challenges, we propose Pion, a drop-in replacement for Muon that preserves its computational efficiency while replacing uniform spectral whitening with a two-stage Promotion+Suppression mechanism, which we call the high-pass NS iteration. This design induces a sharp spectral high-pass effect, anchoring dominant singular values at 1 while suppressing noisy tail components toward 0, with controllable filter strength. To preserve pretrained per-head heterogeneity, Pion also supports a per-head mode that applies updates independently across attention heads via a simple reshape, at no extra cost. In VLA training on LIBERO and LIBERO-Plus, Pion consistently outperforms both baselines across l_1-regression (VLA-Adapter) and flow-matching (VLANeXt) architectures, e.g., reaching 100% success rate on LIBERO Object after 1,500 training steps with VLA-Adapter, vs. 97.0% for Muon and only 32.2% for AdamW. The advantage of Pion further extends to a real Franka Research 3 robot with a pi_0.5 backbone under the DROID setup on three grasp-and-place tasks. In RLVR post-training on Qwen3-1.7B/4B with GRPO and GMPO, Pion also outperforms AdamW on MATH and GSM8K while Muon collapses to zero.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper identifies Muon’s isotropic spectral whitening as a key deficit in handling low-rank and noisy gradients in VLA and RLVR applications.
It introduces Pion, a high-pass variant that strategically promotes dominant singular values while suppressing spectral noise to enhance learning stability.
Empirical evaluations show Pion’s superior convergence speed, robustness, and task performance across simulated and real-world benchmarks.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

Introduction and Motivation

The paper "Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR" (2605.19282) addresses fundamental limitations of matrix-aware optimizers, focusing on Muon—a spectral optimizer leveraging Newton–Schulz (NS) iterations for momentum matrix orthogonalization. While Muon’s spectral whitening (driving all singular values to 1) is effective in large-scale LLM pretraining, the authors expose its critical shortcomings in two emerging deep learning domains: vision-language-action (VLA) multimodal training and reinforcement learning with verifiable rewards (RLVR). The paper analyzes the spectral origin of these failures and proposes Pion, a high-pass variant of Muon designed to preserve informative singular components and suppress spectral noise.

Spectral Analysis of Muon’s Limitations

Muon’s update mechanism distills to $\boldsymbol{\Theta}_t = \boldsymbol{\Theta}_{t-1} - \eta \, \mathrm{msign}(\mathbf{M}_t)$ , where $\mathrm{msign}$ maps all singular values of the momentum matrix $\mathbf{M}_t$ to one. This isotropic spectral exploration enables strong generalization in LLM pretraining, but the paper thoroughly demonstrates its non-adaptiveness across rank and noise in two key regimes:

1. VLA (Vision-Language-Action):

VLA models are composed of vision, language, and action modules, each characterized by distinct spectral gradient profiles. Empirical analysis reveals that the action module gradients are consistently low-rank; Muon's isotropic whitening amplifies the spectral tail—dominated by noise—leading to suboptimal convergence and poor final task performance.

Figure 1: Limitations of Muon in VLA training—gradient effective rank distinguishes vision, language, and action modules, showing action’s persistent low rank.

2. RLVR (Reinforcement Learning with Verifiable Rewards):

RLVR post-training produces low-SNR policy gradients, with informative signal concentrated in a handful of dominant spectral directions. Muon’s uniform whitening, again, elevates noisy components that destabilize fine-tuning and cause rapid model collapse—unlike stable convergence observed with AdamW.

These observations are corroborated by SNR and per-module erank measurements and downstream task success rates, explicitly quantifying Muon’s failure to adapt to non-uniform spectral statistics.

High-Pass NS Remedy: Pion

Motivated by the spectral mismatch, the authors introduce Pion (Spectral High-pass Optimization on Momentum). Pion replaces Muon’s default NS iteration with a two-stage polynomial: Promotion and Suppression. The Promotion stage amplifies leading singular values, anchoring them at one, while Suppression contracts the spectral tail toward zero, implementing a sharp high-pass filter. The coefficients for both stages are derived analytically to enforce fixed points and monotonicity.

Figure 2: Visualization of the scalar NS polynomial mapping under Muon, Promotion, Suppression, and their composition in Pion, illustrating sharply differentiated spectral filtering.

Pion also supports a per-head mode for transformers: updates are applied independently along attention heads, preserving pretrained heterogeneity critical in RLVR post-training for maintaining headwise specialization.

Figure 3: Success rates comparison of AdamW, Muon, and Pion for VLA-Adapter on LIBERO, showing Pion’s pronounced gains and faster convergence across diverse task suites.

Empirical Evaluation and Strong Numerical Claims

Experiments encompass both simulated and real-world robotic VLA settings (LIBERO, LIBERO-Plus, Franka Research 3 robot), and RLVR post-training with GRPO/GMPO on Qwen3-1.7B/4B (MATH, GSM8K benchmarks). Key empirical findings include:

In VLA-Adapter training, Pion achieves up to 100% success rate in LIBERO Object after 1,500 steps vs. 97.0% for Muon and 32.2% for AdamW, demonstrating both superior convergence speed and final accuracy.
On challenging perturbations in LIBERO-Plus, Pion shows increased robustness under distribution shifts, outperforming Muon and AdamW in language, noise, and robot variants.
Real robot evaluation: Pion achieved an 85.6% average success rate in grasp-and-place tasks, compared to 31.1% for AdamW and 38.9% for Muon.
In RLVR post-training, Muon collapses (zero accuracy), while Pion matches or exceeds AdamW, yielding faster convergence and higher gradient SNR.

Figure 4: Validation accuracy across RLVR settings, showing Pion’s stability and AdamW-level performance; Muon yields near-zero accuracy in all cases.

Reverse ablation with a low-pass Muon variant confirms that only high-pass spectral shaping enables stable learning in these regimes.

Figure 5: Scalar map of LPMuon (low-pass Muon) and accuracy trends—strong evidence that spectral high-pass (Pion) is necessary for RLVR.

Theoretical and Practical Implications

The study reframes spectral momentum orthogonalization as a flexible polynomial design, directly linking optimizer efficacy to spectral filtering profiles. Practically, Pion delivers adaptive robustness across heterogeneous model modules and noise-prone post-training, without overhead beyond Muon's NS iterations. The per-head mode provides an effective mechanism for respecting pretrained attention head structure, crucial for transformer architectures. These insights portend broader applicability of matrix-aware optimizers in multimodal, reinforcement-driven, and embodied agent contexts.

Conclusion

The paper rigorously details Muon’s spectral failures in VLA and RLVR, attributing them to a universal limitation: indiscriminate spectral whitening that corrupts informative updates in low-rank and low-SNR regimes. Pion’s high-pass NS, analytically derived and efficiently implemented, consistently outperforms both Muon and AdamW across tasks, architectures, and hardware. The findings underpin future advances in adaptive spectral optimization, suggesting further exploration around dynamic cutoff mechanisms, heterogeneity-aware updates, and cross-domain applications of matrix-aware optimizers.

Markdown Report Issue