AMUSE: Anytime Muon with Stable Gradient Evaluation

Published 21 May 2026 in cs.LG | (2605.22432v1)

Abstract: Modern deep learning commonly relies on AdamW with prescribed learning rate schedules, but recent works challenge both components: Schedule-Free optimization removes explicit schedules via iterate averaging, and Muon improves the update geometry by orthogonalizing momentum for matrix parameters. Despite Muon's strong empirical performance, its underlying mechanism remains partially understood. We study Muon through the river-valley loss landscape, where useful training progress occurs along a flat, low-curvature bulk subspace (the river), while high-curvature dominant directions form steep valley walls that induce oscillations. We empirically show that while Muon's orthogonalization accelerates river progress by increasing the bulk component, it also amplifies dominant-direction noise, causing oscillatory trajectories. Building on this, we propose Anytime MUon with Stable gradient Evaluation (AMUSE), which integrates Muon's rapid bulk progress with the stabilizing effect of Schedule-Free averaging. AMUSE uses a time-varying interpolation coefficient that initially evaluates gradients near the fast Muon sequence for rapid adaptation, then gradually shifts toward the stable averaged sequence to suppress valley-wall oscillations. As a result, AMUSE requires no learning rate schedules and supports anytime training. Across vision tasks and LLM pretraining, AMUSE consistently improves the performance-iteration Pareto frontier over (Schedule-Free) AdamW and Muon.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces AMUSE, which fuses Muon’s geometric update with schedule-free stabilization using a dynamic, time-varying interpolation coefficient.
It empirically outperforms traditional optimizers by efficiently navigating flat loss landscapes while suppressing high-curvature oscillations.
Its minimal hyperparameter overhead and anytime training capability demonstrate strong potential for scalable and robust deep learning optimization.

AMUSE: Anytime Muon with Stable Gradient Evaluation

Motivation and Background

The AMUSE optimizer is designed to address two fundamental challenges in deep learning training: unstable dynamics in high-curvature directions and inefficiency in exploiting flat, low-curvature directions in the loss landscape. Standard recipes, such as AdamW with learning rate decay schedules, have dominated training for modern architectures, particularly transformers. However, recent developments in schedule-free optimization and the Muon optimizer—where momentum is orthogonalized for matrix-valued parameters—have demonstrated notable empirical improvements over traditional methods.

Muon exploits matrix structure, promoting faster progress along flat directions (“the river”), but it amplifies noise and high-curvature oscillations (“valley walls”). Schedule-Free (SF) optimizers stabilize trajectories by gradient evaluation at interpolated points between the current and averaged iterates, filtering high-curvature fluctuations and enabling anytime training without explicit learning rate schedules. AMUSE integrates Muon's geometric update with SF’s stabilizing mechanism using a time-varying interpolation coefficient that shifts gradient evaluation from Muon’s fast sequence to the stable averaged sequence, facilitating rapid early adaptation and late-stage stability.

River-Valley Loss Landscape: Analysis and Principles

The optimization landscape of deep neural networks is highly anisotropic. The Hessian spectrum typically includes a few large outlier eigenvalues (dominant subspace) and a broad, lower spectrum (bulk subspace). This geometric structure underpins the river-valley analogy: dominant directions form steep valley walls with high curvature, whereas the bulk forms a flat river conducive to stable progress.

Optimizers must traverse the river efficiently and suppress unnecessary updates along valley walls. Empirically, Muon updates contain substantially larger bulk components than SGD or AdamW, accelerating progress in low-curvature directions. Orthogonalization shifts the update further away from dominant directions:

Figure 1: Muon produces smaller dominant-subspace components compared to SGD and AdamW; AMUSE further suppresses dominant updates.

However, Muon's orthogonalization is non-selective, amplifying noise and small dominant components, which can cause oscillatory motion across valley walls and trajectory instability. This is reflected in the Hessian eigenspectrum and confirmed by measuring the dominant/bulk ratios:

Figure 2: Comparison of dominant component ratios illustrates the increased bulk orientation of Muon and AMUSE updates.

Scaling experiments demonstrate that amplifying bulk components accelerates training, while dominant components can induce instability—supporting the notion that low-curvature directions are key to optimization efficiency.

Schedule-Free Stabilization and AMUSE Mechanism

Schedule-Free optimizers interpolate between fast base and averaged sequences, controlling the gradient evaluation point via $\beta$ . Moving the evaluation point towards the averaged sequence suppresses dominant high-curvature contributions prior to orthogonalization, yielding more stable updates. However, fixed $\beta$ induces a tradeoff: low $\beta$ allows rapid early progress but harms stability; high $\beta$ stabilizes late-stage progress but impairs early adaptation.

AMUSE addresses this by employing a time-varying interpolation coefficient $\beta_t$ , which starts low for rapid adaptation and increases towards 1, shifting evaluation toward the stable averaged sequence. This ensures consistent filtering of valley-wall noise and maintains rapid bulk traversal.

Figure 3: Validation perplexity and update norm showcase the Pareto-optimal performance of AMUSE compared to fixed-beta SF-Muon and AMUSE’s superior training stability and speed.

The mathematical schedule for $\beta_t$ dynamically adjusts the effective averaging window without shrinking, interpolating between fixed-beta and constant-window regimes for optimal stability and progress.

Figure 4: Evolution of $\beta_t$ controlled by $\rho$ demonstrates smooth interpolation and preservation of effective averaging window size.

Empirical Evaluation

AMUSE was evaluated across a range of vision and LLM benchmarks. In image classification tasks (CIFAR-10, CIFAR-100, SVHN, ImageNet-1k), AMUSE consistently matches or outperforms all baselines, including schedule-free and Muon optimizers, both in accuracy and convergence speed. In segmentation and MAE fine-tuning contexts, AMUSE remains robust even when transferred to models pretrained with AdamW.

Figure 5: Test accuracy across image-domain experiments highlights AMUSE’s superior anytime performance, averaged over multiple seeds.

In LLM pretraining (LLama architectures at 124M, 720M, and 1B parameters, trained on FineWeb), AMUSE achieves lower validation perplexity throughout training and maintains stability and efficiency even at large batch sizes, where SF-AdamW struggles. AMUSE improves the performance–iteration Pareto frontier decisively, requiring $1.51\times$ , $1.12\times$ , and $\beta$ 0 fewer steps for final performance on 720M LLM, ResNet-50/ImageNet, and ViT/ImageNet respectively.

Figure 6: Validation perplexity on FineWeb pretraining across Llama model scales demonstrates AMUSE’s consistent superiority across all scales.

Analysis shows that adding exponential weight averaging or learning rate decay to constant learning-rate Muon improves performance but does not match AMUSE, reinforcing that its trajectory remains inherently stable and efficient.

Experimental Design and Hyperparameter Handling

AMUSE requires only one additional hyperparameter ( $\beta$ 1) when compared to standard SF optimizers. Muon momentum is fixed and not treated as a tunable parameter. Hyperparameter sweeps demonstrate AMUSE’s mild sensitivity and robust performance across tested ranges, with every configuration outperforming tuned Muon baselines. Implementation details ensure practical memory overhead similar to AdamW and SF-AdamW, with seamless integration into both vision and language domain settings.

Theoretical and Practical Implications

AMUSE demonstrates how optimizer design can leverage loss landscape geometry—by promoting bulk traversal and suppressing valley-wall oscillations—to transcend traditional learning-rate scheduling schemes. The schedule-free, anytime paradigm facilitates flexible training and inference, enabling better resource efficiency and model adaptation. The separation of bulk and dominant subspaces, and deliberate gradient evaluation control, inform a principled approach for future optimizer development.

Theoretical analyses (e.g., quadratic bouncing with matrix-normalized updates) reinforce the importance of controlling dominant components both algorithmically and geometrically. AMUSE also shows strong potential for scale transfer and cross-domain applicability, evidenced by superior results in both vision and language tasks and efficient scaling to larger models and batches.

Speculations for Future Developments

Future research should focus on eliminating the remaining memory overhead incurred by additional averaged states, possibly by devising more selective orthogonalization and averaging mechanisms. Exploration of more adaptive, curvature-aware schedules, integration with alternative normalization or preconditioning strategies (such as Mousse or AdamNamo), and further theoretical analysis of matrix-valued optimizer dynamics in highly anisotropic spaces will continue to push the boundaries of optimizer design.

Moreover, model architecture coupling (e.g., leveraging Muon and AMUSE for hybrid transformer-conv architectures) and distributed training regimes stand to benefit from AMUSE's scale-agnostic and schedule-free properties.

Conclusion

AMUSE combines Muon's matrix orthogonalization and fast bulk progress with schedule-free stable gradient evaluation, using a time-varying interpolation parameter. It consistently outperforms conventional Adam(W), Muon, and schedule-based optimizers across diverse deep learning tasks, delivering improved stability, efficiency, and anytime capability. The geometric and algorithmic insights underlying AMUSE highlight the importance of systematic landscape analysis for principled optimizer design and robust deep learning training (2605.22432).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Brief overview

This paper introduces a new way to train deep neural networks called AMUSE. It combines two ideas:

Muon, an optimizer that makes fast progress but can be bouncy/unstable.
Schedule-Free training, a way to keep training smooth without using a hand‑designed learning rate schedule.

The goal is to move quickly in the directions that really improve the model, while avoiding wasteful bouncing in directions that don’t help much. AMUSE does this automatically and lets you stop training at any time and still get good results.

Key questions the paper asks

Why does Muon often train models faster than the popular AdamW optimizer?
Why can Muon be unstable (it “wobbles”) during training?
Can we keep Muon’s speed but make it stable—without using complex learning rate schedules?
Will this work across different tasks, like image recognition and LLMs?

What did they do and how?

The “river–valley” picture (a simple analogy)

Imagine the training process like walking through a long valley:

The flat riverbed is where most useful progress happens (you can move forward easily).
The steep valley walls are directions that change the loss a lot with tiny steps (they make you bounce side to side).

Good training means:

Take big, steady steps along the river (forward progress).
Avoid zig‑zagging into the steep walls (wasted motion and instability).

What is Muon, in plain words?

Muon updates model layers that are matrices in a “balanced” way. Think of it like turning a music mixer so no single instrument is too loud; everything stays more even. That balance helps it move faster along the flat riverbed. But there’s a catch: by making every part more balanced, Muon can also accidentally boost tiny bits of noise, which makes the path wobble against the valley walls.

What is Schedule-Free training, in plain words?

Most training uses a learning rate schedule (like: warm up, then slowly cool down). Schedule-Free skips that. Instead, it:

Keeps two versions of the model: a fast‑changing one and a smooth, averaged one.
Chooses a point between these two to measure the gradient (the direction to step). This “averaging” acts like a natural stabilizer—it reduces wall-bouncing without needing a hand‑designed schedule.

The AMUSE idea

AMUSE combines Muon’s fast, balanced steps with Schedule-Free’s stability:

Early on, AMUSE looks more at the fast Muon model to adapt quickly.
Over time, AMUSE smoothly shifts to evaluate gradients closer to the averaged (stable) model. This reduces wobbling while keeping speed. Importantly, it doesn’t need a learning rate schedule and works well if you stop training at any time.

How they tested it

They analyzed the directions of updates to see how much movement was along the “river” vs. into the “walls.”
They ran experiments on:
- Image tasks (like CIFAR, ImageNet, segmentation with U‑Net, and ViT fine‑tuning).
- LLM pretraining (Llama‑style models with hundreds of millions to a billion parameters).
They compared AMUSE against AdamW, Schedule‑Free AdamW, and Muon.

Main findings and why they matter

Muon’s “balance” increases progress along the river (good), but also strengthens small noisy parts that cause oscillations against the walls (bad).
Schedule-Free gradient evaluation calms down those oscillations by measuring gradients closer to the stable average.
AMUSE blends the two: it starts fast, then becomes steadily more stable by shifting where it evaluates the gradient. This gives both speed and stability without learning rate schedules.

Here are a few concrete results (why they’re important is noted in parentheses):

AMUSE reaches Muon’s final performance with fewer steps:
- About 1.5× fewer steps on a 720M‑parameter LLM (saves time and compute).
- About 1.1× fewer steps on ImageNet with ResNet‑50 (faster image training).
- About 3.1× fewer steps on ImageNet ViT fine‑tuning (big speedup for modern vision models).
AMUSE improves the performance‑vs‑training‑steps trade‑off across many tasks (more efficient training).
AMUSE naturally supports “anytime” training (you can stop early and still be in a good place), because it doesn’t rely on a carefully timed learning rate schedule.

What does this mean for the future?

Simpler training: You don’t need to hand‑tune learning rate schedules. AMUSE learns fast early on and stabilizes later—by design.
Better use of compute: Faster progress along the directions that matter means fewer training steps for similar (or better) results.
Scales to big models: It works well for LLMs and common vision tasks.
Next steps: AMUSE keeps an extra “average” copy of the model, which uses some memory. A future goal is to get the same speed‑and‑stability benefits with even less memory.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of unresolved issues and concrete avenues for future work suggested by the paper.

Lack of formal convergence and stability guarantees for AMUSE
- No theoretical proof that time-varying interpolation $\beta_t$ yields convergence or suppresses oscillations under stochastic noise; establish conditions (e.g., Lipschitz smoothness, curvature anisotropy, noise models) under which AMUSE provably converges and remains stable.
- Quantify how orthogonalization and SF averaging jointly affect convergence rates and bias/variance trade-offs relative to AdamW and Muon.
Missing principled design of the interpolation schedule
- The schedule $\beta_t = 1-((T_0-1)/(t-1))^\rho(1-\beta_1)$ is heuristic; derive optimal or near-optimal schedules from theory (e.g., minimizing a bound on dominant-direction energy or regret).
- Provide data-/model-adaptive rules for selecting $(\beta_1,\rho,T_0)$ online using observable statistics (e.g., gradient variance, curvature proxies, cosine similarities).
Limited subspace diagnostics at scale
- Dominant/bulk analyses rely on small models or proxies; develop scalable, online estimators (e.g., GGN/Lanczos sketches, randomized Hutchinson methods) to monitor dominant/bulk energy during large-scale LLM training.
- Validate whether the “river-valley” geometry persists across larger models and datasets beyond small MLPs and MNIST-like settings.
Unclear choice of dominant subspace dimension k outside classification
- For classification experiments $k$ is set to the number of classes; for LLMs and other tasks, a principled method to set or infer $k$ is not provided. Develop data-driven and adaptive methods to determine $k$ over training.
Orthogonalization–noise interaction not fully quantified
- Precisely characterize how Muon’s orthogonalization amplifies noise in high-curvature directions under stochastic gradients, and how AMUSE modulates this amplification pre-/post-orthogonalization.
- Develop metrics that directly measure “valley-wall oscillation amplitude” over training and validate reductions due to AMUSE.
Layer/block heterogeneity and parameter-type treatment
- AMUSE inherits Muon’s restriction to matrix-valued parameters and relies on SF-AdamW/SGD for others; examine whether non-matrix blocks (embeddings, LayerNorm, heads) contribute dominant components that limit stability.
- Explore blockwise/layerwise $\beta_t$ schedules (or selective application) based on curvature or spectral diagnostics rather than a single global $\beta_t$ .
Computational and numerical overheads insufficiently characterized
- Provide a comprehensive cost analysis: Newton–Schulz iteration counts, FLOPs, wall-clock throughput, and mixed-precision stability (FP16/BF16) across hardware (A100/H100) versus AdamW/Muon.
- Study numerical stability of orthogonalization for ill-conditioned or near-rank-deficient matrices and the effect of per-layer shapes and weight tying.
Memory footprint and scaling limits
- Although AMUSE requires only one extra model copy relative to Muon, quantify peak memory under activation checkpointing, large batch sizes, and multi-GPU sharding; propose memory-saving variants (e.g., low-rank or quantized states).
Interaction with standard training components
- Gradient clipping: examine compatibility and optimal clipping thresholds under orthogonalization; quantify impact on stability and final performance.
- Weight decay: analyze theoretically and empirically how decoupled weight decay interacts with orthogonalized momentum and SF averaging; study implicit regularization.
- Data augmentation and regularizers (Mixup, CutMix, RandAugment, SAM): evaluate whether AMUSE’s benefits persist or change with strong regularization.
Warmup dependence and “anytime” claims
- Despite no LR decay, AMUSE still uses warmup; ablate or eliminate warmup to test truly schedule-free behavior and define guidelines for $T_0$ selection.
- Validate “anytime” training under realistic scenarios (pause/resume, mid-run extension, changing token budgets) on larger LLMs beyond 124M.
Hyperparameter robustness and scaling laws
- Provide broader sweeps for $(\beta_1,\rho)$ across model sizes, batch sizes, sequence lengths, and datasets; quantify robustness bands and scaling laws for optimal settings.
- Study sensitivity to the Muon momentum coefficient $\mu$ (held fixed here) and its interaction with $\beta_t$ .
Broader task coverage and transferability
- Extend evaluations beyond classification/segmentation and LM next-token prediction: detection/instance segmentation, diffusion/vision-LLMs, seq2seq tasks, reinforcement learning.
- Assess transfer and fine-tuning scenarios (instruction tuning, domain shift, continual learning) and whether AMUSE aids stability under distribution shift.
Generalization, robustness, and fairness outcomes
- Move beyond accuracy/perplexity: evaluate calibration, OOD robustness, adversarial robustness, and group fairness; test claims about balanced component learning in imbalanced regimes within AMUSE.
Comparative baseline breadth and consistency
- Add strong and diverse baselines (e.g., Adafactor, Shampoo/Adafactor hybrids, Sophia, Lion, K-FAC, Shampoo-MAE recipes) under matched budgets; report tokens/sec and hardware configs for apples-to-apples comparisons.
- Compare to WSD-tuned Muon (state-of-practice in open LLM recipes) under equivalent wall-clock and compute.
Direct river-following verification at scale
- EWA/decay responsiveness and cosine similarity are indirect proxies; develop direct, scalable diagnostics that confirm the averaged trajectory hugs the river for large LLMs (e.g., tracking dominant/bulk alignment over time).
Per-layer spectral effects and anisotropy mapping
- Quantify how AMUSE alters per-layer singular spectra and curvature distributions (e.g., attention vs MLP blocks), and correlate these with training speed and generalization.
Failure modes and extreme regimes
- Characterize regimes where AMUSE underperforms or diverges: ultra-small batches, extremely high curvature (very deep or narrow models), sparse gradients, or noisy data.
- Study catastrophic loss spikes and recovery behavior compared to AdamW and Muon.
Integration with parallelism and distributed training
- Analyze communication overheads and numerical inconsistencies under data/model/pipeline parallelism when interpolating or orthogonalizing states; propose synchronization-efficient variants.
Automated or feedback-controlled variants
- Explore control-theoretic or bandit-style mechanisms to adjust $\beta_t$ based on online performance/curvature signals; compare to fixed schedules for both stability and speed.
Theoretical links to central flows and edge-of-stability
- Bridge the river-valley perspective with recent central-flow and edge-of-stability theories; formalize when AMUSE shifts dynamics away from the edge without sacrificing river progress.
Application to non-matrix parameters via structured updates
- Investigate extensions that impose structure (e.g., low-rank or block-orthogonal updates) for vector parameters to unify treatment and potentially reduce dominant-direction noise.
Tokenization, corpus, and multilingual effects in LLMs
- Validate AMUSE across tokenizers (BPE, unigram) and corpora (C4, Pile, multilingual datasets); quantify how data composition affects curvature anisotropy and AMUSE’s gains.
Reproducibility and variance reporting
- Provide multiple seeds and error bars for LLM experiments; release complete hyperparameter grids and training traces to enable rigorous replication and meta-analysis.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are concrete, near-term uses you can deploy now based on the paper’s findings and AMUSE implementation.

LLM pretraining without learning-rate schedules
- Sectors: Software/AI, Cloud, Foundation-model labs
- What to do: Replace AdamW+cosine decay or vanilla Muon with AMUSE for decoder-only Transformers (Llama-like) at 100M–1B scales. Keep linear warmup, drop decay. Expect better performance-per-iteration and strong large-batch scaling.
- Tools/workflows: Integrate the AMUSE optimizer into PyTorch/DeepSpeed/Megatron-LM/Hugging Face Transformers training loops; keep non-matrix parameters on SF-AdamW as in the paper; log update norms and cosine similarity of successive updates for stability monitoring.
- Assumptions/dependencies: Wall-clock gains depend on efficient Newton–Schulz kernels for orthogonalization; per-step overhead may rise vs AdamW, but fewer steps to target perplexity can offset this. Some tuning of learning rate, weight decay, β1, and ρ is still required. Linear warmup T0 remains.
Vision training with simpler, schedule-free recipes
- Sectors: Computer Vision (CV), Edge/cloud ML services
- What to do: Use AMUSE for ResNet-50 ImageNet training, U-Net segmentation (ISIC 2018), and ViT-B/16 MAE fine-tuning. Expect faster anytime gains and fewer steps than AdamW/Muon to reach target accuracy.
- Tools/workflows: Plug into timm/MMCV/lightning recipes; keep SF-SGD or SF-AdamW for non-matrix params; drop cosine decay. For MAE fine-tuning, AMUSE works even when the pretraining used AdamW.
- Assumptions/dependencies: Slight extra state vs Muon (similar to AdamW); matrix orthogonalization adds compute cost; verify mixed-precision and gradient checkpointing compatibility in your stack.
Budget- and interruption-friendly “anytime training”
- Sectors: MLOps/Cloud (spot/preemptible instances), Platform engineering
- What to do: Adopt AMUSE to enable safe early stopping and interruption-resilient training (no schedule coupling to total horizon). Improve GPU cluster utilization by pausing/resuming jobs without re-tuning schedules.
- Tools/workflows: Add early-exit policies keyed on validation loss, update norms, and update-to-update cosine similarity; remove LR-decay logic from orchestration.
- Assumptions/dependencies: Persist AMUSE state correctly (parameters, SF state, Muon momentum). Gains are larger when training horizons are uncertain or fragmented.
Reduced hyperparameter and schedule search in AutoML
- Sectors: AutoML, ML platform tooling, Consulting
- What to do: Replace LR-schedule sweeps with AMUSE; narrow tuning to LR, WD, β1, ρ. Use AMUSE as the default optimizer in self-tuning systems.
- Tools/workflows: Integrate in Ray Tune/Optuna pipelines; search β1 in {0.4, 0.6}, ρ in {0.6, 0.8} as paper suggests, alongside LR/WD.
- Assumptions/dependencies: Still requires warmup T0 tuning for some tasks; ensure orthogonalization step is not a per-trial bottleneck.
Medical imaging model development with faster iteration
- Sectors: Healthcare (research/clinical AI), Biotech
- What to do: Use AMUSE to train and iterate U-Net-like segmentation models (e.g., dermoscopy, radiology) faster and with fewer schedule knobs.
- Tools/workflows: Integrate into MONAI pipelines; keep inference on averaged weights as prescribed by SF formulation.
- Assumptions/dependencies: Clinical deployment still requires rigorous validation; verify that reduced oscillations translate to consistent calibration and segmentation quality.
Training diagnostics using river–valley metrics
- Sectors: Academia, ML research, Tooling
- What to do: Adopt the dominant/bulk subspace lens to debug instability. Track update norms, cosine similarity, and (when feasible) bulk/dominant proxies via Hessian approximations.
- Tools/workflows: pyHessian or low-rank Hessian probes; alerts when dominant components or oscillations spike; compare Muon vs AMUSE behavior.
- Assumptions/dependencies: Curvature estimation is approximate and incurs overhead; use sparsely or on small validation batches.
Recipe upgrades where Muon is already used
- Sectors: Open-model training (e.g., Kimi-K2, GLM-5, DeepSeek-like recipes)
- What to do: Swap Muon+decay schedules for AMUSE (no decay), preserving Muon for matrix params and SF-AdamW/SF-SGD for others.
- Tools/workflows: Configuration-level replacement in existing codebases; shared momentum buffers retained.
- Assumptions/dependencies: Ensure per-step latency increase is acceptable; re-tune LR/β1/ρ minimally.

Long-Term Applications

These opportunities require additional research, scaling, or ecosystem development before broad deployment.

Optimizer autopilots with river tracking
- Sectors: Software/AI, AutoML, MLOps
- Concept: Dynamically adapt βt using online estimates of curvature or dominant/bulk ratios; automatically keep the iterate near the “river” and adjust when oscillations appear.
- Potential tools/products: “River-aware optimizer” service; training-time controllers that modulate βt/LR based on Hessian traces, update cosine, or loss curvature probes.
- Assumptions/dependencies: Reliable, low-overhead curvature proxies; robust controllers that don’t destabilize training.
Hardware and kernel support for orthogonalized updates
- Sectors: Semiconductors, Systems/compilers, Cloud providers
- Concept: Add fast Newton–Schulz or orthogonalization primitives in GPUs/TPUs or libraries (cuBLAS, CUTLASS, Triton kernels) to reduce AMUSE’s per-step cost.
- Potential tools/products: Vendor-optimized AMUSE kernels; fused matrix-momentum ortho ops.
- Assumptions/dependencies: Vendor investment; standardized APIs; evidence of significant wall-clock gains at scale.
Federated/on-device and intermittent training
- Sectors: Mobile/Edge, IoT, Privacy-preserving ML
- Concept: Use anytime, schedule-free AMUSE for devices with intermittent connectivity and variable training budgets; stabilize local updates before aggregation.
- Potential tools/products: Federated SDKs offering AMUSE as the default, with low-precision ortho kernels.
- Assumptions/dependencies: Efficient, memory-light orthogonalization; quantization-friendly variants; empirical validation on non-IID federated data.
Energy- and carbon-aware training policies
- Sectors: Policy, Sustainability, Corporate governance
- Concept: Encourage schedule-free, anytime training methods (like AMUSE) to cut wasted compute from mis-tuned schedules and allow earlier convergence in step count.
- Potential tools/products: Procurement or grant guidelines promoting anytime optimization; reporting standards on “steps-to-target” and energy per token/epoch.
- Assumptions/dependencies: Wall-clock energy measurements at scale; transparent benchmarks showing net energy savings vs strong AdamW baselines.
Continual and robust learning with fewer oscillations
- Sectors: Enterprise ML, Cybersecurity, Healthcare
- Concept: Pair AMUSE with replay/regularization (e.g., EWC) to reduce catastrophic oscillations during domain shifts or task transitions.
- Potential tools/products: Continual-learning frameworks that default to AMUSE; “stability scores” guiding when to consolidate or adapt.
- Assumptions/dependencies: Empirical studies in non-stationary settings; interaction with task rehearsal and evaluation metrics.
Low-precision training enablement (FP8/INT8)
- Sectors: Systems/compilers, Cloud
- Concept: Exploit AMUSE’s stabilized updates to tolerate lower precision arithmetic without divergence, reducing memory/energy.
- Potential tools/products: FP8/INT8 AMUSE kernels; recipes combining AMUSE with quantization-aware training.
- Assumptions/dependencies: Careful error analysis of orthogonalization under quantization; validation on large models.
River-aware debugging and safety monitors
- Sectors: Platform engineering, Safety
- Concept: Online monitors that detect “valley-wall” oscillations and automatically intervene (adjust βt/LR, pause, checkpoint).
- Potential tools/products: Training dashboards exposing dominant/bulk proxies; auto-remediation hooks.
- Assumptions/dependencies: Fast, reliable proxies; policies that avoid overreacting to noise.
RL/robotics training stabilization
- Sectors: Robotics, Autonomous systems
- Concept: Test AMUSE in actor–critic or imitation learning where oscillations are common; leverage bulk-oriented updates for more sample-efficient learning.
- Potential tools/products: RL libraries offering AMUSE backends for policy/value networks.
- Assumptions/dependencies: Evidence on non-stationary objectives; integration with target network updates and entropy schedules.
Cloud cost-optimization and “anytime checkpoints”
- Sectors: Cloud, FinOps, MLOps
- Concept: Products that price and schedule training around AMUSE’s anytime property—budget-aware early exits, elastic scaling, and preemptible-friendly retries.
- Potential tools/products: “Anytime Trainer” services in managed ML platforms; budget-to-quality SLAs.
- Assumptions/dependencies: Estimators linking budget to expected quality; robust checkpointing and resume semantics.

Notes on feasibility and general assumptions

The paper’s improvements are shown in steps-to-quality; wall-clock gains depend on high-quality, fused implementations of orthogonalization and momentum operations.
Memory usage is slightly higher than vanilla Muon (extra SF state) but comparable to AdamW; verify fit at large batch sizes.
AMUSE removes LR decay but still uses warmup and adds β1, ρ; modest tuning remains necessary.
Benefits span CV and LLMs; additional validation is advisable for domains like diffusion, RL, and federated learning.
Open-source implementation is available at the provided GitHub link, easing immediate adoption.

View Paper Prompt View All Prompts

Glossary

AdamW: An adaptive gradient optimizer that combines Adam with decoupled weight decay to improve generalization in deep learning. "Modern deep learning commonly relies on AdamW with prescribed learning rate schedules"
AMUSE: An optimizer that combines Muon with Schedule-Free gradient evaluation using a time-varying interpolation to stabilize training and enable anytime use. "we propose Anytime MUon with Stable gradient Evaluation (AMUSE)"
anisotropic: Having direction-dependent properties; here, referring to loss landscapes where curvature varies greatly across directions. "loss landscapes are highly anisotropic."
anytime training: A training regime where the model can be stopped at any time and still provide a strong checkpoint without dependence on a fixed schedule. "AMUSE naturally supports anytime training."
bulk subspace: The high-dimensional, low-curvature subspace of the parameter space where most useful training progress occurs. "the bulk subspace forms a relatively flat river"
condition-number-free linear convergence: Convergence at a rate that does not depend on the condition number of the problem, indicating robustness to ill-conditioning. "yielding condition-number-free linear convergence in matrix factorization and linear transformer settings."
cosine decay: A learning rate schedule that reduces the learning rate following a cosine function over time. "such as cosine decay"
decoupled weight decay: A regularization technique that applies weight decay separately from the gradient-based parameter update. "with decoupled weight decay"
dominant subspace: The low-dimensional, high-curvature subspace spanned by the top Hessian eigenvectors, associated with steep directions. "the dominant subspace forms steep valley walls"
Exponential Weight Averaging (EWA): A post-hoc averaging technique that exponentially decays past parameter contributions to smooth the trajectory. "Exponential Weight Averaging (EWA)"
Hessian spectrum: The set of eigenvalues of the Hessian matrix that reveals curvature properties of the loss landscape. "the Hessian spectrum often contains a small number of large outlier eigenvalues"
interpolation coefficient: A scalar controlling how much the gradient evaluation point is interpolated between the fast and averaged iterates. "AMUSE uses a time-varying interpolation coefficient"
iterate averaging: Averaging successive parameter iterates during training to stabilize and improve convergence. "Schedule-Free optimization removes explicit schedules via iterate averaging"
Muon: An optimizer that normalizes and orthogonalizes matrix-valued momentum to balance updates across singular directions. "Muon exploits matrix structure by applying momentum to matrix-valued parameters and orthogonalizing the resulting update direction."
Newton-Schulz iteration: An iterative method to approximate matrix functions (e.g., inverse square roots), used here to implement fast orthogonalization. "approximated using a Newton-Schulz iteration"
orthogonalization operator: An operator that transforms a matrix update to have orthonormal structure, reducing dominance of certain directions. "an orthogonalization operator, which is approximated using a Newton-Schulz iteration"
orthogonalized momentum: Momentum that has been transformed to have orthonormal components, promoting balanced updates. "orthogonalized momentum accelerates progress along the river directions."
performance–iteration Pareto frontier: The optimal trade-off curve between achieved performance and number of training iterations. "improves the performance-iteration Pareto frontier"
Polyak--Ruppert (PR) averaging: An averaging method that takes the mean of iterates to reduce variance and improve convergence rates. "as an interpolation between Polyak--Ruppert (PR) averaging and primal averaging"
preconditioner: A transformation applied to the gradient or parameters to improve conditioning and accelerate convergence. "act as an effective preconditioner"
primal averaging: Averaging that updates parameters toward the average of past primal iterates rather than gradients. "as an interpolation between Polyak--Ruppert (PR) averaging and primal averaging"
river-valley loss landscape: A geometric view of optimization where low-curvature “river” directions enable progress and high-curvature “valley walls” cause oscillations. "river-valley loss landscape"
RoPE: Rotary positional embeddings; a method for encoding token positions in Transformers. "using tied input/output embeddings, SwiGLU, RMSNorm, and RoPE."
RMSNorm: A normalization technique using root mean square statistics instead of mean/variance normalization. "using tied input/output embeddings, SwiGLU, RMSNorm, and RoPE."
RMSProp: An adaptive optimizer that scales learning rates by a moving average of recent squared gradients. "using an RMSProp update with decoupled weight decay gives SF-AdamW."
Schedule-Free (SF) optimizer: An optimizer that avoids explicit learning rate schedules by evaluating gradients at interpolated points between fast and averaged iterates. "\citet{defazio2024the} introduce the schedule-free (SF) optimizer"
singular components: Contributions along singular vectors (from SVD) that can dominate matrix updates. "a few large singular components"
SwiGLU: A gated activation function variant used in modern Transformers for improved expressivity and stability. "using tied input/output embeddings, SwiGLU, RMSNorm, and RoPE."
tied input/output embeddings: Sharing the same embedding matrix for both input tokens and output softmax to reduce parameters and improve learning. "using tied input/output embeddings, SwiGLU, RMSNorm, and RoPE."
Warmup-Stable-Decay (WSD): A learning rate schedule with distinct warmup, stable, and decay phases tailored for large model training. "Warmup-Stable-Decay (WSD) schedules"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

AMUSE: Anytime Muon with Stable Gradient Evaluation

Summary

AMUSE: Anytime Muon with Stable Gradient Evaluation

Motivation and Background

River-Valley Loss Landscape: Analysis and Principles

Schedule-Free Stabilization and AMUSE Mechanism

Empirical Evaluation

Experimental Design and Hyperparameter Handling

Theoretical and Practical Implications

Speculations for Future Developments

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Brief overview

Key questions the paper asks

What did they do and how?

The “river–valley” picture (a simple analogy)

What is Muon, in plain words?

What is Schedule-Free training, in plain words?

The AMUSE idea

How they tested it

Main findings and why they matter

What does this mean for the future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Notes on feasibility and general assumptions

Glossary

Open Problems

Continue Learning

Collections

Tweets