Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors

Published 24 Jun 2026 in cs.LG | (2606.25971v1)

Abstract: Modern neural network training relies on optimizers such as Adam and Muon which act on each weight matrix as a single object. Yet every weight matrix carries two distinct quantities -- a \emph{magnitude} and a \emph{direction} -- and all optimizers stepping in the matrix as a whole couple their dynamics: the directional change from an update depends on the current magnitude, while the magnitude drifts as a byproduct of learning the direction, so neither is governed directly by the learning rate. Typical training therefore leans on surrounding recipes such as weight decay and warmup to keep learning stable at scale, though these regulate the coupling only indirectly; other recent methods instead constrain the weight to a fixed-norm sphere, but add no learnable magnitude, leaving scale control to normalization layers alone. We propose \emph{Magnitude--Direction (MD) Decoupling}, an optimizer modification that factorizes each weight into a fixed-norm direction on a hypersphere and learnable per-row and per-column magnitude gains, updated at separate learning rates, all while the model still sees a single fused weight tensor. The method is agnostic to the base optimizer and removes the need for weight decay and warmup. Across both Adam and Muon, MD Decoupling improves on well-tuned baselines, transfers the optimal LR across model width without retuning, and continues to help at scale on large Mixture-of-Experts (MoE) models. Treating magnitude and direction as separately controlled quantities thus yields more predictable training dynamics and a simple, broadly applicable improvement to modern optimizers.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents MD Decoupling, which separates weight magnitude and direction to allow independent control via learning rates.
It introduces a factorized weight representation that uses fixed-norm directions and distinct magnitude gains, eliminating the need for weight decay and warmup.
Empirical studies demonstrate enhanced compute efficiency and zero-shot hyperparameter transfer across various architectures and model scales.

Decoupling Magnitude and Direction in Neural Network Training: A Detailed Analysis

Motivation and Theoretical Foundation

The paper "Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors" (2606.25971) investigates the inherent coupling between weight magnitude and direction in gradient-based optimization of neural networks. Standard optimizers such as Adam, Muon, and derivatives operate on weight matrices as atomic objects, implicitly combining two distinct quantities: magnitude (norm) and direction. This coupling is empirically problematic; the angular update induced by a step is inversely proportional to the current magnitude, and magnitude increases even under scale-invariant losses due to stochastic perpendicular updates. This interferes with precise control via learning rate (LR), often necessitating auxiliary mechanisms like weight decay or warmup scheduling.

Figure 1: Illustration of weight norm-induced distortion in optimizer updates – identical steps rotate weights more at small magnitude and inflate the norm even when only direction matters.

The authors provide quantitative evidence for this interference, observing that the learning rate fails to independently regulate either directional or magnitude change. The equilibrium reached by standard methods is dictated by the interplay of LR and decay, not any explicit or optimal scale dictated by downstream normalization layers. This is particularly problematic for models featuring pervasive normalization (e.g., RMSNorm) where magnitude is often redundant yet still influences learning dynamics.

Magnitude–Direction Decoupling: Algorithmic Formulation

The central contribution is Magnitude–Direction (MD) Decoupling, an optimizer-side modification that factorizes each weight matrix into a fixed-norm direction (on a hypersphere) and separately learned per-row/per-column magnitude gains updated at distinct LRs:

$W = \operatorname{diag}(\gamma_\text{row}) \cdot W_\text{dir} \cdot \operatorname{diag}(\gamma_\text{col}),$

where $W_\text{dir}$ is strictly normalized (Frobenius or axiswise), and the $\gamma$ gains (initialized to unity) are updated independently (typically via Adam). The forward pass operates on the fused tensor, ensuring architectural compatibility and minimal overhead, while the optimizer internally manages the decoupled parameterization.

Figure 2: Two degrees of freedom for MD Decoupling: direction normalization axis (rows, columns, Frobenius) and gain axis (scalar, per row/column, row+column).

This design is agnostic to the underlying base optimizer, disables the need for weight decay and warmup, and allows for precise angular learning schedule control. Notably, fine-grained magnitude gains are not simply redundant with normalization; the model exploits them for activation scaling, as evidenced in gain dynamics monitored during training.

Empirical Ablations on Dense and Sparse Architectures

In-depth ablation studies are performed on GPT-style LMs (181M–1.29B parameters) to interrogate the sensitivity of normalization axis, embedding constraints, and gain parameterization:

Normalization Axis: Row, column, and matrix (Frobenius) constraints yield nearly identical optimal losses; Frobenius is selected for maximal flexibility.

Figure 3: LR sweeps show normalization axis has negligible impact on final loss; Frobenius constraint is preferred for flexibility.

Gain Axis: Among scalar, row-wise, column-wise, and combined row+column gains, the combined variant significantly outperforms alternatives. Higher-rank gains provide no additional benefit.
Gain Parameterization: Direct, exponential, and smooth softplus maps perform similarly; softplus offers marginal stability and is adopted as default.
Learning Rate Sensitivity: Matrix LR is dominant; embedding and gain group LRs show flat loss curves across orders of magnitude, further demonstrating the protocol's robustness.

Figure 4: LR sweeps confirm per-row+per-column gains provide best performance; gain parameterization is benign but softplus confers slight edge.

Scaling Behavior and Hyperparameter Transfer

One of the strongest empirical claims is that the optimal matrix LR under MD Decoupling is invariant to model width (and also depth given proper scaling), enabling zero-shot LR transfer across sizes. This is validated across width, depth, and joint scaling sweeps.

Figure 5: Optimal matrix LR stays fixed as model is scaled; transferability without retuning is demonstrated.

Relative weight updates and activation statistics confirm constant dynamics when scaling, consistent with theory. In addition, standard warmup, which stabilizes large early updates, is rendered obsolete and even detrimental; dropping it yields improved losses both at initialization and after checkpoint re-warming.

Learning Rate Schedules: Direct Control Enabled by MD Decoupling

Training on the sphere makes the LR schedule directly control the weight update magnitude; the schedule's shape is thus more critical than under weight decay. The paper demonstrates that linear decay and Warmup-Stable-Decay (WSD) both have distinct profiles; annealing schedules need to be rethought for on-sphere optimization.

Figure 6: Relative weight update follows the LR schedule directly under MD Decoupling; schedule selection is critically important.

Large-Scale Validation on Mixture-of-Experts Models

MD Decoupling is benchmarked on DeepSeekMoE-style transformers (up to 6.7B total parameters, 810M active), showing persistent gains: MuonMD achieves the same loss as AdamW at roughly $2\times$ less compute. Scaling law fits demonstrate the improvement is a downward level shift (reduced coefficient $A$ ) without affecting scaling exponent $\alpha$ .

Figure 7: Decoupling magnitude/direction yields improved compute-efficiency for MoEs; MuonMD reaches AdamW's loss at half the compute required.

Batch-size scaling and checkpoint re-warming experiments corroborate practical feasibility at scale.

Optimizer Configuration and Computational Efficiency

The implementation is compatible with memory-efficient sharding and distributed setups. Throughput measurements verify negligible runtime overhead, especially for MuonMD ( $<2\%$ slowdown); AdamMD's overhead reduces rapidly with increased global batch size.

Practical and Theoretical Implications

No Weight Decay/No Warmup: MD Decoupling obviates auxiliary hyperparameters, simplifying tuning and enabling direct schedule control.
Hyperparameter Transfer: LR invariance across width/depth provides a practical recipe for scaling; tuning on small models generalizes to much larger systems.
Architectural Compatibility: The fused-parameter approach ensures MD Decoupling integrates with standard architectures and optimizer stacks.
Research Directions: The precise role of learned magnitudes, optimal LR schedules on the sphere, loss-landscape sharpness, behavior in RL regimes, and efficacy in low-precision training are posed as open questions.

Relation to Prior Work

The method situates itself among numerous recent advances: Spherical optimization (nGPT (Loshchilov et al., 2024), HyperP (Ren et al., 30 Mar 2026)), weight normalization (Salimans et al., 2016), scale-vector learning (Wang et al., 26 May 2026), and operator norm constraints (Xie et al., 13 Jan 2026), with the distinction of decoupled learning rates and explicit learnable gains inside the optimizer. Zero-shot LR transfer aligns with $\mu$ P (Yang et al., 2022), but is achieved via decoupled direction/magnitude updates rather than tensor programs. Recent variants of Muon (Muown (Lion et al., 11 May 2026), AngularMuown (Hübler et al., 22 Jun 2026)) further corroborate the theoretical and empirical premises.

Conclusion

Magnitude–Direction Decoupling constitutes an optimizer-centric approach to disentangling weight magnitude and direction, conferring precise control over learning dynamics and enabling a robust, scalable recipe for large-scale pretraining. Empirical results across dense and sparse architectures validate significant improvements in both stability and compute efficiency, with ultralight hyperparameter tuning and direct schedule control. The method aligns with theoretical advances in scale-invariant dynamics and opens avenues for further research into optimizer parametrization, schedule design, and interpretability of learned magnitudes.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper is about a simple idea to train neural networks more smoothly and predictably. It says: instead of adjusting a whole weight matrix (a big table of numbers in a neural network) all at once, treat two parts separately:

the direction the weights point, and
how big those weights are (their magnitude).

By controlling these two parts independently, training becomes more stable and often better, without needing extra tricks like weight decay or warmup.

Key Questions

The authors ask:

Why do current training methods mix up “how far” and “which way” weights move, and how does that cause problems?
Can we separate magnitude and direction during training so the learning rate directly controls the rotation (direction change) while magnitude is learned cleanly?
Does this help across different optimizers (like Adam and Muon) and for both small and large models, including Mixture-of-Experts (MoE) models?

Methods and Approach (Explained Simply)

Think of each weight matrix like lots of arrows:

Each arrow has a direction (where it points) and a length (how long it is). In math, that’s “direction” and “magnitude.”

What goes wrong today:

Most optimizers update the whole arrow at once. The same step rotates small arrows a lot but big arrows only a little. That means the learning rate doesn’t consistently control “how much we turn.”
Even when only direction matters (common if there’s a normalization layer after the weight), updates tend to accidentally make arrows longer over time. This can cause runaway growth unless you add fixes like weight decay and warmup.

The proposed fix: Magnitude–Direction (MD) Decoupling

Keep directions on a “hypersphere.” That’s just a fancy term for “all arrows are kept at a fixed length” so only their direction changes.
Add learnable magnitude “gains” for rows and columns of the matrix. Imagine per-row and per-column volume knobs that make certain parts louder or quieter. These control the scale of outputs in a fine-grained way.
Update direction and magnitude separately, each with its own learning rate:
- Direction updates are normalized and then projected back onto the sphere, so the relative change is set exactly by the learning rate.
- Magnitude gains are learned like regular parameters (often with Adam), giving the network the scale control it needs.
The model still sees a single fused weight tensor, so there’s no architectural change—this logic lives inside the optimizer.

Small glossary to make terms clearer:

Optimizer: the method that decides how to adjust weights to make the model better (e.g., Adam, Muon).
Learning rate (LR): how big a step the optimizer takes each update.
Weight decay: a rule that slowly shrinks weights to prevent them from growing too large.
Warmup: starting with small learning rates to avoid unstable big steps early in training.
Mixture-of-Experts (MoE): a model with many “experts” and a router that chooses a few experts for each input.

Main Findings and Why They Matter

Here are the key results, described in everyday terms:

More stable, predictable updates:
- On the sphere, the learning rate directly controls how much weights rotate. This removes the hidden coupling between direction and magnitude.
- Magnitude gains let the model naturally amplify or dampen specific rows/columns, which improves learning compared to only fixing norms.
Better performance across optimizers:
- MD Decoupling improves well-tuned baselines with both Adam and Muon.
- It reduces or removes the need for weight decay and warmup, simplifying training.
Learning rate “transfer” without retuning:
- With the sphere constraint, the optimal learning rate for direction stays almost the same when you make the model wider (more channels) or deeper (more layers). You can tune LR on a small model and reuse it on bigger ones.
Works at scale with MoE models:
- On large MoE models, the MD approach beats well-tuned AdamW and Muon.
- It reaches the AdamW loss with about 2× less compute, meaning it’s more efficient.
Practical training improvements:
- Warmup-free training is stable and often better, since early steps aren’t wasted at reduced LR.
- Resuming training from checkpoints behaves well (loss and gradients stay stable).
- The shape of the learning-rate schedule matters more on the sphere because the relative update follows the LR directly; smoother, gradual decay can be beneficial.

Implications and Impact

Simpler recipes: You can drop weight decay and warmup in many cases, making training setups cleaner.
Fewer retuning headaches: The same learning rate often works when scaling model width and depth, saving time and compute.
Better efficiency at scale: For big models, especially MoEs, you can hit the same loss with less compute—useful for training LLMs cost-effectively.
Broad applicability: The method is optimizer-agnostic, fits cleanly into existing pipelines, and keeps the model architecture unchanged, so it’s easy to try.

In short, treating “which way” and “how big” as separate, controllable parts during training makes neural networks learn more reliably. It’s a small change inside the optimizer that can yield big improvements across sizes and optimizers.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of what the paper leaves missing, uncertain, or unexplored, phrased to enable follow‑up work.

Optimizer-agnosticism: Validate MD Decoupling beyond Adam/Muon (e.g., Shampoo, AdEMAMix, SOAP, LAMB, SGD without normalized updates), and characterize when non-normalized updates break LR transfer on the sphere.
Learning-rate schedules on the sphere: Determine optimal on-sphere decay shapes (cosine, 1−sqrt, exponential, cyclical) and whether gains and directions should use different schedules or phases.
LR transfer theory: Provide conditions under which width/depth LR transfer holds across base optimizers and architectures, and quantify deviations when update normalization differs (e.g., Muon scaling choices, Adam variance estimates).
Training-length scaling: Derive and validate the exponent p in LR ∝ T^−p for MD across datasets and sizes (the paper uses p≈0.25 heuristically; Complete(d)P uses 0.5), and clarify how this interacts with the spherical constraint.
Batch-size scaling: Establish principled LR–batch scaling under MD (e.g., sqrt(k) or alternative) using gradient-noise-scale analysis in the spherical setting.
Normalization axis selection: Identify regimes (architectures, layers, tasks) where row-, column-, or Frobenius-sphere constraints outperform others; extend beyond transformer MLP/attention to convs and structured matrices.
Sphere radius choice: Move beyond “match initialization norm”—develop principled per-layer sphere radii (learned or scheduled), including adaptation over training and compatibility with different initializations (e.g., scaled output projections).
Direction-update geometry: Assess benefits/limits of projecting gradients onto the tangent space (Riemannian step) vs. simple projection for larger models and diverse losses, especially where radial components might matter.
Muon scaling factor: Provide a principled calibration of Muon’s update scaling (and its dependence on initialization and layer shapes) rather than fixed heuristics; automate robust per-layer calibration.
Embedding/LM-head constraints: Quantify how per-token unit-norm constraints affect rare-token representations, tied embeddings, and downstream tasks; compare forward RMSNorm vs. optimizer-sphere enforcement.
Gains design (expressivity): Determine when higher-rank gain factorizations (rank>1) yield benefits (if any) at scale or in specific layers (e.g., attention Q/K vs. FFN), and characterize compute/overfitting trade-offs.
Gains parameterization and LR: Systematically study gains’ parameterization (softplus/exp/unconstrained), positivity constraints near zero, and separate LR/schedule for gains vs. directions; assess effects on stability and generalization.
Redundancy with normalization layers: Clarify when per-row/column gains are redundant with normalization-layer gains (e.g., RMSNorm scale) and whether MD should disable or regularize one set to avoid over-parameterization.
Norm-free architectures: Test MD in settings without pervasive normalization (e.g., post-norm vs. pre-norm variants, or architectures explicitly lacking RMSNorm after blocks) and diagnose where scale invariance assumptions break.
Router-specific choices in MoE: Explore the best normalization axis, scaling factors, and schedules for routers; analyze impact on load balance, routing entropy, and expert collapse across different routing schemes.
Warmup boundaries: Identify conditions where warmup might still be needed (very large models, cold momentum, small batches) and provide diagnostics for safe warmup-free starts under MD.
Extremely long training stability: Monitor gain dynamics over very long runs (norm drift, extreme gain values) and determine whether additional constraints (clipping, regularizers on gains) are required without weight decay.
Generalization and loss landscape: Measure how MD affects sharpness/flatness, spectral properties (e.g., singular/spectral norms), calibration, and OOD robustness; link to generalization performance beyond pretraining loss.
Downstream performance: Verify that loss improvements translate to benchmark gains (reasoning, multilingual, code, retrieval) and analyze which components (sphere, gains, schedules) contribute most downstream.
Interaction with regularizers: Revisit weight decay’s role under MD (e.g., decay on gains only), and study interactions with dropout, label smoothing, Stochastic Depth, and other regularization techniques.
Finetuning/PEFT compatibility: Determine how MD interacts with LoRA/QLoRA/adapters and whether to constrain base weights, adapters, or both; provide recipes for instruction tuning and SFT with MD.
Modalities and architectures: Evaluate MD in CNNs, ViTs, diffusion/speech models, and RL; assess whether constraints/gains need modality-specific choices (e.g., conv kernel axes, multi-head attention sharing).
Mixed precision and systems: Quantify throughput/latency and memory overheads from MD (splitting/reassembly, optimizer state) across hardware and distributed setups; test numerical stability in fp16/bfloat16.
Hyperparameter interactions: Explore joint tuning of direction LR, gain LR, and schedule shapes; assess sensitivity and automated selection strategies (e.g., adaptive schedules tied to relative update statistics).
Theoretical convergence: Develop formal analysis of MD dynamics for scale-invariant vs. non-invariant losses, including stationary points, convergence rates, and coupling between magnitude and direction under stochastic gradients.
Pathological optimizer interactions: Identify cases where certain preconditioners or grafting techniques interact poorly with sphere projection or gain learning, and propose safeguards or diagnostics.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are concrete, deployable uses of Magnitude–Direction (MD) Decoupling that follow directly from the paper’s results. Each item lists sectors, what to build/do, and key assumptions or dependencies.

Training cost and energy reduction for LLMs and MoEs — Sectors: software/AI infrastructure, energy, finance
- What to do: Wrap existing optimizers (Adam, Muon) with MD Decoupling to fix weight norms (Frobenius sphere) and add learnable per-row/per-column gains; remove weight decay and warmup. Apply the paper’s default recipe: Frobenius constraint for matrices, unit L2 rows for embeddings and LM head, softplus-parametrized gains, and separate LRs for direction vs. gains.
- Tools/products: md-optimizer wrappers for PyTorch/JAX; Hugging Face Trainer plugin; config templates to disable weight decay and warmup.
- Assumptions/dependencies: Base optimizer provides normalized updates (Adam/Muon as used in paper). Use residual downscaling (e.g., α = 1/L) for depth. Adopt paper’s Muon scaling factor and router row-normalization for MoE.
Compute-efficient MoE training (up to ~2× less compute to match AdamW loss) — Sectors: software/AI infrastructure, energy, finance
- What to do: For DeepSeekMoE-style models, adopt MuonMD (Muon + MD Decoupling) with the paper’s router normalization (rows) and LR transfer rule; scale LR with training length by ≈ T^−0.25 as in the paper.
- Tools/products: MoE training recipe cards; gating/router configs with on-sphere updates.
- Assumptions/dependencies: Results shown on 270M–810M active parameters; downstream task parity assumed via loss–performance correlation.
Hyperparameter tuning compression via width-wise LR transfer — Sectors: software/AI infrastructure, academia
- What to do: Tune the matrix LR once on a small model; reuse it unchanged across widths (and with a simple rule across depth). Cut LR sweeps for scaled models.
- Tools/products: AutoML/HParam services that lock a matrix LR for families of widths; registries that store “width-stable” LRs.
- Assumptions/dependencies: Use the sphere constraint so that the relative update equals the LR; keep embedding/output LRs separate if still on Adam.
Warmup-free training and safe resume/continual training — Sectors: software/AI infrastructure, healthcare, finance, government
- What to do: Remove warmup from pretraining, SFT, and staged training; resume from checkpoints without re-warmup. Monitor gradient norms and relative updates.
- Tools/products: Training pipelines without warmup blocks; “re-warming” tests; dashboards for angular/relative updates.
- Assumptions/dependencies: For extremely large models with Adam, a very short warmup might still help early momentum states.
More stable magnitude control without weight decay — Sectors: software/AI infrastructure
- What to do: Replace weight decay with on-sphere direction updates and learned gains; reduce reliance on ad-hoc fixes (e.g., QK-clipping).
- Tools/products: Config simplifications; training checklists that remove decay and clipping unless needed.
- Assumptions/dependencies: Presence of normalization layers makes many losses scale-invariant; gains restore fine-grained scaling the sphere removes.
Embedding stabilization via per-row unit-norm — Sectors: software/AI infrastructure
- What to do: Enforce unit L2 norm per token embedding vector (and per LM-head row). Optionally remove or simplify post-embedding RMSNorm.
- Tools/products: Embedding-normalization hooks in the optimizer; lightweight validators for embedding RMS and angular updates.
- Assumptions/dependencies: Follow the paper’s fused-weight implementation to avoid extra memory traffic.
Training schedule modernization for on-sphere optimization — Sectors: software/AI infrastructure, academia
- What to do: Prefer gradual annealing (e.g., linear decay) over Warmup–Stable–Decay, since on-sphere relative updates mirror the nominal LR exactly. Start simple; monitor relative updates per layer.
- Tools/products: LR schedule components “for-sphere”; per-layer relative-update telemetry.
- Assumptions/dependencies: No weight decay; relative updates should match schedule by design.
Reproducible baselines and pedagogy — Sectors: academia, education
- What to do: Use MD Decoupling to demonstrate magnitude–direction interference and angular updates in teaching labs; publish width-transferable LRs to aid replication.
- Tools/products: Notebooks showing toy scale-invariant losses, angular-change tracking, and LR transfer; baseline configs for 100–1B parameter models.
- Assumptions/dependencies: Same architecture choices as in the paper (RMSNorm, residual scaling) ease replication.
ESG and procurement quick wins — Sectors: policy, enterprise IT
- What to do: Update internal training standards to include on-sphere optimizers and warmup removal to reduce energy per experiment; require reporting of relative-update telemetry in RFPs.
- Tools/products: Policy addenda for AI training; internal green-AI scorecards that credit on-sphere training and reduced tuning.
- Assumptions/dependencies: Organizations can adopt optimizer-level changes without vendor lock-in.
Domain model fine-tuning on limited budgets — Sectors: healthcare, legal, education, SMBs
- What to do: Fine-tune or continually train domain-specific LLMs using MD Decoupling to reduce divergence risk and compute. Resume runs safely after pauses.
- Tools/products: Turnkey fine-tuning scripts with MD optimizers; cost calculators that estimate savings vs. AdamW+decay+warmup.
- Assumptions/dependencies: Pretraining loss improvements translate to target tasks; check with lightweight evals.

Long-Term Applications

These opportunities are plausible but need further validation, scaling, or productization beyond the paper’s experiments.

RL and control policy training on spheres — Sectors: robotics, autonomy, gaming
- What to explore: Apply MD Decoupling to policy/value networks where update predictability matters; study interactions with nonstationary gradients and entropy regularization.
- Dependencies: Empirical validation in on-policy/off-policy RL; stability under high variance updates.
Multi-modal and diffusion models — Sectors: media, healthcare imaging, industrial inspection
- What to explore: Extend MD Decoupling to vision backbones, VLMs, and diffusion U-Nets; check if angular-control improves scaling and reduces warmup-specific heuristics.
- Dependencies: Benchmarks on image/text/video tasks; adaptation of sphere/scale rules to convolutional and attention–conv blocks.
Hardware and compiler fusion — Sectors: semiconductors, cloud
- What to explore: Fuse “update + project-to-sphere + reassemble” into accelerator kernels; Triton/XLA passes for in-optimizer factorization; memory-aware fused ops.
- Dependencies: Kernel engineering; verification that fused steps retain numerical stability and speed.
Automated scaling rules across training length, batch size, and MoE granularity — Sectors: software/AI infrastructure
- What to explore: Move from the paper’s T^−0.25 heuristic toward principled, theory-backed schedules for tokens, depth, width, and expert counts.
- Dependencies: Theory of rotational equilibrium on spheres; broad-scale ablations; integration with Complete(d)P-like recipes.
Higher-rank and structured magnitude gains — Sectors: software/AI infrastructure, research
- What to explore: Learn rank-k gain matrices or structured gains (e.g., Kronecker factorizations) when rank-1 (row×column) isn’t enough.
- Dependencies: Evidence beyond initial null results; regularization to prevent overfitting/instability; memory–compute trade studies.
Standardization and governance of optimizer telemetry — Sectors: policy, enterprise IT
- What to explore: Define metrics (relative update, angular change, gain dispersion) as required reporting for large training runs to reduce silent failures and improve comparability.
- Dependencies: Community consensus; integration in logging standards and training cards.
On-device/private personalization — Sectors: mobile, healthcare, finance
- What to explore: Use warmup-free, predictable updates for brief on-device adaptation (privacy-preserving personalization) without instability spikes.
- Dependencies: Efficient mobile kernels; validation under intermittent power/thermal constraints.
Safety and reliability tooling — Sectors: safety, compliance
- What to explore: Use sphere-constrained dynamics to detect out-of-family training behavior (e.g., runaway magnitudes) and to bound angular changes per step for risk controls.
- Dependencies: Thresholds correlated with undesirable behaviors; interfaces for halting/rolling back.
Curriculum and education content — Sectors: education
- What to explore: Develop modules that teach geometry-aware optimization (spherical updates, angular LRs) and the difference between magnitude and direction learning.
- Dependencies: Stable open-source implementations and visualizations.
Interoperability with advanced optimizers — Sectors: software/AI infrastructure
- What to explore: Combine MD Decoupling with Shampoo/SOAP or Riemannian optimizers; compare shape-aware preconditioners on-sphere vs. off-sphere.
- Dependencies: Benchmarks across scales; kernel support for mixed-precision preconditioning.

Cross-cutting assumptions and dependencies to watch

Architectural context: Results are strongest for transformer-style models with normalization layers; embeddings and LM-head rows held at unit norm.
Optimizer behavior: Base optimizer should produce normalized-like updates (Adam/Muon); Muon uses a scale factor that should match the target weight norm.
Schedules: On-sphere relative updates mirror LR exactly; decay must be chosen deliberately; training-length scaling exponent (≈0.25 suggested) is not yet settled.
Integration details: Use fused-weight implementation to avoid memory overhead; project back to the sphere each step; use softplus (or similar) for positive gains.
External validation: The paper reports pretraining loss and scaling benefits; downstream-task confirmation is advisable before high-stakes deployment.

View Paper Prompt View All Prompts

Glossary

AdEMAMix: An optimizer family mixing Adam-like adaptive updates with EMA-style components to stabilize training. "the weight update can be treated as a black box, so it naturally fits different optimizers (AdEMAMix, Muon, Shampoo, \dots)."
Adam: A widely used adaptive gradient optimizer that normalizes updates by running estimates of first and second moments. "Beyond Adam~\citep{kingma2014adam}, matrix-aware optimizers such as Shampoo~\citep{gupta2018shampoo}, SOAP~\citep{vyas2024soap}, and Muon~\citep{jordan2024muon} improve the update by accounting for the geometry of the weight space"
AdamW: A variant of Adam with decoupled weight decay for better regularization and scale control. "reaching AdamW's loss with ${\sim}2\times$ less compute"
Angular update: The change in direction (angle) of a weight tensor after an optimizer step, often approximated by the relative update size. "We can measure the directional change caused by an optimizer update $\Delta W$ through the angular update $\angle(W, W+\Delta W)$ "
Complete(d)P: A hyperparameter parameterization scheme for scaling learning rate and weight decay across model and data scales. "we follow the Complete(d)P~\citep{dey2025completep,mlodozeniec2025completed} parametrization to set the LR and weight decay for all parameter groups"
DeepSeekMoE: A Mixture-of-Experts transformer architecture used to evaluate scaling behavior. "We use a DeepSeekMoE-style architecture~\citep{dai2024deepseekmoe}"
Effective LR: A combined quantity (e.g., $\sqrt{\eta\lambda}$ ) that governs dynamics under certain regimes like rotational equilibrium. "motivated by intuitions from rotational equilibrium~\citep{kosson2023rotational}, where the effective LR $\sqrt{\eta\lambda}$ governs the dynamics."
Frobenius norm: The square root of the sum of squared entries of a matrix, used to define matrix magnitude. "Throughout, unless otherwise noted, $\lVert\cdot\rVert$ is the Frobenius norm"
Frobenius sphere: The set of matrices constrained to a fixed Frobenius norm, used to control relative updates. "AdamH/MuonH~\citep{muonh2026} constrain the hidden weights of LLMs to a fixed Frobenius sphere"
GQA: Grouped-Query Attention; a variant of attention that groups queries to reduce compute. "each with head-dimension 128, GQA~\citep{ainslie2023gqa}, QK-norm~\citep{dehghani2023scaling,wortsman2023small}, and Sandwich Norm~\citep{ding2021cogview,kim2025peri} with RMSNorm~\citep{zhang2019rmsnorm}."
HyperP: A method extending fixed-norm training to transfer learning rates across multiple axes (width, depth, tokens, MoE granularity). "HyperP~\citep{ren2026rethinking} builds on top of this and investigates how to achieve LR transfer across width, depth, training tokens, and MoE granularity."
Hypersphere: The manifold of vectors (or matrices) with fixed norm; used to constrain directions during training. "a fixed-norm direction on a hypersphere and learnable per-row and per-column magnitude gains"
LionAR: A method emphasizing relative-update control that demonstrated LR transfer from a spherical/relative perspective. "the same effect was shown with LionAR in earlier work~\citep{kosson2025weight}"
Mixture-of-Experts (MoE): A sparse model architecture that routes tokens to a subset of expert networks per layer. "large Mixture-of-Experts (MoE) models."
Muon: A matrix-aware optimizer that preconditions updates using approximate second-order information and normalization. "Beyond Adam~\citep{kingma2014adam}, matrix-aware optimizers such as Shampoo~\citep{gupta2018shampoo}, SOAP~\citep{vyas2024soap}, and Muon~\citep{jordan2024muon}"
MuonMD: The combination of Muon with Magnitude–Direction Decoupling, including normalization choices tailored to MoE routers. "For MuonMD we additionally normalize the routers along the expert axis (rows)"
Newton--Schulz: An iterative method used to compute matrix inverse square roots, here referenced for orthogonalization quality. "assuming proper orthogonalization through Newton--Schulz"
nGPT: A training approach that fixes weight norms on a sphere without weight decay and bundles architectural changes. "Our fixed-norm motivation is closely related to nGPT~\citep{loshchilov2024ngpt}"
QK-clip: A stabilization trick that clips query–key interactions to prevent runaway growth. "warmup~\citep{goyal2017accurate,xiong2020layer} and fixes like QK-clip~\citep{kimiteam2025kimi} patch similar symptoms of runaway growth."
QK-norm: A normalization technique applied to queries and keys in attention to stabilize training. "GQA~\citep{ainslie2023gqa}, QK-norm~\citep{dehghani2023scaling,wortsman2023small}, and Sandwich Norm"
Relative weight update: The size of the parameter change relative to its magnitude, which directly controls angular update on the sphere. "The relative weight update is then determined by the LR at every step"
Riemannian gradient: The gradient defined with respect to a manifold’s geometry, used when optimizing on spheres. "and supports it with Riemannian gradient theory."
RMS grafting: A technique that matches the root-mean-square scale of an update to another update or target; used in Muon variants. "which we found to be noticeably better than the RMS grafting when sweeping"
RMSNorm: Root-Mean-Square Layer Normalization that scales activations by their RMS without mean-centering. "with RMSNorm~\citep{zhang2019rmsnorm}"
Rotational Equilibrium: A framework describing how weight norms and updates balance under decay and noise, explaining norm growth. "Prior work on Rotational Equilibrium~\citep{kosson2023rotational} showed how weight decay partially fixes this"
Sandwich Norm: A normalization placement strategy that inserts normalization around sublayers to stabilize training. "Sandwich Norm~\citep{ding2021cogview,kim2025peri}"
Scale-invariant: A property where scaling weights does not change the loss, making only direction matter. "the loss is scale-invariant: This means that only the direction of the weights affects the output, not the magnitude."
Scaling laws: Empirical relations (often power laws) between performance and compute/model/data scale. "Scaling laws for sparse MoEs, where the improvement holds across a wide range of compute."
Shampoo: A second-order preconditioning optimizer that uses matrix factorizations to adapt learning rates across dimensions. "Shampoo~\citep{gupta2018shampoo}"
SOAP: A matrix-aware optimization method that accounts for weight-space geometry to improve updates. "SOAP~\citep{vyas2024soap}"
Spectral norm: The largest singular value of a matrix, used to control or analyze stability and drift. "motivated by Muon's tendency to let the spectral norm drift upward"
Top-2 routing: An MoE mechanism where each token is routed to the two most relevant experts. "with 64 experts (1 shared) and top-2 routing."
Warmup: An initial phase of gradually increasing the learning rate to avoid instability at the start of training. "Like Muon, we no longer need warmup, since the large early updates it exists to prevent never appear."
Warmup-Stable-Decay (WSD): A learning-rate schedule with phases of warmup, constant LR, and decay. "We compare the established recipe of a Warmup-Stable-Decay (WSD) schedule"
Weight decay: A regularization technique that penalizes large weights, often implemented as decoupled L2 shrinkage. "standard optimizers struggle to learn the magnitude of weight matrices and lean on weight decay to keep learning the direction"
Weight Normalization: A reparameterization that separates weight magnitude and direction to improve conditioning. "The factorization echoes classic Weight Normalization~\citep{salimans2016weight}"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors

Summary

Decoupling Magnitude and Direction in Neural Network Training: A Detailed Analysis

Motivation and Theoretical Foundation

Magnitude–Direction Decoupling: Algorithmic Formulation

Empirical Ablations on Dense and Sparse Architectures

Scaling Behavior and Hyperparameter Transfer

Learning Rate Schedules: Direct Control Enabled by MD Decoupling

Large-Scale Validation on Mixture-of-Experts Models

Optimizer Configuration and Computational Efficiency

Practical and Theoretical Implications

Relation to Prior Work

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

Methods and Approach (Explained Simply)

Main Findings and Why They Matter

Implications and Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies to watch

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors

Summary

Decoupling Magnitude and Direction in Neural Network Training: A Detailed Analysis

Motivation and Theoretical Foundation

Magnitude–Direction Decoupling: Algorithmic Formulation

Empirical Ablations on Dense and Sparse Architectures

Scaling Behavior and Hyperparameter Transfer

Learning Rate Schedules: Direct Control Enabled by MD Decoupling

Large-Scale Validation on Mixture-of-Experts Models

Optimizer Configuration and Computational Efficiency

Practical and Theoretical Implications

Relation to Prior Work

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

Methods and Approach (Explained Simply)

Main Findings and Why They Matter

Implications and Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies to watch

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research