Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 218 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

A Proof of Learning Rate Transfer under $μ$P (2511.01734v1)

Published 3 Nov 2025 in stat.ML, cs.AI, cs.CL, and cs.LG

Abstract: We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $\mu$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that under $\mu P$, the optimal learning rate converges to a \emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide intuitive proofs and support the theoretical findings with extensive empirical results.

Summary

  • The paper demonstrates that the optimal learning rate under μP converges to a constant, enabling consistent hyperparameter transfer from small to large models.
  • It employs a polynomial analysis of training loss in linear MLPs, revealing a dominant linear term that simplifies learning rate tuning as width increases.
  • Empirical results confirm that μP preserves feature learning and stable training dynamics, unlike standard and neural tangent parametrizations.

Proof and Implications of Learning Rate Transfer under μ\muP

Introduction and Motivation

The paper establishes a rigorous theoretical foundation for learning rate (LR) transfer in neural networks parametrized with Maximal Update Parametrization (μ\muP). The central result is that, for deep linear Multi-Layer Perceptrons (MLPs) trained with gradient descent, the optimal learning rate converges to a non-zero constant as the model width nn \to \infty. This property, termed "learning rate transfer," enables hyperparameter tuning on small models and direct transfer to larger models, significantly reducing computational cost and complexity in large-scale neural network training. The analysis contrasts μ\muP with Standard Parametrization (SP) and Neural Tangent Parametrization (NTP), showing that only μ\muP supports robust LR transfer. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Optimal learning rate as a function of model width, demonstrating convergence to a non-zero constant under μ\muP and vanishing optimal LR under SP.

Theoretical Framework

Neural Parametrizations

The paper formalizes neural parametrizations as explicit scaling rules for initialization and learning rate with respect to model width. For μ\muP, the initialization of the output layer VV uses variance n2n^{-2}, and the learning rate is held constant for gradient descent (or scaled as ηn1\eta n^{-1} for Adam). In contrast, SP uses n1n^{-1} variance for VV and does not scale the learning rate, leading to suboptimal training dynamics as width increases.

Definition of Learning Rate Transfer

Learning rate transfer is defined as the convergence in probability of the optimal learning rate ηn\eta_n for width nn to a deterministic constant η>0\eta_\infty > 0 as nn \to \infty. This property is crucial for practical scalability, as it allows hyperparameter settings to remain stable across model sizes.

Main Results

Polynomial Structure of Training Loss

The analysis leverages the fact that, for linear MLPs, the training loss after tt steps can be expressed as a polynomial in the learning rate η\eta, with coefficients determined by the random initialization. Under μ\muP, higher-order coefficients vanish as nn increases, leaving a dominant linear term. This leads to a quasi-quadratic loss landscape in η\eta and enables precise characterization of the optimal learning rate.

Convergence of Optimal Learning Rate

The paper proves that, under μ\muP, the optimal learning rate ηn(t)\eta_n^{(t)} converges to a non-zero constant η(t)\eta_\infty^{(t)} at each training step tt, with convergence rate O(n1/2)O(n^{-1/2}). The limiting value is given by a closed-form expression involving the input Gram matrix KK and target vector yy:

η(1)=mLyKyKy2\eta_\infty^{(1)} = \frac{m}{L} \frac{y^\top K y}{\|K y\|^2}

where mm is the dataset size and LL is the network depth. Figure 2

Figure 2

Figure 2

Figure 2: Train loss as a function of learning rate at t=5t=5 and t=10t=10 for both μ\muP and SP, illustrating stable optimal LR under μ\muP and shifting optimum under SP.

Failure of LR Transfer in SP and NTP

For SP, the optimal learning rate vanishes as nn \to \infty, and for NTP, it diverges. This is attributed to the scaling of the output layer and the lack of feature learning in the infinite-width limit, resulting in either vanishing or exploding feature updates.

Generalization to Arbitrary Training Steps

The proof is extended to arbitrary training steps tt, showing that the polynomial structure persists and the optimal learning rate continues to converge to a non-zero constant under μ\muP, provided the limiting loss has a unique minimizer.

Empirical Validation

Extensive experiments confirm the theoretical predictions. Linear and non-linear MLPs (with ReLU activation), varying depths, and different optimizers (SGD, Adam) all exhibit robust LR transfer under μ\muP. The convergence rate of the optimal LR matches theoretical expectations for moderate widths and accelerates for larger widths. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Train loss as a function of learning rate at t=20t=20 for linear and ReLU MLPs of varying depth, showing consistent LR transfer under μ\muP.

Figure 4

Figure 4: Train loss as a function of learning rate at t=100t=100 for a deep ReLU MLP trained with Adam, demonstrating LR transfer near convergence.

Practical and Theoretical Implications

Hyperparameter Tuning Efficiency

The results provide a rigorous justification for the empirical practice of tuning learning rates on small models and transferring them to large models when using μ\muP. This dramatically reduces the computational burden of hyperparameter search in large-scale neural network training.

Feature Learning in the Infinite-Width Limit

The analysis clarifies that μ\muP is unique in preserving feature learning as width increases, in contrast to SP and NTP, which either suppress or exaggerate feature updates. This has implications for the design of scalable neural architectures and the theoretical understanding of training dynamics.

Extension to Non-Linear Networks and Other Optimizers

While the proofs are restricted to linear MLPs and gradient descent, empirical results suggest that LR transfer under μ\muP extends to non-linear architectures and optimizers such as Adam. Future work may generalize the theoretical framework to these cases.

Scaling Laws and Depth Dependence

The limiting optimal learning rate depends on network depth, suggesting that depth-aware versions of μ\muP (e.g., Depth-μ\muP) may be necessary for optimal scaling in very deep networks.

Limitations and Future Directions

The current analysis is limited to linear networks and does not address the full complexity of non-linear activations or adaptive optimizers. Extending the proof techniques to these settings will require new mathematical tools, particularly for controlling large-width deviations and non-quadratic loss landscapes. Additionally, the empirical observation of accelerated convergence rates for very large widths warrants further investigation.

Conclusion

This work provides the first rigorous proof of learning rate transfer under μ\muP, establishing that the optimal learning rate converges to a non-zero constant as model width increases. The results have significant practical implications for scalable neural network training and deepen the theoretical understanding of feature learning in the infinite-width regime. The findings motivate further research into generalizing LR transfer to non-linear networks, alternative optimizers, and more complex architectures.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Overview

This paper studies a simple but important question in training neural networks: how should we choose the learning rate when we make a network wider? The authors prove that, under a specific way of setting up a network called μP (Maximal Update Parametrization), the best learning rate stays roughly the same even as the network gets very wide. This is called “learning rate transfer.” They also show that this helpful property does not hold under other common setups.

Key Objectives

The paper focuses on three main goals:

  • Explain, with a proof, why the best learning rate doesn’t need to change as we make a network wider if we use μP.
  • Show that this “learning rate transfer” fails under other ways of setting up neural networks (like the Standard Parametrization, SP, and Neural Tangent Parametrization, NTP).
  • Support the theory with experiments on different network sizes, depths, and training setups.

Methods and Approach

Think of a neural network as a machine with layers that transform inputs into outputs. Two important ideas here:

  • Width: how many “units” each layer has. Making the network wider means adding more units per layer.
  • Learning rate: a knob that controls how big each training step is. If it’s too big, training can explode; if it’s too small, training is slow or stalls.

What the authors did:

  • They studied a simple kind of network called a “linear MLP.” It has multiple layers, but no activation functions like ReLU. This makes math cleaner.
  • They showed that, after one training step, the network’s loss (how wrong it is) can be written as a polynomial in the learning rate. A polynomial is just a sum like a×(LR) + b×(LR)2 + c×(LR)3, and so on.
  • Under μP, most of the higher powers of the learning rate fade away as the network gets very wide, leaving a simple curve with a clear best learning rate that doesn’t go to zero.
  • They then extended the idea to more training steps. The math gets more complicated (more terms matter), but the key trick remains: analyze how those polynomial terms behave as width grows.
  • They used standard probability results to show the coefficients of those polynomials settle to fixed values when the network gets very wide, which pins down a stable best learning rate.
  • Finally, they ran lots of experiments to confirm their proofs. They checked linear networks trained with gradient descent, and also tried ReLU networks and Adam optimizers to see how well the idea holds up in practice.

To make the technical terms friendlier:

  • μP (Maximal Update Parametrization): a careful way to set the scale of the network’s weights and learning rate so that the network’s “features” (the internal patterns it learns) keep changing meaningfully even when the network is very wide.
  • SP (Standard Parametrization): the usual default setup many libraries use. It doesn’t guarantee stable behavior as you make the network wider.
  • NTP (Neural Tangent Parametrization): a setup that makes extremely wide networks behave like fixed kernels (their features barely change). That often means you need a very different learning rate behavior.

Main Findings

Here are the most important results:

  • Under μP, the best learning rate converges to a non-zero constant as the network width grows. Translation: you can tune the learning rate on a smaller, narrower network and reuse it for much wider versions.
  • Under SP, the best learning rate shrinks toward zero as you make the network wider. So you would need to keep retuning it, which is costly and inconvenient.
  • Under NTP, the needed learning rate tends to blow up as the network gets wider, which is also bad for easy tuning.
  • They proved this carefully after one training step and showed the convergence rate is fast (roughly like 1 divided by the square root of the width), and then extended the proof idea to multiple training steps.
  • Experiments matched the theory:
    • In linear networks trained with gradient descent, μP showed clean learning rate transfer; SP did not.
    • With ReLU networks and Adam optimizer, μP still showed strong learning rate transfer in practice, even though the formal proof was for linear networks.
    • Deeper networks tend to prefer smaller learning rates, but the “transfer” property still holds under μP.
    • Later in training, the range of “good enough” learning rates widens, making it less sensitive to the exact value.

Implications and Impact

This research provides a solid mathematical reason to trust learning rate transfer under μP. That’s a big deal because:

  • It saves time and money: you can tune your learning rate once on a smaller model and reuse it on larger ones.
  • It reduces guesswork and instability when scaling models.
  • It helps bridge the gap between theory and practice: we now know when and why learning rate transfer should happen.

Limitations and future directions:

  • The formal proofs cover linear networks trained with gradient descent. The experiments suggest similar behavior in more realistic setups (ReLU and Adam), but full proofs there are future work.
  • The result depends on mild assumptions (like there being a clear best learning rate at each step), which usually hold in practice.

In short: if you set up your network using μP, the best learning rate doesn’t drift as you make it wider. You can tune once and reuse—making scaling smoother and more reliable.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the paper, framed to enable concrete follow-up research.

  • Extension beyond linear MLPs: The proofs are restricted to deep linear networks with GD and quadratic loss. A rigorous theory for non-linear activations (e.g., ReLU), modern losses (e.g., cross-entropy), and non-linear feature dynamics is not provided.
  • Optimizer generality: Learning rate (LR) transfer is only proved for full-batch gradient descent. The behavior under common optimizers (SGD with momentum, Adam with c=1c=1 scaling, Adagrad, RMSProp) and their width-dependent preconditioning remains theoretically unresolved.
  • Stochastic training: The impact of gradient noise (mini-batching), batch size scaling, and stochastic optimization on LR transfer and its asymptotics is not analyzed.
  • Training deeper “input/output” layers: The theory fixes W0W_0 and VV. A formal analysis when W0W_0 and VV are trainable (including the specified μ\muP LR scaling for W0W_0) is missing, as is the effect of training these layers on LR transfer.
  • Multi-output networks: Although the text states results “can be generalized” to multi-dimensional outputs, no explicit proof or scaling rules (especially for VV) are given for multi-output architectures.
  • General-step characterization: For t>1t>1, LR transfer is shown to exist but without:
    • An explicit expression for the limiting optimal LR η(t)\eta_\infty^{(t)} (analogous to the closed-form at t=1t=1).
    • Non-asymptotic or asymptotic convergence rates for ηn(t)η(t)\eta_n^{(t)} \to \eta_\infty^{(t)}.
    • Conditions guaranteeing convexity/uniqueness of the limiting minimizer for general tt (the uniqueness assumption is taken as “mild” but unproven/generic).
  • Convergence-rate anomaly: Empirically at t=1t=1, the observed rate changes abruptly beyond n1024n \approx 1024, deviating from the O(n1/2)O(n^{-1/2}) prediction. The cause of this regime shift and whether tighter (or different) rates hold asymptotically is left unexplained.
  • Assumption sensitivity: The results rely on Gaussian i.i.d. initialization and independence assumptions in Tensor Programs. Robustness to:
    • Non-Gaussian initializations (e.g., orthogonal, truncated, heavy-tailed).
    • Architectural inductive biases (e.g., weight sharing in CNNs, residual connections).
    • Bias parameters and normalization layers (BatchNorm, LayerNorm).
    • is not studied.
  • Dataset structure: The assumption Ky0Ky \neq 0 is crucial but its failure modes are not characterized. What happens when Ky0Ky \approx 0 (near-degenerate Gram–label alignment), or when KK has poor conditioning?
  • Depth scaling: While experiments vary depth, the theory does not treat depth-to-infinity regimes or Depth-μ\muP. Formal depth-scaling exponents and their effect on LR transfer (including potential joint width–depth limits) remain open.
  • Alternative parameterizations: For SP/NTP, only qualitative failure modes are provided. A full characterization of optimal LR scaling (e.g., can SP regain LR transfer with LR ∝ n1/2n^{-1/2} or another exponent?) is not analyzed.
  • Finite-width corrections: No non-asymptotic bounds quantify how large nn must be for practical LR transfer, or provide finite-nn error terms between ηn(t)\eta_n^{(t)} and η(t)\eta_\infty^{(t)} beyond t=1t=1.
  • Generalization vs. training loss: The results focus on training loss. Whether LR transfer under μ\muP correlates with generalization performance, stability, and edge-of-stability phenomena (and how these differ across parameterizations) is not investigated.
  • Loss landscapes and multimodality: For t>1t>1, the limiting polynomial could, in principle, have multiple minima depending on data and initialization. A rigorous analysis of when uniqueness holds and how to detect/handle multimodality is lacking.
  • Real-data validation: Experiments use synthetic data. Whether the theoretical predictions and LR transfer properties hold across real datasets and architectures (CNNs, Transformers) remains empirically and theoretically open.
  • LR schedules: The analysis covers constant LR only. The interaction of width scaling with practical LR schedules (warmup, cosine decay, step decay) and their transferability is unaddressed.
  • Regularization: The effects of explicit regularizers (weight decay, dropout), implicit bias from optimization, and their scaling under μ\muP on LR transfer are unknown.
  • Classification settings: The theory uses quadratic loss; LR transfer under standard classification losses (e.g., logistic/cross-entropy) is not proved.
  • Polynomial structure at higher steps: Beyond t=2t=2, the degree and coefficient structure of the limiting polynomials are not characterized, making it hard to derive analytic properties (e.g., convexity, minimizer location, sensitivity) at general tt.
  • Quantifying failure under NTP: While LR under NTP is argued to blow up, a formal width-dependent rate of divergence and its implications for practical tuning are not derived.
  • Multi-step dynamics and interaction terms: The proofs do not track how cross-terms across steps (e.g., dependence of a(t)a^{(t)}, b(t)b^{(t)}, and χ(t)\chi^{(t)} on η\eta) shape the limiting loss landscape, leaving open a systematic way to compute or bound these higher-order contributions.
  • Robustness to noisy labels and misspecification: Sensitivity of LR transfer to label noise, model misspecification, and non-linear ground truth (beyond the empirical ReLU case) lacks theoretical treatment.
  • Scaling with sample size and dimension: How LR transfer depends on mm and dd (beyond the Gram dependence at t=1t=1), including high-dimensional asymptotics and conditioning of KK, is not analyzed.
  • Practical tuning guidance: The paper shows existence and convergence but does not translate the asymptotic theory into actionable finite-width tuning rules (e.g., recommended base width, LR search ranges, depth-dependent scaling constants).
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Immediate Applications

The following applications can be deployed now, leveraging the paper’s theoretical guarantees for linear networks and its empirical evidence for ReLU MLPs and common optimizers.

  • Hyperparameter tuning cost reduction in ML pipelines (software/MLOps)
    • Use case: Tune learning rate (LR) on a small-width model under μP, then reuse the same LR for larger-width models without re-tuning.
    • Workflow: Adopt μP initialization and LR scaling (GD: constant LR; Adam: LR ∝ n⁻¹). Tune LR at width n₀ (e.g., 2¹⁰), validate transfer at larger widths (e.g., 2¹⁴–2¹⁶).
    • Tools: μP-aware wrappers for PyTorch/TensorFlow layers; AutoML integration for “tune-small, deploy-big.”
    • Assumptions/dependencies: Requires μP parameterization (not SP/NTP); projection layer V variance ∝ n⁻²; training with GD/SGD/Adam; non-degenerate input Gram matrix (Ky ≠ 0); unique minimizer of the limiting loss.
  • Faster experimentation and reproducibility in academia (academia/software)
    • Use case: Standardize experimental setups with μP so learning rate settings remain stable across width, improving reproducibility in scaling studies.
    • Workflow: Report LR settings alongside μP initialization rules; run ablations at small widths to infer LRs for larger widths.
    • Tools: Experimental protocol checklists; μP “recipe cards” for linear and ReLU MLPs; notebooks that verify LR transfer via convergence plots.
    • Assumptions/dependencies: Applicability is strongest for linear MLPs (proven), with empirical support for ReLU+Adam; caution with more complex architectures.
  • Energy and cost savings for model training (energy/finance/software)
    • Use case: Reduce hyperparameter sweeps that scale poorly with width, lowering cloud/HPC costs and energy consumption.
    • Workflow: Limit LR sweeps to small widths; apply learned LR to larger widths; monitor training stability (loss vs. LR).
    • Tools: Training cost dashboards; LR transfer monitors; CI/CD hooks to gate large-width runs until LR is validated.
    • Assumptions/dependencies: Savings depend on eliminating width-dependent LR retuning; broader savings scale with how often width is increased.
  • Practical guidance for model scaling in product teams (software/consumer tech)
    • Use case: Teams scaling feed-forward or simple ReLU MLPs for recommendation, ranking, or tabular prediction can adopt μP to stabilize LR across width.
    • Workflow: Convert SP to μP by changing V initialization variance to n⁻²; adopt width-aware LR choice; implement guardrails to avoid SP/NTP defaults.
    • Tools: μP migration guides; linters that flag non-μP parameterization; “μP presets” in framework configs.
    • Assumptions/dependencies: Most effective for architectures close to those tested (MLPs); SP/NTP defaults in popular frameworks may need overriding.
  • AutoML and hyperparameter search enhancements (software)
    • Use case: Shrink the search space for LR in width-scaling experiments; bias search toward μP-consistent LRs.
    • Workflow: Two-stage search: (1) small-width LR tuning; (2) width scaling with fixed LR. Optionally refine LR in a narrow band near the theoretical optimum.
    • Tools: AutoML plugins that enforce μP scaling and carry LRs forward across widths; Bayesian optimization seeded with μP priors.
    • Assumptions/dependencies: Requires compact LR intervals for optimization; relies on convergence rates (often O(n⁻¹/²)) for practical widths.
  • Curriculum and lab design for scaling behaviors (education)
    • Use case: Teach students scaling laws and parameterizations via μP, demonstrating LR transfer vs. failure under SP/NTP.
    • Workflow: Labs comparing train loss vs. LR at different widths; measure convergence of the optimal LR; discuss Gram matrix role and depth effects.
    • Tools: Course modules; interactive visualizations of LR transfer; exercises replicating figures similar to the paper’s setups.
    • Assumptions/dependencies: Students should have basics in initialization, optimization, and large-width limits.
  • Benchmarking and diagnostics for LR sensitivity (software/quality assurance)
    • Use case: Establish LR transfer diagnostics to catch width-induced LR drift in SP/NTP pipelines.
    • Workflow: Plot loss vs. LR across widths; flag cases where the optimal LR shifts toward zero (SP) or diverges (NTP).
    • Tools: “LR transfer score” metrics; automated reports in training logs; comparison harnesses between μP and SP/NTP.
    • Assumptions/dependencies: Diagnostic thresholds calibrated on known convergence rates; uses compact LR intervals.

Long-Term Applications

These applications require further research and engineering, particularly to generalize proofs beyond linear networks and integrate into complex production systems.

  • Standardization of μP-like parameterizations across frameworks (software/standards)
    • Use case: Make μP default or available as a first-class option in PyTorch/TensorFlow/JAX for MLPs/CNNs/Transformers.
    • Workflow: Develop library-level parameterization APIs; certify “μP-compliant” initializations and optimizer scalings; include Depth-μP variants.
    • Tools: μP core libraries; compliance tests; reference implementations for common architectures.
    • Assumptions/dependencies: Proofs for nonlinear and attention-based architectures; compatibility with mixed-precision and distributed training.
  • μP-aware optimizers and schedules (software/optimization research)
    • Use case: Design optimizers or LR schedules that exploit polynomial-in-η loss structure and μP transfer properties.
    • Workflow: Derive adaptive schedules from early-step polynomial approximation; create μP-aware Adam variants; integrate with “edge of stability” insights.
    • Tools: Optimizer plugins; LR schedule generators; training telemetry that estimates polynomial coefficients online.
    • Assumptions/dependencies: Robust estimation of coefficients in non-linear regimes; guarantees for complex architectures.
  • Scaling laws and governance for green AI (policy/energy/academia)
    • Use case: Reduce hyperparameter sweep emissions by institutionalizing μP-based “tune-small” standards in grants, benchmarks, and procurement.
    • Workflow: Policies requiring documented use of LR transfer methods; carbon budgets that penalize width-dependent retuning.
    • Tools: Reporting templates; audit tools tracking tuning runs; benchmark guidelines rewarding μP-based efficiency.
    • Assumptions/dependencies: Broad acceptance of μP; metrics linking μP adoption to measurable energy savings.
  • Federated and on-device scaling strategies (software/edge/robotics)
    • Use case: Transfer LRs from compact on-device models to larger server models, or from simulation to higher-capacity real robots.
    • Workflow: μP-tuned LR at small width on-device/in-simulation; scale width for server-side or production hardware without re-tuning LR.
    • Tools: Federated training orchestration with μP parameterization; simulators using μP-aware model scaling.
    • Assumptions/dependencies: Extension of LR transfer to architectures used on edge/robots; constraints from heterogeneous hardware.
  • Sector-specific adoption in high-stakes domains (healthcare/finance)
    • Use case: Accelerate model iteration by stabilizing LR across width when scaling tabular/MLP models for risk scoring or imaging.
    • Workflow: Validate μP LR transfer on domain datasets; incorporate μP into risk management pipelines and clinical ML development.
    • Tools: Compliance-ready μP training recipes; LR transfer validation reports for audits; domain-specific AutoML with μP.
    • Assumptions/dependencies: Clinical/financial models may use architectures beyond linear/MLP; regulatory acceptance requires rigorous validation.
  • μP-based training autopilots and dashboards (software/MLOps)
    • Use case: End-to-end systems that tune at small width, predict LR transfer, orchestrate scaling, and monitor deviations.
    • Workflow: Implement “HP Transfer Toolkit” that handles initialization, LR tuning, scaling plans, and alerting on transfer failures.
    • Tools: Orchestration engines; LR transfer monitors; width-scaling simulators; integration with Ray/Tune, Kubernetes, and major DL frameworks.
    • Assumptions/dependencies: Reliable transfer across architectures; strong telemetry to detect exceptions and re-tune only when necessary.
  • Deeper theoretical generalization and guarantees (academia)
    • Use case: Extend proofs to non-linear networks (CNNs, Transformers), residual connections, and modern training tricks (weight decay, normalization).
    • Workflow: Combine Tensor Programs with mean-field and NTK/feature-learning analyses; formalize conditions for unique minimizers and convergence rates.
    • Tools: Theoretical toolkits; synthetic benchmarks designed to map boundaries of LR transfer.
    • Assumptions/dependencies: Mathematical tractability, particularly for attention/residual dynamics; careful control of large-width deviations.

Key Assumptions and Dependencies

  • μP parameterization is correctly implemented:
    • Initialization: W₀ ∼ N(0, d⁻¹), Wₗ ∼ N(0, n⁻¹), V ∼ N(0, n⁻²).
    • Learning rate: GD uses constant LR; Adam uses LR ∝ n⁻¹ (c = 1).
  • Dataset and geometry:
    • Input Gram matrix non-degenerate (Ky ≠ 0).
    • Loss landscape with a unique minimizer at each step t.
  • Proven scope vs. empirical scope:
    • Theoretical guarantees for linear MLPs with GD (and detailed one-step convergence at O(n⁻¹/²)).
    • Empirical support for ReLU MLPs and Adam; mixed results reported in LLM literature.
  • Framework defaults:
    • SP/NTP defaults in common libraries can break LR transfer (optimal LR shifts to zero or diverges); must override to μP.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Adam: An adaptive gradient-based optimizer that uses estimates of first and second moments of gradients to adjust learning rates per parameter. "For the learning rate, μP scaling parametrizes the learning rate as η n{-1} for Adam [yang2022tensor] and η for gradient descent."
  • Almost sure convergence: A strong form of convergence of random variables where the sequence converges with probability 1. "As n→∞, all the coefficients converge almost surely to deterministic constants."
  • Argmin: The set or value of the argument that minimizes a function. "Given width n and training step t, an optimal LR can be defined as η{n}{(t)} ∈ \textrm{argmin}{\eta > 0} L_n{(t)}(\eta)."
  • Bayesian neural network: A neural network with probabilistic weights, treating weights as random variables with specified priors. "Neal1996 was the first to introduce the “1/\textrm{fan-in}” initialization in the context of Bayesian neural networks."
  • Big-O notation: Asymptotic notation that describes an upper bound on the growth rate of a function. "we write c_n = O(d_n) when c_n < \kappa d_n for n large enough, for some constant \kappa > 0."
  • Convergence in probability: A mode of convergence where random variables get arbitrarily close to a constant with high probability as the sample size grows. "We say that LR transfers with width n if there exists a deterministic constant η\infty{(t)}> 0 such that the optimal learning rate η_n{(t)} converges in probability to a η\infty{(t)} as n goes to infinity."
  • Edge of stability: A regime in training where the learning rate is close to the threshold that destabilizes gradient descent dynamics. "Learning rate transfer was empirically studied in \cite{noci2024superconsistencyneuralnetwork} from the angle of Hessian geometry (and its connection to the edge of stability \cite{cohen2022gradientdescentneuralnetworks})"
  • Fan-in: The number of input connections to a neuron; used to scale initialization to maintain activation magnitudes. "He initialization \citep{he2016deep} introduced the “1/\textrm{fan-in}” initialization which normalizes the weights to achieve order one activations as width grows"
  • Feature learning: The process by which a network modifies internal representations (features) during training, rather than behaving like a fixed kernel. "The authors showed that under the neural tangent parametrization, training dynamics converge to a kernel regime in the infinite-width limit, a phenomenon known as lazy training ... which exhibit significant feature learning."
  • Gaussian process: A stochastic process where any finite set of points has a joint Gaussian distribution; emerges as a limit of certain wide networks. "showed that single-layer Bayesian networks converge to a Gaussian process in the infinite-width limit"
  • Gram matrix: A matrix of inner products between vectors, often used to analyze data geometry. "we define the m\times m normalized input Gram matrix K=\left(d{-1}\,\langle x_i,x_j\rangle\right)_{1\leq i,j \leq m} \in\mathbb R{m\times m}"
  • He initialization: A weight initialization scheme scaling by 1/fan-in to stabilize activations and gradients in deep networks. "He initialization \citep{he2016deep} sets the initialization weights as centred gaussian random variables with “1/fan_in” variance"
  • Hessian geometry: Geometric properties derived from the Hessian (second derivative) of the loss, used to understand optimization behavior. "from the angle of Hessian geometry (and its connection to the edge of stability \cite{cohen2022gradientdescentneuralnetworks})"
  • Hyperparameter (HP) transfer: The phenomenon where optimal hyperparameters remain stable across model scales, enabling reuse without retuning. "A nice by-product of μP is HP transfer, or where optimal HPs seem to converge as width increases"
  • Infinite-width limit: The theoretical regime where the number of neurons per layer tends to infinity, revealing simplified training dynamics. "a neural network parameterization designed to “maximize” feature learning in the infinite-width limit."
  • Kernel regime: A training regime where the network behaves like a fixed kernel method, with features not changing significantly. "training dynamics converge to a kernel regime in the infinite-width limit"
  • Lazy training: A regime where network features stay close to their initialization and training is approximately linear around that point. "a phenomenon known as lazy training \citep{chizat2020lazytrainingdifferentiableprogramming}"
  • Learning rate (LR) transfer: Stability of the optimal learning rate as model width increases, allowing reuse across scales. "We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with μP"
  • L2 norm: The Euclidean norm; also used to define convergence in second moment for random variables. "asymptotics are defined in the sense of the second moment (L2L_2 norm)."
  • Maximal Update Parametrization (μP): A parameterization that sets scaling exponents to maximize feature learning and enable hyperparameter transfer. "introduced the Maximal Update Parametrization (μ\muP) which sets precise scaling exponents for the initialization and learning rate."
  • Mean-field initialization: An initialization scaling motivated by mean-field theory, often leading to nontrivial limits as width grows. "the convergence of ϕ\phi_\ell to $0$ is a result of the fact that f(0)(x)f^{(0)}(x) converges to zero because of the Mean-field-type initialization of the projection layer VN(0,n2)V \sim N(0,n^{-2})."
  • Mean-field neural networks: A theoretical framework analyzing networks via mean-field limits and partial differential equations. "and the literature on mean-field neural networks \citep{sirignano2019meanfieldanalysisneural, mei2019meanfieldtheorytwolayersneural, Mignacco_2021, chizat2022infinitewidthlimitdeeplinear}"
  • Multi-Layer Perceptron (MLP): A feedforward neural network with multiple layers of linear transformations (and optionally nonlinearities). "We consider a linear Multi-Layer Perceptron (MLP) given by f(x) = V\top W_L W_{L-1} \dots W_1 W_0 x"
  • Neural parametrization: A specification of how initialization and learning rate scale with width and depth. "A neural parametrization for model \cref{eq:mlp} specifies the constants (α)0L,αV(\alpha_\ell)_{0 \leq \ell \leq L}, \alpha_V, and αη\alpha_{\eta}"
  • Neural Tangent Kernel (NTK): A kernel that describes training dynamics of infinitely wide networks under certain parameterizations. "The Neural Tangent Kernel (NTK, \citep{jacot2018neural}) was one of the first attempts to understand training dynamics of large-width neural networks."
  • Neural Tangent Parametrization (NTP): A parameterization in which training dynamics converge to the NTK kernel regime. "In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP)."
  • Outer product: The product of two vectors producing a matrix, used in expressing rank-1 updates. "For two vectors z,zRnz,z' \in R^n, zzz' \otimes z denotes the outer product."
  • Projection layer: The final layer vector/matrix mapping hidden activations to outputs; denoted by V in the model. "the Mean-field-type initialization of the projection layer VN(0,n2)V \sim N(0,n^{-2})."
  • Rank-1 matrix: A matrix that can be expressed as the outer product of two vectors; appears in gradient expressions. "the gradients are given by rank-1 matrices"
  • ReLU: A nonlinear activation function defined as max(0, x), commonly used in deep networks. "MLP with ReLU activation of varying depth trained with Adam."
  • Standard Parametrization (SP): A common parameterization (e.g., PyTorch defaults) without width-dependent learning rate scaling. "Standard Parametrization (SP): α=1\alpha_\ell = 1 for {0,,L}\ell \in \{0,\dots, L\}, αV=1\alpha_V = 1, and c=0c=0."
  • Strong Law of Large Numbers (SLLN): A theorem ensuring sample averages converge almost surely to expected values. "Strong Law of Large Numbers (SLLN) as nn \to \infty yields convergence to yLx2d1y\,L\|x\|^2 \, d^{-1} almost surely."
  • Tensor Programs: A framework for analyzing wide-network limits via program-like derivations and master theorems. "The convergence of the coefficients to deterministic limit follows from the “Master Theorem” in \cite{yang2021tensor}."
  • Uniform convergence on compact sets: Convergence of functions such that the maximum deviation over any compact domain tends to zero. "The tt-step loss Ln(t)(η)L^{(t)}_n(\eta) converges almost surely to L(t)(η)L^{(t)}_\infty(\eta) uniformly over η\eta on any compact set."

Authors (1)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 298 likes.

Upgrade to Pro to view all of the tweets about this paper: