Learning Rate Transfer in Normalized Transformers
Abstract: The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup. However, despite having hyperparameters that explicitly scale with model size, we observe that nGPT does not exhibit learning rate transfer across model dimension and token horizon. To rectify this, we combine numerical experiments with a principled use of alignment exponents (arXiv:2407.05872) to revisit and modify the $μ$P approach to hyperparameter transfer (arXiv:2011.14522). The result is a novel nGPT parameterization we call $ν$GPT. Through extensive empirical validation, we find $ν$GPT exhibits learning rate transfer across width, depth, and token horizon.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper looks at how to pick a good learning rate for Transformer models when you change their size or how long you train them. The authors focus on a special kind of Transformer called a “Normalized Transformer” (nGPT) that trains fast and doesn’t need extra tricks like weight decay or learning rate warmup. They show that nGPT still loses “learning rate transfer” when you make the model bigger or train for more tokens. Then they design a new version, called vGPT, that keeps the learning rate “just right” as you scale the model’s width (how wide layers are), depth (how many layers), and token horizon (how many tokens you train on).
The questions they asked
In simple terms, they asked:
- If a learning rate works well for a small model, can we reuse it automatically for a bigger or deeper model, or for longer training?
- Why does the original nGPT fail to do this across model width and long training?
- Can we change how we set the learning rate so it “transfers” correctly when models get wider, deeper, or see more data?
How did they study it?
Think of training a model like driving a car:
- The learning rate is your speed.
- The model’s width and depth are like the size and number of floors in a building you’re navigating.
- The token horizon is how long your trip is.
If you keep the same speed while moving to a much bigger car or a much longer road, you might crash or go too slowly. The goal is to adjust speed in a predictable way so driving stays smooth no matter the car size or trip length.
Here’s what they did, in everyday language:
- They measured “alignment exponents”: These numbers tell you how well the model’s weights and activations point in the same direction. You can think of alignment as “how well the model’s parts cooperate when learning.” Prior methods assumed a certain kind of alignment that didn’t match what actually happens, so the authors re-measured it carefully.
- They combined these measurements with math-based rules for safe updates: The rules say how much each part of the model should change per training step so learning is steady (not exploding or stalling).
- They proposed new scaling rules (vGPT): These rules say how to scale the learning rate for different parts of the model when you increase width, depth, or the number of training tokens.
Key ideas behind the approach:
- Normalization everywhere: nGPT keeps activations and weights at a stable size each step. This makes the math more predictable.
- Different parts, different speeds: Embeddings, hidden layers, and the output layer should not all use the same learning rate when you scale the model. vGPT gives each of them its own “speed scaling.”
- Longer training needs a smaller peak learning rate: As you train on more tokens, you should lower the learning rate by a specific power law so learning stays stable.
What did they find?
The authors propose simple, practical rules that let learning rates transfer well when you scale the model:
- Across number of tokens (how long you train): Multiply the global learning rate by about (number of tokens)-1/3. In math: n_global ∝ tokens-1/3. This matches independent measurements from other studies.
- Across width (how wide layers are):
- Embedding layer learning rate scales like width-1/2.
- Hidden and output layer learning rates scale like width-3/4.
- These choices come from their measured “mid alignment” (not too little, not too much) between weights and activations.
- Across depth (how many layers): Keep certain “rescaler” parameters small but scale their initial values with depth so very deep models stay stable, without having to reduce the hidden-layer learning rate.
- Keep key normalization scales constant (like 0.03) instead of shrinking them with model size.
Why this matters:
- With these rules (vGPT), they show “learning rate transfer” across width, depth, and token horizon. That means a learning rate tuned on a small, short run still works when you scale up, saving lots of trial-and-error.
- vGPT trains as well as (or slightly better than) the original nGPT, and better matches real training behavior than older theories that assumed stronger alignment.
- A commonly used older method (“pP”/Depth-pP) doesn’t match normalized Transformers as well, especially for depth, while vGPT does.
Why it matters
Tuning learning rates for very large models is expensive and slow. If you can find a simple recipe that scales automatically, you:
- Save time and money because you don’t need to re-run big hyperparameter searches.
- Reduce training instability and surprises when moving to wider or deeper models.
- Get predictable performance when you increase the amount of training data (tokens).
Key takeaways in plain words
- One-size-fits-all learning rates don’t hold when you grow Transformers; different parts of the model need different “speed” adjustments.
- For normalized Transformers, lowering the learning rate as you train on more tokens by about the cube-root rule (power −1/3) keeps training smooth.
- Using “mid alignment” measurements leads to width rules that make small-model settings transfer well to big models.
- The new vGPT setup provides simple scaling rules that work across width, depth, and training length, without sacrificing performance.
Knowledge Gaps
Unresolved gaps, limitations, and open questions
Below is a single, consolidated list of concrete gaps and open problems suggested by the paper that future researchers could act on:
- Theory for depth transfer in normalized models: The paper observes good transfer over depth without scaling
η_hiddenand with only scalingα_A,initandα_M,initbym_depth, yet provides no complete theoretical account for why “noη_hiddendepth correction” works for nGPT/vGPT. Develop a principled theory of depth transfer for residual-post-norm, LERP-based normalized Transformers that explains when (and why)η_hiddenshould or should not scale with depth. - Dynamics of trainable rescalers: The trainable LERP parameters (
α_A,α_M) appear to adapt in ways that obviate Depth-pP-style corrections, but the mechanism is not analyzed. Characterize the per-layer/component distributions and dynamics ofα_A,α_M(and their gradients) over training and depth to understand their stabilizing role and whether explicit schedules could improve stability or transfer. - Alignment exponents are time-varying: The derivations assume fixed alignment exponents, while measurements show pronounced time dependence (early training ≫ late training). Build a dynamic theory of alignment exponents and derive update/learning-rate prescriptions that adapt to evolving alignment rather than assuming fixed “mid” alignment.
- Justification for “mid alignment” (3/4) choice: The selection of
max{α, ν} = 3/4is motivated by empirical fits weighted by loss decrease, but the weighting is ad hoc. Provide a principled criterion for aggregating time-varying alignment into a single scaling prescription and test alternative aggregations (e.g., step-weighted, gradient-norm-weighted, Fisher-weighted). - Generality across datasets: All results are on FineWeb-Edu with a single tokenizer and domain. Validate width/depth/token-horizon transfer on diverse datasets (code, multilingual, speech-text, math, reasoning) and different tokenizers to assess robustness of the scaling exponents and constants.
- Batch size and gradient noise scale: Token-horizon scaling
η ∝ T^{-1/3}is fit at fixed batch size 64. Systematically study how optimal LR scales with both token horizon and batch size (and thus with gradient noise scale), including microbatching and data-parallel variations. - Optimizer dependence: The theory leans on a signGD-like view of Adam and uses fixed Adam hyperparameters (
β1=0.9,β2=0.95,ε=1e-16). Assess how transfer exponents change under different optimizers (e.g., Lion, Muon, SGD+momentum, Adafactor), different Adamβ1/β2/εsettings, decoupled vs coupled weight decay, and with/without gradient clipping. - Weight decay effects: Experiments set weight decay to 0, yet several cited works argue weight decay changes alignment and transfer behavior. Quantify how nonzero weight decay modifies alignment exponents and the vGPT scaling rules; determine if the “mid alignment” prescription remains valid.
- Scaling of head dimension
d_key: Derivations assume constantd_keyand note the need to adjust ifd_keyscales. Empirically test width transfer when increasingd_key(at fixedn_heads) and derive the corresponding attention scaling (including whether to remove/adjust the√d_keyfactor). - Alternative normalization placements: Results are specific to residual-post-norm with vector L2
Norm(.). Evaluate whether the proposed scaling transfers to pre-LN, RMSNorm, or other normalization variants, and characterize how normalization placement and type alter alignment and LR transfer. - Matrix renormalization frequency: The approach renormalizes columns/rows of multiple matrices before each training step. Investigate how reducing the frequency (e.g., every K steps) or omitting renormalization affects stability, alignment exponents, and LR transfer—and whether scalings need adjustment.
- Sensitivity to fixed constants (0.03 rescaler scales): Several
*_scalehyperparameters are set to a constant 0.03 instead ofd_model^{-1/2}without rigorous justification. Provide a sensitivity analysis and theoretical rationale for the choice and its dependence on width, depth, and optimizer. - Embedding/unembedding extra tuning knobs: The recommendation allows “optional” constant multipliers for
η_inputandη_outputtuned on a base model. Identify principled rules (or measurements) to eliminate these free knobs, or provide recipes to select them that provably transfer. - Stability at extreme depths: Sweeps go up to 128 layers; baseline shows some instability at high depth. Test substantially deeper regimes (e.g., 256–1k layers) to map the limits of the current prescription and whether additional per-depth or per-layer corrections become necessary.
- Transfer via different width axes: Width transfer is validated by scaling the number of heads with fixed head dimension. Test transfer when scaling hidden widths in MLPs and attention projections (or head dimension) at fixed
n_headsto confirm generality of them_widthexponents. - Broader architectural variants: Examine whether vGPT scaling persists for Transformers with GLU/GEGLU/Swish-GLU MLPs, attention variants (multi-query, grouped-query, linear attention), ALiBi/relative positions, dropout, MoE layers, tied embeddings, and shared projections.
- LR schedule dependence: All runs use cosine decay to 10% of the peak. Evaluate whether transfer quality and optimal peak LR scalings change under alternative schedules (linear, step, exponential, no decay), and whether schedule hyperparameters themselves transfer across scales.
- Precision and numerics: The method may interact with numerical precision (BF16/FP8), loss scaling, fused kernels, and optimizer implementations. Quantify how numeric formats and mixed-precision strategies affect alignment measurements and the validity of the scaling laws.
- Beyond next-token LM: Validate whether LR transfer carries over to finetuning, instruction tuning, multilingual alignment, and RLHF/SFT stages, where gradient statistics and alignment could differ.
- Compute-optimality and 20 tokens-per-parameter: The paper adopts 20 TPP heuristically and notes ambiguity about including embeddings. Fit compute–data scaling laws specifically for vGPT to determine the compute-optimal horizon under the proposed scaling, with and without embeddings.
- Formal metric of “transfer quality”: Plots suggest good transfer, but a quantitative definition/metric (e.g., shift of optimal LR, regret relative to oracle LR, area under performance–LR curve) is not provided. Define and report a standardized metric to compare parameterizations.
- Variance across seeds and runs: Some results average 3 seeds; others use EMA of validation loss; a consistent uncertainty analysis is missing. Provide systematic variance estimates (seeds, batches, data orders) to quantify robustness of transfer claims.
- Interaction with sequence length: Token horizon is varied via steps; sequence length is fixed at 4096. Measure whether optimal LR also depends on sequence length (context length) and whether
T^{-1/3}holds when scaling tokens via longer contexts at fixed steps. - Role of attention scaling at very high width: The proofs note potential adjustment to attention scaling if alignment increases or
d_keygrows. Explore whether emergent alignment at scale necessitates rescaling attention logits or changing per-head/output projections to preserve transfer. - Empirical upper bounds of update size: The analysis assumes “small enough updates” so per-step renormalization does not change scales. Quantify actual update magnitudes across layers and training to verify this assumption and identify regimes where it fails.
- Alternative aggregation of alignment across layers: Alignment exponents are averaged across layers and measured on a single validation batch. Study per-layer, per-head, and token-position heterogeneity; design aggregation methods that better predict per-layer LR scaling or suggest layerwise LR schedules.
- Token-horizon exponent origin: The
-1/3scaling is empirical (and concurs with other work) but lacks a first-principles derivation for normalized Transformers. Develop a theory that predicts the1/3exponent under the normalization and renormalization used here. - Effect of regularization: No dropout, stochastic depth, or augmentations are used. Evaluate how such regularizers affect alignment exponents, stability, and LR transfer, especially at large depths.
- Practical overhead of per-step normalization: The method normalizes matrices every step; the compute/memory overhead and its scalability are not quantified. Benchmark costs and explore approximations (e.g., periodic or low-rank normalization) that retain transfer.
- Cross-architecture comparatives: While pP baselines are included, broader comparisons to other scaling schemes (e.g., Unit Scaling, MaxViT-style residual scaling, per-module LR normalization) on identical setups would clarify when vGPT is preferable.
- Out-of-distribution generalization: Transfer is evaluated on validation loss within the same data distribution. Assess whether LR transfer tuned on small models generalizes to OOD validation/tasks for larger models, and whether misalignment grows.
- Safety of constant
0.03choice forS_*scales: The new constants for key/query/value and MLP rescaler scales may affect gradient flow and saturation. Provide ablations across a range of constants and characterize failure modes (e.g., vanishing gradients, attention collapse). - Interaction with parameter tying and weight sharing: Many practical LMs tie input/output embeddings or share projections. Test whether vGPT scalings remain valid under tying, and if not, derive adjusted exponents for tied parameters.
- Extension to mixture-of-experts (MoE): Investigate whether router parameters, expert fan-in/out, and sparse updates preserve the “mid alignment” assumption and how LR should scale with number of experts and expert capacity.
- Predictive tools for practitioners: Given the residual dependence on tuned constants, provide a step-by-step diagnostic procedure (e.g., measure early alignment exponents, estimate safe LR band) that practitioners can apply to new settings to select scalings without large sweeps.
Practical Applications
Immediate Applications
Below are concrete, deployable uses of the paper’s vGPT parameterization and findings for learning-rate transfer in Normalized Transformers (nGPT). Each item includes suggested sectors, potential tools/workflows, and key assumptions or dependencies that affect feasibility.
- HP transfer across model size without new sweeps
- What it enables: Take learning rates tuned on a small nGPT and auto-derive per-parameter-group LRs for wider/deeper/larger-token-horizon models using the vGPT rules (mwidth, mdepth, mdata).
- Sectors: Software/AI (foundation model training), Cloud/HPC.
- Tools/workflows: PyTorch param-group generator that applies mwidth-1/2 (embeddings), mwidth-3/4 (hidden/unembedding), and mdata-1/3; integration into training launchers (e.g., TorchTitan, Megatron-LM forks), CI pipelines for “transfer-only” LR configs.
- Assumptions/dependencies: nGPT-like architecture (Norm post-residual, Q/K normalization, LERP rescalers, per-step weight/column normalization); Adam(-like) optimizer without weight decay, no warmup; batch size held fixed when applying mdata; base LR tuned on a representative small model.
- No-warmup, no-weight-decay recipes for normalized Transformers
- What it enables: Simplified, faster training recipes that remove warmup schedules and weight decay while preserving stability.
- Sectors: Software/AI, MLOps.
- Tools/workflows: Template configs with vGPT scalings; guardrails that verify early-step stability on a small batch before full-scale runs.
- Assumptions/dependencies: Normalized Transformer blocks as defined in the paper; Adam setup (β1=0.9, β2=0.95, ε≈1e-16); external validation for non-nGPT architectures.
- Faster model family development (small-to-large)
- What it enables: Build families of models (e.g., 300M→1B→3B) with one HP sweep at the base size and transfer LRs upward, cutting exploration compute.
- Sectors: Software/AI, Startups/SMEs, Open-source model releases.
- Tools/workflows: “Family builder” script that takes base LR + aspect ratios and emits LR configs across SKUs; minimal “sanity sweep” (2–3 points) at target size to validate.
- Assumptions/dependencies: Width scaling largely via nheads; head dimension and architectural constants close to base runs.
- Resume/extend pretraining with token horizon-aware LR
- What it enables: When increasing tokens/steps, reduce the peak LR by mdata-1/3 to keep runs stable and performant.
- Sectors: Software/AI, Cloud/HPC.
- Tools/workflows: LR scheduler that automatically recomputes peak LR when extending training; dashboards that track cumulative tokens and trigger the mdata correction.
- Assumptions/dependencies: Fixed (or similar) batch size; exponent ≈1/3 validated in your stack; schedule interaction (cosine decay) remains consistent.
- Stable deep model training out-of-the-box
- What it enables: Train deeper nGPT models (e.g., 64–128 layers) with robust LR transfer and reduced instability.
- Sectors: Software/AI, Multimodal/NLP teams.
- Tools/workflows: Depth-aware initialization of LERP rescalers (set @A,init and @M,init ∝ mdepth); unchanged hidden LR as recommended.
- Assumptions/dependencies: Trainable LERP rescalers; depth correction applied to rescaler initialization, not to hidden LR.
- Per-parameter LR separation for embeddings vs hidden/unembedding
- What it enables: Different LR scaling for embeddings (mwidth-1/2) vs hidden + unembedding (mwidth-3/4), reducing over/under-updates as model widens.
- Sectors: NLP, Recommendation systems (large embeddings), ASR/TTS with text front-ends.
- Tools/workflows: Param-group definitions that isolate Einput/Eoutput LRs; optional constant multipliers tuned on the base model.
- Assumptions/dependencies: Weight normalization per step; clear separation of embedding and hidden parameter groups.
- Reduced cost/energy via fewer HP sweeps
- What it enables: Lower compute spend and carbon footprint by replacing large-scale LR sweeps with vGPT transfer.
- Sectors: Energy/ESG reporting, Finance (cost control for ML), Cloud providers.
- Tools/workflows: “Budget planner” that estimates avoided sweeps and energy; ESG dashboards mapping LR-transfer adoption to emissions reduction.
- Assumptions/dependencies: Comparable tokens-per-parameter policies; organizational acceptance of transfer-based HP setting.
- Reproducible, portable training recipes for academia
- What it enables: Repeatable experiments across scales with predictable LR behavior; better cross-lab comparability.
- Sectors: Academia, Education.
- Tools/workflows: Lab handouts/notebooks implementing vGPT in PyTorch; small→large lab assignments demonstrating transfer; benchmark baselines with and without vGPT.
- Assumptions/dependencies: Access to nGPT code or compatible normalized Transformer implementations.
- HPC scheduling and procurement planning
- What it enables: Compute-optimal token planning (e.g., 20 tokens/parameter) tied to LR scaling; improved utilization and risk management.
- Sectors: Cloud/HPC Ops, Finance (capex/opex planning).
- Tools/workflows: Job configuration generators that ingest planned width/depth/tokens and output LR and schedule; “what-if” dashboards for cost/time curves.
- Assumptions/dependencies: Policies on tokens-per-parameter; reliable token horizon exponent in your workload.
- Diagnostic monitoring via alignment proxies
- What it enables: Lightweight proxies (e.g., gradient norms, update magnitudes per group) to detect mis-specified LR scaling early if alignment deviates from “mid”.
- Sectors: MLOps, Reliability/SRE for ML.
- Tools/workflows: Train-time metrics and alerts if update/activation scales drift; rapid rollback to adjusted multipliers.
- Assumptions/dependencies: Direct measurement of alignment exponents is costly; use proxies and validate occasionally.
Long-Term Applications
These items require additional research, broader validation, or engineering to scale beyond the paper’s scope.
- Generalize vGPT beyond nGPT to other normalized architectures
- What it enables: Apply transfer rules to RMSNorm/LN-based post-norm Transformers, Vision Transformers, and multimodal stacks.
- Sectors: Vision (ViT), Multimodal, Speech.
- Tools/workflows: Cross-architecture benchmarks; adapters that translate vGPT scalings to non-L2-normalized blocks.
- Assumptions/dependencies: The role of Norm vs RMSNorm and LERP-equivalents; need to confirm mid-alignment and LR exponents.
- Optimizer-agnostic transfer (beyond Adam)
- What it enables: LR transfer for SGD, Lion, AdaFactor, Muon, and hybrid optimizers.
- Sectors: Software/AI, Hardware/Compiler co-design.
- Tools/workflows: Empirical alignment exponent measurement under alternate optimizers; theory bridging signGD-like assumptions to specific optimizers.
- Assumptions/dependencies: Update-direction statistics differ by optimizer; exponents may change.
- Joint scaling with batch size and context length
- What it enables: LR policies that jointly adjust for token horizon, batch size, and sequence length (context window growth).
- Sectors: Software/AI, LLM-as-a-Service.
- Tools/workflows: SDE-based or empirical fits for mdata exponent vs batch; context-length-aware attention scaling policies.
- Assumptions/dependencies: The 1/3 exponent is measured at fixed batch sizes; new exponents likely depend on batch and optimizer.
- Closed-loop AutoLR via online alignment estimation
- What it enables: Systems that estimate alignment exponents during training and dynamically tune LR multipliers to preserve transfer conditions.
- Sectors: MLOps, AutoML.
- Tools/workflows: Lightweight estimators (or reliable proxies) baked into training loops; controller that adjusts per-group LRs/schedules.
- Assumptions/dependencies: Measurement overhead and noise; stability guarantees for on-policy LR updates.
- Depth transfer theory and tooling for trainable rescalers
- What it enables: Predictive rules for LR when residual rescalers (@A, @M) are trainable, supporting very deep (>200 layers) stacks.
- Sectors: Research, Foundation model engineering.
- Tools/workflows: Analytical models; ablators that separate rescaler dynamics from hidden LR to propose depth-aware policies.
- Assumptions/dependencies: Current evidence suggests hidden LR need not change with depth in nGPT; broader regimes may differ.
- Hardware- and compiler-aware LR transfer
- What it enables: Compiler passes (e.g., PyTorch 2, XLA) that detect model scale and auto-inject vGPT LR param-groups; distributed trainers that enforce per-step normalization efficiently.
- Sectors: Hardware/Systems, Cloud providers.
- Tools/workflows: Optimizer plugins, graph rewrites for matrix normalization, fused kernels for Norm and LERP.
- Assumptions/dependencies: Engineering for per-step renormalization overhead; correctness and determinism in distributed settings.
- Safety/robustness baselines for large-scale training
- What it enables: Predictable scaling reduces catastrophic divergences; establish preflight checks (sanity bounds) before multi-million-dollar runs.
- Sectors: Safety/Compliance, Finance (risk management).
- Tools/workflows: “Go/No-Go” checklists using vGPT scalings; regression tests that compare predicted vs observed LR optima.
- Assumptions/dependencies: Organizational process integration; model-specific exceptions.
- Domain-specific model training under constrained compute
- What it enables: Smaller hospitals, banks, or public agencies train competent domain LMs with minimal HP searches.
- Sectors: Healthcare, Finance, Public sector.
- Tools/workflows: Turnkey nGPT recipes + vGPT LR transfer; templated data pipelines; low-cost reproducibility guides.
- Assumptions/dependencies: Availability of normalized Transformer code; data governance constraints.
- Long-horizon agents and robotics
- What it enables: Apply token horizon LR scaling to decision/trajectory Transformers with very long episodes.
- Sectors: Robotics, Autonomous systems, RL.
- Tools/workflows: Training curricula with step-count-aware LR scaling; evaluation of stability under long horizons.
- Assumptions/dependencies: Differences between supervised pretraining and RL/behavior cloning may alter exponents.
- On-device continual learning with normalized micro-Transformers
- What it enables: Stable LR schedules that adapt as token counts accrue on device (privacy-preserving personalization).
- Sectors: Mobile/Edge AI, Consumer devices.
- Tools/workflows: Lightweight normalized Transformer variants; LR controllers using mdata updates during on-device updates.
- Assumptions/dependencies: Compute/memory constraints; need for energy-efficient per-step normalization.
Cross-cutting assumptions and dependencies to keep in mind
- Architectural: Results assume an nGPT-style normalized Transformer with L2 Norm at block outputs, post-norm residuals, normalized Q/K, trainable LERP rescalers, and per-step matrix normalization. Deviations may require recalibration.
- Optimizer/settings: Validated with Adam (β1=0.9, β2=0.95, ε≈1e-16), no weight decay, no warmup, fixed batch size. Different optimizers, weight decay, or warmups can change transfer behavior.
- Scaling exponents: Width scalings rely on a “mid alignment” assumption (max alignment exponents ≈3/4) and empirical w=1/2 for hidden/output. If alignment dynamics differ materially, LR multipliers may need adjustment.
- Token horizon: The mdata-1/3 law is empirical (fit ≈0.34) and may vary with batch size, data, and architectures; validate locally.
- Parameterization details: Keeping scalers (e.g., @A,scale, @M,scale, Sqk,scale, Sz,scale) constant (≈0.03) and scaling @A,init/@M,init with depth are part of the recipe.
- Scope: Experiments used FineWeb-Edu, sequence length 4096, batch size 64, and parameter ranges up to ~3B with depths up to 128; extrapolation beyond these regimes requires caution.
- Implementation overhead: Per-step weight/column normalization and LERP operations must be efficient to realize net speedups.
Glossary
- abc-Parametrizations: A class of neural network parameterizations used to study large-width limits in the Tensor Programs framework. "using a natural class they call 'abc-Parametrizations'"
- AdamW: An optimizer variant of Adam that decouples weight decay from the gradient update. "We use Adam (AdamW with weight decay 0.0) with 31 = 0.9, 32 = 0.95, € = 10-16"
- alignment exponents: Quantities that characterize matrix-vector alignment between weights, activations, and updates, used to predict safe learning-rate scaling. "a principled use of alignment exponents (Everett et al., 2024)"
- attention head: An individual attention mechanism within multi-head self-attention that computes weighted combinations of values. "the nth token output of an attention head is given by the standard expression"
- CompleteP: A parameterization approach that sets depth scaling to preserve “complete” feature learning (nonlinear dependence on parameters). "CompleteP (Dey et al., 2025) sets @depth = 1"
- compute-optimal: Refers to a heuristic token budget per parameter that is near-optimal for compute efficiency. '"compute-optimal" 20 tokens per parameter.'
- cosine schedule: A learning-rate schedule that decays the rate following a cosine curve. "the learning rate decayed to 10% of its peak (initial) value using a cosine schedule."
- cross-entropy loss: A standard loss for classification and language modeling measuring divergence between predicted and true distributions. "We train nGPT as defined in section 2 with cross-entropy loss on the FineWeb-Edu dataset"
- Depth-pP: A depth-scaled extension of pP that adjusts learning rates with depth (often with @depth = 1/2). "Depth-pP (Yang et al., 2024b) style parametrization sets @depth = 1/2"
- dynamical mean field theory: A statistical physics framework used to analyze high-dimensional learning dynamics in neural networks. "This limit was studied with dynamical mean field theory in Bordelon and Pehlevan (2022)."
- feature diversity: The property that different layers or units learn diverse features rather than become linearized. "argue that @depth = 1/2 is optimal because it admits 'feature diversity'"
- FineWeb-Edu: A curated web-text dataset used for training and evaluating LLMs. "FineWeb-Edu dataset (Penedo et al., 2024)"
- head dimension: The dimensionality of each attention head’s key/query/value vectors. "we fix the head dimension at5 102"
- HP transfer: The practice of extrapolating hyperparameters found on small models or short runs to larger models or longer runs. "for transferring HPs across model depth, width, and token horizon."
- lazy infinite-width limits: Regimes where very wide networks behave like linear models around initialization, with little feature learning. "identified 'lazy' infinite-width limits"
- learning rate warmup: A strategy that increases the learning rate gradually at the start of training. "does not require weight decay or learning rate warmup."
- LERP: A learned “linear interpolation” residual connection that blends the input with the block output using trainable, nonnegative rescalers. 'replaced in (2) and (4) by a "linear interpolation" function (LERP)'
- logits: The pre-softmax scores produced by a model, whose scale is monitored to ensure stability. "and the logits do not blow up"
- Maximal Update Parametrization (pP): A parameterization designed to maximize stable parameter updates without blowup, promoting learning-rate transfer across width. "which they called the Maximal Update Parametrization (pP)."
- mid alignment assumption: A modeling assumption that weights and activations exhibit partial alignment (intermediate between none and full). "mid alignment assumption (in which weights and activations are partially but not completely aligned)"
- Norm(.): An L2 normalization layer used in nGPT to normalize vectors without trainable parameters. "Standard RMSNorm(.) or LayerNorm(.) are replaced with Norm(.), which l2-normalizes the vector"
- normalized Transformer (nGPT): A Transformer variant that heavily uses normalization and trainable rescalers to enable faster, more stable training without weight decay or warmup. "The Normalized Transformer, or nGPT (Loshchilov et al., 2025) achieves impressive training speedups"
- Post-LN: A Transformer normalization scheme applying normalization after the residual addition; used to maintain stability at depth. "absent techniques like Post-LN"
- pre-LN: A Transformer normalization scheme applying normalization before the residual block; the standard baseline referenced for comparison. "when compared with a standard pre-LN Transformer:"
- residual-post-norm: Performing normalization after the output of each block in the residual stream. "which is sometimes called residual-post-norm (Liu et al., 2021; OLMo et al., 2025)."
- rotary positional embedding: A method for encoding positional information by rotating key and query vectors in a latent space. "rotary positional embedding (Su et al., 2023) maps"
- signGD: A simplified optimizer that updates parameters in the direction of the sign of the gradient, used to motivate Adam’s scaling behavior. "a smoothed version of the signGD update"
- SiLU: An activation function defined as x·sigmoid(x), used in the MLP blocks of the Transformer. "SiLU(v) = v Oo(v) with o(.) the standard sigmoid"
- spectral norms: Matrix norms based on the largest singular value, used to analyze and derive stable parameterizations like pP. "A simple theoretical perspective deriving pP using spectral norms is provided in Yang et al. (2024a)."
- Tensor Programs framework: A theoretical framework for analyzing large-width/depth limits of neural networks via program-like descriptions. "The Tensor Programs framework (Yang, 2019, 2020; Yang and Littwin, 2021; Yang, 2021)"
- token horizon: The total number of tokens seen during training (or per sweep), affecting optimal learning-rate scaling. "small models trained on limited token horizons"
- Unit Scaling: A parameterization technique that normalizes activations, weights, and gradients to unit scale at initialization. "Unit Scaling (Blake et al., 2023) ensuring that activations, weights and gradients have scale one at initialization."
- unembedding: The final linear projection from the model’s hidden state back to vocabulary logits. "passes through a linear unembedding"
- weight-activation alignments: The degree to which weight matrices align with activation directions, influencing safe update sizes and transfer. "weight-activation alignments in nGPT do not satisfy the hypotheses that underlie typical pP-type parameterizations."
- weight decay: An L2 regularization technique that penalizes large weights; notably not used in nGPT. "does not require weight decay"
- width corrections: Scaling rules for learning rates or parameters with model width to preserve stable, non-trivial training dynamics. "Width corrections (powers of mwidth) are important for width transfer in vGPT."
Collections
Sign up for free to add this paper to one or more collections.