Papers
Topics
Authors
Recent
2000 character limit reached

Scale-Aware Hyperparameters

Updated 4 January 2026
  • Scale-aware hyperparameters are tuning strategies that adjust for changes in model width, depth, dataset size, and compute budget to achieve near-optimal performance without re-tuning for each scale.
  • Empirical studies reveal that key hyperparameters, such as learning rate and regularization, follow power-law scaling laws which enable efficient zero-shot transfer across different training regimes.
  • Recent methodologies, including power-law extrapolation, surrogate-based Bayesian optimization, and multi-objective Pareto searches, reduce tuning costs and computational waste in large-scale learning.

Scale-Aware Hyperparameters

Scale-aware hyperparameters are hyperparameter choices or tuning methodologies that explicitly account for changes in problem scale—such as model width, depth, dataset size (token horizon), compute budget, or architecture—so that optimal or near-optimal performance can be achieved without independently re-tuning for each new scale. This paradigm addresses the prohibitive cost and inefficiency of exhaustive, scale-specific grid or random search, especially in large-scale deep learning, kernel methods, or scientific computing. Recent advances in scale-aware hyperparameter transfer, power-law scaling laws, architecture- and optimizer-aware scaling prescriptions, and principled optimization frameworks enable reliable “zero-shot” or efficiently optimized hyperparameter selection as scale varies across training regimes, architectures, and data sizes.

1. Theoretical Foundations and Transfer Principles

The central theoretical objective for scale-aware hyperparameters is robust transfer: ensuring that hyperparameters tuned at small scale (e.g., on a model of width nn) remain (asymptotically) near-optimal when transferred to a larger scale (e.g., width NnN \gg n), thereby minimizing suboptimality and compute waste.

A principled formalism quantifies “fast transfer” as follows. Let ϕn(h)\phi_n(h) denote the validation loss for width nn with hyperparameters hh, and h(n)=argminhϕn(h)h^*(n)=\arg\min_h \phi_n(h). If h(n)h^*(n) converges sufficiently quickly to the infinite-width optimum h()h^*(\infty)—specifically, so that the transfer suboptimality cn=ϕ(h(n))ϕc_n=|\phi_\infty(h^*(n))-\phi_\infty^*| satisfies cn=o(an)c_n=o(a_n) with an=ϕnϕa_n=|\phi_n^*-\phi_\infty^*|—then scale-aware hyperparameter transfer is both fast and compute-efficient. This property is often realized under scale-appropriate parameterizations such as Maximal Update Parameterization (μ\muP) or under certain alignment conditions between gradients and parameters (Ghosh et al., 28 Dec 2025, Everett et al., 2024).

Empirical and synthetic studies show that fast transfer is nontrivial: in random-feature regression, transfer is provably fast, while in certain “hard” tasks no such benefit appears even under μ\muP (Ghosh et al., 28 Dec 2025).

2. Empirical Scaling Laws Across Model and Data Scale

Empirical research demonstrates that many optimal hyperparameters follow power-law (or other functional) scaling rules as the scale of the model or dataset changes. For LLMs, the optimal learning rate η(D)\eta^*(D) decreases with token horizon DD as a power law:

η(D)=BDβ,β0.32\eta^*(D) = B D^{-\beta}, \quad \beta \approx 0.32

(Bjorck et al., 2024). When jointly considering model size NN and DD, a unified law emerges:

η(N,D)=CNαDβ,α0.23, β0.32\eta^*(N, D) = C N^{-\alpha} D^{-\beta}, \quad \alpha \approx 0.23, \ \beta \approx 0.32

This enables “zero-overhead” transfer: practitioners tune on a small token budget and apply

η(D1)η(D2)(D1/D2)0.32\eta^*(D_1) \approx \eta^*(D_2) (D_1/D_2)^{-0.32}

at much larger data scale, as validated across families of transformer models.

In kernel methods, complexity penalties and regularization (λ\lambda) scale automatically with dataset size nn, and Nyström rank mm grows sublinearly with nn to balance variance and approximation error (Meanti et al., 2022). In deep kernel or deep learning settings, global and per-layer learning rates are found to follow functional relationships with width, depth, and data processed (see Table below). Parameters such as weight decay and regularization coefficients can follow nontrivial, optimizer-dependent scaling rules (Qiu et al., 5 Dec 2025, Everett et al., 2024).

Setting Scaling Law for Hyperparameter Reference
LLM training (LR vs tokens) η(D)D0.32\eta^*(D) \propto D^{-0.32} (Bjorck et al., 2024)
Joint model+data (LLM) η(N,D)N0.23D0.32\eta^*(N,D) \propto N^{-0.23} D^{-0.32} (Bjorck et al., 2024)
Kernel Ridge Regression λ\lambda increases with nn (Meanti et al., 2022)
Transformer (width, depth) η(wd)1\eta \propto (w\sqrt{d})^{-1} (McLeish et al., 7 Feb 2025)
Shampoo/Muon (μP, width) See Table 1 formulas in (Qiu et al., 5 Dec 2025) (Qiu et al., 5 Dec 2025)
Standard param. (per-layer LR) ηhiddennγ\eta_{\text{hidden}} \sim n^{-\gamma} (Everett et al., 2024)

3. Methodological Advances: Frameworks for Scale-Aware Tuning

Several recent algorithmic frameworks are designed to enable scalable, scale-aware hyperparameter optimization:

  • Hyperparameter Transfer and Power-Law Extrapolation: For LLM training, zero-shot transfer relies on fitting a learning rate optimum at a small token horizon and extrapolating via a power law in token count or model size, drastically reducing the need for high-cost LR sweeps at ultimate scale (Bjorck et al., 2024).
  • Zeroth-Order Hypergradient Methods: The HOZOG algorithm computes average finite-difference hypergradients along random directions to scale bilevel hyperparameter optimization to hundreds or thousands of variables, with cost almost independent of hyperparameter dimensionality (Gu et al., 2021).
  • Surrogate-Based Bayesian Optimization: Deep Power Laws (DPL) uses a neural surrogate constrained to power-law learning curves for efficient gray-box hyperparameter optimization over variable epoch/training budget scales (Kadra et al., 2023). Bayesian approaches with closed-form expectations can reduce the number of function evaluations for scale parameters by 40×\times (Yadav et al., 7 Oct 2025).
  • Multi-Objective Pareto Frontier Search: CARBS traces out the performance–cost Pareto frontier and fits joint scaling laws for all hyperparameters with respect to compute budget, enabling automated discovery of scaling exponents and prescriptions for large deep learning systems (Fetterman et al., 2023).

4. Architecture- and Optimizer-Aware Scaling Strategies

Scale-awareness must encompass not only the quantity of computation or model/data size but also architectural and optimizer-specific considerations:

  • Architecture-Aware Learning Rate Scaling: Closed-form rules for maximal learning rate for arbitrary DAG-structured neural networks express ηmax(pLp3)1/2k1\eta_{\max} \propto (\sum_{p} L_p^3)^{-1/2} k^{-1}, where LpL_p is the depth along a path, and kk is the kernel size for convolutional architectures. This strategy adapts seamlessly to MLPs, CNNs, and residual networks, ensuring stable and efficient training across topologies (Chen et al., 2024).
  • Optimizer and Parameterization Dependence: Analysis reveals that the optimal scaling exponent for per-layer learning rates varies with the chosen parameterization (standard, μ\muP, mean-field, etc.) and empirical alignment between parameters, data, and gradients (Everett et al., 2024). Importantly, all parameterizations can support hyperparameter transfer if the correct per-layer exponents are applied, with standard parameterization + per-layer LR often outperforming μ\muP for scale transfer when combined with appropriate scaling of Adam’s ϵ\epsilon parameter or employing Adam-atan2 to bypass underflow (Everett et al., 2024).
  • Preconditioned Optimizers: For Shampoo, SOAP, Muon, and related matrix-preconditioned optimizers, width-only and joint width–depth scaling rules for learning rate and damping are rigorously derived. Empirically, independent weight decay should scale as 1/width1/\text{width} for compute-optimal performance. Explicit block-wise preconditioning (“blocking”) and spectral-normalization steps are critical to eliminating finite-width drifts and securing robust transfer (Qiu et al., 5 Dec 2025).

5. Hyperparameter Scaling in Practice: Implementation and Limitations

Practical deployment of scale-aware hyperparameter strategies follows a sequence of tuning on reference scale, analytic or regression-based extrapolation, and composition with architecture- or optimizer-specific rules. The main practical methodologies are:

  • Fit the optimal hyperparameter(s) (such as learning rate) at a tractable scale (smaller model, shorter token horizon).
  • Extrapolate to the target scale using empirically validated scaling laws (e.g., η(D1)=η(D2)(D1/D2)0.32\eta^*(D_1) = \eta^*(D_2)(D_1/D_2)^{-0.32} or per-layer exponents from analytic tables).
  • For multi-hyperparameter and multi-objective settings, jointly fit all hyperparameters against cost or compute via Pareto-optimization frameworks (CARBS, DPL).
  • Where architectural changes are present, apply architecture-aware rules (e.g., adjust η\eta for depth, width, path count, kernel size as prescribed (Chen et al., 2024)).
  • In nonparametric or kernel methods, adapt regularization and rank parameters with scale based on penalized risk decompositions (Meanti et al., 2022).
  • For stochastic models, exploit structural statistical surrogates to determine scale/precision hyperparameters efficiently under uncertainty (Yadav et al., 7 Oct 2025).
  • In scenario-specific cases (e.g., Quickshift segmentation), scale all kernel and distance hyperparameters linearly with image size to preserve invariances in the number and shape of output objects (Garreau, 2022).

Limitations identified include the restriction of validated power-law exponents to certain recipes, the need for re-derivation for non-Adam/GPT-3 settings, and a dearth of studies in regimes such as multi-modality, extreme long-tail token horizons, or newly emerging architectures (e.g., MoE, SSMs) (Bjorck et al., 2024, McLeish et al., 7 Feb 2025). Extrapolation outside of well-characterized domains should be approached with caution.

6. Case Studies, Empirical Validation, and Impact

Concrete case studies substantiate the efficacy and necessity of scale-aware hyperparameters:

  • LLama-1 Case Study: Extrapolation from fits at smaller horizons revealed the LR used in LLama-1 (3e-4) was over 2.5×2.5\times larger than the extrapolated optimum (1.15e-4 at 1T tokens), with an upper-bound validation-loss penalty ΔL0.027\Delta L \approx 0.027—a substantial accuracy gap attributed to violating scale-awareness (Bjorck et al., 2024).
  • Kernel Ridge Regression: Adaptive penalty design matched performance of much larger hand-tuned configurations, with computational cost reduced by over an order of magnitude (Meanti et al., 2022).
  • Deep Power Laws (DPL): Across 59 tasks, DPL's scale-aware surrogate efficiently identified 2×2\times more top-1% configurations and found tuning oracles $3$–6×6\times faster than competitive baselines (Kadra et al., 2023).
  • Matrix-Preconditioned Optimizers: Using μ\muP + width-aware weight decay and explicit normalization, Muon and Shampoo achieved 1.4×1.4\times and 1.3×1.3\times speedup over AdamW, with speedup collapsing in the absence of correct scaling (Qiu et al., 5 Dec 2025).

These studies confirm the considerable performance and efficiency improvements that arise from explicit scale-awareness in hyperparameter selection across domains.

7. Open Problems and Future Directions

Unsolved directions include understanding the theoretical origins of observed scaling exponents (e.g., why β0.32\beta \approx 0.32 for η\eta vs. DD in LLMs), investigating breakdown regimes of power-law scaling at ultra-large scales or new architectures, integrating schedule and batch-size scaling with trajectory-invariance principles, and recalibrating scaling rules for shifting hardware and training recipes. Open questions remain about the transferability of scale-aware prescriptions for unstudied modalities or loss functions, and the optimal compositionality when jointly scaling model, data, batch, and optimizer hyperparameters (Bjorck et al., 2024, McLeish et al., 7 Feb 2025, Ghosh et al., 28 Dec 2025).

Continued progress will likely depend on datasets that support scaling law fitting across all hyperparameters, further algorithmic advances in meta-optimization under scale, and theoretical connections between scaling exponents, representation learning, and dynamical systems regimes.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Scale-Aware Hyperparameters.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube