Papers
Topics
Authors
Recent
Search
2000 character limit reached

Model Scaling Laws in ML

Updated 24 March 2026
  • Model scaling laws in machine learning are power-law equations linking model size, dataset size, and compute to test loss, enabling performance prediction at large scales.
  • They are derived through systematic log-linear regression on small models and validated across domains such as language, vision, and code.
  • These laws inform compute-optimal resource allocation and model design, though they may break down in data-limited or high-noise regimes.

Model scaling laws in machine learning formalize the empirical observation that as one increases model size, dataset size, or compute resources, loss typically decreases according to simple power-law relationships, subject to irreducible task and noise floors. These quantitative laws enable practical prediction of model performance at resource scales orders of magnitude beyond initial experiments, and underpin resource allocation in the development of large-scale neural networks, including state-of-the-art LLMs. This article provides a comprehensive survey of the mathematical forms, methods of estimation, empirical validation, theoretical foundations, domain-specific variants, and known breakdowns of scaling laws, primarily citing technical results from “Unraveling the Mystery of Scaling Laws: Part I” (Su et al., 2024) and supporting contemporary literature.

1. Canonical Power Laws: Formulation and Core Relations

Model scaling laws specify the asymptotic relationship between test (cross-entropy) loss LL and the key resource variables: model size (NN), dataset size (DD), optimization steps (SS), and compute (CC). Under fixed architecture, data distribution, and well-tuned hyperparameters, the empirically robust forms are:

  • Model-size scaling (infinite data/compute):

L(N)=(Nc/N)αNL(N) = (N_c / N)^{\alpha_N}

where NN = number of (non-embedding) parameters, αN\alpha_N = scaling exponent, NcN_c = characteristic scale.

  • Data-size scaling (infinite model/compute):

L(D)=(Dc/D)αDL(D) = (D_c / D)^{\alpha_D}

DD = dataset size (tokens), αD\alpha_D = data exponent (0.095\approx 0.095 commonly), DcD_c = characteristic token count.

  • Combined regime (infinite compute):

L(N,D)=(Nc/N)αN+(Dc/D)αDL(N,D) = (N_c/N)^{\alpha_N} + (D_c/D)^{\alpha_D}

This form captures trade-offs when both NN and DD are varied (Su et al., 2024).

  • Compute-limited regime:

L(N,Smin)=(Nc/N)αN+(Sc/Smin)αSL(N, S_{\min}) = (N_c/N)^{\alpha_N} + (S_c/S_{\min})^{\alpha_S}

SminS_{\min}: minimal steps to reach a given LL at infinite batch size, αS\alpha_S = step exponent, ScS_c = step constant.

Bcrit(L)=B/L1/αBB_{\rm crit}(L) = B_*/L^{1/\alpha_B}

BB_* and αB\alpha_B tuned per experiment.

  • Finite-batch trajectory (implicit in LL):

L=(Nc/N)αN+(Sc/S)αS[1+(B/(BL1/αB))]αSL = (N_c/N)^{\alpha_N} + (S_c/S)^{\alpha_S} \left[1 + (B_*/(B \, L^{1/\alpha_B})) \right]^{\alpha_S}

This formula predicts the full time/loss trajectory with arbitrary batch size.

All exponents (αN,αD,αS,αB)(\alpha_N, \alpha_D, \alpha_S, \alpha_B) and prefactors are determined via log-linear regression on experiments with small-scale models ($1$–$60$M parameters) and are then used to extrapolate to models up to $33$B parameters, as confirmed empirically (Su et al., 2024).

2. Practical Estimation of Scaling Parameters

Accurate scaling-law predictions hinge on systematic small-scale experimentation:

  • Estimating (αN,Nc)(\alpha_N, N_c):

Train k7k\simeq7 decoder-only Transformers (N=1N=1M–$60$M) to convergence on massive corpora (negligible data-limited effects), then fit logLi=αN(logNclogNi)\log L_i = \alpha_N (\log N_c - \log N_i).

  • Estimating (αS,Sc)(\alpha_S, S_c):

For fixed NN, use very large batch sizes to minimize step noise, record L(S)L(S) at many SS, then fit log[L(S)const]\log[L(S)-\text{const}] vs. logS\log S.

  • Estimating (αB,B)(\alpha_B, B_*):

For fixed NN, train short runs at various BjB_j, compute Bcrit(L)B_{\rm crit}(L) from S/BS/B contours of constant LL, then fit logBcrit=logB(1/αB)logL\log B_{\rm crit} = \log B_* - (1/\alpha_B)\log L.

Example fitted values: | Corpus/Context | αN\alpha_N | NcN_c | αS\alpha_S | ScS_c | αB\alpha_B | BB_* | |-----------------|:----------:|:-------------:|:----------:|:---------:|:----------:|:-----------:| | C4/1024 | 0.076 | 1.5×10141.5\times10^{14} | 0.67 | 2.6×1032.6\times 10^3 | 0.205 | 1.7×1081.7\times 10^8 | | 3T-mix/4096 | 0.0615 | 4.85×10174.85\times10^{17} | 0.672 | 1.54×1031.54\times 10^3 | 0.139 | 2.15×10112.15\times 10^{11} |

Constants depend sensitively on context length, tokenization, and data specifics (Su et al., 2024).

3. Empirical Validation and Cross-Domain Generality

Scaling laws have been robustly validated across language modeling, vision, code understanding, recommendation, acoustic modeling, and more.

  • Language and code models: Exponents are typically αN0.030.08, αD0.040.10\alpha_N \approx 0.03-0.08,\ \alpha_D\approx 0.04-0.10 for cross-entropy loss, both in large-scale NLP (Su et al., 2024) and masked LLMs for code (Lin et al., 2024).
  • Vision/TinyML: For models below 20M parameters, exponents are substantially steeper, α=0.100.16\alpha = 0.10-0.16 for error rate in ConvNets (Alnemari et al., 7 Mar 2026), but local exponents decay and saturate at scale.
  • Multi-output and kernel regression: Theoretical results confirm two-term power-law expansions: Lσ2M(a1)+Neff(a1)/aL - \sigma^2 \sim M^{-(a-1)} + N_{\text{eff}}^{-(a-1)/a}, where aa reflects the spectrum of the data covariance; these predict larger exponents than found in large-scale deep models, supporting the universality-but-nonuniversality hypothesis (Chen et al., 3 Mar 2025).
  • Cases where scaling breaks: In small data or "critical size" regimes scaling laws break down; tasks with less than 10K examples, mismatch between pretrain and downstream tasks, or tasks dominated by irreducible noise often do not exhibit simple scaling (Ivgi et al., 2022Alnemari et al., 7 Mar 2026).
  • Composition bias and architecture sensitivity: For NMT, scaling exponents differ notably for encoder and decoder, and composition bias in train/test data can dominate the scaling phase and even suppress BLEU improvements beyond a threshold (Ghorbani et al., 2021).

4. Theoretical Foundations: Mechanisms and Universality

Scaling-law phenomena originate from a combination of statistical and dynamical effects in high-dimensional learning:

  • Polynomial-spectrum mechanism: When the eigenvalues of the data covariance (or kernel) decay as a power law, optimal generalization yields excess risk falling off as nαn^{-\alpha} with α=2s/(2s+1/β)\alpha = 2s/(2s + 1/\beta), where ss quantifies target smoothness and 1/β1/\beta is the redundancy index (Bi et al., 25 Sep 2025).
  • Random-feature and kernel regimes: Solvable models (random-feature ridge regression, NTK, field-theory dualities) display exact NPN\leftrightarrow P symmetry and identical scaling exponents for model and sample size, with breakdown or plateau when NN or PP approaches data intrinsic dimension (Maloney et al., 2022Zhang, 2024).
  • SGD implicit regularization: In linear settings with power-law spectrum and Gaussian prior, one-pass SGD suppresses variance terms, yielding the empirical scaling law Lσ2=Θ(M(a1)+N(a1)/a)L-\sigma^2 = \Theta(M^{-(a-1)} + N^{-(a-1)/a}), in sharp contrast to classical variance-limited bias-variance trade-offs (Lin et al., 2024).
  • Capacity and redundancy: The sharpness of the spectrum—i.e., degree of redundancy—directly modulates scaling exponents; flatter spectra (higher redundancy) slow down returns-to-scale, motivating investigations of representation learning and spectrum regularization to accelerate power-law decay (Bi et al., 25 Sep 2025).

5. Extensions: Compressed Models and Composite Laws

Unified scaling laws extend to models trained or deployed under compression (quantization, sparsity, etc.):

  • Unified capacity law: All compressed formats obey

Loss(N,D;R)=A[Nρ(R)]α+BDβ+E\text{Loss}(N, D; R) = A [N \rho(R)]^{-\alpha} + B D^{-\beta} + E

where ρ(R)\rho(R) is the "dense-equivalent" capacity of format RR, measured by the per-dimension Gaussian MSE of RR (Panferov et al., 2 Jun 2025).

  • Compositionality: For combinations (e.g., sparse+quantized), ρ(R)\rho(R) factors multiplicatively, greatly simplifying cross-format scaling prediction.
  • Empirical validation: For Llama-style Transformers, α0.070.09\alpha\approx0.07-0.09, β0.30.5\beta\approx 0.3-0.5 remain stable across INT, FP formats, and sparse/quantized hybrids. RMSE-injection predicts scaling-law fit parameter efficiency directly from format statistics without retraining (Panferov et al., 2 Jun 2025).

6. Methodological Best Practices for Scaling Law Estimation

Robust scaling law estimation requires precise methodology:

  • Data: Preferably train 5–10 small to moderate models spanning 10210^2103×10^3\times range beneath the target, holding architecture and training protocol constant (Choshen et al., 2024).
  • Checkpoint inclusion: Always include intermediate training checkpoints (excluding the first 10% of steps), substantially improving predictive accuracy.
  • Goodness-of-fit and extrapolation: Only trust extrapolations when the fit achieves R20.95R^2 \geq 0.95; out-of-family extrapolation requires careful cross-family parameter transfer with fixed exponents.
  • Uncertainty quantification: Use at least 5 seeds per scale, hierarchical bootstrap (≥1000 resamples) for confidence intervals on scaling parameters and predictions.
  • Small-scale protocol efficiency: Training only small models to fit power laws can forecast attributes (loss, steps-to-loss, compute, optimal batch size) of 10B+ models in advance—achieving up to 5×5\times compute savings in model design (Su et al., 2024Choshen et al., 2024).

7. Implications, Limitations, and Open Directions

Scaling laws are central to pretraining strategy, compute budgeting, and architecture search for frontier models.

  • Compute-optimal allocation: For two-term power law L(N,D)ANα+BDβL(N,D)\sim A N^{-\alpha} + B D^{-\beta} with fixed budget CNDC\approx ND, optimal scaling is NCα/(α+β),DCβ/(α+β)N\sim C^{\alpha/(\alpha+\beta)}, D\sim C^{\beta/(\alpha+\beta)}, aligning with the empirical "Chinchilla-optimal" prescription and spectral-theory predictions (Su et al., 2024Lin et al., 2024).
  • Limits and breakdowns: When approaching irreducible task noise, as in acoustic modeling or in translationese-dominated NMT, returns from further scale diminish rapidly. Architecture, domain, training regime, and data distribution can all induce deviations, saturation, or even reversal (e.g., systematic error redistribution in TinyML regimes or plateau if source/data spectrum is exhausted) (Alnemari et al., 7 Mar 2026Ghorbani et al., 2021).
  • Future work: Open technical fronts include inference-aware scaling laws (test-time compute, per-query adaptivity), multi-objective or fairness-aware laws, spectrum-specific and compositional architectures, and predictive scaling in safety-critical, data-limited, or highly non-i.i.d. settings (Sengupta et al., 17 Feb 2025).

Scaling laws provide a precise predictive lens but must be contextualized to architecture, domain, data spectrum, and operational constraints for optimal practical use.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Model Scaling Laws in Machine Learning.