Model Scaling Laws in ML

Updated 24 March 2026

Model scaling laws in machine learning are power-law equations linking model size, dataset size, and compute to test loss, enabling performance prediction at large scales.
They are derived through systematic log-linear regression on small models and validated across domains such as language, vision, and code.
These laws inform compute-optimal resource allocation and model design, though they may break down in data-limited or high-noise regimes.

Model scaling laws in machine learning formalize the empirical observation that as one increases model size, dataset size, or compute resources, loss typically decreases according to simple power-law relationships, subject to irreducible task and noise floors. These quantitative laws enable practical prediction of model performance at resource scales orders of magnitude beyond initial experiments, and underpin resource allocation in the development of large-scale neural networks, including state-of-the-art LLMs. This article provides a comprehensive survey of the mathematical forms, methods of estimation, empirical validation, theoretical foundations, domain-specific variants, and known breakdowns of scaling laws, primarily citing technical results from “Unraveling the Mystery of Scaling Laws: Part I” (Su et al., 2024) and supporting contemporary literature.

1. Canonical Power Laws: Formulation and Core Relations

Model scaling laws specify the asymptotic relationship between test (cross-entropy) loss $L$ and the key resource variables: model size ( $N$ ), dataset size ( $D$ ), optimization steps ( $S$ ), and compute ( $C$ ). Under fixed architecture, data distribution, and well-tuned hyperparameters, the empirically robust forms are:

Model-size scaling (infinite data/compute):

$L(N) = (N_c / N)^{\alpha_N}$

where $N$ = number of (non-embedding) parameters, $\alpha_N$ = scaling exponent, $N_c$ = characteristic scale.

Data-size scaling (infinite model/compute):

$L(D) = (D_c / D)^{\alpha_D}$

$D$ = dataset size (tokens), $\alpha_D$ = data exponent ( $\approx 0.095$ commonly), $D_c$ = characteristic token count.

Combined regime (infinite compute):

$L(N,D) = (N_c/N)^{\alpha_N} + (D_c/D)^{\alpha_D}$

This form captures trade-offs when both $N$ and $D$ are varied (Su et al., 2024).

Compute-limited regime:

$L(N, S_{\min}) = (N_c/N)^{\alpha_N} + (S_c/S_{\min})^{\alpha_S}$

$S_{\min}$ : minimal steps to reach a given $L$ at infinite batch size, $\alpha_S$ = step exponent, $S_c$ = step constant.

Critical batch size scaling:

$B_{\rm crit}(L) = B_*/L^{1/\alpha_B}$

$B_*$ and $\alpha_B$ tuned per experiment.

Finite-batch trajectory (implicit in $L$ ):

$L = (N_c/N)^{\alpha_N} + (S_c/S)^{\alpha_S} \left[1 + (B_*/(B \, L^{1/\alpha_B})) \right]^{\alpha_S}$

This formula predicts the full time/loss trajectory with arbitrary batch size.

All exponents $(\alpha_N, \alpha_D, \alpha_S, \alpha_B)$ and prefactors are determined via log-linear regression on experiments with small-scale models ($1$–$60$M parameters) and are then used to extrapolate to models up to $33$B parameters, as confirmed empirically (Su et al., 2024).

2. Practical Estimation of Scaling Parameters

Accurate scaling-law predictions hinge on systematic small-scale experimentation:

Estimating $(\alpha_N, N_c)$ :

Train $k\simeq7$ decoder-only Transformers ( $N=1$ M–$60$M) to convergence on massive corpora (negligible data-limited effects), then fit $\log L_i = \alpha_N (\log N_c - \log N_i)$ .

Estimating $(\alpha_S, S_c)$ :

For fixed $N$ , use very large batch sizes to minimize step noise, record $L(S)$ at many $S$ , then fit $\log[L(S)-\text{const}]$ vs. $\log S$ .

Estimating $(\alpha_B, B_*)$ :

For fixed $N$ , train short runs at various $B_j$ , compute $B_{\rm crit}(L)$ from $S/B$ contours of constant $L$ , then fit $\log B_{\rm crit} = \log B_* - (1/\alpha_B)\log L$ .

Example fitted values: | Corpus/Context | $\alpha_N$ | $N_c$ | $\alpha_S$ | $S_c$ | $\alpha_B$ | $B_*$ | |-----------------|:----------:|:-------------:|:----------:|:---------:|:----------:|:-----------:| | C4/1024 | 0.076 | $1.5\times10^{14}$ | 0.67 | $2.6\times 10^3$ | 0.205 | $1.7\times 10^8$ | | 3T-mix/4096 | 0.0615 | $4.85\times10^{17}$ | 0.672 | $1.54\times 10^3$ | 0.139 | $2.15\times 10^{11}$ |

Constants depend sensitively on context length, tokenization, and data specifics (Su et al., 2024).

3. Empirical Validation and Cross-Domain Generality

Scaling laws have been robustly validated across language modeling, vision, code understanding, recommendation, acoustic modeling, and more.

Language and code models: Exponents are typically $\alpha_N \approx 0.03-0.08,\ \alpha_D\approx 0.04-0.10$ for cross-entropy loss, both in large-scale NLP (Su et al., 2024) and masked LLMs for code (Lin et al., 2024).
Vision/TinyML: For models below 20M parameters, exponents are substantially steeper, $\alpha = 0.10-0.16$ for error rate in ConvNets (Alnemari et al., 7 Mar 2026), but local exponents decay and saturate at scale.
Multi-output and kernel regression: Theoretical results confirm two-term power-law expansions: $L - \sigma^2 \sim M^{-(a-1)} + N_{\text{eff}}^{-(a-1)/a}$ , where $a$ reflects the spectrum of the data covariance; these predict larger exponents than found in large-scale deep models, supporting the universality-but-nonuniversality hypothesis (Chen et al., 3 Mar 2025).
Cases where scaling breaks: In small data or "critical size" regimes scaling laws break down; tasks with less than 10K examples, mismatch between pretrain and downstream tasks, or tasks dominated by irreducible noise often do not exhibit simple scaling (Ivgi et al., 2022 Alnemari et al., 7 Mar 2026).
Composition bias and architecture sensitivity: For NMT, scaling exponents differ notably for encoder and decoder, and composition bias in train/test data can dominate the scaling phase and even suppress BLEU improvements beyond a threshold (Ghorbani et al., 2021).

4. Theoretical Foundations: Mechanisms and Universality

Scaling-law phenomena originate from a combination of statistical and dynamical effects in high-dimensional learning:

Polynomial-spectrum mechanism: When the eigenvalues of the data covariance (or kernel) decay as a power law, optimal generalization yields excess risk falling off as $n^{-\alpha}$ with $\alpha = 2s/(2s + 1/\beta)$ , where $s$ quantifies target smoothness and $1/\beta$ is the redundancy index (Bi et al., 25 Sep 2025).
Random-feature and kernel regimes: Solvable models (random-feature ridge regression, NTK, field-theory dualities) display exact $N\leftrightarrow P$ symmetry and identical scaling exponents for model and sample size, with breakdown or plateau when $N$ or $P$ approaches data intrinsic dimension (Maloney et al., 2022 Zhang, 2024).
SGD implicit regularization: In linear settings with power-law spectrum and Gaussian prior, one-pass SGD suppresses variance terms, yielding the empirical scaling law $L-\sigma^2 = \Theta(M^{-(a-1)} + N^{-(a-1)/a})$ , in sharp contrast to classical variance-limited bias-variance trade-offs (Lin et al., 2024).
Capacity and redundancy: The sharpness of the spectrum—i.e., degree of redundancy—directly modulates scaling exponents; flatter spectra (higher redundancy) slow down returns-to-scale, motivating investigations of representation learning and spectrum regularization to accelerate power-law decay (Bi et al., 25 Sep 2025).

5. Extensions: Compressed Models and Composite Laws

Unified scaling laws extend to models trained or deployed under compression (quantization, sparsity, etc.):

Unified capacity law: All compressed formats obey

$\text{Loss}(N, D; R) = A [N \rho(R)]^{-\alpha} + B D^{-\beta} + E$

where $\rho(R)$ is the "dense-equivalent" capacity of format $R$ , measured by the per-dimension Gaussian MSE of $R$ (Panferov et al., 2 Jun 2025).

Compositionality: For combinations (e.g., sparse+quantized), $\rho(R)$ factors multiplicatively, greatly simplifying cross-format scaling prediction.
Empirical validation: For Llama-style Transformers, $\alpha\approx0.07-0.09$ , $\beta\approx 0.3-0.5$ remain stable across INT, FP formats, and sparse/quantized hybrids. RMSE-injection predicts scaling-law fit parameter efficiency directly from format statistics without retraining (Panferov et al., 2 Jun 2025).

6. Methodological Best Practices for Scaling Law Estimation

Robust scaling law estimation requires precise methodology:

Data: Preferably train 5–10 small to moderate models spanning $10^2$ – $10^3\times$ range beneath the target, holding architecture and training protocol constant (Choshen et al., 2024).
Checkpoint inclusion: Always include intermediate training checkpoints (excluding the first 10% of steps), substantially improving predictive accuracy.
Goodness-of-fit and extrapolation: Only trust extrapolations when the fit achieves $R^2 \geq 0.95$ ; out-of-family extrapolation requires careful cross-family parameter transfer with fixed exponents.
Uncertainty quantification: Use at least 5 seeds per scale, hierarchical bootstrap (≥1000 resamples) for confidence intervals on scaling parameters and predictions.
Small-scale protocol efficiency: Training only small models to fit power laws can forecast attributes (loss, steps-to-loss, compute, optimal batch size) of 10B+ models in advance—achieving up to $5\times$ compute savings in model design (Su et al., 2024 Choshen et al., 2024).

7. Implications, Limitations, and Open Directions

Scaling laws are central to pretraining strategy, compute budgeting, and architecture search for frontier models.

Compute-optimal allocation: For two-term power law $L(N,D)\sim A N^{-\alpha} + B D^{-\beta}$ with fixed budget $C\approx ND$ , optimal scaling is $N\sim C^{\alpha/(\alpha+\beta)}, D\sim C^{\beta/(\alpha+\beta)}$ , aligning with the empirical "Chinchilla-optimal" prescription and spectral-theory predictions (Su et al., 2024 Lin et al., 2024).
Limits and breakdowns: When approaching irreducible task noise, as in acoustic modeling or in translationese-dominated NMT, returns from further scale diminish rapidly. Architecture, domain, training regime, and data distribution can all induce deviations, saturation, or even reversal (e.g., systematic error redistribution in TinyML regimes or plateau if source/data spectrum is exhausted) (Alnemari et al., 7 Mar 2026 Ghorbani et al., 2021).
Future work: Open technical fronts include inference-aware scaling laws (test-time compute, per-query adaptivity), multi-objective or fairness-aware laws, spectrum-specific and compositional architectures, and predictive scaling in safety-critical, data-limited, or highly non-i.i.d. settings (Sengupta et al., 17 Feb 2025).

Scaling laws provide a precise predictive lens but must be contextualized to architecture, domain, data spectrum, and operational constraints for optimal practical use.