Model Scaling Laws in ML
- Model scaling laws in machine learning are power-law equations linking model size, dataset size, and compute to test loss, enabling performance prediction at large scales.
- They are derived through systematic log-linear regression on small models and validated across domains such as language, vision, and code.
- These laws inform compute-optimal resource allocation and model design, though they may break down in data-limited or high-noise regimes.
Model scaling laws in machine learning formalize the empirical observation that as one increases model size, dataset size, or compute resources, loss typically decreases according to simple power-law relationships, subject to irreducible task and noise floors. These quantitative laws enable practical prediction of model performance at resource scales orders of magnitude beyond initial experiments, and underpin resource allocation in the development of large-scale neural networks, including state-of-the-art LLMs. This article provides a comprehensive survey of the mathematical forms, methods of estimation, empirical validation, theoretical foundations, domain-specific variants, and known breakdowns of scaling laws, primarily citing technical results from “Unraveling the Mystery of Scaling Laws: Part I” (Su et al., 2024) and supporting contemporary literature.
1. Canonical Power Laws: Formulation and Core Relations
Model scaling laws specify the asymptotic relationship between test (cross-entropy) loss and the key resource variables: model size (), dataset size (), optimization steps (), and compute (). Under fixed architecture, data distribution, and well-tuned hyperparameters, the empirically robust forms are:
- Model-size scaling (infinite data/compute):
where = number of (non-embedding) parameters, = scaling exponent, = characteristic scale.
- Data-size scaling (infinite model/compute):
= dataset size (tokens), = data exponent ( commonly), = characteristic token count.
- Combined regime (infinite compute):
This form captures trade-offs when both and are varied (Su et al., 2024).
- Compute-limited regime:
: minimal steps to reach a given at infinite batch size, = step exponent, = step constant.
- Critical batch size scaling:
and tuned per experiment.
- Finite-batch trajectory (implicit in ):
This formula predicts the full time/loss trajectory with arbitrary batch size.
All exponents and prefactors are determined via log-linear regression on experiments with small-scale models ($1$–$60$M parameters) and are then used to extrapolate to models up to $33$B parameters, as confirmed empirically (Su et al., 2024).
2. Practical Estimation of Scaling Parameters
Accurate scaling-law predictions hinge on systematic small-scale experimentation:
- Estimating :
Train decoder-only Transformers (M–$60$M) to convergence on massive corpora (negligible data-limited effects), then fit .
- Estimating :
For fixed , use very large batch sizes to minimize step noise, record at many , then fit vs. .
- Estimating :
For fixed , train short runs at various , compute from contours of constant , then fit .
Example fitted values: | Corpus/Context | | | | | | | |-----------------|:----------:|:-------------:|:----------:|:---------:|:----------:|:-----------:| | C4/1024 | 0.076 | | 0.67 | | 0.205 | | | 3T-mix/4096 | 0.0615 | | 0.672 | | 0.139 | |
Constants depend sensitively on context length, tokenization, and data specifics (Su et al., 2024).
3. Empirical Validation and Cross-Domain Generality
Scaling laws have been robustly validated across language modeling, vision, code understanding, recommendation, acoustic modeling, and more.
- Language and code models: Exponents are typically for cross-entropy loss, both in large-scale NLP (Su et al., 2024) and masked LLMs for code (Lin et al., 2024).
- Vision/TinyML: For models below 20M parameters, exponents are substantially steeper, for error rate in ConvNets (Alnemari et al., 7 Mar 2026), but local exponents decay and saturate at scale.
- Multi-output and kernel regression: Theoretical results confirm two-term power-law expansions: , where reflects the spectrum of the data covariance; these predict larger exponents than found in large-scale deep models, supporting the universality-but-nonuniversality hypothesis (Chen et al., 3 Mar 2025).
- Cases where scaling breaks: In small data or "critical size" regimes scaling laws break down; tasks with less than 10K examples, mismatch between pretrain and downstream tasks, or tasks dominated by irreducible noise often do not exhibit simple scaling (Ivgi et al., 2022Alnemari et al., 7 Mar 2026).
- Composition bias and architecture sensitivity: For NMT, scaling exponents differ notably for encoder and decoder, and composition bias in train/test data can dominate the scaling phase and even suppress BLEU improvements beyond a threshold (Ghorbani et al., 2021).
4. Theoretical Foundations: Mechanisms and Universality
Scaling-law phenomena originate from a combination of statistical and dynamical effects in high-dimensional learning:
- Polynomial-spectrum mechanism: When the eigenvalues of the data covariance (or kernel) decay as a power law, optimal generalization yields excess risk falling off as with , where quantifies target smoothness and is the redundancy index (Bi et al., 25 Sep 2025).
- Random-feature and kernel regimes: Solvable models (random-feature ridge regression, NTK, field-theory dualities) display exact symmetry and identical scaling exponents for model and sample size, with breakdown or plateau when or approaches data intrinsic dimension (Maloney et al., 2022Zhang, 2024).
- SGD implicit regularization: In linear settings with power-law spectrum and Gaussian prior, one-pass SGD suppresses variance terms, yielding the empirical scaling law , in sharp contrast to classical variance-limited bias-variance trade-offs (Lin et al., 2024).
- Capacity and redundancy: The sharpness of the spectrum—i.e., degree of redundancy—directly modulates scaling exponents; flatter spectra (higher redundancy) slow down returns-to-scale, motivating investigations of representation learning and spectrum regularization to accelerate power-law decay (Bi et al., 25 Sep 2025).
5. Extensions: Compressed Models and Composite Laws
Unified scaling laws extend to models trained or deployed under compression (quantization, sparsity, etc.):
- Unified capacity law: All compressed formats obey
where is the "dense-equivalent" capacity of format , measured by the per-dimension Gaussian MSE of (Panferov et al., 2 Jun 2025).
- Compositionality: For combinations (e.g., sparse+quantized), factors multiplicatively, greatly simplifying cross-format scaling prediction.
- Empirical validation: For Llama-style Transformers, , remain stable across INT, FP formats, and sparse/quantized hybrids. RMSE-injection predicts scaling-law fit parameter efficiency directly from format statistics without retraining (Panferov et al., 2 Jun 2025).
6. Methodological Best Practices for Scaling Law Estimation
Robust scaling law estimation requires precise methodology:
- Data: Preferably train 5–10 small to moderate models spanning – range beneath the target, holding architecture and training protocol constant (Choshen et al., 2024).
- Checkpoint inclusion: Always include intermediate training checkpoints (excluding the first 10% of steps), substantially improving predictive accuracy.
- Goodness-of-fit and extrapolation: Only trust extrapolations when the fit achieves ; out-of-family extrapolation requires careful cross-family parameter transfer with fixed exponents.
- Uncertainty quantification: Use at least 5 seeds per scale, hierarchical bootstrap (≥1000 resamples) for confidence intervals on scaling parameters and predictions.
- Small-scale protocol efficiency: Training only small models to fit power laws can forecast attributes (loss, steps-to-loss, compute, optimal batch size) of 10B+ models in advance—achieving up to compute savings in model design (Su et al., 2024Choshen et al., 2024).
7. Implications, Limitations, and Open Directions
Scaling laws are central to pretraining strategy, compute budgeting, and architecture search for frontier models.
- Compute-optimal allocation: For two-term power law with fixed budget , optimal scaling is , aligning with the empirical "Chinchilla-optimal" prescription and spectral-theory predictions (Su et al., 2024Lin et al., 2024).
- Limits and breakdowns: When approaching irreducible task noise, as in acoustic modeling or in translationese-dominated NMT, returns from further scale diminish rapidly. Architecture, domain, training regime, and data distribution can all induce deviations, saturation, or even reversal (e.g., systematic error redistribution in TinyML regimes or plateau if source/data spectrum is exhausted) (Alnemari et al., 7 Mar 2026Ghorbani et al., 2021).
- Future work: Open technical fronts include inference-aware scaling laws (test-time compute, per-query adaptivity), multi-objective or fairness-aware laws, spectrum-specific and compositional architectures, and predictive scaling in safety-critical, data-limited, or highly non-i.i.d. settings (Sengupta et al., 17 Feb 2025).
Scaling laws provide a precise predictive lens but must be contextualized to architecture, domain, data spectrum, and operational constraints for optimal practical use.