Scaling Laws in Machine Learning

Updated 5 October 2025

Model scaling laws are power-law relationships that define how performance metrics improve with increased model parameters, dataset size, and compute resources.
Empirical studies across language, speech, recommendation, and reinforcement learning validate these laws, revealing predictable regimes of diminishing returns and optimal resource allocation.
Scaling laws offer actionable guidance for early model selection, hyperparameter tuning, and forecasting resource needs by quantifying the trade-offs between model complexity and training data.

Model scaling laws in machine learning formalize the empirical observation that task performance (typically test loss or error) improves as a smooth power law relative to the resources committed to training—namely, model size, dataset size, and total computation. These laws allow practitioners to predict the performance of larger models or datasets by extrapolating from smaller variants, and they provide principled guidelines for optimizing model development, resource allocation, and architectural choices across diverse domains of supervised, unsupervised, and reinforcement learning.

1. Mathematical Formulation and Universality

Scaling laws describe the dependence of model performance metrics (e.g., cross-entropy loss, word error rate) on resources such as number of parameters $N$ , dataset size $D$ , and compute budget $C$ . A canonical form underlying many domains is:

$L(N, D) = \left[ (L_\infty)^{1/\alpha} + (N_C/N)^{\alpha_N/\alpha} + (D_C/D)^{\alpha_D/\alpha} \right]^\alpha$

where $L_\infty$ is the irreducible loss, $N_C$ and $D_C$ are scale parameters, and $\alpha_N$ , $\alpha_D$ , $\alpha$ are exponents controlling the rate of improvement with model size and data. When only a single resource dominates, this reduces to simpler power laws:

$L(N) = L_\infty + (N_C/N)^{\alpha_N} \qquad L(D) = L_\infty + (D_C/D)^{\alpha_D}.$

Similar forms appear in other settings, such as kernel regression, where the excess risk scales as

$\mathrm{Excess\ risk} = \Theta\left( M^{-(a-1)} + N^{-(a-1)/a} \right)$

with $M$ the effective model dimension and $a$ the spectral decay exponent of the data covariance matrix (Lin et al., 12 Jun 2024).

This universality across tasks, modalities, and even theoretical models (e.g., random feature models, kernel methods) is rooted in underlying properties such as the eigenvalue spectrum of the covariance of real-world data, which empirically observe a power-law decay (Maloney et al., 2022, Bi et al., 25 Sep 2025).

2. Empirical Validation Across Domains

Scaling laws have been empirically validated in a wide variety of machine learning paradigms and domains:

Language Modeling: Transformer-based models, as demonstrated by Kaplan et al. and extended to models with up to 33B parameters, follow scaling laws for test loss in $N$ and $D$ , with clear regimes of diminishing returns and predictable performance floors (Su et al., 11 Mar 2024).
Acoustic/Speech Models: Auto-predictive coding loss in acoustic models exhibits power-law scaling over two orders of magnitude in both $N$ and $D$ , with precise prediction of performance limits (Droppo et al., 2021, Gu et al., 2023).
Recommendation Systems: DLRM-style click-through rate prediction models obey a power-law plus constant form in all three axes (data, parameter count, compute), with data scaling retaining efficiency even when parameter scaling saturates (Ardalani et al., 2022).
Reinforcement Learning: In multi-agent RL (e.g., AlphaZero), the Elo rating (playing strength) scales as a power law in both network size and training compute, with the existence of an optimal size–compute allocation and observed sample efficiency gains in larger models (Neumann et al., 2022).
Code Understanding, Time Series, Multilingual LMs: Code search/clone detection (Lin et al., 20 Feb 2024), forecasting with TSFMs (Yao et al., 16 Oct 2024), and multilingual pretraining (He et al., 15 Oct 2024) show robust power-law scaling, occasionally complicated by cross-task interactions or the need for family-level aggregation.
Material Property Prediction: Neural models for predicting molecular/material properties follow scaling laws in data, parameter count, and compute, with exponents depending on architecture and resource (Trikha et al., 26 Sep 2025).

Power-law behaviors have also been demonstrated in kernel regression (Chen et al., 3 Mar 2025, Bi et al., 25 Sep 2025), with scaling exponents determined by data spectrum redundancy.

3. Interpretation of Scaling Exponents and Limitations

The scaling exponents $\alpha_N$ , $\alpha_D$ , $\beta$ , etc., quantify how rapidly task error decreases with increased resources. Larger exponents indicate more efficient scaling: a modest increase in resources yields substantial performance gain, whereas small exponents denote diminishing returns. For example, in speech models, a 5% improvement in loss from increasing $N$ requires a $\sim$ 25× increase in parameters, while the corresponding data scaling exponent indicates a $\sim$ 14× data increase for the same reduction (Droppo et al., 2021).

A universal irreducible loss term $L_\infty$ or asymptotic limit $\gamma$ often emerges, representing the fundamental lower bound achievable by any model given infinite data or parameters. Once the predicted achievable loss value nears this floor, further scaling is futile without architectural innovation or better data.

Limitations are domain- and architecture-specific. For some tasks (e.g., discriminative rescoring (Gu et al., 2023), recommendation systems (Ardalani et al., 2022)), parameter scaling yields minimal gains beyond a certain point, suggesting a saturated regime. In other settings, scaling laws break down when model size or dataset size exceeds the intrinsic latent dimension of the data (as seen in random feature models, where risk plateaus when $N$ or $D > M$ ) (Maloney et al., 2022).

Sensitivity to experimental design is a recurrent theme: The prescriptions derived from scaling laws (e.g., optimal token–parameter ratio, width–depth tradeoffs) can vary sharply with training schedules, hyperparameter settings, or architectural details. The Gemstones dataset (McLeish et al., 7 Feb 2025) demonstrates that narrow experimental design leads to fragile prescriptions, and practitioners should account for such sensitivity.

4. Theoretical Mechanisms: Covariance Spectrum and Redundancy

Theoretical work connects scaling exponents to the redundancy in data representations. When the input data covariance matrix has a polynomial tail, $\lambda_i \sim i^{-1/\beta}$ , then the effective dimension and hence the error decay is governed by

$\alpha = \frac{2s}{2s + 1/\beta}$

where $s$ is the source regularity (Bi et al., 25 Sep 2025). Higher redundancy (flatter spectrum, smaller $\beta$ ) slows learning (smaller $\alpha$ ).

Random feature models and kernel regression provide analytic tractability to explore how power-law spectra of features or kernels translate into observable scaling laws in predictive accuracy, exposing the role of nonlinear activations in “extending” the effective range of the spectrum (Maloney et al., 2022, Chen et al., 3 Mar 2025). Universality is established: the exponent $\alpha$ remains invariant under invertible linear transformation of features and is set by the heaviest-tailed spectral component in mixture distributions.

This anchoring of the empirical scaling law to data statistics explains cross-domain ubiquity and why architecture- or resource-based improvements ultimately “run into” the data’s intrinsic redundancy limit.

5. Practical Guidance for Model and Resource Allocation

The predictive accuracy of scaling laws enables quantitative planning:

Resource Optimization: For fixed budgets, scaling laws indicate optimal trade-offs between model size and training data (or compute). For example, in the scaling law $L(N, D)$ , the ratio $\alpha_N/\alpha_D$ prescribes how much to grow $D$ when $N$ is increased (Droppo et al., 2021):

$D \propto N^{\alpha_N/\alpha_D}$

Early-Stage Model Selection: Extrapolation from small-scale models accurately forecasts large-scale performance, enabling efficient pre-selection of architecture, optimization strategy, and pretraining objectives (Ivgi et al., 2022).
Hyperparameter Tuning: Explicit expressions for test loss as a function of batch size, training steps, and learning rate enable estimation of optimal batch sizes and convergence trajectories with minimal experimentation (Su et al., 11 Mar 2024, Wang et al., 8 Oct 2024).
Training Regimes: Results suggest that full convergence is not always computationally optimal; training larger models for fewer updates may reach a performance frontier more efficiently (Droppo et al., 2021). For MoE models, the scaling law justifies allocation of budget to increase model width rather than only data (Wang et al., 8 Oct 2024).
Performance Forecasting: Scaling laws allow scientists to estimate the required resource expansion to reach specified error thresholds, crucial for long-term research and hardware planning in large-scale AI system development (Ardalani et al., 2022, Trikha et al., 26 Sep 2025).

6. Extensions and Open Questions

Recent research generalizes scaling laws to new contexts:

Uncertainty Quantification: Predictive uncertainty, both epistemic and aleatoric, across vision and LLMs, is found to follow scaling laws with dataset size, contradicting the popular notion that "so much data" makes Bayesian treatment unnecessary (Rosso et al., 11 Jun 2025).
Multilingual and Multi-domain Models: A proposed scaling law for multilingual LMs reveals that, for each language family, performance improvement depends only on its own sampling ratio, not on the data distribution of other families—thus, global optimal sampling proportions can be derived from small-scale models and scale to large settings (He et al., 15 Oct 2024).
Architectural Breadth: Comparative studies show that MoE and dense models share underlying scaling frameworks, but differences in noise scale and regularization affect optimal batch size and learning rate prescriptions (Wang et al., 8 Oct 2024).
Time Series Models: Power-law scaling persists for time series foundation models (TSFMs) in both in- and out-of-distribution generalization, with encoder-only architectures displaying superior scalability (Yao et al., 16 Oct 2024).
Material Science: Scaling exponents in material property prediction are architecture-dependent; physically-informed models (e.g., EquiformerV2) more efficiently convert added capacity or data into error reduction than unconstrained Transformers (Trikha et al., 26 Sep 2025).

Open challenges include understanding architectural factors that break or accelerate the observed power law, scaling behaviors in highly multi-modal or imbalanced domains, and universality limits in extremely overparameterized regimes or non-i.i.d. training scenarios.

7. Summary Table: Representative Scaling Law Instances

Domain/Task	Scaling Formula	Exponent behavior/Notes
Language Modeling (Su et al., 11 Mar 2024)	$L(N) = (N_c/N)^{\alpha_N}$	Valid up to 33B params; constants depend on setup
Acoustic Models (Droppo et al., 2021)	$L(N, D)$ as above	Irreducible loss $L_\infty$ sets floor
Recommendation (Ardalani et al., 2022)	$L(x) = \alpha x^{-\beta} + \gamma$	Parameter scaling saturated; data/compute still help
Kernel Regression (Bi et al., 25 Sep 2025)	$E[\mathcal{E}] \sim n^{-\alpha}$	$\alpha = 2s/(2s+1/\beta)$ ; $\beta$ from spec. tail
Multilingual LMs (He et al., 15 Oct 2024)	$L_i(N, D, p_i)$ power law in each $p_i$	Opt. ratios from small models generalize to large
Uncertainty (Rosso et al., 11 Jun 2025)	$EU \propto 1/N^\gamma$	Epistemic uncertainty decays with more data, nonzero
RL (Neumann et al., 2022)	Elo scales log-linear in $\log N$	Strength $\propto N^\alpha$ , optimality computable
Materials (Trikha et al., 26 Sep 2025)	$L = \alpha N^{-\beta}$	Exponent $\beta$ strong function of architecture

This table typifies the generality and predictive power of scaling laws. The actual formulas and exponent values are determined empirically or theoretically based on the domain, model, and data spectrum.

Scaling laws provide a unified, predictive, and theoretically motivated framework for understanding how model performance improves with increased resources. They reveal the central role of data redundancy, architectural inductive biases, and resource allocation in dictating learning efficiency and guiding future advances in large-scale machine learning.