Scaling Law Discovery (SLD)

Updated 1 August 2025

Scaling Law Discovery (SLD) is the study of mathematical relationships that predict how key system performance metrics change as factors like model size and data scale.
It employs methodologies such as power-law, log-linear, and dimensionless analyses, combining theoretical derivation with empirical curve fitting.
SLD drives optimal resource allocation, automated scientific discovery, and improved forecasting across fields like deep learning, physics, and urban studies.

Scaling Law Discovery (SLD) encompasses the identification and formalization of mathematical relationships that predict how key system properties—such as generalization error, test loss, or output distributions—change as critical variables (like model size, dataset size, compute, or other system resources) are scaled. SLD provides a rigorous basis for understanding empirical regularities (e.g., Zipf’s law, neural network scaling, dimensionless laws in physics and engineering), enabling principled performance forecasting, optimal resource allocation, and new avenues for automated scientific discovery across scientific and engineering domains.

1. Mathematical Principles and Representative Scaling Laws

Scaling law discovery is grounded in the empirical observation that many systems—across linguistics, urban studies, physics, and machine learning—exhibit regularities expressible as simple functional forms relating variables of interest. The most prevalent include:

Power-Law Scaling:

$L = k \cdot x^{-\alpha} + c$

where $L$ is the response metric (test loss, error, etc.), $x$ is a scaling variable (model parameters, data size, computation), $\alpha$ is the scaling exponent, $k$ a coefficient, and $c$ an irreducible minimum or noise floor (Su et al., 11 Mar 2024, Zhang et al., 2023, Rosenfeld, 2021).

Log-Linear Laws:

$A = \alpha \log_{10} X + \beta$

where $A$ is the acceptance rate or throughput and $X$ could be pretraining tokens, draft model depth, or batch size in decoding tasks (Yan et al., 8 May 2025).

Dimensionless/Homogeneous Group Laws in Physics:

$\Pi = \prod_{i} Q_i^{w_i},\quad \text{where } D \cdot \mathbf{w} = 0$

$\Pi$ is a dimensionless group, $Q_i$ are physical variables, and $\mathbf{w}$ is a combination of exponents ensuring dimensional balance via the dimension matrix $D$ (Xie et al., 2021).

Rank-Size and Hierarchical Scaling:

$f_m = f_1 r_f^{m-1},\quad P_m = P_1 r_p^{1-m}$

$f_m = n P_m^{-D},\quad D = \frac{\ln r_f}{\ln r_p}$

as in Zipf’s law and its generative models for urban systems (Chen, 2011).

Generalization Error Decomposition:

$\epsilon(m, n) = a n^{-\alpha} + b m^{-\beta} + c_\infty$

where $m$ (model size) and $n$ (data size) control decays from independent sources, and $c_\infty$ is the noise or irreducible error (Rosenfeld, 2021).

DP Law Semiparametric Forms:

$L = F(\log M, \log T, \log \bar{\sigma})$

where $L$ is loss, $M$ is model parameters, $T$ iterations, and $\bar{\sigma}$ the DP noise-batch ratio (McKenna et al., 31 Jan 2025).

Critically, the scaling exponent and lower bound constants are typically empirically determined and often depend on experimental setup, data distribution, and architectural choices (Su et al., 11 Mar 2024).

2. Methodologies for Discovery and Estimation

Most scaling law discovery proceeds via two intertwined paths: theoretical derivation and empirical curve fitting or optimization.

Theoretical Approaches

Statistical Mechanics & Entropy Maximization:

Employ variational principles to derive rank-size and upscaling laws in complex systems, resulting in exponential or power law relations under maximum-entropy constraints (Chen, 2011).

Spectral Analysis/RMT:

Use eigenvalue spectra and random matrix theory to analytically derive scaling behavior, particularly in Gaussian and random feature generative models (Maloney et al., 2022, Ding et al., 13 Feb 2025, Chen et al., 3 Mar 2025).

Dimensional Analysis:

Formulate dimensionless scaling laws by solving $D \cdot \mathbf{w} = 0$ and optimizing analytic representations to data (dimensionless learning) (Xie et al., 2021).

Empirical and Data-Driven Approaches

Empirical Curve Fitting:

Fit log-log linear or double power-law models to performance curves across model/data/compute scale, typically via least-squares regression, robust error minimization, or normalized mean squared error as the objective (Su et al., 11 Mar 2024, 2303.0705).

Batch Rescaling and Data Collapse:

Verify scaling ansatz (e.g., $D_L(k) = g(k/L)/(L V_L)$ for word frequencies) by rescaling data and confirming collapse of different scale curves onto a universal $g(x)$ (Font-Clos et al., 2013, Font-Clos et al., 2014).

Automated Symbolic Regression and Evolutionary Search:

Use program synthesis or evolutionary algorithms to search function space for universal scaling expressions, sometimes guided and mutated by LLMs (EvoSLD) (Lin et al., 27 Jul 2025).

Estimation of Constants

Small-Scale Estimation:

Estimate scaling constants (e.g., $N_c, \alpha_n$ ) via regressions on small-scale experiments (models with $1$M–$60$M parameters) and extrapolate accurately to multi-billion parameter regimes (Su et al., 11 Mar 2024).

Two-Level Optimization:

Alternate optimization of parametric forms (fitting coefficients within groups defined by control variables) and searching for optimal dimensionless groups or functional forms (Xie et al., 2021, Lin et al., 27 Jul 2025).

The interplay of mathematical derivations and robust data-driven estimation is central to reliable SLD across practical and theoretical settings.

3. Types of Scaling Laws and Contexts

SLD encompasses a diversity of system types and scientific domains, with distinct scaling laws in each context:

Domain	Scaling Law Type	Functional Example
Urban systems	Rank-size/Zipf, exponent law	$f_m = n P_m^{-D}$ , $P_k = P_1 k^{-q}$
Linguistics	Zipf, Heaps, scaling ansatz	$D_L(n)=g(n/L)/(L V_L)$ , $g(x) = k/[x(a+x^{\gamma-1})]$
Deep learning	Power law, law+constant	$L(x) = \alpha x^{-\beta} + \gamma$
Physics/Engineering	Dimensionless (Pi) law	$\Pi = h^{w_1}\Delta T^{w_2}\cdots$ , $D \cdot \mathbf{w}=0$
Differential privacy	Semi-parametric, non-linear	$L=F(\log M, \log T, \log \bar{\sigma})$
Decoding algorithms	Log-linear	$A = \alpha \log_{10} X + \beta$
Regression (kernel, multi)	Power law decay	$E[L] = \sigma^2 + \Theta(1/M^{a-1}) + \Theta(1/N^{(a-1)/a})$

In each domain, the scaling law formalizes the empirically robust relationship between a scaling variable and an outcome, occasionally incorporating control or nuisance variables (task, domain, or architecture) (Font-Clos et al., 2013, Rosenfeld, 2021, Xie et al., 2021, Lin et al., 27 Jul 2025).

4. Interpretability, Limitations, and Range

SLD is often valid only within a "scaling range"—typically excluding the lowest and highest scales or data points due to outlier effects, finite-size corrections, or breakdown of approximations (e.g., Stirling’s in entropy-maximization or finite-data plateaus in spectral theories) (Chen, 2011, Maloney et al., 2022). For example, the largest or smallest cities may fall outside the valid scaling regime for urban systems; scaling formulas for deep learning error may only extrapolate over regions where data, model size, and compute are not grossly mismatched (Su et al., 11 Mar 2024, Rosenfeld, 2021).

Misinterpretation can occur if a "power law" is fitted to a distribution that is, in fact, a transformed exponential ("spurious" Zipf)—a distinction clarified by reducibility tests and analysis of underlying fractality or Euclidean dimensionality (Chen, 2013). The universal scaling function is invariant up to rescaling, but may exhibit double power-law behavior—e.g., in word frequencies (Zipf for large $x$ , exponent $1$ for small $x$ ) (Font-Clos et al., 2013).

Critically, the constants (exponents, irreducible minima) in scaling formulas depend sensitively on context: data distribution, domain, architecture, training regime, and hyperparameters. Thus, empirical estimation or theoretical derivation under matched assumptions is mandatory for meaningful extrapolation (Su et al., 11 Mar 2024).

5. Impact, Applications, and Automation

Scaling law discovery informs:

Predictive Performance Estimation:

Forecast loss, accuracy, or error from small-scale experiments across model sizes, training steps, or data volumes, with quantifiable uncertainty (Ivgi et al., 2022, Zhang et al., 2023).

Resource-Optimal Configuration:

Choose model/data/compute parameters for desired trade-offs; for example, maximizing throughput under privacy or compute constraints, or selecting the optimal batch size for fastest convergence (Su et al., 11 Mar 2024, McKenna et al., 31 Jan 2025, Yan et al., 8 May 2025).

Scientific Discovery and Interpretation:

Uncover fundamental system invariants (e.g., extracting Rayleigh, Nusselt, or keyhole numbers in fluid/thermal systems) or diagnose spurious vs. genuine scaling (e.g., fractality in city-size distributions) (Xie et al., 2021, Chen, 2013, Chen, 2011).

Automated SLD Frameworks:

EvoSLD automates the search for universal, parsimonious scaling laws via evolutionary algorithms augmented with LLM-guided program mutation, optimizing for normalized fitting error and explicit parameter complexity (Lin et al., 27 Jul 2025). This approach can identify or surpass prior human-expert scaling laws efficiently across diverse domains.

Scaling laws also guide data acquisition (whether to increase data/compute/model size), inform the design of recommendation and code models (Ardalani et al., 2022, Lin et al., 20 Feb 2024), and underpin recent advances in efficient decoding protocols in LLM inference (Yan et al., 8 May 2025).

6. Future Research and Open Challenges

Extension to New Paradigms:

Investigate non-linear, adaptive, mixture-of-experts architectures and the impact of RLHF or other non-standard training regimes on established scaling forms (Maloney et al., 2022, Yan et al., 8 May 2025).

Integration of Privacy and Data Constraints:

Jointly model compute, privacy loss, and data budgets for regulated or sensitive data scenarios, potentially shifting the optimal configuration towards smaller models but higher data/model ratios (McKenna et al., 31 Jan 2025).

Theoretical Underpinnings:

Expand understanding of the origins of scaling exponents, their dependency on spectral properties, intrinsic dimensionalities, or representation learning dynamics (Maloney et al., 2022, Ding et al., 13 Feb 2025, Chen et al., 3 Mar 2025).

Further Automation and Generalization:

Develop frameworks capable of interpreting non-stationary scaling regimes, higher-order effects, and integrating domain knowledge for more robust, automated scaling law extraction (Lin et al., 27 Jul 2025, Xie et al., 2021).

Diagnostic and Regulatory Use:

Employ deviations from predicted scaling as diagnostic signals for model misconfiguration or architecture limits; determine when empirical or "spurious" power laws signal underlying mechanistic shifts (Chen, 2013, Ivgi et al., 2022).

Expanding the frontiers of scaling law discovery will increasingly blend theoretical, empirical, and automated approaches, potentially transforming both scientific inquiry and engineering practice by rendering the fundamental laws of scaling both discoverable and exploitable across new disciplines and systems.