Empirical Scaling in Architecture Discovery

Updated 28 July 2025

The paper demonstrates that empirical scaling predicts neural network performance through power-law relationships between training examples, model size, and computational cost.
It shows that architecture innovations shift error curves downward without altering the inherent scaling exponents, indicating intrinsic task complexity governs sample efficiency.
The study offers actionable guidance on resource allocation by quantifying trade-offs in model scaling and dataset expansion for improved model selection and benchmarking.

Empirical scaling laws for architecture discovery refer to rigorously observed quantitative relationships that predictably link model performance to key resource variables—including training set size, model size, and computational scale—across different architectures and domains. These laws provide both operational and theoretical guidance for the efficient identification and development of neural network architectures, enabling practitioners to forecast accuracy improvements, debug anomalous learning curves, and inform resource allocation strategies. The following sections detail the foundational principles, mathematical formulations, domain-specific findings, and practical implications as established in leading research.

1. Foundations and Formulation of Empirical Scaling Laws

Systematic empirical studies, most notably by Hestness et al. (Hestness et al., 2017) and Kaplan et al. (Kaplan et al., 2020), reveal that deep neural network generalization error exhibits a predictable, smooth power-law decay as a function of resource scale. In the canonical case, if $m$ denotes the number of training examples and $s$ denotes model size (e.g., parameter count), the generalization error $\epsilon$ and model size $s$ scale as:

$\epsilon(m) = \alpha m^{\beta_g} + \gamma$
$s(m) \propto \alpha m^{\beta_p}$

Here, $\alpha$ and $\gamma$ are problem- and architecture-dependent constants, $\beta_g < 0$ is the empirical power-law exponent for generalization error (“learning curve steepness”), and $\beta_p$ (typically $0.5 \leq \beta_p < 1$ ) is the exponent for sublinear model size expansion with dataset size growth. Empirical values of $\beta_g$ are typically in $[-0.35, -0.07]$ , indicating that error reductions per additional data sample are modest but consistent across domains (Hestness et al., 2017).

2. Model Improvements, Architecture Effects, and Scaling Law Robustness

A key empirical observation is that innovations in model architecture, optimizer design, or training procedures manifest as downward shifts in the error curve (i.e., lower $\alpha$ or reduced irreducible error $\gamma$ ), while the scaling exponent $\beta_g$ remains essentially unchanged (Hestness et al., 2017). This invariance was noted across recurrent, convolutional, and attention-based models. The implication is that within a given task domain, the sample efficiency and scaling rate is determined predominantly by the task's intrinsic complexity rather than model family specifics.

The invariance of $\beta_g$ under architectural modification suggests a separation between incremental improvements (curve shifts) and potential paradigm shifts (altering the exponent). Fundamental advances in representation, expressivity, or data augmentation that can change $\beta_g$ may yield substantially higher returns per resource spent.

3. Sublinear Model Growth and the Role in Architecture Discovery

Empirical studies demonstrate that the minimal model size required for optimal prediction grows sublinearly with dataset size: $s(m) \sim m^{\beta_p}$ , with typical $\beta_p$ in $[0.57, 0.78]$ for architectures such as LSTMs and ResNets (Hestness et al., 2017). This sublinear scaling underpins the computational feasibility of leveraging massive datasets—enabling practitioners to exploit larger datasets for accuracy gains without commensurate expansion in parameter count.

For architecture discovery, this means that systematic increases in dataset size should be paired with carefully balanced, but not necessarily proportional, model enlargement. It is both unnecessary and inefficient to scale parameter counts linearly with data volumes.

4. Comparative Scaling Across Domains

Scaling laws empirically hold across diverse machine learning domains, as evidenced in machine translation, language modeling, image recognition, and speech recognition tasks (Hestness et al., 2017, Kaplan et al., 2020). Despite varied input modalities and model architectures, power-law generalization error scaling is consistently observed. Exponents ( $\beta_g$ ) and sublinearity parameters ( $\beta_p$ ) vary by task, reflecting underlying difficulty but not the specifics of the architecture (beyond curves being shifted by better models). For example, neural machine translation models report steep power-law exponents (up to –0.36), LLMs exhibit exponents around –0.128 for “best-fit” models, and vision/speech architectures fall within similar exponent bands.

5. Implications for Architecture Discovery and Model Selection

The main operational use of empirical scaling laws in architecture discovery is threefold:

Predictive validation and debugging: If a candidate architecture does not realize the predicted power-law scaling (as measured by a log–log linear fit of error vs. data/model size), it signals potential issues in data quality or underutilized model capacity.
Setting accurate performance targets: Scaling laws provide principled estimates for what accuracy is achievable at a given data or model size—setting realistic expectations, guiding hyperparameter search, and avoiding overfitting.
Guiding resource allocation: When planning experiments, scaling laws quantify expected returns from increasing data, compute, or model size, enabling researchers to prioritize investments that yield the greatest performance dividend. For instance, as shown in (Hestness et al., 2017), model parameters should be scaled sublinearly with dataset growth; blindly increasing parameter count is inefficient beyond the scaling law’s prescription.

Empirical scaling laws thus replace heuristic or intuition-driven decisions with quantitative procedures in architecture discovery workflows.

6. Methodological Advances for Reliable Scaling Law Estimation

To robustly fit scaling law exponents from empirical learning curves, advanced estimation methodologies have been developed. These include non-linear function classes that interpolate between sigmoidal and asymptotic power-laws and block coordinate descent algorithms for parameter fitting (Alabdulmohsin et al., 2022). These methods allow for accurate forecasting of performance at regime scales well beyond those directly observed, supporting efficient neural architecture search and large-scale model planning. Additionally, the practice of partitioning datasets into multiple shards and fitting models just large enough to overfit each shard (“hyperparameter-reduction”) enhances accuracy and fairness in cross-architecture comparisons (Hestness et al., 2017).

7. Broader Impacts on Systems, Operations, and Future Theoretical Development

Scaling laws extend beyond immediate model and architecture development. For system engineers, knowing the functional form and exponents enables anticipation of compute, memory, and storage demands under projected scaling regimes. Furthermore, the inability of architecture tweaks to alter $\beta_g$ underscores the need for foundational research aiming to break current empirical scaling limitations. The absence of a rigorous theoretical account for observed exponents remains an open question and a motivation for future inquiry.

These laws also influence deployment, hardware co-design, and strategic data acquisition, aligning system-level investment with domains of fastest accuracy improvement as prescribed by generalization scaling.

In summary, empirical scaling laws constitute a powerful, empirically derived framework for architecture discovery. Through robust predictive relationships and cross-domain invariance, these laws inform the growth, optimization, and comparative evaluation of architectures, making clear that both the pace and limit of accuracy improvement are fundamentally constrained by the underlying task complexity and not by the details of model family minutiae (Hestness et al., 2017, Kaplan et al., 2020).

PDF Markdown Chat (Pro)

References (3)

Deep Learning Scaling is Predictable, Empirically (2017)

Scaling Laws for Neural Language Models (2020)

Revisiting Neural Scaling Laws in Language and Vision (2022)

Follow Topic

Get notified by email when new papers are published related to Empirical Scaling Law for Architecture Discovery.