Papers
Topics
Authors
Recent
Search
2000 character limit reached

Derivation of Scaling Laws in Neural Networks

Updated 20 March 2026
  • Derivation of scaling laws is the process of formulating power-law relationships that predict neural network performance as model size and dataset scale increase.
  • The methodology leverages mathematical optimization under compute constraints and reconciles differences between Kaplan and Chinchilla empirical approaches.
  • Practical insights from this topic guide effective resource allocation and robust model design by correcting biases in parameter counting and small-scale effects.

A scaling law describes how a system’s characteristic quantities change systematically with system size, parameters, or resources—often manifesting as power-law relationships. The derivation of scaling laws, both in statistical physics and in modern machine learning, provides a rigorous framework for predicting system behavior as one increases dataset size, model complexity, computational budget, or other relevant variables. In contemporary AI research, systematic derivations of scaling laws are foundational for model design and for setting optimal resource allocation. This article details the mathematical derivations, conceptual sources of bias, and best-practice recommendations for scaling laws in the context of neural networks, highlighting the reconciliation between the Kaplan [2020] and Chinchilla [2022] scaling laws (Pearce et al., 2024).

1. Mathematical Foundations of Neural Scaling Law Derivation

Neural scaling laws typically posit that the achievable loss for large models trained on large datasets decays as a sum of separate power laws in the number of parameters (NN) and the number of training tokens (DD), plus an irreducible error floor. The generic two-term ansatz is

Loss(N,D)=ANα+BDβ+E\mathrm{Loss}(N,D) = A\,N^{-\alpha} + B\,D^{-\beta} + E

where A,B,EA,B,E are empirical constants and α,β\alpha, \beta are scaling exponents. This ansatz is motivated by observing that as either NN or DD becomes large, performance becomes limited by the smaller resource—a trade-off between model capacity and data coverage.

For compute-limited training where total compute cost is C=6NDC = 6ND (for transformers), the compute-optimal allocation of NN and DD can be derived by minimizing loss under this constraint. Inverting D=C/(6N)D = C/(6N) and optimizing in NN yields

NCβα+β,DCαα+βN^* \propto C^{\frac{\beta}{\alpha + \beta}}, \qquad D^* \propto C^{\frac{\alpha}{\alpha + \beta}}

which describes the parametric trade-off between parameters and tokens as a function of compute (Pearce et al., 2024).

2. The Kaplan and Chinchilla Scaling Laws: Derivation and Discrepancy

Kaplan et al. [2020] empirically fit the compute-optimal frontier using non-embedding parameters and non-embedding compute, reporting

NECE0.73,DECE0.27N^*_{\setminus E} \propto C_{\setminus E}^{\,0.73}, \qquad D^*_{\setminus E} \propto C_{\setminus E}^{\,0.27}

This finding is implicitly based on a two-term power-law ansatz but applied over a small-scale regime and using an incorrect count of the parameter set. The derivation proceeds by minimizing

ANEαK+BDβK,subject to CE=6NEDA\,N_{\setminus E}^{-\alpha_K} + B\,D^{-\beta_K}, \quad \text{subject to } C_{\setminus E} = 6N_{\setminus E}D

which after Lagrange multiplier optimization yields

NECEβKαK+βKN^*_{\setminus E} \propto C_{\setminus E}^{\frac{\beta_K}{\alpha_K + \beta_K}}

with their empirically fitted value yielding an exponent approximately 0.73 (Pearce et al., 2024).

Chinchilla [2022] extended this analysis, properly including embedding parameters (NT=NE+NEN_T = N_{\setminus E} + N_E) and fitting across a broader scale. Their closed-form loss is

Loss(NT,D)=E+NcNTα+Dc/Dβ\mathrm{Loss}(N_T, D) = E + \frac{N_c}{N_T^{\alpha} + D_c/D^{\beta}}

from which, under the same compute constraint, the optimal allocation is

NTCTβα+β,DTCTαα+βN_T^* \propto C_T^{\frac{\beta}{\alpha + \beta}}, \qquad D_T^* \propto C_T^{\frac{\alpha}{\alpha + \beta}}

with observed exponents α0.34,β0.28\alpha\approx0.34, \beta\approx0.28, leading to NTCT0.50N_T^*\propto C_T^{0.50}—substantially different from the Kaplan scaling (Pearce et al., 2024).

3. Sources of Bias and Small-Scale Effects

There are two principal sources of bias responsible for the discrepancy in scaling exponents between Kaplan and Chinchilla:

(a) Non-embedding parameter counts: At moderate scales, embedding parameters (token and position embeddings) are a non-negligible fraction of total model size; excluding them causes the NNCC power-law fit to overestimate the true scaling exponent. Adjusting Kaplan's analysis to total parameter count (NTN_T) collapses the apparent higher exponent to Chinchilla’s asymptotic value (Pearce et al., 2024).

(b) Small-scale regime bias: The power-law relationship is only asymptotically accurate at large scales. In the small–medium scale regime studied by Kaplan, curvature in the N(C)N^*(C) function leads to a locally higher exponent: analytic expansion yields a small-scale exponent

gsmall=βα/3+βg_\mathrm{small} = \frac{\beta}{\alpha/3 + \beta}

which, with the Chinchilla-fitted values, produces an exponent in the 0.74–0.78 range—precisely reproducing Kaplan’s result. This demonstrates that both the parameter counting and the finite system size are essential to accurately quantifying scaling exponents (Pearce et al., 2024).

4. Universal and Reconciled Scaling Laws

With proper parameter accounting and a wide-scale regime, both empirical and analytic results converge to the Chinchilla scaling: NT(C)=(αβNcDc)1/(α+β)(C6)β/(α+β),DT(C)=C6NT(C)N_T^*(C) = \left(\frac{\alpha}{\beta} \frac{N_c}{D_c}\right)^{1/(\alpha+\beta)} \left(\frac{C}{6}\right)^{\beta/(\alpha+\beta)}, \qquad D_T^*(C) = \frac{C}{6N_T^*(C)} and the minimum achievable loss along this compute-efficient frontier is

L(C)=E+Nc(NT(C))α+Dc/(DT(C))βL(C) = E + \frac{N_c}{(N_T^*(C))^{\alpha} + D_c/(D_T^*(C))^{\beta}}

with exponents β/(α+β)0.50\beta/(\alpha+\beta) \approx 0.50, supporting compute-optimal balancing between model size and dataset size (Pearce et al., 2024).

5. Practical Methodology and Best-Practice Recommendations

Empirical fitting of scaling laws requires:

  • Counting total parameters NTN_T and total compute CTC_T.
  • Fitting over the widest possible scale regime to ensure the large-NN limit and avoid small-scale, non-universal deviations.
  • If non-embedding parameter counts must be reported for legacy comparison, they should be corrected to total parameter-equivalent exponents to avoid extrapolation artifacts.
  • Providing the full functional form of the scaling law (including all fitted exponents and prefactors), allowing for assessment of both local (small-NN) and asymptotic (large-NN) behavior.

These practices ensure that reported exponents and resource allocations are accurate when extrapolating to future large-scale models (Pearce et al., 2024).

6. Implications, Broader Context, and Impact

This reconciliation of Kaplan and Chinchilla scaling laws aligns the empirical design principles of deep LLMs with robust theoretical foundations. It implies that previously reported exponents larger than $0.5$ resulted from small-scale bias and improper parameter accounting. Correct derivations directly inform resource allocation for next-generation model training, supporting joint scaling of dataset size and model capacity for compute-optimal performance. This understanding also guides comparative and theoretical scaling-law work across architectures, datasets, and tasks.

The findings in (Pearce et al., 2024) establish a clear methodological standard: correct parameter counting and broad-scale fitting are essential for accurate and universal characterization of neural scaling behavior. The Chinchilla law provides the large-scale regime exponents central to optimal resource scheduling in language modeling and generative modeling work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Derivation of Scaling Laws.