Derivation of Scaling Laws in Neural Networks
- Derivation of scaling laws is the process of formulating power-law relationships that predict neural network performance as model size and dataset scale increase.
- The methodology leverages mathematical optimization under compute constraints and reconciles differences between Kaplan and Chinchilla empirical approaches.
- Practical insights from this topic guide effective resource allocation and robust model design by correcting biases in parameter counting and small-scale effects.
A scaling law describes how a system’s characteristic quantities change systematically with system size, parameters, or resources—often manifesting as power-law relationships. The derivation of scaling laws, both in statistical physics and in modern machine learning, provides a rigorous framework for predicting system behavior as one increases dataset size, model complexity, computational budget, or other relevant variables. In contemporary AI research, systematic derivations of scaling laws are foundational for model design and for setting optimal resource allocation. This article details the mathematical derivations, conceptual sources of bias, and best-practice recommendations for scaling laws in the context of neural networks, highlighting the reconciliation between the Kaplan [2020] and Chinchilla [2022] scaling laws (Pearce et al., 2024).
1. Mathematical Foundations of Neural Scaling Law Derivation
Neural scaling laws typically posit that the achievable loss for large models trained on large datasets decays as a sum of separate power laws in the number of parameters () and the number of training tokens (), plus an irreducible error floor. The generic two-term ansatz is
where are empirical constants and are scaling exponents. This ansatz is motivated by observing that as either or becomes large, performance becomes limited by the smaller resource—a trade-off between model capacity and data coverage.
For compute-limited training where total compute cost is (for transformers), the compute-optimal allocation of and can be derived by minimizing loss under this constraint. Inverting and optimizing in yields
which describes the parametric trade-off between parameters and tokens as a function of compute (Pearce et al., 2024).
2. The Kaplan and Chinchilla Scaling Laws: Derivation and Discrepancy
Kaplan et al. [2020] empirically fit the compute-optimal frontier using non-embedding parameters and non-embedding compute, reporting
This finding is implicitly based on a two-term power-law ansatz but applied over a small-scale regime and using an incorrect count of the parameter set. The derivation proceeds by minimizing
which after Lagrange multiplier optimization yields
with their empirically fitted value yielding an exponent approximately 0.73 (Pearce et al., 2024).
Chinchilla [2022] extended this analysis, properly including embedding parameters () and fitting across a broader scale. Their closed-form loss is
from which, under the same compute constraint, the optimal allocation is
with observed exponents , leading to —substantially different from the Kaplan scaling (Pearce et al., 2024).
3. Sources of Bias and Small-Scale Effects
There are two principal sources of bias responsible for the discrepancy in scaling exponents between Kaplan and Chinchilla:
(a) Non-embedding parameter counts: At moderate scales, embedding parameters (token and position embeddings) are a non-negligible fraction of total model size; excluding them causes the – power-law fit to overestimate the true scaling exponent. Adjusting Kaplan's analysis to total parameter count () collapses the apparent higher exponent to Chinchilla’s asymptotic value (Pearce et al., 2024).
(b) Small-scale regime bias: The power-law relationship is only asymptotically accurate at large scales. In the small–medium scale regime studied by Kaplan, curvature in the function leads to a locally higher exponent: analytic expansion yields a small-scale exponent
which, with the Chinchilla-fitted values, produces an exponent in the 0.74–0.78 range—precisely reproducing Kaplan’s result. This demonstrates that both the parameter counting and the finite system size are essential to accurately quantifying scaling exponents (Pearce et al., 2024).
4. Universal and Reconciled Scaling Laws
With proper parameter accounting and a wide-scale regime, both empirical and analytic results converge to the Chinchilla scaling: and the minimum achievable loss along this compute-efficient frontier is
with exponents , supporting compute-optimal balancing between model size and dataset size (Pearce et al., 2024).
5. Practical Methodology and Best-Practice Recommendations
Empirical fitting of scaling laws requires:
- Counting total parameters and total compute .
- Fitting over the widest possible scale regime to ensure the large- limit and avoid small-scale, non-universal deviations.
- If non-embedding parameter counts must be reported for legacy comparison, they should be corrected to total parameter-equivalent exponents to avoid extrapolation artifacts.
- Providing the full functional form of the scaling law (including all fitted exponents and prefactors), allowing for assessment of both local (small-) and asymptotic (large-) behavior.
These practices ensure that reported exponents and resource allocations are accurate when extrapolating to future large-scale models (Pearce et al., 2024).
6. Implications, Broader Context, and Impact
This reconciliation of Kaplan and Chinchilla scaling laws aligns the empirical design principles of deep LLMs with robust theoretical foundations. It implies that previously reported exponents larger than $0.5$ resulted from small-scale bias and improper parameter accounting. Correct derivations directly inform resource allocation for next-generation model training, supporting joint scaling of dataset size and model capacity for compute-optimal performance. This understanding also guides comparative and theoretical scaling-law work across architectures, datasets, and tasks.
The findings in (Pearce et al., 2024) establish a clear methodological standard: correct parameter counting and broad-scale fitting are essential for accurate and universal characterization of neural scaling behavior. The Chinchilla law provides the large-scale regime exponents central to optimal resource scheduling in language modeling and generative modeling work.