Papers
Topics
Authors
Recent
Search
2000 character limit reached

Kaplan & Chinchilla Scaling Laws

Updated 8 February 2026
  • Kaplan and Chinchilla Scaling Laws are empirical and theoretical relationships that predict how test loss in neural language models scales with parameter count, dataset size, and compute cost.
  • They provide power-law formulas for optimal resource allocation, guiding the balance between increasing model size and training data to achieve efficient performance improvements.
  • Extensions incorporate architecture-specific factors like latency and sparsity, enabling principled LLM design and deployment under practical constraints.

Kaplan and Chinchilla Scaling Laws are a class of empirical and theoretical relationships that characterize how neural LLMs' test loss scales with respect to model parameter count, dataset size, compute cost, and, in modern extensions, architecture-dependent inference characteristics. These scaling laws provide quantitative recipes for allocating compute among model capacity and data, predict training behavior across orders of magnitude in scale, and now underpin principled design of LLM architectures under deployment constraints.

1. Origins and Mathematical Forms of Scaling Laws

The original scaling law formulation by Kaplan et al. (2020) established that trained transformer models exhibit predictable decreases in test cross-entropy loss as a power-law in both parameter size NN and number of unique training tokens DD: L(N,D)≃A N−α+B D−β+EL(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E where AA, BB, EE, α\alpha, and β\beta are empirical constants determined by fitting—α≈0.076\alpha \approx 0.076, β≈0.095\beta \approx 0.095 for their original GPT-style models. Compute-optimal model size under the constraint DD0 was derived as DD1, DD2, indicating an allocation favoring larger models.

Hoffmann et al. (2022) ("Chinchilla") revised these exponents using an expanded parameter and data regime, and by accounting for total (including embedding) parameter count and compute: DD3 Their fitting yields DD4, DD5, DD6, DD7, DD8, and critically, a compute-optimal allocation DD9—implying a balanced "50:50" split between increasing model and data for fixed compute (Pearce et al., 2024, Porian et al., 2024).

Empirical tests and analytic reparametrizations, correcting for embedding layer and small-scale biases, demonstrate that Kaplan's originally steeper exponent is a finite-L(N,D)≃A N−α+B D−β+EL(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E0 artifact. Simulating Chinchilla under the original Kaplan conventions replicates the L(N,D)≃A N−α+B D−β+EL(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E1 exponent, with both analyses converging to L(N,D)≃A N−α+B D−β+EL(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E2 at large scale (Pearce et al., 2024, Porian et al., 2024).

2. Theoretical Explanations and Unification

Contemporary work provides rigorous theoretical underpinnings for these empirical power laws:

  • An information-theoretic analysis establishes, in Barron-like single-hidden-layer settings, tight upper bounds whose compute-optimal minimizer fulfills L(N,D)≃A N−α+B D−β+EL(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E3 (i.e., a linear scaling) matching Chinchilla's empirical law up to logarithmic corrections (Jeon et al., 2022).
  • The "Effective Frontier" framework abstracts task learning as progressive coverage of patterns from a long-tailed (Zipf) distribution. Loss is attributed to the mass of unlearned tail patterns above a cutoff L(N,D)≃A N−α+B D−β+EL(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E4 ("Effective Frontier"), which depends on the available resource L(N,D)≃A N−α+B D−β+EL(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E5:

L(N,D)≃A N−α+B D−β+EL(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E6

Distinct scaling exponents emerge for L(N,D)≃A N−α+B D−β+EL(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E7, L(N,D)≃A N−α+B D−β+EL(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E8, and training steps L(N,D)≃A N−α+B D−β+EL(N, D) \simeq A\,N^{-\alpha} + B\,D^{-\beta} + E9. The key result is a Max-Bottleneck principle: when multiple resource constraints apply (e.g., AA0, AA1, AA2), loss is dictated by the slowest-decaying term

AA3

Constrained optimization over AA4, AA5 with AA6 yields the two limiting regimes: the Kaplan law (compute-limited, AA7) and the Chinchilla law (data-limited, AA8), reconciling them as equilibrium solutions of a unified scaling equation (Zou et al., 1 Feb 2026).

3. Practical Fitting, Methodological Refinements, and Calibration

Subsequent analyses identified methodological artifacts impacting exponent estimates and fitting consistency:

  • Exclusion of embedding and output layers from AA9 measures produced a BB0 exponent at small scale; total parameter and compute accounting resolves this to Chinchilla's BB1 (Pearce et al., 2024).
  • Warmup duration, last-layer computational cost, and optimizer hyperparameter scaling systematically skew power-law estimates. Incorporating per-run, size-adaptive warmup, counting all FLOPs (including head and embedding), and per-model optimizer tuning yields scaling curves with BB2—matching Chinchilla (Porian et al., 2024).
  • Out-of-sample accuracy of parametric scaling forms is augmented by robust Huber-loss or L-BFGS fitting. Nonparametric ML regression (e.g., kernel or neural network surface for BB3) further improves frontier estimation in some empirical settings (Barkeshli et al., 15 Jan 2026).

4. Extensions: Architecture, Latency, and Sparsity

Modern scaling law frameworks explicitly incorporate architecture- and efficiency-aware factors:

  • Latency and memory bandwidth: Empirical runtime is much better predicted by the volume of memory copy operations (dominant term in accelerator-bound environments) than by pure FLOP count. The closed-form throughput and loss prediction as a function of transformer hyperparameters allows analytic architecture optimization for fixed wall-clock time budgets, pointing to wide–shallow configurations as preferable for a given parameter count (Inbar et al., 2024).
  • Inference cost: Shape-aware scaling laws co-optimize BB4, BB5, and model "aspect ratio" BB6, with loss penalized as

BB7

These forms enable Pareto-efficient inference-optimized model selection and demonstrate that for a fixed BB8, wider, shallower models achieve up to BB9 faster inference at identical task accuracy, confirming model-shape-dependence in scaling (Bian et al., 30 Jan 2025).

  • Conditional scaling over architecture: Parameterizing loss exponents as functions of hidden size and MLP/attention ratio,

EE0

yields models (e.g., "Panda" and "Surefire") that outperform baselines both in terms of loss and inference throughput (Bian et al., 21 Oct 2025).

  • Sparsity: The "average parameter count" over pre-training, EE1, replaces static EE2 in the Chinchilla law, unifying sparse and dense pre-training in a single scaling framework. Empirical results confirm that for matched EE3 and EE4, "lossless" compression at up to EE5 sparsity yields no training or downstream accuracy deficit, decoupling training-time compute from inference speed (Jin et al., 21 Jan 2025).

5. Origin and Robustness of Scaling Laws

Recent theoretical and synthetic experiments reinforce the universality and origin of scaling laws:

  • Scaling exponents' values are not reducible solely to data distribution tails; robust power-law scaling is observed in synthetic settings devoid of Zipf structure (e.g., transformers on random walks), implying an architectural emergence of power-law spectra or optimization effects (Barkeshli et al., 15 Jan 2026).
  • Data complexity, as modulated via dataset generative process or architecture parameterization, alters the data exponent EE6, while the parameter exponent EE7 typically remains stable (around EE8–EE9 in 2-layer transformers and task-agnostic settings).
  • The necessity of accounting for irreducible offsets α\alpha0 is emphasized for fit validity. One-dimensional power-law fits, including such offsets, outperform both fixed functional forms and exponentials in predictive accuracy.

6. Empirical Design Implications and Current Best Practices

Compiled results across multiple studies yield convergent recipes for LLM design under budget:

  • Use total (including embeddings) parameter and compute counts in all scaling law reporting (Pearce et al., 2024, Porian et al., 2024).
  • For compute-optimal training, allocate compute such that α\alpha1 for α\alpha2 sufficiently large, ensuring the "Chinchilla frontier" is approached (Pearce et al., 2024).
  • For practical deployment, augment Chinchilla or related scaling laws by incorporating shape (aspect ratio) or conditional architectural parameters to optimize for latency-constrained inference, with wider–shallower models yielding substantial efficiency gains for a fixed α\alpha3 (Inbar et al., 2024, Bian et al., 30 Jan 2025, Bian et al., 21 Oct 2025).
  • In sparse pre-training, the effective parameter controlling train loss is the arithmetic mean α\alpha4; use the average-parameter version of the scaling law for both dense and sparse regimes (Jin et al., 21 Jan 2025).
  • For high-complexity or high-dimensional input, theories predict that more of the budget should be channeled into increasing α\alpha5 rather than α\alpha6; this is supported by both analytic upper bound derivations and empirical sweeps (Jeon et al., 2022).
  • Nonparametric frontier fitting and sensitivity analysis over (α\alpha7, α\alpha8) allow for more robust extrapolation at new scales and in atypical regimes (Barkeshli et al., 15 Jan 2026).

The combined legacy of Kaplan and Chinchilla scaling laws is a rigorously grounded, empirically validated, and now architecture-aware methodology for predicting LLM scaling and guiding efficient allocation of resources in pretraining and deployment. Extensions to account for architectural details, memory, and inference cost have established that contemporary model selection and scaling optimization can be performed entirely with closed-form, hyperparameter-only loss models (Inbar et al., 2024, Bian et al., 30 Jan 2025, Bian et al., 21 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Kaplan and Chinchilla Scaling Laws.