Papers
Topics
Authors
Recent
Search
2000 character limit reached

Chinchilla Law: Neural Scaling for Transformers

Updated 28 February 2026
  • Chinchilla Law is a compute-optimal neural scaling law that balances model size and dataset tokens to minimize loss.
  • It prescribes optimal allocations for parameters (N) and tokens (D) under fixed compute budgets, typically suggesting tokens-per-parameter ratios near 1–20.
  • Empirical validations across transformer architectures show robust power-law behavior with near-equal scaling exponents, guiding modern language model design.

Chinchilla Law is a compute-optimal neural scaling law that characterizes the asymptotic relationship between model size, dataset size, and achievable loss for large-scale transformer LLMs. The law prescribes the allocation of model parameters (NN) and training token budget (DD) as a function of the available training compute budget (CC) to minimize validation loss, and has become an empirical standard in both academic and industrial LLM design. The law’s mathematical form and exponents arise from comprehensive regression fits to large-scale transformer model families and are robust to architectural and optimizer variations within standard training regimes.

1. Formulation of the Chinchilla Scaling Law

The Chinchilla Law posits that the minimum achievable next-token cross-entropy loss L(N,D)L(N,D) for a transformer model with NN parameters trained on DD tokens is well-approximated by a two-dimensional power-law sum with irreducible offset: L(N,D)=E+ANα+BDβL(N,D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta} where EE, AA, BB, α\alpha, and β\beta are fitting parameters, with EE typically interpreted as the loss floor (entropy of the data distribution), and AA, BB, α\alpha, β\beta describe the rate at which the reducible loss components decay as model and data scale.

Given a total compute budget CC (in FLOPs), which is proportional to N×DN \times D, the compute-optimal regime is inferred by solving the constrained optimization problem minND=C/κL(N,D)\min_{N D = C / \kappa} L(N, D) with the Chinchilla fit. This yields: Nopt(C)Cβ/(α+β)  ,Dopt(C)Cα/(α+β)N_{\mathrm{opt}}(C) \propto C^{\beta/(\alpha + \beta)} \;,\quad D_{\mathrm{opt}}(C) \propto C^{\alpha/(\alpha + \beta)} with typical empirical fits giving α0.340.37\alpha \approx 0.34-0.37, β0.280.37\beta \approx 0.28-0.37, and thus both exponents very close to 0.5, implying Nopt(C)Dopt(C)CN_{\mathrm{opt}}(C) \approx D_{\mathrm{opt}}(C) \propto \sqrt{C} (Pearce et al., 2024, Besiroglu et al., 2024, Barkeshli et al., 15 Jan 2026). This leads to the practical prescription that the model should be trained on a number of tokens roughly equal (up to proportionality constants) to its number of parameters—i.e., tokens-per-parameter ratio near 1–20, depending on context (Song et al., 2024).

2. Theoretical and Empirical Basis

The Chinchilla Law stands in contrast to earlier scaling laws, notably Kaplan et al. (2020), which prescribed a much steeper Nopt(C)C0.73N_{\rm opt}(C) \propto C^{0.73}. Subsequent work has shown that the Kaplan scaling law overestimates the allocation to NN primarily because it excluded embedding parameters and fitted the scaling relationship only at small model scales (Pearce et al., 2024, Porian et al., 2024). Correcting for these factors and performing scale-aware hyperparameter tuning recovers the Chinchilla exponent a0.5a \approx 0.5 (Porian et al., 2024).

Empirically, the Chinchilla Law has been validated across transformer architectures, tokenization strategies, and open web corpora. Advances in regression methodology—such as fitting one-dimensional slices L(N)DL(N)_D, then using kernel ridge or neural-net regressors for L(N,D)L(N,D)—have yielded improved fits and robust exponents, confirming the law’s stability (Barkeshli et al., 15 Jan 2026).

3. Origin and Interpretation

A resource-theoretic explanation for the Chinchilla Law is provided by modeling neurons as allocatable resources over the network. The key hypotheses are:

  • Loss per subtask scales as $1/N$ where NN is the subtask’s neuron count;
  • As the model is widened (and deepened), every subtask receives a homogeneously increased budget;
  • For transformer-style models with NpW3N_p \sim W^3 (parameters cubic in width), total neurons per subtask scale as Np1/3N_p^{1/3}, yielding loss Np1/3\propto N_p^{-1/3}, which matches the empirically observed α0.34\alpha \approx 0.34 (Song et al., 2024).

Thus, the Chinchilla exponent emerges from both “neurons as resources” and from a convex-quadratic spectral theory of optimization and approximation error, confirming its relevance to both practical and theoretical settings (Volkova et al., 7 Feb 2026).

4. Practical Computation and Regimes of Validity

Best-fit Chinchilla parameters (derived via bootstrapped nonlinear least-squares fits) for large transformer LMs include, for example (Besiroglu et al., 2024): A=482.01±124.58,B=2085.43±1293.23,E=1.8172±0.03, α=0.3478±0.02,β=0.3658±0.02A = 482.01 \pm 124.58,\quad B = 2085.43 \pm 1293.23,\quad E = 1.8172 \pm 0.03,\ \alpha = 0.3478 \pm 0.02,\quad \beta = 0.3658 \pm 0.02 so that

Nopt(C)0.12C0.5126,Dopt(C)8.35C0.4874N_{\mathrm{opt}}(C) \approx 0.12\,C^{0.5126},\quad D_{\mathrm{opt}}(C) \approx 8.35\,C^{0.4874}

The optimal tokens/parameter ratio then weakly decreases as C0.025C^{-0.025} and is empirically near 20 at large scale. Confidence intervals on exponents are at the few percent level, confirming robust near-equality of the growth rates for NN and DD.

The practical regime of validity: N[107,1011]N \in [10^7, 10^{11}], DD up to 101310^{13}, compute budget up to 102410^{24} FLOPs. For out-of-domain architectures or training at extreme D/N ratios (e.g., D/N104D/N \gg 10^4), the exponents may drift and empirical gains from extra data diminish, as revealed by coefficient ablation (Sardana et al., 2023).

5. Modifications, Extensions, and Limitations

Impact of Inference and Deployment

Standard Chinchilla Law optimizes only for pretraining FLOPs. Incorporating inference cost (responsible for 2NR2N \cdot R FLOPs, with RR the total lifetime inference tokens) shifts the compute-optimal prescription toward smaller NN and larger DD as inference demand increases (Sardana et al., 2023). The resulting optimization,

minN,Dtr TotalCompute(N,Dtr;R) subject to L(N,Dtr)=\min_{N, D_{\text{tr}}} \ \text{Total}_\text{Compute}(N,D_{\text{tr}};R) \text{ subject to } L(N, D_{\text{tr}})=\ell

requires numerical solution, but in the limit of RDR \gg D, optimal NN falls, and DD rises relative to the Chinchilla-only optimum.

Optimizer- and Hyperparameter-Aware Extensions

Traditional Chinchilla Law assumes all non-(N,D)(N,D) hyperparameters are optimally tuned. Configuration-to-performance laws (e.g., NCPL) explicitly learn L(Φ)L(\Phi) as a function of full training configuration, capturing effects of batch size, learning rate, optimizer type, schedule, etc., and reducing per-run loss prediction error by 20–40% (Zhang et al., 10 Feb 2026).

Optimizer-aware Chinchilla extensions introduce rescaling factors (ρN(o),ρD(o))(\rho_N^{(o)}, \rho_D^{(o)}) per optimizer oo, holding (A,α,B,β,E)(A, \alpha, B, \beta, E) constant across optimizers and enabling direct cross-optimizer comparison. Empirically, new optimizers (Muon, SOAP) achieve ρD[1.5,2.5]\rho_D \in [1.5, 2.5], increasing data efficiency beyond vanilla AdamW (Volkova et al., 7 Feb 2026).

Fitting Methodologies and Robustness

The standard fitting procedure (used by Hoffmann et al.) fits Eq. (1) globally to (N,D,L)(N,D,L) data. Alternative methods fit 1D slices (robust to exponents’ drift in DD and NN), then regress the surface using fully connected nets or RBF ridge regressors, yielding lower validation MSEs and improved compute-optimal predictions (Barkeshli et al., 15 Jan 2026). Replication efforts highlight the importance of correct data extraction, proper initialization, and rigorous bootstrapping for valid uncertainty quantification—overly tight confidence intervals in the original Chinchilla study likely resulted from statistical underestimation (Besiroglu et al., 2024).

6. Practical Implications and Design Guidelines

  • Chinchilla Law dictates that, under a fixed compute budget, practitioners should allocate parameters and tokens in near-equal (square-root) proportion, with tokens per parameter typically 20\sim20 but weakly declining with scale.
  • All parameters—including embeddings—must be included when measuring NN; compute must account for all major FLOP contributors.
  • For model deployment with significant inference load, the pretraining-optimal Chinchilla ratio is suboptimal: models should be trained smaller and longer to save on inference costs.
  • Hyperparameter, optimizer choice, and hardware constraints interact with (and may violate) Chinchilla predictions; configuration-aware extensions or direct residual learning atop the Chinchilla fit are required for accurate large-scale forecasts (Zhang et al., 10 Feb 2026).
  • The law is robust across natural language, synthetic graphs, and simplified LLMs, suggesting its applicability to a broad range of transformer-based systems (Barkeshli et al., 15 Jan 2026).

7. Theoretical Foundations and Scope

The Chinchilla exponents are further justified by spectral theory arguments: loss decomposes into approximation and optimization errors, each admitting power-law decay when the data exhibit a polynomial eigenvalue decay in the Hessian. The sum of these contributions yields the empirical scaling law,

L(N,D)E+ANα+BDβL(N,D) \sim E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

with α\alpha, β\beta determined by the spectral “dimension” of the data/model interaction (Volkova et al., 7 Feb 2026).

A plausible implication is that any architectural or optimizer innovation that alters the effective spectral decay or exploits resource allocation more efficiently could yield sharper scaling (either reducing the constants AA, BB or increasing exponents), but rigorous validation at large scale remains essential.


For further mathematical details and empirical fits, see (Barkeshli et al., 15 Jan 2026, Pearce et al., 2024, Song et al., 2024, Besiroglu et al., 2024, Sardana et al., 2023, Volkova et al., 7 Feb 2026, Zhang et al., 10 Feb 2026, Porian et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chinchilla Law.