Papers
Topics
Authors
Recent
2000 character limit reached

Scaling Laws of Large Language Models

Updated 13 December 2025
  • Scaling laws are quantifiable relationships that define how model size, dataset scale, and compute influence performance in large language models.
  • The mathematical formulations use power laws with empirical exponents, enabling accurate performance predictions and optimized resource allocation.
  • Extensions to these laws cover downstream tasks, architectural variants like MoE, and temporal dynamics, offering practical guidelines for designing next-gen LLMs.

LLMs exhibit systematic and highly regular trade-offs between model size, dataset scale, compute, architecture, and performance—patterns empirically captured by scaling laws. These empirical or semi-theoretical relations serve as quantitative maps for model design, resource allocation, and performance forecasting in large-scale pretraining and fine-tuning regimes. While initial laws related upstream (pretraining) loss to size and data through power laws, recent work extends these principles to downstream tasks, architectural sparsity, data domain, temporal dynamics, and even direct benchmark scores. This article surveys core mathematical forms, theoretical underpinnings, domain-specific variants, practical implications, and current research directions as documented in recent literature.

1. Mathematical Forms of Scaling Laws

Scaling laws describe how key metrics (typically loss or downstream accuracy) depend on design and training variables. The archetypal form is a multi-term power law in model size (N or P: non-embedding parameters or total parameters), dataset size (D: tokens), and sometimes compute (C: FLOPs):

Law / Paper Formula Regime
Hoffmann et al. / Chinchilla L(N,D)=E+ANα+BDβL(N, D) = E + A N^{-\alpha} + B D^{-\beta} (EE: loss floor) Dense models
Farseer (Li et al. 2025) L(N,D)=exp(a3Nγ+b3)+exp(a2Nβ+b2)Dexp(a1Nα+b1)L(N, D) = \exp(a_3N^\gamma+b_3) + \exp(a_2N^\beta+b_2) D^{-\exp(a_1N^\alpha+b_1)} Unified
Code LLMs (Farseer fit) Same as above, but with distinct coefficients reflecting higher data hunger Code models
MoE / Sparse (Hossain et al. 2025) L(N,D,S)=e(1S)γ+[a(1S)α+cS]Nα+bDβL(N, D, S) = e(1-S)^\gamma + [a(1-S)^\alpha + cS]N^{-\alpha} + b D^{-\beta} Dense/Sparse

Parameters A,B,E,α,βA,B,E,\alpha,\beta are empirically fitted. Power-law exponents typically fall in [0.2,0.6][0.2, 0.6]; EE is set by data/architecture, and constants reflect domain properties.

Mixture-of-Experts (MoE) models admit structurally analogous laws with extra scaling in the number of experts EE:

L^(N,D,E)=ANαEγ+BDβ+σ\hat{L}(N,D,E) = \frac{A}{N^\alpha E^\gamma} + \frac{B}{D^\beta} + \sigma

with compute budget C=NDC = N D (Wang et al., 8 Oct 2024).

Code LLMs obey similar power laws, but optimal data-to-parameter ratios (D/ND/N) are $7$–20×20\times higher than for natural language tasks: D/N150D/N \sim 150–$400$ vs. D/N20D/N \sim 20 at equal FLOPs (Luo et al., 9 Oct 2025).

Refined laws, such as Farseer, further include non-separable interactions and dynamic data exponents, significantly lowering prediction errors and supporting reliable extrapolation across orders of magnitude in N,DN,D (Li et al., 12 Jun 2025).

2. Theoretical Justifications and Regimes

Statistical models shed light on why power laws emerge. Maloney et al. (Maloney et al., 2022) analytically derive scaling regimes by coupling the eigenspectrum of the natural data distribution (typically power-law-tailed) to model complexity via random features and ridge regression. Key implications:

  • If the eigenvalue spectrum follows λII(1+α)\lambda_I \sim I^{-(1+\alpha)}, test loss asymptotically behaves as LNαL \sim N^{-\alpha} or LDαL \sim D^{-\alpha}, saturating at a noise floor dictated by data entropy.
  • Optimal compute allocation occurs at equiparameterization (NDN \propto D), aligning with empirical findings that the lowest loss per FLOP comes when NN and DD are increased in lockstep.
  • When model size or data exceed the effective dimension of the latent space (i.e., spectrum support), power-law scaling breaks and loss plateaus.

Farseer’s improvement arises from explicitly modeling the non-separability in the NNDD interaction, accurately capturing the empirical surfaces found in massive LLM training campaigns (Li et al., 12 Jun 2025).

3. Domain- and Architecture-Specific Scaling Patterns

Empirical research demonstrates that scaling laws broadly generalize across LLM families but with substantial domain or architecture-induced shifts in coefficients and exponents.

  • MoE architectures exhibit identical scaling exponents to dense models but larger fractions of compute should be allocated to model scale (NC0.59N\sim C^{0.59}, DC0.41D\sim C^{0.41} for E=8E=8) (Wang et al., 8 Oct 2024). MoE models achieve up to $16$\% better data efficiency at fixed compute than comparable dense models.
  • Code LLMs demand significantly higher data/parameter ratios: optimal D/ND/N grows super-linearly with compute; mixture experiments confirm that a moderate natural language fraction can help small code models in data-scarce regimes but degrades performance at large scale (Luo et al., 9 Oct 2025).
  • Sparse/pruned models interpolate between dense and fully sparse regimes, with performance given by L(N,D,S)L(N,D,S), where SS is sparsity. This general law exactly recovers dense scaling at S=0S=0 and matches MoE/pruning curves up to S=0.98S=0.98, ensuring optimal resource trade-off between NN, DD, and SS for a fixed compute budget (Hossain et al., 8 Aug 2025).

4. Extensions: Temporal, Downstream, and Context-Aware Laws

Modern scaling law research now addresses aspects beyond upstream loss:

  • Temporal Laws: Model test loss evolution through training, at both sequence-level and token-position granularity. The temporal scaling law uses a dynamic hyperbolic fit for per-token loss, parameterized by position and step. Unlike power-law fits, the temporal law achieves near-perfect R2R^2 on both in-domain and out-of-distribution sets, enabling early hyperparameter selection and accurate prediction of training trajectories (Xiong et al., 27 Apr 2024).
  • Downstream Performance: For metrics such as BLEU (MT), benchmark accuracy (MMLU), or few-shot task scores, specialized scaling forms are required.
    • Machine translation scales as a log-law in BLEU (if pretraining and target distributions are well aligned), but cross-entropy can monotonically decrease even when BLEU no longer improves, indicating distribution misalignment risk (Isik et al., 6 Feb 2024).
    • The “Performance Law” directly predicts MMLU from model shape, data scale, and a training-instability penalty, achieving \sim3-point average error on held-out models and extending to both dense and MoE (Wu et al., 19 Aug 2024).
  • Context-aware Laws: Downstream task performance depends jointly on training compute and the length of in-context demonstrations. These are captured by a saturating power-law in compute, a saturating power-law in context length, and a context-window penalty. Different tasks exhibit distinct “characteristic context” scales; e.g., arithmetic reasoning benefits from many-shots, translation saturates after a few demonstrations (Montgomery et al., 16 Oct 2025).
  • Loss-to-loss Laws: The downstream loss on any evaluation task is a universal, shifted power law of pretraining validation loss; the dominant influence is the pretraining data distribution—not model size, architecture, or hyperparameters (Mayilvahanan et al., 17 Feb 2025).

5. Empirical Methodologies and Validation

Scaling law analysis unites large-scale grid sweeps, statistical surface fitting, and robust extrapolation validation.

  • Comprehensive grids of (model size, dataset size) are constructed, with held-out validation runs and test losses measured on high-quality data splits (Li et al., 12 Jun 2025, Luo et al., 9 Oct 2025).
  • Nonlinear and log-linear fits are benchmarked by mean-relative error, R2R^2, and the magnitude of extrapolation errors on out-of-domain (off-grid or ultra-large-scale) data points (Li et al., 12 Jun 2025).
  • Modern methodologies emphasize piecewise and differential fitting, ensuring that fitted exponents and offset constants generalize across regimes rather than overfit a narrow slice.
  • Newer laws (e.g., Farseer) demonstrably outperform older power-law models, reducing average relative error by several multiples and maintaining percent-level accuracy when predicting models >10×>10\times larger than any in the training set (Li et al., 12 Jun 2025, Luo et al., 9 Oct 2025).

6. Practical Guidelines and Trade-Offs in LLM Design

Scaling laws provide closed-form formulas and explicit recipes for model design, resource allocation, and cost–performance optimization.

  • For a fixed compute budget, optimal model/data allocations are given by scaling exponents (e.g., NoptC0.464N_{\rm opt} \sim C^{0.464}, DoptC0.536D_{\rm opt} \sim C^{0.536} for vanilla dense models) (Shuai et al., 2 Dec 2024).
  • In code LLMs, always budget for much higher D/ND/N; for C1021C\sim10^{21} FLOPs, allocate D/N150D/N\sim 150–$200$ (Luo et al., 9 Oct 2025).
  • Sparse/Pruned networks: Use joint optimization of NN, DD, and SS via the generalized law to maximize gains for a specific workload (Hossain et al., 8 Aug 2025).
  • For merging specialist models (like adapters), performance follows a L(N,k)=L(N)+A(N)/(k+b)L(N,k)=L_\infty(N)+A(N)/(k+b) law. Most gains arrive by k=5k=5–$6$ experts, and diminishing returns thereafter are mathematically predictable. This enables optimal mix of expert acquisition and base scaling for efficient ensemble construction (Wang et al., 29 Sep 2025).
  • For downstream score prediction (e.g., MMLU), performance is not a trivial function of loss. Use log-linear regression on model configuration and data scale, and apply hardware/instability discounts. This allows performance planning and leakage/quality diagnosis (Wu et al., 19 Aug 2024).
  • In few-shot and context-extended deployment, leverage context-aware scaling to decide between investing in model scale or context-window extensions based on the task’s saturation profile (Montgomery et al., 16 Oct 2025).

7. Open Directions and Future Methodologies

Recent position work advocates recasting scaling law discovery as an inverse problem: Given a desired performance threshold, find the minimal ingredients (model/data/compute/annotation) to achieve it under resource constraints, potentially exposing stage-wise and hybrid scaling laws that extend well beyond classical power-law regimes (Verma et al., 9 Sep 2025). New target areas include:

  • Data selection scaling: Discovering minimal necessary data for target metrics under optimal curation.
  • Inference scaling: Jointly optimizing over architectures, inference strategies, and context to minimize cost for a given utility.
  • Machine unlearning scaling: Quantifying cost–performance trade-offs for removing (unlearning) data while retaining utility on retained data.

Actual derivations, empirical coefficients, and validation in these new inverse-problem contexts remain open for future research.


Scaling laws in LLMs currently anchor quantitative planning for model training, compute investment, and deployment policy. The research trajectory now moves toward generalizing these laws to cover sparsity, domain shift, explicit downstream utility, temporally evolving training, context-driven inference, and compositional/ensemble intelligence, with the aim to precisely map performance landscapes for next-generation AI systems.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Scaling Laws of Large Language Models.