Scaling Laws of Large Language Models
- Scaling laws are quantifiable relationships that define how model size, dataset scale, and compute influence performance in large language models.
- The mathematical formulations use power laws with empirical exponents, enabling accurate performance predictions and optimized resource allocation.
- Extensions to these laws cover downstream tasks, architectural variants like MoE, and temporal dynamics, offering practical guidelines for designing next-gen LLMs.
LLMs exhibit systematic and highly regular trade-offs between model size, dataset scale, compute, architecture, and performance—patterns empirically captured by scaling laws. These empirical or semi-theoretical relations serve as quantitative maps for model design, resource allocation, and performance forecasting in large-scale pretraining and fine-tuning regimes. While initial laws related upstream (pretraining) loss to size and data through power laws, recent work extends these principles to downstream tasks, architectural sparsity, data domain, temporal dynamics, and even direct benchmark scores. This article surveys core mathematical forms, theoretical underpinnings, domain-specific variants, practical implications, and current research directions as documented in recent literature.
1. Mathematical Forms of Scaling Laws
Scaling laws describe how key metrics (typically loss or downstream accuracy) depend on design and training variables. The archetypal form is a multi-term power law in model size (N or P: non-embedding parameters or total parameters), dataset size (D: tokens), and sometimes compute (C: FLOPs):
| Law / Paper | Formula | Regime |
|---|---|---|
| Hoffmann et al. / Chinchilla | (: loss floor) | Dense models |
| Farseer (Li et al. 2025) | Unified | |
| Code LLMs (Farseer fit) | Same as above, but with distinct coefficients reflecting higher data hunger | Code models |
| MoE / Sparse (Hossain et al. 2025) | Dense/Sparse |
Parameters are empirically fitted. Power-law exponents typically fall in ; is set by data/architecture, and constants reflect domain properties.
Mixture-of-Experts (MoE) models admit structurally analogous laws with extra scaling in the number of experts :
with compute budget (Wang et al., 8 Oct 2024).
Code LLMs obey similar power laws, but optimal data-to-parameter ratios () are $7$– higher than for natural language tasks: –$400$ vs. at equal FLOPs (Luo et al., 9 Oct 2025).
Refined laws, such as Farseer, further include non-separable interactions and dynamic data exponents, significantly lowering prediction errors and supporting reliable extrapolation across orders of magnitude in (Li et al., 12 Jun 2025).
2. Theoretical Justifications and Regimes
Statistical models shed light on why power laws emerge. Maloney et al. (Maloney et al., 2022) analytically derive scaling regimes by coupling the eigenspectrum of the natural data distribution (typically power-law-tailed) to model complexity via random features and ridge regression. Key implications:
- If the eigenvalue spectrum follows , test loss asymptotically behaves as or , saturating at a noise floor dictated by data entropy.
- Optimal compute allocation occurs at equiparameterization (), aligning with empirical findings that the lowest loss per FLOP comes when and are increased in lockstep.
- When model size or data exceed the effective dimension of the latent space (i.e., spectrum support), power-law scaling breaks and loss plateaus.
Farseer’s improvement arises from explicitly modeling the non-separability in the – interaction, accurately capturing the empirical surfaces found in massive LLM training campaigns (Li et al., 12 Jun 2025).
3. Domain- and Architecture-Specific Scaling Patterns
Empirical research demonstrates that scaling laws broadly generalize across LLM families but with substantial domain or architecture-induced shifts in coefficients and exponents.
- MoE architectures exhibit identical scaling exponents to dense models but larger fractions of compute should be allocated to model scale (, for ) (Wang et al., 8 Oct 2024). MoE models achieve up to $16$\% better data efficiency at fixed compute than comparable dense models.
- Code LLMs demand significantly higher data/parameter ratios: optimal grows super-linearly with compute; mixture experiments confirm that a moderate natural language fraction can help small code models in data-scarce regimes but degrades performance at large scale (Luo et al., 9 Oct 2025).
- Sparse/pruned models interpolate between dense and fully sparse regimes, with performance given by , where is sparsity. This general law exactly recovers dense scaling at and matches MoE/pruning curves up to , ensuring optimal resource trade-off between , , and for a fixed compute budget (Hossain et al., 8 Aug 2025).
4. Extensions: Temporal, Downstream, and Context-Aware Laws
Modern scaling law research now addresses aspects beyond upstream loss:
- Temporal Laws: Model test loss evolution through training, at both sequence-level and token-position granularity. The temporal scaling law uses a dynamic hyperbolic fit for per-token loss, parameterized by position and step. Unlike power-law fits, the temporal law achieves near-perfect on both in-domain and out-of-distribution sets, enabling early hyperparameter selection and accurate prediction of training trajectories (Xiong et al., 27 Apr 2024).
- Downstream Performance: For metrics such as BLEU (MT), benchmark accuracy (MMLU), or few-shot task scores, specialized scaling forms are required.
- Machine translation scales as a log-law in BLEU (if pretraining and target distributions are well aligned), but cross-entropy can monotonically decrease even when BLEU no longer improves, indicating distribution misalignment risk (Isik et al., 6 Feb 2024).
- The “Performance Law” directly predicts MMLU from model shape, data scale, and a training-instability penalty, achieving 3-point average error on held-out models and extending to both dense and MoE (Wu et al., 19 Aug 2024).
- Context-aware Laws: Downstream task performance depends jointly on training compute and the length of in-context demonstrations. These are captured by a saturating power-law in compute, a saturating power-law in context length, and a context-window penalty. Different tasks exhibit distinct “characteristic context” scales; e.g., arithmetic reasoning benefits from many-shots, translation saturates after a few demonstrations (Montgomery et al., 16 Oct 2025).
- Loss-to-loss Laws: The downstream loss on any evaluation task is a universal, shifted power law of pretraining validation loss; the dominant influence is the pretraining data distribution—not model size, architecture, or hyperparameters (Mayilvahanan et al., 17 Feb 2025).
5. Empirical Methodologies and Validation
Scaling law analysis unites large-scale grid sweeps, statistical surface fitting, and robust extrapolation validation.
- Comprehensive grids of (model size, dataset size) are constructed, with held-out validation runs and test losses measured on high-quality data splits (Li et al., 12 Jun 2025, Luo et al., 9 Oct 2025).
- Nonlinear and log-linear fits are benchmarked by mean-relative error, , and the magnitude of extrapolation errors on out-of-domain (off-grid or ultra-large-scale) data points (Li et al., 12 Jun 2025).
- Modern methodologies emphasize piecewise and differential fitting, ensuring that fitted exponents and offset constants generalize across regimes rather than overfit a narrow slice.
- Newer laws (e.g., Farseer) demonstrably outperform older power-law models, reducing average relative error by several multiples and maintaining percent-level accuracy when predicting models larger than any in the training set (Li et al., 12 Jun 2025, Luo et al., 9 Oct 2025).
6. Practical Guidelines and Trade-Offs in LLM Design
Scaling laws provide closed-form formulas and explicit recipes for model design, resource allocation, and cost–performance optimization.
- For a fixed compute budget, optimal model/data allocations are given by scaling exponents (e.g., , for vanilla dense models) (Shuai et al., 2 Dec 2024).
- In code LLMs, always budget for much higher ; for FLOPs, allocate –$200$ (Luo et al., 9 Oct 2025).
- Sparse/Pruned networks: Use joint optimization of , , and via the generalized law to maximize gains for a specific workload (Hossain et al., 8 Aug 2025).
- For merging specialist models (like adapters), performance follows a law. Most gains arrive by –$6$ experts, and diminishing returns thereafter are mathematically predictable. This enables optimal mix of expert acquisition and base scaling for efficient ensemble construction (Wang et al., 29 Sep 2025).
- For downstream score prediction (e.g., MMLU), performance is not a trivial function of loss. Use log-linear regression on model configuration and data scale, and apply hardware/instability discounts. This allows performance planning and leakage/quality diagnosis (Wu et al., 19 Aug 2024).
- In few-shot and context-extended deployment, leverage context-aware scaling to decide between investing in model scale or context-window extensions based on the task’s saturation profile (Montgomery et al., 16 Oct 2025).
7. Open Directions and Future Methodologies
Recent position work advocates recasting scaling law discovery as an inverse problem: Given a desired performance threshold, find the minimal ingredients (model/data/compute/annotation) to achieve it under resource constraints, potentially exposing stage-wise and hybrid scaling laws that extend well beyond classical power-law regimes (Verma et al., 9 Sep 2025). New target areas include:
- Data selection scaling: Discovering minimal necessary data for target metrics under optimal curation.
- Inference scaling: Jointly optimizing over architectures, inference strategies, and context to minimize cost for a given utility.
- Machine unlearning scaling: Quantifying cost–performance trade-offs for removing (unlearning) data while retaining utility on retained data.
Actual derivations, empirical coefficients, and validation in these new inverse-problem contexts remain open for future research.
Scaling laws in LLMs currently anchor quantitative planning for model training, compute investment, and deployment policy. The research trajectory now moves toward generalizing these laws to cover sparsity, domain shift, explicit downstream utility, temporally evolving training, context-driven inference, and compositional/ensemble intelligence, with the aim to precisely map performance landscapes for next-generation AI systems.