Compute-Optimal Scaling Law

Updated 27 November 2025

Compute-optimal scaling law is a mathematical framework that defines the optimal ratio between neural network parameters and training data using empirical power-law relationships.
The formulation minimizes a loss function L(N, D)=E+A/N^α+B/D^β under a compute constraint (N×D ∝ C), yielding domain-specific exponents that guide model design.
Practical applications include optimizing model performance in language modeling, vision, reinforcement learning, and other tasks by balancing training efficiency and inference costs.

A compute-optimal scaling law describes how to allocate a fixed computational budget between neural network model size and training dataset size to achieve minimal loss in large-scale deep learning. This principle enables the design of models and training regimens that maximize the return on computational investment for a given domain and task. Compute-optimal scaling laws are formulated and empirically tested across modalities including language modeling, vision, reinforcement learning, motion forecasting, and symbolic regression, with distinct but related exponents encoding the efficiency of parameter and data scaling under a fixed compute constraint.

1. Mathematical Formulation

Compute-optimal scaling laws are derived from empirical power-law relationships between model performance and the primary scaling axes: parameter count ( $N$ ), dataset size ( $D$ ), and total training compute ( $C$ ). The canonical functional form for loss is: $L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$ where $L(N, D)$ is the loss (e.g., cross-entropy), $E$ is the irreducible minimum, $A, B$ are prefactors, and $\alpha, \beta > 0$ encode how loss improves with $N$ and $D$ respectively.

Under a compute constraint ( $C$ measured in FLOPs), typically approximated as $C \propto N D$ , the optimization problem becomes finding $(N_{\mathrm{opt}}(C), D_{\mathrm{opt}}(C))$ that minimizes $L(N, D)$ subject to $N D = C$ . Analytical minimization yields characteristic exponents: $N_{\mathrm{opt}}(C) \propto C^{p}, \qquad D_{\mathrm{opt}}(C) \propto C^{d}$ with

$p = \frac{\beta}{\alpha + \beta}, \qquad d = \frac{\alpha}{\alpha + \beta}$

and the data-to-model size ratio governed by $\frac{D_{\mathrm{opt}}}{N_{\mathrm{opt}}} \propto \frac{\beta}{\alpha}$ . For symmetric exponents ( $\alpha \approx \beta$ ), this ratio is approximately 1, but it can deviate substantially between domains and even between subskills within the same modality (Baniodeh et al., 9 Jun 2025, Sengupta et al., 17 Feb 2025, Porian et al., 27 Jun 2024, Roberts et al., 13 Mar 2025).

2. Empirical Determination of Exponents

The exponents $\alpha, \beta$ and thereby $p, d$ are empirically estimated via large-scale sweeps over $N$ , $D$ , and $C$ . For example, in autonomous driving agents performing motion forecasting and planning, a detailed fit yields:

$p = 0.63 \pm 0.08$ , $d = 0.44 \pm 0.06$ ( $N_{\mathrm{opt}} \propto C^{0.63}$ , $D_{\mathrm{opt}} \propto C^{0.44}$ )
$N_{\mathrm{opt}} \propto D_{\mathrm{opt}}^{1.5}$

For general language modeling, the latest robust multi-dataset, multi-hyperparameter synthesis gives $N_{\mathrm{opt}} \approx 0.09 C^{0.497}$ and a nearly fixed $D_{\mathrm{opt}}/N_{\mathrm{opt}} \approx 15$ ( $D_{\mathrm{opt}} \propto C^{0.503}$ ) when all sources of discrepancy, such as omitted decoder head FLOPs, warmup duration, and scale-dependent optimizer tuning, are properly corrected (Porian et al., 27 Jun 2024). In symbolic regression, the optimal exponents are $N_{\mathrm{opt}} \propto C^{0.40}$ , $D_{\mathrm{opt}} \propto C^{0.43}$ , with a token-to-parameter ratio $r \approx 15 C^{0.03}$ exhibiting only mild drift over a broad compute range (Otte et al., 30 Oct 2025).

The following table summarizes representative exponents across domains:

Domain	$p$ (N exponent)	$d$ (D exponent)	Source
Language Modeling	$0.50$	$0.50$	(Porian et al., 27 Jun 2024)
Motion Forecasting	$0.63$	$0.44$	(Baniodeh et al., 9 Jun 2025)
Symbolic Regression	$0.40$	$0.43$	(Otte et al., 30 Oct 2025)
Single-Agent RL	$0.4$–$0.8$	$1-\text{(N exponent)}$	(Hilton et al., 2023)
AlphaZero RL	$0.62$	$-$	(Neumann et al., 2022)
ViT Image Classification	$0.22$ (width), $0.45$ (depth)	-	(Alabdulmohsin et al., 2023)

3. Generalization, Quantization, and Performance Bounds

Compute-optimal scaling also governs the generalization gap and the quantizability of models. Under Chinchilla-level scaling ( $N^* \sim D^* \sim C^{1/2}$ ), the token-wise generalization gap—the excess population loss over the empirical loss—provably shrinks as

$\Delta_{\mathrm{gen}}(C) = O(C^{-1/4})$

because both the loss variance and quantization gap terms decay with increased parameterization and training set size. Specifically, the Freedman-type martingale bound establishes that larger models, when scaled according to compute-optimal allocation, generalize better and are more amenable to quantization (Finzi et al., 21 Apr 2025).

4. Inference-Time and Task-Specific Trade-Offs

Compute-optimal scaling laws are principally derived for training time; however, inference-time compute efficiency introduces a further axis of optimization. With a fixed inference budget, performance can be traded between model size and the number of samples (e.g., trajectory rollouts in planning). There exists a crossover point such that for constrained FLOPs, exhaustive sampling from a smaller model can outperform a larger model, but beyond this point, increasing model size is more efficient (Baniodeh et al., 9 Jun 2025). Furthermore, different tasks or skill domains may have distinct scaling exponents and optimal trade-offs. For instance, knowledge QA tasks exhibit larger compute-optimal model sizes (higher parameter hunger), while code generation tasks are more data-efficient and benefit from smaller models trained on more tokens. This skill dependence implies that validation set composition directly shifts compute-optimal allocations by as much as 50% for the same compute budget (Roberts et al., 13 Mar 2025).

5. Theoretical Underpinnings and Information-Theoretic Foundations

The structure and exponents of compute-optimal scaling laws are now being linked to underlying information-theoretic and statistical mechanics models. For example, the optimal $N^* \sim D^* \sim C^{1/2}$ "Chinchilla rule" arises in both:

A graph-based analogy to decoding in LDPC codes, where the learning of concepts from text is mapped to iterative belief propagation, and finite-size scaling arguments dictate the $1/2$ exponent (Nayak et al., 2 Oct 2024).
Information-theoretic bounds for shallower networks, where compute-optimal error decreases as $E^*(C) = \Theta(C^{-1/2})$ , with optimal allocation maintaining $N^*/n^* = \Theta(1)$ (Jeon et al., 2022).

These theoretical perspectives align with, and in some cases predict, the empirical exponents observed across domains.

6. Domain-Specific and Architectural Extensions

Although the core compute-optimal scaling law holds across a range of tasks, significant deviations arise in certain architectures, modalities, and operational regimes:

Mixture-of-Experts and sparse models display bifurcated or modified law structure, demanding new optimization axes (e.g., number of experts, routing decisions).
In vision and multimodal settings, exponents are sensitive to architectural parameters like patch size, depth, and hidden dimension, necessitating joint shape-compute optimization (Alabdulmohsin et al., 2023).
In symbolic regression and motion generation, steeper exponents ( $\beta \approx 0.21$ for loss vs. compute) indicate that compute is more rapidly converted into performance improvement compared to language modeling (Otte et al., 30 Oct 2025, Lu et al., 19 Dec 2024).
In reinforcement learning, environment horizon and domain structure alter the scaling, with gamma exponents ( $\gamma$ ) spanning a broader range depending on task horizon and intrinsic challenge (Hilton et al., 2023, Neumann et al., 2022).

7. Practical Guidelines and Limitations

Applying compute-optimal scaling laws in practice involves:

Empirically fitting small-scale experiments to extract exponents for a given model, data, and objective.
Allocating a fixed compute budget between $N$ and $D$ according to the inferred $p$ and $d$ exponents.
Validating predicted loss curves on held-out data to ensure power-law adherence.
Accounting for operational constraints (memory, inference cost, data availability) and adapting the allocation accordingly.

Open challenges include modeling inference-aware scaling, data curation effects, adaptation to hybrid architectures, hardware-specific scaling dynamics, and skill-mixed (multi-objective) training (Sengupta et al., 17 Feb 2025). The laws also assume idealized regimes (large-scale, power-law loss falloff, unlimited data), so extrapolation beyond validated regions or to unreliable domains remains a point of active research.

References:

"Scaling Laws of Motion Forecasting and Planning -- A Technical Report" (Baniodeh et al., 9 Jun 2025)
"Compute-Optimal LLMs Provably Generalize Better With Scale" (Finzi et al., 21 Apr 2025)
"Towards Scaling Laws for Symbolic Regression" (Otte et al., 30 Oct 2025)
"Resolving Discrepancies in Compute-Optimal Scaling of LLMs" (Porian et al., 27 Jun 2024)
"How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines" (Sengupta et al., 17 Feb 2025)
"Compute Optimal Scaling of Skills: Knowledge vs Reasoning" (Roberts et al., 13 Mar 2025)
"A Dynamical Model of Neural Scaling Laws" (Bordelon et al., 2 Feb 2024)
"4+3 Phases of Compute-Optimal Neural Scaling Laws" (Paquette et al., 23 May 2024)
"An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in LLMs" (Nayak et al., 2 Oct 2024)
"An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws" (Jeon et al., 2022)
"Scaling laws for single-agent reinforcement learning" (Hilton et al., 2023)
"Scaling Laws for a Multi-Agent Reinforcement Learning Model" (Neumann et al., 2022)
"Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design" (Alabdulmohsin et al., 2023)
"ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model" (Lu et al., 19 Dec 2024)