Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compute-Optimal Size Scaling

Updated 20 March 2026
  • Compute-optimal size scaling is a strategy that balances model parameters and training tokens to minimize loss under a fixed compute budget.
  • Empirical methods such as training-curve envelopes and iso-FLOP profiles confirm a balanced power-law relation with exponents near 0.5 for both parameters and data.
  • This principle has reshaped deep learning practices by enabling more efficient training, reduced inference costs, and robust performance across diverse domains.

Compute-Optimal Size Scaling

Compute-optimal size scaling defines the strategy for allocating a fixed training compute budget between model size (number of parameters) and data exposure (number of training tokens) to minimize loss in large-scale neural models. This principle is now foundational for deep learning model design, especially for LLMs, but has extended robustly to multimodal transformers, protein LLMs, reinforcement learning, and generative modeling. The canonical “Chinchilla rule” asserts that, for next-token prediction with a Transformer under a fixed compute budget, optimal performance is achieved when model size and total number of training tokens are scaled in tandem—contradicting earlier strategies that favored disproportionately large model sizes. The empirical validation of this rule has led to a paradigm shift in both the design and practical deployment of state-of-the-art models.

1. Formulation of the Compute-Optimal Frontier

Let NN denote the trainable parameter count and DD the number of tokens processed during pretraining. Given a total compute budget CC (measured in FLOPs), the standard approximation is

C6NDC \approx 6 N D

where the factor 6 accounts for forward and backward passes and other per-token model costs. The compute-optimal scaling problem is: (Nopt(C),Dopt(C))=argminN,DL(N,D)s.t.6ND=C(N_{\text{opt}}(C), D_{\text{opt}}(C)) = \arg\min_{N, D} L(N,D) \quad \text{s.t.} \quad 6 N D = C where L(N,D)L(N,D) is the final validation loss.

Empirically and theoretically, the optimal allocation exhibits power-law structure: Nopt(C)Cα,Dopt(C)CβN_{\text{opt}}(C) \propto C^\alpha, \qquad D_{\text{opt}}(C) \propto C^\beta In the language modeling setting, the best-fit exponents are α0.5\alpha \approx 0.5, β0.5\beta \approx 0.5, so NN and DD should be scaled in lockstep as compute increases (Hoffmann et al., 2022). This is the defining prescription of compute-optimal size scaling.

2. Empirical Derivation and Validation (“Chinchilla Rule”)

The “Chinchilla” work establishes this scaling law by three complementary methods (Hoffmann et al., 2022):

  • Training-curve envelope: For each compute CC, interpolate the envelope of observed loss curves across hundreds of Transformer models (NN from 70M to 16B, DD from $5$B to $500$B). The minimizing (N,D)(N, D) at each CC yields a=b=0.50a = b = 0.50 in the power-law fits.
  • Iso-FLOP profiles: For a fixed CC, train models at a range of NN. The NN with lowest final loss again follows NC0.49N \propto C^{0.49}, DC0.51D \propto C^{0.51}.
  • Parametric loss fitting: Decompose

L(N,D)E+ANαparams+BDβdataL(N, D) \approx E + \frac{A}{N^{\alpha_{\mathrm{params}}}} + \frac{B}{D^{\beta_{\mathrm{data}}}}

and solve for the constrained minimizer. Robust regression produces α0.46\alpha \approx 0.46, β0.54\beta \approx 0.54.

Experiments training a 70B Chinchilla model on 1.4T tokens (same compute as Gopher-280B, 300B tokens) confirm that this allocation sharply outperforms prior large LLMs by all measured downstream metrics, while offering lower fine-tuning and inference cost (Hoffmann et al., 2022).

3. Closed-Form Solution and Key Scaling Exponents

Given the loss ansatz L(N,D)=E+ANαparams+BDβdataL(N, D) = E + \frac{A}{N^{\alpha_{\mathrm{params}}}} + \frac{B}{D^{\beta_{\mathrm{data}}}}, the closed-form compute-optimal allocations are: Nopt(C)=G(C6)α,Dopt(C)=G1(C6)βN_{\text{opt}}(C) = G \left(\frac{C}{6}\right)^{\alpha}, \qquad D_{\text{opt}}(C) = G^{-1} \left(\frac{C}{6}\right)^{\beta} with

α=βdataαparams+βdata,β=αparamsαparams+βdata,G=(αparamsAβdataB)1/(αparams+βdata)\alpha = \frac{\beta_{\mathrm{data}}}{\alpha_{\mathrm{params}} + \beta_{\mathrm{data}}}, \quad \beta = \frac{\alpha_{\mathrm{params}}}{\alpha_{\mathrm{params}} + \beta_{\mathrm{data}}}, \quad G = \left(\frac{\alpha_{\mathrm{params}} A}{\beta_{\mathrm{data}} B}\right)^{1/(\alpha_{\mathrm{params}} + \beta_{\mathrm{data}})}

For observed values αparams0.30.4\alpha_{\mathrm{params}} \sim 0.3-0.4, βdata0.30.4\beta_{\mathrm{data}} \sim 0.3-0.4, both exponents approach 0.5. Thus, a doubling of compute should double both NN and DD; alternative ratios result in suboptimal loss (Hoffmann et al., 2022). This rule is robust to modeling details and was independently derived in information-theoretic and LDPC-based frameworks (Nayak et al., 2024, Jeon et al., 2022).

4. Contrasts with Earlier and Alternative Scaling Laws

Earlier guidance (e.g., Kaplan et al. 2020) proposed different exponents (NC0.73,DC0.27N \propto C^{0.73}, D \propto C^{0.27}), leading to practices of excessively scaling model size while keeping dataset size fixed. Careful analysis has attributed discrepancies to failure to account for the final linear layer FLOPs, warmup duration mis-scaling, and the lack of scale-appropriate optimizer tuning (Porian et al., 2024). Once these are rectified, the Chinchilla law is consistently reproduced on multiple corpora and model scales.

Recent work further suggested a “unified” law in which only the product C=NDC = N D determines the final loss: L(N,D)=αlog(ND)+βL(N,D) = \alpha \log(N D) + \beta leading to the conclusion that, under fixed compute, any allocation (N,D)(N, D) with ND=CN D = C reaches the same final loss—so choices along this frontier should be dictated by secondary considerations such as inference efficiency (Guo, 2024).

5. Implications for Model Design and Training Practice

The practical ramifications of compute-optimal size scaling are extensive:

  • Performance Maximization: Given CC, set NkC0.5N \sim k C^{0.5} and Dk1C0.5D \sim k^{-1} C^{0.5} (with kk tuned to the observed loss surface) to minimize final pretraining loss.
  • Efficiency: Models sized according to this principle—such as Chinchilla—require significantly less compute for downstream usage while achieving state-of-the-art accuracy, as confirmed by substantial margin on the MMLU benchmark (Hoffmann et al., 2022).
  • Resource Allocation: Scaling both parameters and data equally avoids the diminishing returns regime associated with undertraining large models or overfitting to limited datasets.
  • Inference and Fine-tuning: Smaller, well-trained models are more inference-efficient for a fixed training compute, facilitating cost-effective deployment.

A contrast of results across modalities and domains demonstrates that the specific exponents can vary (see Table).

Domain / Model Class Exponent α\alpha (model) Exponent β\beta (data) Reference
LLMs (Chinchilla) ≈ 0.5 ≈ 0.5 (Hoffmann et al., 2022)
Protein LMs 0.27 0.71 (Serrano et al., 2024)
NVS Transformers (SVSM) 0.52 0.47 (Kim et al., 24 Feb 2026)
RL Single-Agent (Procgen) ≈ 0.4–0.8 (env.-dependent) (Hilton et al., 2023)

The determination of exponents in each setting is empirical and may depend on dataset complexity, input dimension, or architecture class.

6. Extensions, Caveats, and Open Problems

Compute-optimal size scaling forms the basis for optimal allocation but is not universal across all architectures, domains, or tasks. Notable extensions and caveats include:

  • Architectural Variants: Sparse models and Mixture-of-Experts require appropriately modified scaling laws; e.g., for MoE, an expansion factor must be introduced into the optimality criterion (Sengupta et al., 17 Feb 2025).
  • Skill Dependence: Compute-optimal NN and DD can be skill-dependent; e.g., code generation tasks are more data-hungry, while knowledge QA is more capacity-hungry, altering optimal scaling curves by up to ±50% depending on the validation set composition (Roberts et al., 13 Mar 2025).
  • Inference-Optimal Frontiers: In inference-heavy contexts, optimal allocation may favor small NN, large DD, maximizing sample efficiency under downstream compute constraints (Guo, 2024).
  • Non-Language Domains: In protein language modeling, the exponents favor a much more aggressive scaling of DD over NN, reflecting a rapidly saturating performance plateau in NN (Serrano et al., 2024).
  • Adaptive and Dynamic Schedulers: For models capable of adjusting shape (width, depth) during training, adaptive (“shape-scheduling”) can further reduce required compute by up to 40–60% vs. static sizing, outpacing the traditional compute-optimal front (Anagnostidis et al., 2023).
  • Test-Time Scaling: Compute-optimal size scaling can be framed analogously for inference policies, determining test-time budget allocation across candidate answers or search breadth (Liu et al., 10 Feb 2025, Snell et al., 2024).

7. Theoretical Foundations and Universality

From an information-theoretic perspective, compute-optimal scaling emerges as the point where misspecification and estimation error (finite data vs. finite capacity) are balanced (Jeon et al., 2022). A bipartite “concept learning” model supports the universality of the α=0.5\alpha = 0.5 exponent under broad graph-theoretic assumptions (Nayak et al., 2024). Experimentally, normalized loss curves of compute-optimally trained networks collapse onto universal curves across model sizes and architectures, providing diagnostic and predictive power for practitioners—underwriting the notion that power-law scaling and compute-optimality are robust and transferable principles in deep learning (Qiu et al., 2 Jul 2025).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compute-Optimal Size Scaling.