Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compute-Optimal Model Size

Updated 4 April 2026
  • Compute-optimal model size is defined as the minimum parameter count that minimizes task loss under a fixed compute budget using empirical scaling laws.
  • It leverages power-law relationships between model size and data quantity to determine optimal allocations, as seen in Chinchilla and specialized domain scenarios.
  • Practical guidelines incorporate adjustments for domain-specific tasks and fine-tuning strategies like LoRA, balancing pretraining and inference compute requirements.

A compute-optimal model size is the parameter count N∗(C)N^*(C) that minimizes task loss (or maximizes downstream utility) subject to a fixed compute budget CC (measured in floating-point operations, FLOP) for pre-training, fine-tuning, or inference. This concept is grounded in empirical and theoretical scaling laws for parameter–data tradeoffs in neural networks, and it underpins efficient resource allocation for large-scale models in natural language processing, computer vision, protein modeling, embedding specialization, deep reinforcement learning, and test-time scaling.

1. Canonical Compute-Optimal Scaling Laws

Classical compute-optimal scaling arises from empirical power-law fits to the relationship between generalization loss L(N,D)L(N, D), model size NN, and data quantity DD, subject to a fixed-compute constraint C≈κNDC \approx \kappa N D. The most widely used ansatz for language modeling loss is

L(N,D)≈L∞+ANα+BDβL(N,D) \approx L_{\infty} + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}

with constants α,β>0\alpha, \beta > 0 determined by grid sweeps over NN and DD. The optimal allocation of compute is determined by Lagrangian minimization under this constraint, leading to the Chinchilla scaling law (Hoffmann et al., 2022, Porian et al., 2024, Nayak et al., 2024):

CC0

Empirical fits (e.g., Hoffmann et al., 2022) yield CC1, CC2, so that

CC3

In practice, this yields an approximately equal allocation of compute to model size and data in contemporary LLMs, with finer corrections explained by input dimensionality, data quality, and optimizer particulars (Nayak et al., 2024, Porian et al., 2024, Jeon et al., 2022, Junior et al., 3 Jan 2025).

2. Analytical Derivation and Corrections

The closed-form solution to the compute-optimal trade-off is obtained via Lagrange multipliers:

CC4

Setting partial derivatives to zero yields

CC5

with the scaling exponent CC6 (Nayak et al., 2024, Jeon et al., 2022).

Recent corrections to the original Kaplan law (exponent CC7)—accounting for the last-layer head cost, warmup proportional to CC8, and optimal batch size and AdamW CC9—yield exponents (L(N,D)L(N, D)0) converging to the Chinchilla value of L(N,D)L(N, D)1 (Porian et al., 2024). Practical formulas are

L(N,D)L(N, D)2

across a wide range of budgets L(N,D)L(N, D)3 FLOPs.

3. Domain Adaptations and Empirical Extensions

Domain specialization, data quality, and modality significantly modulate the optima and the scaling exponents:

  • Domain-specialized LMs: Lower data-quality constant L(N,D)L(N, D)4 shifts L(N,D)L(N, D)5 upward for fixed L(N,D)L(N, D)6 (Junior et al., 3 Jan 2025). Legal, medical, finance: as model size increases, specialization offers greater compute efficiency, e.g. the 14B-legal model attains its loss floor at L(N,D)L(N, D)7 lower compute than a general-domain model.
  • Protein language modeling: CLM and MLM objectives on protein sequences yield exponents differing from NLP: e.g., for MLM, L(N,D)L(N, D)8, L(N,D)L(N, D)9 (Cheng et al., 2024, Serrano et al., 2024). The optimal allocation is more parameter-dominated for MLMs (smaller vocabulary, high compositional complexity).
  • Vision Transformers: For shape-optimized ViTs, the total parameter count scales as NN0, obtained by fitting error power laws to width, depth, and MLP-size, then combining exponents (Alabdulmohsin et al., 2023).

Optimal trade-offs are modality- and architecture-specific and require empirical calibration for each domain, but all obey a parametric frontier of the form NN1.

4. Compute-Optimal Model Size in Fine-Tuning and Transfer

Specialized tasks such as contrastive embedding model training exhibit their own compute-optimal laws, incorporating the spectrum of fine-tuning methods:

  • Embedding models: For given compute, optimal model size NN2 depends on the choice of fine-tuning (full vs. partial/LoRA). Below a crossover (NN3 FLOP), full fine-tuning is optimal; above, LoRA with rank 32–128 is preferred (Ziarko et al., 2024).
  • The optimum NN4 grows sublinearly with NN5; at high budgets, LoRA and full fine-tuning yield almost identical optima, dictated by the task's empirical IsoFLOP curves. Bias-only fine-tuning is never optimal.

Empirical recipe for text embedding models under a fixed NN6:

  1. For NN7 FLOP: full fine-tuning with largest feasible NN8.
  2. For NN9 FLOP: use LoRA with high rank, largest feasible DD0.
  3. Saturate DD1 with DD2 via DD3 (full) or the appropriate fine-tuning cost formula (LoRA).

5. Compute-Optimal Model Size at Test-Time and End-to-End Budgets

Classical scaling laws focus on pretraining, but modern deployments optimize for total (pretrain+inference) compute or inference-specific FLOPs:

  • Train-to-Test (T²) Scaling: Introduction of test-time sampling (pass@DD4), where inference cost scales as DD5, shifts the pretraining optimum. When accounting for fixed inference compute, the compute-optimal DD6 jointly minimize loss or maximize accuracy: DD7 with DD8, typical exponents DD9, C≈κNDC \approx \kappa N D0, C≈κNDC \approx \kappa N D1 (Roberts et al., 1 Apr 2026).
  • Test-time scaling and inference: Smaller, overtrained models, plus many test samples, outperform large models at fixed inference budget (Wu et al., 2024, Liu et al., 10 Feb 2025, Roberts et al., 1 Apr 2026). Empirically, C≈κNDC \approx \kappa N D2 for inference-only scaling (Wu et al., 2024).

Empirical observations from T² and test-time scaling: as soon as inference compute is included, the optimal region shifts toward smaller models trained much longer (high tokens/parameter), supporting aggressive test-time sampling or search.

Paradigm C≈κNDC \approx \kappa N D3 scaling C≈κNDC \approx \kappa N D4 scaling Key context
Chinchilla LM C≈κNDC \approx \kappa N D5 C≈κNDC \approx \kappa N D6 Language modeling (Hoffmann et al., 2022, Porian et al., 2024)
ViT C≈κNDC \approx \kappa N D7 n/a (data unlimited) Computer vision (Alabdulmohsin et al., 2023)
pLM MLM C≈κNDC \approx \kappa N D8 C≈κNDC \approx \kappa N D9 Protein masked LM (Cheng et al., 2024)
Inference-opt L(N,D)≈L∞+ANα+BDβL(N,D) \approx L_{\infty} + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}0 - LLM inference, test-time (Wu et al., 2024)
T² end-to-end L(N,D)≈L∞+ANα+BDβL(N,D) \approx L_{\infty} + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}1 see formulas Train-to-test, many L(N,D)≈L∞+ANα+BDβL(N,D) \approx L_{\infty} + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}2 (Roberts et al., 1 Apr 2026)

6. Unified Laws: Limits, Anomalies, and Practical Considerations

Attempts to collapse model performance onto total compute, decoupling data and parameter allocation, lead to different "unified" scaling laws:

  • Unified Law (Guo 2024): Some recent works fit BPC as a function of L(N,D)≈L∞+ANα+BDβL(N,D) \approx L_{\infty} + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}3 only, with no unique optimum in (N, D); any allocation with L(N,D)≈L∞+ANα+BDβL(N,D) \approx L_{\infty} + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}4 yields identical compression (Guo, 2024). Compute-optimal split then requires an external constraint (inference cost, data supply) to pick L(N,D)≈L∞+ANα+BDβL(N,D) \approx L_{\infty} + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}5 (e.g., for inference-optimality, push L(N,D)≈L∞+ANα+BDβL(N,D) \approx L_{\infty} + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}6 minimal, L(N,D)≈L∞+ANα+BDβL(N,D) \approx L_{\infty} + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}7).
  • These regimes highlight boundaries of the classic two-variable scaling law, with implications for exabyte-scale and web-limited data scenarios.

Practical guidelines and caveats:

  • Proper accounting of "head" parameters and scale-dependent learning rate/batch (L(N,D)≈L∞+ANα+BDβL(N,D) \approx L_{\infty} + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}8) tuning are essential for correct exponents (Porian et al., 2024).
  • For domain-specialized settings with high-quality data, exponents shift to favor larger L(N,D)≈L∞+ANα+BDβL(N,D) \approx L_{\infty} + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}9 at fixed α,β>0\alpha, \beta > 00 (Junior et al., 3 Jan 2025).
  • Under strong test-time constraints or aggressive sampling/search protocols, optimal α,β>0\alpha, \beta > 01 can be order(s) of magnitude smaller than pure pretraining scaling suggests, with significant overtraining (tokens per parameter many times higher than canonical ratios) (Roberts et al., 1 Apr 2026).

7. Cross-Domain Generalization and Future Directions

Compute-optimal scaling laws have been extended to a range of architectures (ViT, RL), domains (protein, law, medicine), fine-tuning protocols, and both training and inference regimes, as summarized in the literature above.

Ongoing and open topics:

  • Understanding exponents' dependence on data complexity and modality (e.g., protein MLM vs. CLM vs. text).
  • Joint optimization of pretraining and deployment (test-time search, sample-efficient inference).
  • Constant-factor improvements via optimizer/batch/hyperparameter scaling.
  • Information-theoretic characterizations adding α,β>0\alpha, \beta > 02 corrections and sharp thresholds based on semantic graph analogies (Nayak et al., 2024).

The resulting body of work provides practitioners and theorists with explicit, empirically validated recipes for determining α,β>0\alpha, \beta > 03 under a wide spectrum of practical constraints, enabling rigorous design of compute-scaled models across the spectrum of modern machine learning systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compute-Optimal Model Size.