Compute-Optimal Model Size
- Compute-optimal model size is defined as the minimum parameter count that minimizes task loss under a fixed compute budget using empirical scaling laws.
- It leverages power-law relationships between model size and data quantity to determine optimal allocations, as seen in Chinchilla and specialized domain scenarios.
- Practical guidelines incorporate adjustments for domain-specific tasks and fine-tuning strategies like LoRA, balancing pretraining and inference compute requirements.
A compute-optimal model size is the parameter count that minimizes task loss (or maximizes downstream utility) subject to a fixed compute budget (measured in floating-point operations, FLOP) for pre-training, fine-tuning, or inference. This concept is grounded in empirical and theoretical scaling laws for parameter–data tradeoffs in neural networks, and it underpins efficient resource allocation for large-scale models in natural language processing, computer vision, protein modeling, embedding specialization, deep reinforcement learning, and test-time scaling.
1. Canonical Compute-Optimal Scaling Laws
Classical compute-optimal scaling arises from empirical power-law fits to the relationship between generalization loss , model size , and data quantity , subject to a fixed-compute constraint . The most widely used ansatz for language modeling loss is
with constants determined by grid sweeps over and . The optimal allocation of compute is determined by Lagrangian minimization under this constraint, leading to the Chinchilla scaling law (Hoffmann et al., 2022, Porian et al., 2024, Nayak et al., 2024):
0
Empirical fits (e.g., Hoffmann et al., 2022) yield 1, 2, so that
3
In practice, this yields an approximately equal allocation of compute to model size and data in contemporary LLMs, with finer corrections explained by input dimensionality, data quality, and optimizer particulars (Nayak et al., 2024, Porian et al., 2024, Jeon et al., 2022, Junior et al., 3 Jan 2025).
2. Analytical Derivation and Corrections
The closed-form solution to the compute-optimal trade-off is obtained via Lagrange multipliers:
4
Setting partial derivatives to zero yields
5
with the scaling exponent 6 (Nayak et al., 2024, Jeon et al., 2022).
Recent corrections to the original Kaplan law (exponent 7)—accounting for the last-layer head cost, warmup proportional to 8, and optimal batch size and AdamW 9—yield exponents (0) converging to the Chinchilla value of 1 (Porian et al., 2024). Practical formulas are
2
across a wide range of budgets 3 FLOPs.
3. Domain Adaptations and Empirical Extensions
Domain specialization, data quality, and modality significantly modulate the optima and the scaling exponents:
- Domain-specialized LMs: Lower data-quality constant 4 shifts 5 upward for fixed 6 (Junior et al., 3 Jan 2025). Legal, medical, finance: as model size increases, specialization offers greater compute efficiency, e.g. the 14B-legal model attains its loss floor at 7 lower compute than a general-domain model.
- Protein language modeling: CLM and MLM objectives on protein sequences yield exponents differing from NLP: e.g., for MLM, 8, 9 (Cheng et al., 2024, Serrano et al., 2024). The optimal allocation is more parameter-dominated for MLMs (smaller vocabulary, high compositional complexity).
- Vision Transformers: For shape-optimized ViTs, the total parameter count scales as 0, obtained by fitting error power laws to width, depth, and MLP-size, then combining exponents (Alabdulmohsin et al., 2023).
Optimal trade-offs are modality- and architecture-specific and require empirical calibration for each domain, but all obey a parametric frontier of the form 1.
4. Compute-Optimal Model Size in Fine-Tuning and Transfer
Specialized tasks such as contrastive embedding model training exhibit their own compute-optimal laws, incorporating the spectrum of fine-tuning methods:
- Embedding models: For given compute, optimal model size 2 depends on the choice of fine-tuning (full vs. partial/LoRA). Below a crossover (3 FLOP), full fine-tuning is optimal; above, LoRA with rank 32–128 is preferred (Ziarko et al., 2024).
- The optimum 4 grows sublinearly with 5; at high budgets, LoRA and full fine-tuning yield almost identical optima, dictated by the task's empirical IsoFLOP curves. Bias-only fine-tuning is never optimal.
Empirical recipe for text embedding models under a fixed 6:
- For 7 FLOP: full fine-tuning with largest feasible 8.
- For 9 FLOP: use LoRA with high rank, largest feasible 0.
- Saturate 1 with 2 via 3 (full) or the appropriate fine-tuning cost formula (LoRA).
5. Compute-Optimal Model Size at Test-Time and End-to-End Budgets
Classical scaling laws focus on pretraining, but modern deployments optimize for total (pretrain+inference) compute or inference-specific FLOPs:
- Train-to-Test (T²) Scaling: Introduction of test-time sampling (pass@4), where inference cost scales as 5, shifts the pretraining optimum. When accounting for fixed inference compute, the compute-optimal 6 jointly minimize loss or maximize accuracy: 7 with 8, typical exponents 9, 0, 1 (Roberts et al., 1 Apr 2026).
- Test-time scaling and inference: Smaller, overtrained models, plus many test samples, outperform large models at fixed inference budget (Wu et al., 2024, Liu et al., 10 Feb 2025, Roberts et al., 1 Apr 2026). Empirically, 2 for inference-only scaling (Wu et al., 2024).
Empirical observations from T² and test-time scaling: as soon as inference compute is included, the optimal region shifts toward smaller models trained much longer (high tokens/parameter), supporting aggressive test-time sampling or search.
| Paradigm | 3 scaling | 4 scaling | Key context |
|---|---|---|---|
| Chinchilla LM | 5 | 6 | Language modeling (Hoffmann et al., 2022, Porian et al., 2024) |
| ViT | 7 | n/a (data unlimited) | Computer vision (Alabdulmohsin et al., 2023) |
| pLM MLM | 8 | 9 | Protein masked LM (Cheng et al., 2024) |
| Inference-opt | 0 | - | LLM inference, test-time (Wu et al., 2024) |
| T² end-to-end | 1 | see formulas | Train-to-test, many 2 (Roberts et al., 1 Apr 2026) |
6. Unified Laws: Limits, Anomalies, and Practical Considerations
Attempts to collapse model performance onto total compute, decoupling data and parameter allocation, lead to different "unified" scaling laws:
- Unified Law (Guo 2024): Some recent works fit BPC as a function of 3 only, with no unique optimum in (N, D); any allocation with 4 yields identical compression (Guo, 2024). Compute-optimal split then requires an external constraint (inference cost, data supply) to pick 5 (e.g., for inference-optimality, push 6 minimal, 7).
- These regimes highlight boundaries of the classic two-variable scaling law, with implications for exabyte-scale and web-limited data scenarios.
Practical guidelines and caveats:
- Proper accounting of "head" parameters and scale-dependent learning rate/batch (8) tuning are essential for correct exponents (Porian et al., 2024).
- For domain-specialized settings with high-quality data, exponents shift to favor larger 9 at fixed 0 (Junior et al., 3 Jan 2025).
- Under strong test-time constraints or aggressive sampling/search protocols, optimal 1 can be order(s) of magnitude smaller than pure pretraining scaling suggests, with significant overtraining (tokens per parameter many times higher than canonical ratios) (Roberts et al., 1 Apr 2026).
7. Cross-Domain Generalization and Future Directions
Compute-optimal scaling laws have been extended to a range of architectures (ViT, RL), domains (protein, law, medicine), fine-tuning protocols, and both training and inference regimes, as summarized in the literature above.
Ongoing and open topics:
- Understanding exponents' dependence on data complexity and modality (e.g., protein MLM vs. CLM vs. text).
- Joint optimization of pretraining and deployment (test-time search, sample-efficient inference).
- Constant-factor improvements via optimizer/batch/hyperparameter scaling.
- Information-theoretic characterizations adding 2 corrections and sharp thresholds based on semantic graph analogies (Nayak et al., 2024).
The resulting body of work provides practitioners and theorists with explicit, empirically validated recipes for determining 3 under a wide spectrum of practical constraints, enabling rigorous design of compute-scaled models across the spectrum of modern machine learning systems.