Compute-Optimal Training

Updated 16 October 2025

Compute-optimal training is a strategy that balances model scale and data usage to maximize performance under fixed compute constraints.
It leverages theoretical scaling laws, adaptive hyperparameter tuning, and simulation-driven methods to optimize resource allocation.
Empirical studies in language, vision, and reinforcement learning validate its effectiveness, achieving significant efficiency and performance gains.

Compute-optimal training refers to the principled allocation of computational resources throughout model development and training to achieve maximal performance for a fixed compute budget. The concept, which is central to modern deep learning, quantifies the optimal balance among model size, data size, training steps, and other variables—subject to hardware and cost constraints—so that additional compute is not wasted on negligible improvements. This paradigm has yielded both empirically validated scaling laws in LLMs and theoretical understanding across a wide array of machine learning settings, including vision, protein modeling, reinforcement learning, and quantization-aware training. Approaches for compute-optimal training now encompass not just model and dataset scaling, but also adaptive learning schedules, data composition, parallelization, and resource-aware algorithmic strategies.

1. Theoretical Foundations and Scaling Laws

The basis for compute-optimal training is in the empirical and theoretical scaling laws that relate model parameters (N), training tokens (D), and total compute (C) to expected loss or error. Foundational works (Hoffmann et al., 2022, Jeon et al., 2022) establish that, under a fixed compute budget (C ≈ N * D), the optimal allocation scales both parameters and tokens nearly equally: Nₒₚₜ ∝ C^0.5, Dₒₚₜ ∝ C^0.5, minimizing a loss function of the form

$L(N, D) = E + A/N^\alpha + B/D^\beta$

where E is irreducible loss, and α, β ≈ 0.5 describe empirical error decay rates due to limited model or data size.

Theoretical analysis demonstrates that this optimal frontier arises when neither estimation error (primarily mitigated by more data) nor approximation error (addressed by more parameters) dominates (Jeon et al., 2022). Under increasing complexity (e.g., higher input dimension or longer sequences), the allocation may shift further in favor of model size.

Tables summarizing optimal scaling exponents:

Domain	Model Exponent α	Data Exponent β	Reference
Language LMs	≈0.5	≈0.5	(Hoffmann et al., 2022)
Protein LMs	0.27	0.71	(Serrano et al., 11 Jun 2024)
Adaptive Models	varies by config	varies	(Anagnostidis et al., 2023)

This highlights both the generality and the domain-specific nuances of compute-optimal scaling.

2. Optimization Methodologies and Practical Algorithms

Compute-optimal training is instantiated through algorithms that adjust compute allocations during training. For hyperparameter selection and early stopping under strict budgets, formulations such as AutoBCT (Cironis et al., 2021) recast hyperparameter optimization as a sequential decision process. Here, the objective is

$\sup_{(u_n),\,\tau \geq 1}~ \mathbb{E}\Bigl[ H(u_{\tau-1}) - \gamma \sum_{n=1}^\tau (t_n)^+ \Bigr]$

where $u_n$ denotes hyperparameter settings at epoch $n$ , $H(u)$ is the expected model quality, $t_n$ is training cost per epoch, and $\gamma$ is a cost-sensitivity parameter. AutoBCT uses Bayesian surrogate modeling (via Kalman filtering) and Markov decision processes solved by dynamic programming to determine both hyperparameter schedules and adaptive stopping criteria, balancing potential model quality gain against additional compute cost.

Table: Core steps in AutoBCT (all from (Cironis et al., 2021)):

Step	Description
Bayesian surrogate modeling	Linear basis expansion of score and cost functions, with recursive Kalman updates
MDP embedding	Posterior statistics as state; decision: continue (select $u_{n+1}$ ) or stop
Value function computation	Recursion with Bellman equations, solved numerically by regression Monte Carlo
Adaptive control	At each step, select $u$ to maximize Q-value, monitor stop criterion dynamically

Direct search strategies for hyperparameter scaling, such as CARBS (Fetterman et al., 2023), and simulation-based approaches for parallelization strategies (vTrain (Bang et al., 2023)), further expand compute optimization to pipeline-wide planning.

3. Empirical Validation and Task-Specific Scaling

Systematic large-scale experiments demonstrate the universality of compute-optimal scaling, but also its dependence on domain and task structure.

In large language modeling, models such as Chinchilla (Hoffmann et al., 2022), trained with the compute-optimal allocation (smaller model, more tokens vs. prior art), outperform substantially larger models with the same compute budget—for example, Chinchilla (70B params, 1.4T tokens) vs. Gopher (280B, 300B tokens). Comparable findings are reproduced in open models such as Cerebras-GPT (Dey et al., 2023), further validated across vision (Ahmadi et al., 17 Oct 2024), protein language modeling (Serrano et al., 11 Jun 2024, Cheng et al., 4 Nov 2024), and deep RL (Fu et al., 20 Aug 2025).

Recent results indicate that skill-dependent scaling laws exist (Roberts et al., 13 Mar 2025): knowledge-based tasks (e.g., question answering) are "capacity-hungry" (requiring more parameters), while reasoning and code tasks are "data-hungry" (favoring more training tokens). The composition of the pretraining datamix, and the choice of validation metric, have been shown to shift the computed optimal allocation by as much as 50% in parameter count.

4. Resource-Aware Data, Model, and Optimization Strategies

Compute-optimal training extends beyond just model and dataset scaling.

Data selection: Methods such as AutoScale (Kang et al., 29 Jul 2024) fit parametric models to optimize domain mixture weights at small scales, then extrapolate to larger budgets using power-law relationships (e.g., $N^{(3)*} = (N^{(2)*})^2 / N^{(1)*}$ ), ensuring that the data composition is scale-aware and adapts as training compute increases.
Sampling and synthetic data: For data generation under compute constraints, more (but weaker) model samples can be preferable to fewer, expensive (strong model) generations for finetuning LM reasoners, as evidenced by coverage and diversity metrics (Bansal et al., 29 Aug 2024).
Finetuning recipes: In resource-constrained embedding model training, systematic search across model sizes, data, and fine-tuning methods (e.g., full, LoRA/block-freezing) produces empirical recipes where full fine-tuning is optimal for small budgets, while LoRA dominates at larger scales (Ziarko et al., 6 Jun 2024).
Data selection compute cost: The total cost of selection and training must be considered. Simple methods are nearly always optimal at modest budgets; only when the training model is ≥5× (perplexity) or 10× (gradient selection) larger than the selection model does expensive selection become worthwhile (Yin et al., 21 Oct 2024).

5. Adaptive and Flexible Compute-Optimal Regimes

Emerging work demonstrates that non-static allocation further improves compute efficiency:

Adaptive model training (Anagnostidis et al., 2023): By dynamically adjusting architecture parameters (e.g., patch size in vision, context length in LMs) during training based on real-time scaling curve derivatives, one can traverse optimal scaling regimes, achieving up to 50–60% FLOPs savings compared to static schedules.
Flexible optimization schedules (Hägele et al., 28 May 2024): Replacing cycle-length-locked cosine learning rate schedules with constant LR plus cooldown (and possibly stochastic weight averaging) enables a single long run to support many performance checkpoints, drastically reducing the duplicated compute cost in scaling law experiments.
QAT/FP allocation (Dremov et al., 26 Sep 2025): Quantization-aware training becomes compute-optimal when the allocation of tokens between full-precision and QAT phases is tuned according to a tokens-per-parameter-byte statistic; optimal fractions and losses can be predicted via empirically fitted scaling laws.

6. Simulation and System-Level Compute Optimization

Full-system simulation is crucial for realizing compute-optimal plans:

Profiling-driven simulators (Bang et al., 2023) (e.g., vTrain): Accurately predict effective utilization, communication overhead, and completion times across all possible pipeline and parallelization plans. This corrects for naïve overestimation of achievable compute, leading to grounded decisions on architecture scale and cluster configuration.
Multi-tenant / cluster scheduling: Incorporating simulation-guided estimates into job schedulers increases on-time completion rates and reduces average job times and makespan, increasing cluster efficiency and maintaining cost-effectiveness under shared resource constraints.

7. Future Directions and Broader Implications

Current research opens several ongoing avenues:

Refinement and validation of scaling laws for new modalities and skills, especially as data quality, domain-specific plateau effects, and algorithmic innovations (e.g., transfer scaling in protein LMs (Cheng et al., 4 Nov 2024)) are characterized more deeply.
Robust adaptive data mixing and selection strategies that match the practical realities of evolving web-scale datasets and emerging modalities (Kang et al., 29 Jul 2024).
Exploration of compute-optimality in reinforcement learning (Fu et al., 20 Aug 2025), quantization (Dremov et al., 26 Sep 2025), and fine-grained skills/task compositions (Roberts et al., 13 Mar 2025).
Integration of full-stack simulation into resource allocation, including not only training but also inference, and accounting for memory, network, and energy constraints.

Compute-optimal training is now a core principle underpinning efficient, large-scale machine learning. The paradigm supports principled decision-making at all levels of research and deployment, enabling practitioners to maximize achievable model performance within hard resource constraints and to avoid the diminishing returns and wasted scales of prior overprovisioned regimes.

References