Compute-Optimal Model and Data Allocation

Updated 6 March 2026

Compute-optimal model and data allocation is a framework that optimally divides a finite compute budget between model parameters and training data to maximize performance.
It compares foundational laws like Chinchilla’s proportional scaling with unified log-linear models that allow flexible splits under fixed compute constraints.
The article also explores applications in inference, distributed computing, reinforcement learning, and wireless systems, with empirical validations and task-specific adaptations.

Compute-optimal model and data allocation refers to the principled division of a finite computational budget between a model's size (number of parameters, $N$ ) and the amount of data used for training (number of tokens, $D$ ) or inference, in order to maximize empirical or theoretical performance. Modern research has extended this concept beyond standard pretraining to data selection, adaptive model shaping, distributed computing, RL, inference-time allocation, and resource-constrained wireless systems. This article systematically surveys core laws, methodologies, and practical rules for compute-optimal allocation in both foundational and emerging scenarios.

1. Foundational Laws: From Chinchilla to One-Dimensional Scaling

Traditional approaches to compute-optimal allocation in transformer-based language modeling posit that, for a fixed compute budget $C$ , both model size $N$ and dataset size $D$ should be increased in lock-step. The classic “Chinchilla rule” is derived from empirical two-dimensional power-law fits of the form

$L(N, D) \approx A N^{-\alpha} + B D^{-\beta} + L_\infty$

subject to $C \sim N D$ FLOPs. Minimizing loss under this constraint yields

$N^* \propto C^{\beta/(\alpha+\beta)}\,,\quad D^* \propto C^{\alpha/(\alpha+\beta)}$

For LLMs, exponents are generally $\alpha\approx\beta\approx0.5$ , leading to $N^*\sim D^*\sim C^{1/2}$ and $D^*/N^* = \text{constant}$ , requiring proportional doubling of $N$ and $D$ as $C$ increases (Hoffmann et al., 2022).

However, "More Compute Is What You Need" (Guo, 2024) provides strong empirical evidence for a single-parameter scaling law: $\text{BPC}(C) = \alpha \log C + \beta$ with $\alpha = -0.031, \beta = 0.572$ . Here, model compression (bits-per-character, BPC) depends almost exclusively on total compute $C = N D$ , not the split between $N$ and $D$ . Any $(N,D)$ satisfying $C = N D$ achieves the same final compression. This unified law implies no intrinsic trade-off: for fixed $C$ , practitioners are free to select any $(N,D)$ according to inference constraints.

2. Compute Allocation Strategies, Data Regimes, and Inference Efficiency

The unified log-linear law (Guo, 2024) motivates two distinct regimes:

Data-rich regime: When abundant high-quality web data is available ( $D < D_{\rm max}$ ), minimizing $N$ reduces inference cost while maximizing $D$ absorbs the remaining compute. Thus, optimal allocation is achieved by using the smallest $N$ compatible with latency/throughput requirements, dedicating maximum possible compute to more data.
Data-exhaustion regime: Once $D$ saturates available web-scale data (typically $D_{\rm lim}\approx 4-10$ T tokens), only increasing $N$ (and thus $C$ ) yields further gains.

Crucially, inference cost scales poorly with $N$ , but $D$ impacts only training. The result is that, for deployment-oriented pretraining, shifting compute towards larger $D$ allows both optimal loss and improved inference efficiency. In contrast, the Chinchilla law ( $N \sim D$ ) constrains $D/N$ to a constant and cannot fully leverage this regime without violating the isocompute constraint (Hoffmann et al., 2022, Guo, 2024).

3. Derivation and Empirical Validation

The unified log-linear law arises from regression of BPC on $\ln (N D)$ , fit to open-source transformer models (e.g., Llama 1/2/3, Falcon, Qwen, DeepSeek, Yi, Mistral). Models such as Llama 3, trained with $\sim10\times$ more data than Chinchilla-recommended, nevertheless lie perfectly on the same log-linear trend, refuting any requirement for special balancing. Outliers—models trained with extreme $D/N$ ratios—do not systematically deviate from the fit, supporting the thesis that $C$ alone governs final compression (Guo, 2024).

4. Comparative Perspective: Power-Law Versus Log-Linear Scaling

Allocation Law	C-Dependence	D/N Constraint	Optimum Structure
Chinchilla (Power-Law)	$N,D\propto C^{1/2}$	$D/N = \text{const}$	Equal scaling: $N$ and $D$ both increase as $C^{1/2}$
Unified (Log-Linear) (Guo, 2024)	$(N \times D) = C$	$D/N$ unconstrained	Any $N,D$ with $N D = C$ , allowing arbitrary splits

Unlike Chinchilla and other two-term power-law models (Hoffmann et al., 2022, Nayak et al., 2024), the log-linear law empirically dominates for current generation transformers in practical compute ranges. However, it remains to be tested in extreme asymptotic settings.

5. Skills, Validation, and Sensitivity to Task

Skill-specific analysis demonstrates that optimal allocation is sensitive to downstream evaluation mix. Empirical scaling exponents differ sharply between, e.g., knowledge-based QA (capacity-hungry, $\beta/\alpha \gg1$ ) and code generation (data-hungry, $\alpha/\beta\gg1$ ), with up to 50% swing in optimal $N^*$ depending on validation set composition (Roberts et al., 13 Mar 2025). Thus, correct allocation must be informed by application-specific priorities, and practitioners must align validation metrics accordingly. If the target is "satisficing" across heterogeneous skills, mixing pretraining datamixes (e.g., $2{:}1$ code:knowledge) aligns optimal $N$ for critical subskills.

6. Compute-Optimal Allocation in Broader Contexts

Beyond language modeling, compute-optimal allocation extends to diverse settings:

Data selection under compute constraints: The joint minimization of data-selection and training compute reveals that, at small budgets, lexically or embedding-based scoring dominates, while computationally heavy selection (perplexity or gradient) is only optimal when the training model is much larger (at least $>5\times$ or $10\times$ ) than the scorer (Yin et al., 2024). Compute should be prioritized for cheaper selection and more tokens until this threshold is crossed.
Inference compute allocation: OSCA algorithmically divides a fixed sample budget across a set of LLM inference configurations (model, temperature, language, etc.) using a convex program, leading to order-of-magnitude compute savings over best single configurations (Zhang et al., 2024).
RL: Compute-optimal scaling in value-based deep RL involves allocating compute among model size, update-to-data ratio, and batch size. The empirical trade-off is governed by phenomenon such as TD-overfitting; optimal allocation power-laws are derived for Pareto-efficient sample and compute use (Fu et al., 20 Aug 2025).
Distributed computing: Universal, deterministic allocation of data and subfunction loads (as a $d$ -uniform hypergraph edge partitioning) achieves order-optimal communication and computation scaling, e.g., $n/N^{1/d}$ communication cost and bounded load imbalance, via interweaved clique design (Maheri et al., 9 Jan 2026).
Wireless/federated resource allocation: In edge scenarios, joint model adapter compression and local compute allocation is solved via fractional programming and KKT updates under system/energy/delay constraints (Wang et al., 2024). Dinkelbach’s method yields superlinear convergence to the utility/consumption optimum.

7. Caveats, Assumptions, and Limitations

The unified computational scaling law (Guo, 2024) is strictly empirical for transformer-based autoregressive LLMs. Deviations may occur:

For architectures other than transformers, or with task-specific objective functions outside next-token prediction.
At extreme values of $N$ or $D$ , where the log-linear law has not been tested.
When data quality or heterogeneity is significant: higher-quality data can shift optimal allocations (as shown by DeepSeek, 2024).
BPC is not a comprehensive measure of downstream or emergent abilities.
The $N D$ -based compute proxy omits model depth, sequence length, optimizer and pipeline parallelization factors.

Regardless, for practical compute budgets and modern open-source LMs, allocating compute primarily based on the total product $N D$ , guided by inference constraints and data availability, is the new empirical optimum. The previous practice of fixed $D/N$ balancing is now superseded except under data exhaustion or specialized skill-optimized scenarios.

References:

"More Compute Is What You Need" (Guo, 2024)
"Training Compute-Optimal LLMs" (Hoffmann et al., 2022)
"An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in LLMs" (Nayak et al., 2024)
"Compute-Constrained Data Selection" (Yin et al., 2024)
"Scaling LLM Inference with Optimized Sample Compute Allocation" (Zhang et al., 2024)
"Compute Optimal Scaling of Skills: Knowledge vs Reasoning" (Roberts et al., 13 Mar 2025)
"Compute-Optimal Scaling for Value-Based Deep RL" (Fu et al., 20 Aug 2025)
"Universal and Asymptotically Optimal Data and Task Allocation in Distributed Computing" (Maheri et al., 9 Jan 2026)
"Resource Allocation and Secure Wireless Communication in the Large Model-based Mobile Edge Computing System" (Wang et al., 2024)