Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compute-Optimal Model and Data Allocation

Updated 6 March 2026
  • Compute-optimal model and data allocation is a framework that optimally divides a finite compute budget between model parameters and training data to maximize performance.
  • It compares foundational laws like Chinchilla’s proportional scaling with unified log-linear models that allow flexible splits under fixed compute constraints.
  • The article also explores applications in inference, distributed computing, reinforcement learning, and wireless systems, with empirical validations and task-specific adaptations.

Compute-optimal model and data allocation refers to the principled division of a finite computational budget between a model's size (number of parameters, NN) and the amount of data used for training (number of tokens, DD) or inference, in order to maximize empirical or theoretical performance. Modern research has extended this concept beyond standard pretraining to data selection, adaptive model shaping, distributed computing, RL, inference-time allocation, and resource-constrained wireless systems. This article systematically surveys core laws, methodologies, and practical rules for compute-optimal allocation in both foundational and emerging scenarios.

1. Foundational Laws: From Chinchilla to One-Dimensional Scaling

Traditional approaches to compute-optimal allocation in transformer-based language modeling posit that, for a fixed compute budget CC, both model size NN and dataset size DD should be increased in lock-step. The classic “Chinchilla rule” is derived from empirical two-dimensional power-law fits of the form

L(N,D)ANα+BDβ+LL(N, D) \approx A N^{-\alpha} + B D^{-\beta} + L_\infty

subject to CNDC \sim N D FLOPs. Minimizing loss under this constraint yields

NCβ/(α+β),DCα/(α+β)N^* \propto C^{\beta/(\alpha+\beta)}\,,\quad D^* \propto C^{\alpha/(\alpha+\beta)}

For LLMs, exponents are generally αβ0.5\alpha\approx\beta\approx0.5, leading to NDC1/2N^*\sim D^*\sim C^{1/2} and D/N=constantD^*/N^* = \text{constant}, requiring proportional doubling of NN and DD as CC increases (Hoffmann et al., 2022).

However, "More Compute Is What You Need" (Guo, 2024) provides strong empirical evidence for a single-parameter scaling law: BPC(C)=αlogC+β\text{BPC}(C) = \alpha \log C + \beta with α=0.031,β=0.572\alpha = -0.031, \beta = 0.572. Here, model compression (bits-per-character, BPC) depends almost exclusively on total compute C=NDC = N D, not the split between NN and DD. Any (N,D)(N,D) satisfying C=NDC = N D achieves the same final compression. This unified law implies no intrinsic trade-off: for fixed CC, practitioners are free to select any (N,D)(N,D) according to inference constraints.

2. Compute Allocation Strategies, Data Regimes, and Inference Efficiency

The unified log-linear law (Guo, 2024) motivates two distinct regimes:

  • Data-rich regime: When abundant high-quality web data is available (D<DmaxD < D_{\rm max}), minimizing NN reduces inference cost while maximizing DD absorbs the remaining compute. Thus, optimal allocation is achieved by using the smallest NN compatible with latency/throughput requirements, dedicating maximum possible compute to more data.
  • Data-exhaustion regime: Once DD saturates available web-scale data (typically Dlim410D_{\rm lim}\approx 4-10 T tokens), only increasing NN (and thus CC) yields further gains.

Crucially, inference cost scales poorly with NN, but DD impacts only training. The result is that, for deployment-oriented pretraining, shifting compute towards larger DD allows both optimal loss and improved inference efficiency. In contrast, the Chinchilla law (NDN \sim D) constrains D/ND/N to a constant and cannot fully leverage this regime without violating the isocompute constraint (Hoffmann et al., 2022, Guo, 2024).

3. Derivation and Empirical Validation

The unified log-linear law arises from regression of BPC on ln(ND)\ln (N D), fit to open-source transformer models (e.g., Llama 1/2/3, Falcon, Qwen, DeepSeek, Yi, Mistral). Models such as Llama 3, trained with 10×\sim10\times more data than Chinchilla-recommended, nevertheless lie perfectly on the same log-linear trend, refuting any requirement for special balancing. Outliers—models trained with extreme D/ND/N ratios—do not systematically deviate from the fit, supporting the thesis that CC alone governs final compression (Guo, 2024).

4. Comparative Perspective: Power-Law Versus Log-Linear Scaling

Allocation Law C-Dependence D/N Constraint Optimum Structure
Chinchilla (Power-Law) N,DC1/2N,D\propto C^{1/2} D/N=constD/N = \text{const} Equal scaling: NN and DD both increase as C1/2C^{1/2}
Unified (Log-Linear) (Guo, 2024) (N×D)=C(N \times D) = C D/ND/N unconstrained Any N,DN,D with ND=CN D = C, allowing arbitrary splits

Unlike Chinchilla and other two-term power-law models (Hoffmann et al., 2022, Nayak et al., 2024), the log-linear law empirically dominates for current generation transformers in practical compute ranges. However, it remains to be tested in extreme asymptotic settings.

5. Skills, Validation, and Sensitivity to Task

Skill-specific analysis demonstrates that optimal allocation is sensitive to downstream evaluation mix. Empirical scaling exponents differ sharply between, e.g., knowledge-based QA (capacity-hungry, β/α1\beta/\alpha \gg1) and code generation (data-hungry, α/β1\alpha/\beta\gg1), with up to 50% swing in optimal NN^* depending on validation set composition (Roberts et al., 13 Mar 2025). Thus, correct allocation must be informed by application-specific priorities, and practitioners must align validation metrics accordingly. If the target is "satisficing" across heterogeneous skills, mixing pretraining datamixes (e.g., 2:12{:}1 code:knowledge) aligns optimal NN for critical subskills.

6. Compute-Optimal Allocation in Broader Contexts

Beyond language modeling, compute-optimal allocation extends to diverse settings:

  • Data selection under compute constraints: The joint minimization of data-selection and training compute reveals that, at small budgets, lexically or embedding-based scoring dominates, while computationally heavy selection (perplexity or gradient) is only optimal when the training model is much larger (at least >5×>5\times or 10×10\times) than the scorer (Yin et al., 2024). Compute should be prioritized for cheaper selection and more tokens until this threshold is crossed.
  • Inference compute allocation: OSCA algorithmically divides a fixed sample budget across a set of LLM inference configurations (model, temperature, language, etc.) using a convex program, leading to order-of-magnitude compute savings over best single configurations (Zhang et al., 2024).
  • RL: Compute-optimal scaling in value-based deep RL involves allocating compute among model size, update-to-data ratio, and batch size. The empirical trade-off is governed by phenomenon such as TD-overfitting; optimal allocation power-laws are derived for Pareto-efficient sample and compute use (Fu et al., 20 Aug 2025).
  • Distributed computing: Universal, deterministic allocation of data and subfunction loads (as a dd-uniform hypergraph edge partitioning) achieves order-optimal communication and computation scaling, e.g., n/N1/dn/N^{1/d} communication cost and bounded load imbalance, via interweaved clique design (Maheri et al., 9 Jan 2026).
  • Wireless/federated resource allocation: In edge scenarios, joint model adapter compression and local compute allocation is solved via fractional programming and KKT updates under system/energy/delay constraints (Wang et al., 2024). Dinkelbach’s method yields superlinear convergence to the utility/consumption optimum.

7. Caveats, Assumptions, and Limitations

The unified computational scaling law (Guo, 2024) is strictly empirical for transformer-based autoregressive LLMs. Deviations may occur:

  • For architectures other than transformers, or with task-specific objective functions outside next-token prediction.
  • At extreme values of NN or DD, where the log-linear law has not been tested.
  • When data quality or heterogeneity is significant: higher-quality data can shift optimal allocations (as shown by DeepSeek, 2024).
  • BPC is not a comprehensive measure of downstream or emergent abilities.
  • The NDN D-based compute proxy omits model depth, sequence length, optimizer and pipeline parallelization factors.

Regardless, for practical compute budgets and modern open-source LMs, allocating compute primarily based on the total product NDN D, guided by inference constraints and data availability, is the new empirical optimum. The previous practice of fixed D/ND/N balancing is now superseded except under data exhaustion or specialized skill-optimized scenarios.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compute-Optimal Model and Data Allocation.