Predictable Capability Gains with Scale

Updated 28 February 2026

The paper demonstrates that scaling laws, expressed as power-law and sigmoidal functions, reliably forecast ML performance improvements across various domains.
It details a systematic methodology including data sharding, hyperparameter sweeps, and Pareto front analysis to identify optimal resource allocation thresholds.
Practical insights highlight that predictable scaling informs efficient planning for data, model, and compute resources while acknowledging limits such as irreducible error saturation.

Predictable capability gains with increasing scale refer to the empirically robust phenomenon that, across a wide range of machine learning domains, model performance can be forecast as a simple function of scale—whether in terms of dataset size, model parameters, or computational expenditure—following specific mathematical scaling laws. This regularity enables practitioners to set realistic expectations for progress by scaling data, model size, or compute, and to optimize the allocation of resources for a desired performance level. Predictability of capability gains is now established not only in canonical supervised domains (image, language, and speech), but also in complex settings including end-to-end autonomous driving, reinforcement learning, symbolic regression, and even chart understanding and model merging.

1. Core Scaling Laws: Power-laws and Sigmoid Boundaries

The central mathematical framework grounding predictable capability gains is built on power-law and monotone-sigmoidal relationships:

Power-law error decay: For data size $N$ in supervised domains, the held-out error typically follows

$E(N) = a \cdot N^{-\alpha} + E_\infty$

where $\alpha$ is the scaling exponent, $E_\infty$ is the irreducible error floor (Bayes/annotation noise), and $a$ is a domain-specific prefactor. This form is observed in machine translation, language modeling, image classification, and speech recognition, with exponents $\alpha$ stable within each task (e.g., $\alpha \approx 0.128$ for translation, $\alpha \approx 0.066$ for word-level LM, $\alpha \approx 0.309$ for ImageNet top-1) (Hestness et al., 2017).

Sublinear model-size scaling: The parameter count $S(N)$ required for best-fit at data size $N$ scales sublinearly,

$S(N) = b \cdot N^\beta,\quad \beta < 1$

with $\beta$ in the range 0.57–0.92 depending on task and optimizer (Hestness et al., 2017).

Saturating sigmoid boundaries: As scale increases, performance on some tasks saturates following a quantile boundary,

$Q_\tau(F) = y_0 + L \cdot \sigma(a + \beta \cdot \log_{10} F)$

where $\sigma(t) = 1/(1+e^{-t})$ , and $F$ is the pretraining FLOPs; $y_0$ , $L$ , $a$ , $\beta$ are fitted parameters (Zhang et al., 17 Feb 2026).

These relationships extend cleanly across various domains and are validated by high-quality fits ( $R^2 > 0.97$ ) over multiple orders of magnitude in scale.

2. Methodological Foundations and Empirical Protocols

Predictability is established by standardized experimental methodology:

Sharding and hyperparameter search: Datasets are partitioned across several orders of magnitude in size; for each size, light hyperparameter sweeps determine the smallest model that overfits the data but minimizes held-out loss (Hestness et al., 2017).
Pareto front analysis: In RL and resource tradeoff settings (data vs. compute vs. UTD ratio in deep RL), combinations that yield frontier performance at given budget constraints are identified and fit by low-parameter power laws (Rybkin et al., 6 Feb 2025).
Quantile regression and coverage validation: For performance surfaces showing saturation or breakthroughs, monotone sigmoid curves are fit to upper quantiles and validated for reliability out-of-distribution, often across successive generations of models (Zhang et al., 17 Feb 2026).
Multimodal low-rank capability mapping: Observational approaches synthesize measurements from hundreds of public checkpoints and families into a low-dimensional “capability space,” onto which downstream performance maps linearly after log-transforming compute (Ruan et al., 2024).

This systematized approach allows for the robust forecasting of capability gains at previously untested scales in diverse application settings.

3. Domain-General Predictability and Modality-Specific Variations

Scaling law regularity applies across modalities, often with domain-specific exponents:

Domain/Task	Exponent $\alpha$ , $\beta$ , $c$ , etc.	Best-Fit Formulation	Reference
Machine translation (seq2seq+attn)	$\alpha \approx 0.128$	$E(N) = a N^{-\alpha} + E_\infty$	(Hestness et al., 2017)
ImageNet classification (ResNet)	$\alpha \approx 0.309$ , $\beta \approx 0.57$	$E(N) = a N^{-\alpha} + E_\infty$ , $S(N) = b N^\beta$	(Hestness et al., 2017)
Acoustic modeling (APC, transformer)	$\alpha_D = 0.01946$ , $\alpha_N = 0.01601$	$L(D) = L_\infty + (D_C/D)^{\alpha_D}$	(Droppo et al., 2021)
End-to-end autonomous driving (FDE)	$c \approx -0.396$	$E_{FDE}(N) = 1.358 N^{-0.396} + 0.543$	(Naumann et al., 6 Apr 2025)
Symbolic regression (solved rate)	$\beta \approx 0.359$	$S(C) \approx 7.9 \times 10^{-8} C^{0.359}$	(Otte et al., 30 Oct 2025)
Neural material models (EquiformerV2)	$\beta_D=0.242$ , $\beta_P=0.383$ , $\beta_C=0.339$	$L = \alpha \cdot D^{-\beta_D}$ , etc.	(Trikha et al., 26 Sep 2025)

In all cases, provided the scale is within two orders of magnitude of held-out fits, predictive error in performance extrapolation is $<10\%$ for aggregate metrics.

4. Limits, Subtleties, and Breakdown Regimes

Irreducible error saturation: All scaling laws approach an asymptotic floor ( $E_\infty$ , $L_\infty$ ), determined by Bayes-optimal error or dataset/label noise. No amount of scale can surpass this ceiling (Hestness et al., 2017, Droppo et al., 2021).
Emergence and distributional multimodality: Apparent “emergent” breakthroughs correspond not to fundamentally discrete jumps, but to transitions in the weights of bimodal or multimodal seed distributions, with the mean, standard deviation, and success modes of capability scaling smoothly with $N$ (Zhao et al., 24 Feb 2025).
Downstream unpredictability and metric discontinuity: Standard pretraining loss metrics (cross-entropy/NLL) scale smoothly, but post-hoc downstream metrics (accuracy on k-way multiple choice) may lose predictability due to argmax or thresholding; only by modeling probability scaling for each alternative (“incorrect” choices) alongside correct can downstream curves be forecast accurately (Schaeffer et al., 2024).
Domain/task-specific plateauing: For some tasks (e.g., math reasoning at high compute), capability boundaries continue to advance with architectural or data innovations, drifting beyond the original sigmoid fit’s plateau (Zhang et al., 17 Feb 2026).

5. Practical Forecasting: Planning and Resource Allocation

Predictable scaling laws enable precise planning across data, model, and compute axes:

Dataset budgeting: Given current performance and target threshold, invert the scaling law to compute the required data/model/compute increase. For

$N^* = N_0 \left( \frac{E_0 - E_\infty}{E^* - E_\infty} \right)^{1/\alpha}$

the required increase can grow exponentially with small $\alpha$ (e.g., in word-level LMs, $\alpha \approx 0.066$ ).

Model capacity matching: Sublinear growth of $S(N)$ means that doubling data only requires $2^\beta \approx 1.5\times$ more parameters for $\beta \approx 0.57$ (ImageNet).
Compute provisioning: If cost scales as $N^{1+\beta}$ , practitioners can extrapolate GPU-hours needed for target gains, revealing that diminishing error returns always accompany rapidly increasing compute (Hestness et al., 2017).
Subset selection and efficient evaluation: Monotone log-linear relations between training subset entropy/coverage and performance allow extrapolation of full-set gains from small-scale, high-diversity subset runs, thus minimizing experimental expense (Liu et al., 4 Feb 2026).

6. Special Cases: Multi-agent, Generative, and Compositional Scaling

The general scaling principles extend to specialized contexts:

Model merging: Predictable capability gains when assembling $k$ domain experts follow a law

$E[L | N, k] = L_\infty(N) + \frac{A(N)}{k + b}$

with $1/k$ merging tail and a size-dependent floor, enabling optimized planning of merge depth vs. base-scale investment (Wang et al., 29 Sep 2025).

Generative evaluations (pass@k): Scaling laws for pass-at- $k$ rates in code generation and problem solving can be forecast using compute, parameter/token, or gold-reference log-likelihood as covariates; gold-reference fits are stable even >4 orders below target, supporting robust, long-range extrapolation (Schaeffer et al., 28 Sep 2025).
Reinforcement learning: Value-based, off-policy RL exhibits a one-dimensional Pareto frontier between data and compute, indexed by the updates-to-data (UTD) ratio. Budget-optimal splits and required scale for a performance threshold follow simple, fitted power laws (Rybkin et al., 6 Feb 2025).

7. Forecasting, Uncertainty, and Future Directions

While predictable scaling law behavior is now robustly documented, several complexities and caveats remain:

Cross-family universality and capability space: Observational approaches show that, after mapping models into a latent “capability space” by principal component analysis of benchmark errors, cross-family scaling is governed by linear relationships in this space, and diverse emergent, agentic, and post-training behaviors can all be forecast from the same low-dimensional representation (Ruan et al., 2024).
Uncertainty quantification and backtesting: Predictive errors must always be reported alongside point forecasts—e.g., extrapolating BBH accuracy to $10\times$ more compute yields a mean absolute error of $\approx6$ percentage points on unseen checkpoints (Owen, 2024). Confidence bands and out-of-distribution holdout validation are required for robust operationalization.
Metric and benchmark design: For maximal scaling predictability, downstream evaluation protocols should report rich continuous metrics (log-probabilities), high $k$ in multiple-choice, and coverage of both correct and incorrect response probability scaling, to suppress non-invertible discretization artifacts (Schaeffer et al., 2024).
Limits of universality: Key empirical exponents are stable for a given task and metric but may shift under architectural breakthroughs, new pretraining regimes, or significant shifts in task composition (e.g., math reasoning frontiers (Zhang et al., 17 Feb 2026)); it is an open research question whether deeper theoretical explanations can capture and predict such epochal changes.