Papers
Topics
Authors
Recent
Search
2000 character limit reached

Prescriptive Scaling for Machine Learning

Updated 21 February 2026
  • Prescriptive scaling is a novel approach that shifts from average-case predictions to quantile-based performance guarantees, mapping resource budgets to assured outcomes.
  • It employs a four-parameter saturating sigmoid model with smoothed quantile loss to estimate high-confidence performance curves from log-compute metrics.
  • Optimal experimental design using Fisher information and bin-balanced regularization enables near-full-data accuracy with significantly reduced evaluation overhead.

Prescriptive scaling is an emerging paradigm in machine learning and model development that reframes classical scaling-law analysis into actionable, decision-theoretic procedures for practitioners. Instead of predicting mean performance trends under controlled experimental conditions, prescriptive scaling addresses the computation of conservative, high-confidence upper quantiles of attainable performance, conditioned on contemporary post-training and deployment practices. This approach yields principled, quantile-anchored maps from resource budgets (such as pretraining FLOPs, data quantity, or model size) to guaranteed downstream performance, accommodating the full heterogeneity and drift of real-world post-training pipelines (Zhang et al., 17 Feb 2026).

1. Conceptual Shift: From Predictive to Prescriptive Scaling

Traditional scaling laws (as in Kaplan et al. 2020 and Hoffmann et al. 2022) typically describe the average case: the expectation of loss or perplexity is modeled as a smooth power law in compute, model size, or data,

E[loss]αCγ+β,\mathbb{E}[\text{loss}] \approx \alpha\,C^{-\gamma} + \beta,

valid under tightly controlled, single-recipe pretraining. However, these laws do not account for variance induced by subsequent fine-tuning, post-training recipe changes (e.g., RLHF, instruction tuning), architectural tweaks, or temporal shifts in deployed models.

Prescriptive scaling, by contrast, aggregates all post-trained model checkpoints—across heterogeneous leaderboards and recipes—into a single empirical population. It estimates capability boundaries by fitting upper quantiles (e.g., the 98th98^\text{th} percentile) of task accuracy as a function of log-compute. The core map is

z=log10Cqτ(z)Qτ(accuracyZ=z),z = \log_{10} C \mapsto q_\tau(z) \approx Q_\tau(\text{accuracy} | Z = z),

where %%%%1%%%% answers, for example: “If I spend 10z10^z FLOPs, what is the best accuracy I can guarantee with 98%98\% probability, under today’s ecosystem of post-training recipes?” (Zhang et al., 17 Feb 2026).

2. Mathematical Framework for Capability Boundaries

Prescriptive scaling models the conditional upper-quantile (τ\tau) boundary with a four-parameter, monotone, saturating sigmoid in z=log10(FLOPs)z = \log_{10}(\text{FLOPs}): qτ(z;θ)=y0+Lσ(a+βz),q_\tau(z;\,\theta) = y_0 + L\,\sigma(a + \beta\,z), with σ(t)=11+et\sigma(t) = \frac{1}{1 + e^{-t}}. Parameters satisfy:

  • y0[0,1]y_0 \in [0,1]: baseline as zz \to -\infty
  • L[0,1y0]L \in [0, 1 - y_0]: sigmoid height, qτ(z)1q_\tau(z) \leq 1
  • β0\beta \geq 0: ensures monotonicity with zz
  • aa: sigmoid location

This parameterization naturally models saturation: as zz \to \infty, qτy0+L1q_\tau \to y_0 + L \leq 1; as zz \to -\infty, qτy0q_\tau \to y_0.

Fitting is performed via smoothed quantile (pinball) loss,

L(θ)=i=1Nτ(yiqτ(zi;θ))+λθ22,\mathcal{L}(\theta) = \sum_{i=1}^N \ell_\tau(y_i - q_\tau(z_i;\theta)) + \lambda\|\theta\|_2^2,

with

τ(u)=1κlog(1+eκu)+(τ1)u,κ50,λ103\ell_\tau(u) = \frac{1}{\kappa}\log(1 + e^{\kappa u}) + (\tau-1)u, \qquad \kappa \approx 50,\, \lambda \sim 10^{-3}

Box constraints on (y0,L,β)(y_0,L,\beta) are enforced to guarantee monotonicity and headroom (Zhang et al., 17 Feb 2026).

3. Efficient Experimental Design for Boundary Estimation

Evaluating all models on all tasks is typically infeasible. Prescriptive scaling leverages optimal experimental design to select a small (\sim20%) FLOP-weighted subset of evaluations sufficient to recover boundaries with near-optimal fidelity:

  • The information-matrix approximation computes the Fisher information over candidate models given the Jacobians of qτq_\tau;
  • Bin-balanced regularization ensures the sampled subset covers the entire log-compute range;
  • The final acquisition maximizes a combined criterion (I-optimality plus bin coverage) under an overall evaluation budget via a greedy, gain-per-cost heuristic, using efficient rank-one updates for statistics.

Empirically, this subsampling achieves boundaries within 1–2% of full-data fit accuracy on typical benchmarks, with as little as 5% sampling sufficing for certain cases (Zhang et al., 17 Feb 2026).

4. Temporal Robustness and Monitoring Boundary Shifts

A critical feature of prescriptive scaling is temporal reliability: capability boundaries should transfer across successive model generations. Chronologically partitioning leaderboard data and fitting the boundary on period Pt\mathcal{P}_t, then evaluating on Pt+1\mathcal{P}_{t+1}, allows for diagnostics:

  • Coverage error (deviation from targeted τ\tau in each bin)
  • Out-of-distribution quantile loss

Observations:

  • On knowledge-intensive tasks (e.g., MMLU-Pro, BBH, GPQA, MuSR), boundaries remain robust, with coverage error within ±1\pm1–$2$%.
  • On mathematical reasoning (e.g., MATH Lvl 5) and instruction-following tasks, boundaries show persistent under-coverage in later periods, indicating an advancing frontier—the map is not yet saturated and keeps moving as new algorithms and data emerge (Zhang et al., 17 Feb 2026).

5. Deployment Scenarios and Prescriptive Utility

The prescriptive scaling map facilitates a range of actionable workflows:

  • Budget–Performance Translation: For a target accuracy yy^\star, invert qτ(z)q_\tau(z) to obtain the required z=log10Cz^\star = \log_{10} C^\star. Invest CC^\star FLOPs, confident that 98%98\% of post-training runs will reach or exceed yy^\star.
  • Dynamic Boundary Monitoring: Regularly re-fit on new data and monitor boundary shifts. Persistent under-coverage signals architectural advances outside the previously characterized envelope.
  • “Ceiling” and Model Family Effects: Small-model ceilings manifest as sigmoid saturation; high-accuracy requirements necessitate larger-scale pretraining if the task boundary is saturating. For knowledge-heavy tasks, smaller models may suffice with extensive post-training.
  • Efficient Benchmarking: Apply balanced I-optimal sampling to minimize task evaluation overhead, preserving performance guarantees with limited experiment budgets.
  • Contamination and Saturation Detection: Comparative shift tests across related benchmarks expose potential data contamination (post-publication leakage). Temporal analysis of the slope of log(param#)logit(qτ)\log(\text{param\#}) \mapsto \operatorname{logit}(q_\tau) quantifies progress towards or beyond small-model “ceilings.”

6. Comparison to Prescriptive Scaling in Other Domains

Prescriptive scaling, while motivated by LLMs, has analogues in acoustic modeling (Droppo et al., 2021), generative model evaluation (Schaeffer et al., 28 Sep 2025), classification with feature normalization (Amorim et al., 2022), and clustering with shape complexity optimization (Aguilar et al., 2022). All share a focus on turning statistical prediction into resource allocation procedures:

  • In acoustic modeling, joint scaling laws prescribe (N,D)(N, D) (parameters, data) for a fixed compute limit using empirically fitted exponents, enforcing irreducible error floors and budget trade-offs (Droppo et al., 2021).
  • In generative model evaluations, compute-optimal allocations between parameters and data are derived via theoretically grounded envelopes of scaling laws; quantile predictions are matched to target “pass@k” rates (Schaeffer et al., 28 Sep 2025).
  • In clustering and normalization, prescriptive approaches optimize over candidate scaling transformations or feature scalings to maximize downstream task indices under explicit constraints (Amorim et al., 2022, Aguilar et al., 2022).

7. Implications and Limitations

Prescriptive scaling transforms compute budgeting from an empirical art into a data-driven, quantile-anchored protocol. It allows practitioners to engineer for high-confidence performance, monitor for boundary drift, and allocate experimental budget with maximal efficiency. A caveat is that the saturating envelope assumed in capability boundary modeling may be broken by paradigm-shifting approaches or recipe drift, as observed in advancing math-reasoning tasks. Regular updating and robust model evaluation are thus essential to maintain the validity of prescriptive projections (Zhang et al., 17 Feb 2026).


References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Prescriptive Scaling.