Parametric Test-Time Scaling Laws

Updated 10 November 2025

The paper introduces explicit power-law forms that relate evaluation metrics to resource variables like compute and context length through empirical curve fitting.
It combines architectural analysis with regression techniques to derive actionable guidelines for optimal inference compute allocation across diverse models.
The framework provides practical insights for enhancing performance in time series models, large language models, and game-playing AI while addressing extrapolation and saturation challenges.

Parametric test-time scaling laws provide a quantitative framework for modeling how predictive performance, uncertainty, or efficiency metrics change as a function of compute, context length, or other controllable variables at inference rather than training time. These laws play a critical role in modern neural scaling studies, spanning domains such as time series foundation models, large reasoning models, autoregressive LLMs, world foundation models, and high-throughput agents. The foundational results combine empirical curve fitting, architectural analysis, and resource-aware modeling to reveal universal patterns and practical guidelines for allocating inference compute.

1. Mathematical Forms of Parametric Test-Time Scaling Laws

Across diverse application areas, parametric test-time scaling laws take the form of explicit functional relationships between an evaluation metric (e.g., predictive loss, accuracy, Elo, or uncertainty) and one or more resource variables exercised at inference. The canonical structure is a simple or piecewise power law, frequently parameterized as follows:

Time Series Foundation Models (TSFM) Loss Scaling (Yao et al., 16 Oct 2024):

$L(N) \approx \left( \frac{N_0}{N} \right)^{\alpha_N}, \quad L(C) \approx \left( \frac{C_0}{C} \right)^{\alpha_C}, \quad L(D) \approx \left( \frac{D_0}{D} \right)^{\alpha_D}$

where $L$ is log-likelihood loss per token; $N$ is number of parameters; $C$ is training compute; $D$ is training data volume; $\{\alpha_N, \alpha_C, \alpha_D\}$ are empirically fit exponents; $\{N_0, C_0, D_0\}$ are normalization constants.

Reasoning Model "Scaling Plateau" Laws (Wang et al., 26 May 2025):

$F(C) = F_{\max} \cdot \bigl( 1 - (1-p_x)^C \bigr)$

where $F(C)$ is expected accuracy as a function of test-time sampling/refinement budget $C$ ; $p_x$ is single-sample or roundwise success probability; $F_{\max}$ is asymptotic performance.

World Foundation Models – Power Law Score (Cong et al., 31 Mar 2025):

$P(N) \approx P_0 + k N^\beta, \qquad 0 < \beta \lesssim 0.2$

where $P(N)$ is an evaluation metric (FVD, FID) and $N$ is number of inference samples.

Game Playing Elo Laws – Train/Test Compute Frontier (Jones, 2021):

$P(S, C_{\mathrm{train}}, C_{\mathrm{test}}) = \min\left\{\max\left[ m_S^i S + m^f_i \log_{10}(C_{\mathrm{train}}) + m^t_i \log_{10}(C_{\mathrm{test}}) + c^i, m^p_S S + c^p \right], 0 \right\}$

Uncertainty Contracting Laws in Parametric Models (Rosso et al., 11 Jun 2025):

$\mathrm{Var}_{\text{epistemic}}(x) = \frac{1}{N} g(x)^\top I(\theta^*)^{-1} g(x) + O(1/N^2)$

The selection of parametric form depends on architecture, metric, and resource considered.

2. Fitting Methodology and Empirical Exponents

The empirical validation of these laws proceeds via large-scale evaluations spanning several orders of magnitude in the scaling variable(s), typically leveraging the following workflow:

Fit Range: Experiments are conducted across $>5$ orders of magnitude in parameter count, data, or compute (TSFM: $N, C, D$ ; World models: $N$ ).
Regression: Log–log regression is used to fit exponents (e.g., $\alpha_N$ for TSFM loss scaling, $\beta$ for WFM/FVD curves). For test-time plateau models, nonlinear curve fitting is performed to estimate $p_x$ and $F_{\max}$ per problem or dataset.
Goodness-of-fit: $R^2$ values above 0.97 are routinely reported (TSFM, vision scaling (Yao et al., 16 Oct 2024, Jones, 2021)), and in uncertainty contraction, $R^2 \ge 0.95$ is observed across several model/dataset pairs (Rosso et al., 11 Jun 2025).
Critical Exponents:
- For encoder-only TSFM parameter scaling, $\alpha_N \simeq 0.065$ (ID), $\simeq 0.062$ (OOD).
- WFM test-time scaling exponents $\beta \sim -0.15$ (FVD versus $N$ ).
- Reasoning models (TTSPM) do not yield power laws but exhibit logarithmic saturation as $C \to \infty$ .
Piecewise/plateau regions: For board games (Jones, 2021), three-segment fits quantify a sharp transition from compute-limited to perfect-play regions, with constant slopes numerically consistent with idealized sampling theory.

The fitting protocols include dense evaluations, out-of-sample predictions, and cross-architecture/benchmark comparisons.

3. Architectural and Task-Dependent Phenomena

Parametric test-time scaling laws are modulated by both model architecture and task regime:

Encoder-Only vs Decoder-Only (TSFM): Encoder-only transformers yield exponents $\sim$ 20% higher than decoder-only ( $\alpha_N$ 0.065 vs 0.054 ID), leading to more rapid improvement as $N$ scales.
SOTA Architectures (Moirai, Chronos): Architectural enhancements often reduce intercept but weaken exponents; Moirai improves ID intercept by $\sim$ 10% at $10$–$100$M parameters, but $\alpha_N$ drops to 0.060; Chronos shows even lower exponents and poorer OOD scaling.
Sparse vs. Dense Attention (Kinetics) (Sadhukhan et al., 5 Jun 2025): For inference regimes dominated by attention memory bandwidth, sparse attention dramatically reduces the quadratic scaling with generation length ( $L_{\mathrm{out}}$ ) in the cost function.
Task Difficulty and Training Coverage (Javanmard et al., 4 Oct 2025): In in-context learning, task hardness (quantified as $H = \mathrm{Tr}(\Lambda)/\lambda_{\min}(\Lambda)$ ) modulates the exponential rate of test-time error decay, and poor training coverage can result in "overthinking", i.e., error increase with more test-time compute.

Tabular summary of key architectural scaling differences (TSFM):

Model Type	$\alpha_N$ (ID)	$\alpha_N$ (OOD)	ID Intercept	OOD Intercept
Encoder-only	0.065	0.062	$3\times10^5$	$4\times10^5$
Decoder-only	0.054	0.050	higher	higher
Moirai	0.060	0.058	$\sim$ 10% lower	marginally better
Chronos-T5	$\leq$ 0.03	$\leq$ 0.03	lower at N $\ll$	much weaker scaling

4. Resource Allocation and Scaling Guidelines

The derived models yield actionable prescriptions for allocating inference‐time compute for maximal performance:

Optimize for ID/OOD: Model size dominates improvements in ID metrics; data volume is the main lever for OOD generalization, with recommended minima $N \ge 50$ –$100$M (encoder-only), $D \ge 1$ B points (Yao et al., 16 Oct 2024).
Marginal Returns & Saturation Points (Reasoning Plateau): TTSPM provides the closed-form saturation budget

$C_{\mathrm{sat}} = \lceil \ln (\varepsilon/(F_{\max} p_x)) / \ln (1-p_x) \rceil$

for diminishing marginal gain below tolerance $\varepsilon$ , supporting adaptive allocation (Wang et al., 26 May 2025).

Test-Time Strategies: On WFMs, modest best-of- $N$ ( $N=4$ –$8$) often matches large model inferences at lower cost (Cong et al., 31 Mar 2025); but Kinetics law suggests that, past a critical threshold in parameter count (e.g., $P_{\mathrm{thresh}}\approx$ 14B for Qwen3), allocating extra test-time compute to longer generations or more trials becomes more effective than further scaling $P$ (Sadhukhan et al., 5 Jun 2025).
Compute/Memory-Awareness: The Kinetics equivalent-FLOPs cost function penalizes memory-bound attention operations, favoring larger models and the introduction of sparse attention for extended generation (Sadhukhan et al., 5 Jun 2025).

Illustrative tradeoff table (WFM, COSMOS):

Model & Strategy	Relative FLOPs	Composite Gain vs. Baseline
4B $\times$ 1	1.0×	baseline
4B $\times$ 4	1.8×	+4–5 %
4B $\times$ 8	2.5×	+5–6 %
12B $\times$ 1	3.0×	0% (vs. 4B×1)

5. Limitations, Caveats, and Open Questions

Despite their utility, parametric test-time scaling laws rest on several assumptions and have regime-specific limits:

Extrapolation Validity: Most fits are single-factor and do not capture all interaction terms (e.g., N/C/D jointly), nor do they always generalize beyond ranges studied (e.g., WFM scaling only validated for nuScenes/Waymo, not robotics) (Yao et al., 16 Oct 2024, Cong et al., 31 Mar 2025).
Saturation & Diminishing Returns: Nearly all scaling curves show clear saturation or plateau, well-captured by logarithmic or power-law forms, but real applications may encounter abrupt breakdowns if search or process-level pruning is suboptimal.
Measurement Sensitivity: SOTA improvements on ID may reduce scaling exponents, leading to worse OOD outcomes at scale (Moirai, Chronos examples) (Yao et al., 16 Oct 2024).
Domain Dependency: Domain, task hardness, and train/test distribution match have large effects; for ICL and chain-of-thought, inadequate training task diversity can yield negative returns (worsening) as more test-time reasoning is performed (Javanmard et al., 4 Oct 2025).
Practical Implementability: Test-time scaling methods relying on fast tokenizers, sparse verifiers, or chain-of-thought require supporting infrastructure and may not transfer to all architectures (e.g., non-tokenized decoders (Cong et al., 31 Mar 2025)).

6. Implications and Practice-Oriented Prescriptions

Parametric test-time scaling laws underpin a set of practical design principles for modern foundation models:

For TSFMs: Start with encoder-only models at scale ( $N \ge 50$ –$100$M), prioritize data volume for OOD, and validate inductive architectural modifications, particularly on OOD domains (Yao et al., 16 Oct 2024).
In Reasoning: Use closed-form TTSPM (Eq. 3) to select per-task sampling/refinement budgets, adaptively trading off compute against expected marginal gain (Wang et al., 26 May 2025).
For LLM/Attention Models: Incorporate memory-aware cost (Kinetics) to allocate compute between model scaling, generation length, and number of trials, with a distinct shift to favor model size above hardware/attention bottleneck thresholds. Employ sparse attention to maintain scaling efficiency at long sequence lengths (Sadhukhan et al., 5 Jun 2025).
Test-Time Scaling in World/Foundation Models: Prefer smaller models with best-of- $N$ /beam search under tight compute budgets; introduce process-level acceleration infrastructure for practical gains (Cong et al., 31 Mar 2025).
In ICL/CoT Regimes: Balance training context length against test-time step length for exponential error decay, but monitor for training coverage gaps that can induce "overthinking," where increased k harms performance (Javanmard et al., 4 Oct 2025).

These results collectively demonstrate that parametric test-time scaling laws provide predictive—and prescriptive—power for efficient design and deployment of high-capacity neural systems, contingent on resource accounting, workload characterization, and thoughtful validation against target use cases.