Multi-Draw Surrogacy Theorems

Updated 23 June 2026

Multi-Draw Surrogacy Theorems are statistical results that use averages of multiple stochastic surrogate draws (e.g., from LLMs) to estimate treatment effects.
They rely on rigorous assumptions such as surrogacy and comparability to map surrogate outputs to human outcomes in randomized experiments.
Increasing the number of surrogate draws reduces attenuation bias and variance, enhancing efficiency in A/B testing, behavioral simulations, and safety-critical systems.

Multi-Draw Surrogacy Theorems are a class of statistical results that formalize the conditions and procedures under which averages of multiple independent surrogate outcomes—such as repeated draws from a stochastic model like a LLM—provide valid estimates for treatment effects on target outcomes of interest, such as human responses in randomized experiments. These theorems are central to the emerging methodology of using computational surrogates, particularly LLMs, to accelerate experimentation in fields like causal inference, A/B testing, and safety-critical system validation, while rigorously quantifying the implications of model stochasticity and calibration.

1. Formal Framework for Surrogate-Based Causal Inference

The surrogacy framework considers experiments in which a randomized treatment $W \in \{0,1\}$ is assigned alongside observed covariates $X$ . The target (e.g., human) outcome under treatment $W$ is denoted $Y_h = Y(W)$ , while the surrogate outcome $Y_s$ represents an observable generated or predicted by a surrogate model (e.g., the output of an LLM given the same stimulus) (Persson et al., 15 Jun 2026). The central population of interest may be split into a reference data sample $P=0$ , in which both $Y_h$ and $Y_s$ are observed, and a deployment sample $P=1$ , in which only $Y_s$ is observable.

Identification of the average treatment effect (ATE) on the target outcome, $X$ 0, via surrogacy requires two principal assumptions:

Surrogacy (Prentice Criterion): $X$ 1, ensuring conditional independence of $X$ 2 from $X$ 3 given $X$ 4.
Comparability: $X$ 5 with appropriate support-overlap, enabling valid transport of the surrogate calibration across samples.

The key quantity is the calibrated mapping $X$ 6. Under both surrogacy and comparability, the plug-in estimator

$X$ 7

identifies the human ATE from surrogate-only experiments (Persson et al., 15 Jun 2026).

2. Stochastic Surrogates and Multi-Draw Averaging

Many surrogate models, including LLMs, produce stochastic outcomes: repeated queries with the same input $X$ 8 yield independent realizations $X$ 9 drawn from $W$ 0. This introduces both noise and bias relative to the latent expectation $W$ 1 (Persson et al., 15 Jun 2026). The multi-draw surrogacy theorems formalize several key results:

Identification with Infinite Replication: Conditioning on the average surrogate $W$ 2 recovers the identification property as $W$ 3. Specifically, if $W$ 4 and $W$ 5 are independent given $W$ 6, then surrogacy holds in the limit of infinite surrogate draws [Proposition 2, (Persson et al., 15 Jun 2026)].
Attenuation Bias and Its Correction: For finite $W$ 7, calibration regressions of $W$ 8 on noisy $W$ 9 exhibit classical errors-in-variables attenuation. If $Y_h = Y(W)$ 0 is the signal and $Y_h = Y(W)$ 1 is the surrogate's conditional noise variance, the reliability $Y_h = Y(W)$ 2 approaches 1 as $Y_h = Y(W)$ 3 increases. The resulting bias and variance inflation disappear as $Y_h = Y(W)$ 4 [Proposition 3, (Persson et al., 15 Jun 2026)].
Variance Decomposition: The estimator $Y_h = Y(W)$ 5 based on the calibrated surrogate exhibits increased variance, with effective sample size $Y_h = Y(W)$ 6, where $Y_h = Y(W)$ 7 is the noise-to-signal ratio. Increasing $Y_h = Y(W)$ 8 thus directly boosts statistical efficiency (Persson et al., 15 Jun 2026).

3. Falsification, Partial Identification, and Robustness Bounds

Surrogacy and comparability are empirically falsifiable but not verifiable for novel treatments or domains. Multi-draw surrogacy estimation incorporates several auxiliary diagnostic and bounding techniques:

Falsification of Surrogacy: Historical data ( $Y_h = Y(W)$ 9) with both $Y_s$ 0 and $Y_s$ 1 permit the construction of testable moment conditions, validating whether the surrogacy relation holds for observed treatments (Persson et al., 15 Jun 2026). Passing these tests is necessary but not sufficient for out-of-sample validity.
Worst-Case Bias Bound under Limited Overlap: When distributions of $Y_s$ 2 differ between experimental ( $Y_s$ 3) and deployment ( $Y_s$ 4) samples, the maximum bias in transported ATE is bounded by

$Y_s$ 5

where $Y_s$ 6 bounds $Y_s$ 7 and $Y_s$ 8 is the total variation distance between $Y_s$ 9 and $P=0$ 0 for arm $P=0$ 1 [Proposition 4, (Persson et al., 15 Jun 2026)]. This bound holds for any calibration and does not rely on surrogacy, providing distribution-free partial identification.

4. Practical Calibration and Empirical Performance

Empirical studies confirm both the attenuation and variance phenomena predicted by the theory and the corrective power of multi-draw averaging and flexible nonparametric calibration:

In Upworthy headline experiments, raw surrogate ATEs from single LLM draws are severely attenuated and noisy. Calibrated regressions using averages of $P=0$ 2 to $P=0$ 3 LLM draws recover the human ATE within sampling error, with nonparametric regressors (random forests, gradient-boosted trees) outperforming linear models (Persson et al., 15 Jun 2026).
Variance reductions, as predicted, scale with increasing $P=0$ 4. Attenuation bias is rapidly eliminated even with modest replication. Diagnostic tests sometimes reveal persistent violations of surrogacy in certain prompt or model settings or for novel interventions.
Diagnostics including hold-out moment testing, overlap quantification, positive control injections, and memorization probes are essential for trustworthy surrogate estimation.

5. Design Implications and Theoretical Limitations

Multi-Draw Surrogacy Theorems provide rigorously justified procedures for using computational surrogates in causal inference and system validation, subject to specific caveats:

Prompt and Temperature Tuning: Prompt formulation and model temperature modulate both the informativeness (signal) and randomness (noise) of LLM surrogates. Optimal design maximizes the reliability ratio $P=0$ 5. Empirically, $P=0$ 6 to $P=0$ 7 draws per condition is generally sufficient for practical de-attenuation (Persson et al., 15 Jun 2026).
Unverifiable Out-of-Sample Validity: Even if surrogacy is not falsified on historical data, validity of new treatments or external populations is fundamentally untestable absent new human evaluation (Persson et al., 15 Jun 2026).
Long-Term Outcome Surrogacy: For long-term effects, multi-stage surrogacy (LLM → short-term, short-term → long-term) is necessary. This compounds the assumptions and increases susceptibility to violation.
Limited Generalization: The results rely on the foundational model’s alignment and the calibration dataset’s coverage. Out-of-distribution queries, low-overlap cases, or foundational models with insufficient supervised alignment may invalidate surrogacy or drastically inflate bias bounds.

6. Applications and Broader Context

Multi-draw surrogacy theory enables a principled methodology for accelerating and reducing the cost of randomized experiments in areas where human evaluation is infeasible or expensive. It is particularly influential in the context of LLM-based behavioral simulation, fast A/B testing, and surrogate-based validation in safety-critical domains (Persson et al., 15 Jun 2026). The same principles underlie gradient-based falsification with neural surrogates in control and CPS domains, where differentiable surrogate models enable efficient optimization—though these fields typically address deterministic surrogates (Kötz et al., 8 May 2026, Kundu et al., 6 May 2025).

Within machine learning and causal inference, multi-draw surrogacy represents a minimal and testable relaxation of full distributional equivalence, providing a balance between efficiency and robustness as long as the implied assumptions are empirically justified and carefully diagnosed.