Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

Published 15 Jun 2026 in stat.ME, cs.AI, econ.EM, and math.ST | (2606.17165v1)

Abstract: Organizations and researchers show increasing interest in using LLMs in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes recovers the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs. The framework shows that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. When these conditions fail, the effect of interest is only partially identified, and we provide diagnostics that can falsify surrogacy on historical experiments together with a bound on the worst-case bias from limited overlap. We further show that the stochasticity inherent to LLMs introduces both bias and variance, but using an average of multiple draws as the surrogate mitigates both. We illustrate the methods and theory in simulations and an application to A/B tests on Upworthy headlines. A central takeaway from our work is that the validity of LLM outcomes as surrogates can only be falsified for past treatments and never verified for new ones, so human experiments remain indispensable for novel interventions. We discuss the role of LLM choice, prompting, and temperature as design variables, and how to size human experiments for validation.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a principled surrogacy framework grounded in surrogate endpoint theory to enable valid causal inference on human ATEs using LLM-generated outcomes.
The methodology leverages multi-draw surrogacy to mitigate LLM stochasticity, demonstrating that increasing replication reduces bias and mean squared error.
Practical diagnostics and empirical validations, including nonparametric calibration on headline A/B tests, underscore the necessity of human pilot studies.

Statistical Surrogacy Foundations for LLM-Based A/B Testing

Introduction

The use of LLMs as proxies for human participants in A/B testing has rapidly become both technically feasible and economically attractive. However, leveraging LLM outcomes for causal inference concerning human populations introduces substantial identification challenges. "Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference" (2606.17165) develops a principled framework, grounded in surrogate endpoint theory, that rigorously specifies the assumptions, estimation strategy, and diagnostic procedures necessary for LLM-generated outcomes to support valid causal inference on human average treatment effects (ATEs).

Formal Problem Setup

The paper models experiments with randomized assignment $W\in\{0,1\}$ (control/treatment), covariates $X$ , and human outcome $Y$ , with $Y^*$ denoting an outcome stochastically generated by an LLM, conditioned on $(W, X)$ . The LLM surrogate framework considers two samples:

The experimental (human) sample, with both $(W, X, Y^*, Y)$ observed;
The artificial (LLM) sample, with only $(W, X, Y^*)$ observed.

The fundamental question is: under what conditions can inference on the human ATE, $\tau = \mathbb{E}[Y(1) - Y(0)],$ be carried out using only LLM data?

Identification Theory: Surrogacy and Comparability

Perfect distributional equivalence between LLM and human outcomes is too strong and rarely plausible. The paper hence introduces a surrogacy perspective, mirroring classical surrogate endpoint literature, and rigorously formalizes the conditions for point identification:

Surrogacy (Prentice Criterion): $Y \perp W \mid X, Y^*$ (i.e., conditional on covariates and the surrogate, treatment has no further effect on the human outcome);
Comparability: The conditional distribution of $Y$ given $X$ 0 is invariant across human and LLM samples, i.e., $X$ 1, where $X$ 2 indexes the sample.

Under these assumptions, the calibration function $X$ 3, learned on the human sample, enables unbiased estimation of the human ATE from the artificial sample via

$X$ 4

(Figure 1)

Figure 1: Sampling distribution of the calibrated ATE (blue) versus the raw LLM ATE (orange), illustrating removal of bias via calibration.

This reduction elucidates that distributional shifts in the relationship between the LLM surrogate and human outcome are absorbed by the nonparametric calibration, aligning LLM-based estimates with the estimand of interest.

Stochasticity and Multi-Draw Surrogacy

A key technical contribution is the explicit analysis of LLM stochasticity. Since LLM outputs are inherently noisy, single-draw surrogates introduce both variance inflation and, critically, attenuation bias in the calibrated estimator—a direct analog to classical measurement error.

To address this, the paper develops and proves multi-draw surrogacy theorems. When $X$ 5 independent LLM draws are averaged per unit, the surrogate approaches the latent conditional mean, restoring identification and attenuating estimator bias as $X$ 6. This yields an operational guideline: increasing replication count $X$ 7 both improves efficiency (reducing MSE as $X$ 8) and debiases ATE estimation. These findings are supported by extensive simulation.

Figure 3: Averaging $X$ 9 LLM draws per unit asymptotically restores surrogacy when only the latent mean contains treatment signal.

Figure 5: RMSE of the estimator drops as $Y$ 0 and mean estimate approaches the true ATE as $Y$ 1 increases, confirming variance reduction and debiasing.

Diagnostics: Falsification and Sensitivity Analysis

The framework distinguishes between empirical diagnostics that can falsify necessary assumptions and those that merely assess plausibility:

Surrogacy Falsification Test: On held-out historical data, the fitted calibration function $Y$ 2 must recover per-arm means; significant residuals indicate failure.
Comparability Sensitivity Bound: The worst-case impact of lack of distributional overlap between experimental and artificial samples is quantified by a tight bound (function of total variation distance and outcome range), providing robust assessment when identification fails.
Figure 6: Bias of the calibrated ATE scales linearly with the violation magnitude of surrogacy (direct effect of $Y$ 3) and comparability (shift in calibration slope).

Figure 2: Theoretical worst-case bound holds strictly above observed ATE discrepancies under deliberate overlap violations.

Empirical Study: Upworthy Headline Experiments

The framework is validated on the Upworthy Research Archive [matias2021upworthy], consisting of thousands of live headline A/B tests.

Treatment is whether a headline is phrased as a question; outcome is click-through rate (CTR).
GPT-4o-mini is prompted to predict CTRs for held-out headlines with varying $Y$ 4.
Nonparametric calibration (RF, GBT) yields LLM-calibrated ATE estimates within statistical error of the true human ATE for $Y$ 5; uncalibrated or linear calibration fails, producing significant attenuation.
Figure 4: Estimated ATEs for raw and calibrated LLM surrogates as a function of $Y$ 6; nonparametric calibration rapidly eliminates attenuation.

Comprehensive diagnostics confirm absence of LLM memorization, robustness to shifting test populations, and validate recovery of synthetic treatment effects.

Figure 7: Token-level F1 assessment rules out LLM memorization of Upworthy headlines.

Practical Implications for Experimentation

The calibration-plus-surrogacy method subsumes LLM design (training regimen, prompt template, temperature) and replication count into the experimental workflow. The calibration function can and should be fit flexibly, but LLM stochasticity sets a lower bound on replication requirements for bias and variance. Diagnostics on historical data can only falsify surrogacy for past treatments, not verify it for novel ones—implying that human experimentation remains essential for new interventions. The work further discusses the trade-off in sizing human pilot studies versus purely LLM-based simulation based on risk and impact.

Conclusion

This work rigorously specifies when and how LLM-generated outcomes can be used for human causal inference in A/B testing. The statistical surrogacy formalism exposes the precise role of LLM calibration, the essential nature of surrogacy and comparability, and quantifies the implications of LLM stochasticity and finite draws. Importantly, the framework demonstrates that LLM-based causal inference is not a turnkey replacement for human experimentation: diagnostics can only falsify, not verify, key identification assumptions, making human experiments indispensable for genuinely novel treatment effects. Further directions include relaxing SUTVA, handling multiple outcomes/agents, and extending estimation to long-term surrogate settings.

Markdown Report Issue