Design-Based Confidence Sequences: A General Approach to Risk Mitigation in Online Experimentation

Published 16 Oct 2022 in stat.ME and stat.AP | (2210.08639v3)

Abstract: Randomized experiments have become the standard method for companies to evaluate the performance of new products or services. In addition to augmenting managers' decision-making, experimentation mitigates risk by limiting the proportion of customers exposed to innovation. Since many experiments are on customers arriving sequentially, a potential solution is to allow managers to "peek" at the results when new data becomes available and stop the test if the results are statistically significant. Unfortunately, peeking invalidates the statistical guarantees for standard statistical analysis and leads to uncontrolled type-1 error. Our paper provides valid design-based confidence sequences, sequences of confidence intervals with uniform type-1 error guarantees over time for various sequential experiments in an assumption-light manner. In particular, we focus on finite-sample estimands defined on the study participants as a direct measure of the incurred risks by companies. Our proposed confidence sequences are valid for a large class of experiments, including multi-arm bandits, time series, and panel experiments. We further provide a variance reduction technique incorporating modeling assumptions and covariates. Finally, we demonstrate the effectiveness of our proposed approach through a simulation study and three real-world applications from Netflix. Our results show that by using our confidence sequence, harmful experiments could be stopped after only observing a handful of units; for instance, an experiment that Netflix ran on its sign-up page on 30,000 potential customers would have been stopped by our method on the first day before 100 observations.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Glossary

A/B test: A controlled experiment comparing two variants, typically a control (A) and a treatment (B). "In a typical A/B test,"
Adaptive Probabilistic Treatment Assignment: An assignment mechanism where treatment probabilities can change over time based on past data, while remaining bounded away from 0 and 1. "Adaptive Probabilistic Treatment Assignment"
Adaptive Testing: Sequential experimentation where allocation or testing strategy adapts as data accrues. "Adaptive Testing"
anytime-valid inference: Statistical inference that maintains error guarantees regardless of when analysis or stopping occurs. "companies... have recently started incorporating anytime-valid inference in their experimentation platforms."
Asymptotic confidence sequence: A sequence of intervals that provides valid coverage guarantees after sufficiently large sample sizes, uniformly over time. "Asymptotic confidence sequences were first introduced by \citep{time_uniform},"
Bonferroni correction: A multiple-comparison adjustment dividing the error rate across tests to control family-wise error. "Although any naive approach such as Bonferroni correction will technically lead to valid type-1 error guarantees,"
Bounded Potential Outcomes: An assumption that each unit’s potential outcomes are uniformly bounded by a finite constant. "Bounded Potential Outcomes"
carryover effects: Situations where past treatments affect current outcomes in longitudinal experiments. "with carryover effects"
causal estimands: Target quantities in causal inference that summarize effects of interventions (e.g., ATE). "the relevant causal estimands and estimators"
confidence sequence: A sequence of confidence sets with time-uniform coverage, enabling valid peeking and stopping. "Confidence sequences are sequences of confidence sets with time-uniform coverage guarantees,"
contemporaneous treatment effect: The immediate effect of the treatment at a given time point in time series/panel settings. "the contemporaneous treatment effect"
design-based framework: An inferential approach that conditions on fixed potential outcomes and uses randomization for uncertainty. "the design-based framework"
empirical Bernstein inequalities: Concentration bounds that incorporate empirical variance to tighten probabilistic guarantees. "We now apply the empirical Bernstein inequalities"
filtration: An increasing sequence of sigma-algebras representing information available over time in stochastic processes. "with respect to a filtration"
finite-population framework: A perspective focusing on the realized (finite) set of units in the experiment rather than an infinite superpopulation. "In the finite-population framework,"
Finite-Sample Average Treatment Effect: The average treatment effect defined over the actually observed sample of N units. "Finite-Sample Average Treatment Effect"
interference: When one unit’s treatment assignment affects another unit’s outcome. "no interference between experimental units"
inverse propensity score estimator: An estimator that reweights outcomes by inverse assignment probabilities to unbiasedly estimate causal effects. "inverse propensity score estimator"
martingales: Stochastic processes with conditional expectation equal to the current value, used to build time-uniform tests. "how confidence sequences are constructed through martingales."
martingale difference sequence: A sequence of mean-zero increments relative to a filtration, enabling advanced limit approximations. "using strong martingale difference sequence approximation"
mixture distribution: A probabilistic model combining multiple component distributions, often to sharpen bounds or adaptivity. "leverage a mixture distribution with the truncated gamma distribution"
multi-arm bandits: Sequential decision problems balancing exploration and exploitation across multiple treatments/arms. "multi-arm bandits"
non-asymptotic confidence sequence: A confidence sequence with finite-sample validity that holds uniformly over time. "We now improve the design-based non-asymptotic confidence sequence introduced in Section~\ref{subsection:nonasympc_CS}"
panel data: Data with multiple units observed repeatedly over time. "panel data settings"
panel experiments: Experiments conducted on panel data, assigning treatments across units and time. "panel experiments"
peeking: Monitoring results during data collection and making decisions before the planned end. "peeking invalidates the statistical guarantees"
potential outcomes: The pair (or set) of outcomes that would be observed under each possible treatment for a unit. "Under the potential outcomes formulation of causal inference,"
positivity assumption: The requirement that treatment assignment probabilities are bounded away from 0 and 1. "we make the following positivity assumption."
propensity scores: The probabilities of receiving each treatment, possibly dependent on past information. "the propensity scores for every individual"
regret-minimizing properties: Characteristics of algorithms that aim to minimize cumulative performance loss relative to the best action. "their regret-minimizing properties"
sigma-algebra: A collection of events defining measurable information sets in probability theory. "the sigma-algebra that contains all pairs of N potential outcomes"
stable treatment value assumption: The assumption that treatments are well-defined and there are no hidden versions; often paired with no interference. "called the stable treatment value assumption"
stopping time: A random time at which a decision is made, measurable with respect to the information up to that time. "a well-defined stopping time"
super population: A hypothetical infinite population from which the sample is drawn, used for population-level estimands. "super population"
supermartingale: A stochastic process whose conditional expectation is at most its current value, central to time-uniform bounds. "a non-negative supermartingale"
switchback experiments: Time-based experiments that alternate treatments over periods for the same unit(s), common in marketplaces. "switchback experiments"
Thompson Sampling: A Bayesian bandit algorithm that samples actions according to posterior probabilities of being optimal. "Thompson Sampling"
time series experiments: Experiments where a single unit receives treatments over time, allowing temporal dependence and carryover. "time series experiments"
time-uniform coverage guarantees: Guarantees that confidence sets simultaneously cover the target at all times with high probability. "time-uniform coverage guarantees,"
truncated gamma distribution: A gamma distribution restricted to a subset of its support, used in mixture constructions for bounds. "the truncated gamma distribution"
type-1 error: The probability of incorrectly rejecting a true null hypothesis (false positive). "type-1 error"
Ville's Maximal Inequality: A result providing tail bounds for the maximum of a nonnegative supermartingale, enabling anytime control. "Ville's Maximal Inequality"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Design-Based Confidence Sequences: A General Approach to Risk Mitigation in Online Experimentation

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Glossary

Open Problems

Continue Learning

Collections