Evaluating Decision Rules Across Many Weak Experiments

Published 12 Feb 2025 in stat.ME | (2502.08763v2)

Abstract: Technology firms conduct randomized controlled experiments ("A/B tests") to learn which actions to take to improve business outcomes. In firms with mature experimentation platforms, experimentation programs can consist of many thousands of tests. To effectively scale experimentation, firms rely on decision rules: standard operating procedures for mapping the results of an experiment to a choice of treatment arm to launch to the general user population. Despite the critical role of decision rules in translating experimentation into business decisions, rigorous guidance on how to evaluate and choose decision rules is scarce. This paper proposes to evaluate decision rules based on their cumulative returns to business north star metrics. Although intuitive and easy to explain to decision-makers, this quantity can be difficult to estimate, especially when experiments have weak signal-to-noise ratios. We develop a cross-validation estimator that is much less biased than the naive plug-in estimator under conditions realistic to digital experimentation. We demonstrate the efficacy of our approach via a case study of 123 historical A/B tests at Netflix, where we used it to show that a new decision rule would have increased cumulative returns to the north star metric by an estimated $33\%$, directly leading to the adoption of the new rule.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper proposes a framework to evaluate A/B testing decision rules based on their cumulative returns to long-term business metrics across many experiments.
It introduces a cross-validation estimator with experiment-splitting to mitigate the winner's curse bias when estimating cumulative returns, unlike simpler plug-in methods.
Applying this method to historical A/B tests at Netflix, the authors demonstrated that adopting a new decision rule based on a better proxy metric would significantly increase estimated cumulative returns, leading to its implementation.

This paper, "Evaluating Decision Rules Across Many Weak Experiments," addresses the challenge of choosing effective decision rules in mature A/B testing environments within technology companies. Decision rules are standard operating procedures that translate the results of an experiment into a decision about which treatment arm to launch. The paper argues that rigorous methods for evaluating and selecting decision rules are lacking. The authors propose evaluating decision rules based on their cumulative returns to business north star metrics (e.g., long-term user retention, revenue).

Here's a breakdown of the key elements:

Problem:

Technology firms rely on A/B testing for product innovation.
Large-scale experimentation programs necessitate the use of decision rules to standardize launch decisions.
Evaluating and choosing between decision rules is challenging, especially when experiments have weak signal-to-noise ratios.
A simple "plug-in" estimator of cumulative returns can be biased due to the winner's curse (winning arms are sometimes chosen due to noise, leading to an overestimation of their true effect).

Proposed Solution:

Evaluate decision rules based on their cumulative returns to business north star metrics across many past experiments.
Use a cross-validation estimator with experiment-splitting to mitigate the winner's curse. This estimator separates the data used to select the winning arm from the data used to evaluate its performance.
Show that the cross-validation estimator is significantly less biased than the plug-in estimator under realistic conditions.

Key Concepts:

Decision Rules: Standard operating procedures for mapping experiment results to launch decisions.
North Star Metric: The primary business outcome (e.g., retention, revenue).
Proxy Metrics: Metrics correlated with the north star metric but easier to measure (higher signal-to-noise ratio).
Guardrail Metrics: Metrics that must not decline for a launch to be considered.
Cumulative Returns: The total impact on the north star metric if all experiments were decided using a specific decision rule.
Winner's Curse: The tendency to overestimate the true effect of a winning arm due to noise in the experiment.
Cross-Validation Estimator: An estimator that splits the experimental data and uses one portion to choose the winning arm and the other to estimate the return of that arm, mitigating the winner's curse.

Methodology:

Framework: Define a framework for evaluating decision rules based on cumulative returns. This framework is applicable to choosing p-value thresholds and selecting proxy metrics.
Estimator: Develop a cross-validation estimator to address the winner's curse.
Theory: Provide theoretical conditions under which the cross-validation estimator consistently selects the best rule from a finite set of candidates as the number of experiments increases.
Simulation: Demonstrate the benefits of the framework in a simplified example, evaluating candidate proxy metrics in simulated data.
Application: Apply the framework to select decision rules at Netflix, showing that a new decision rule would increase cumulative returns.

Theoretical Results:

The paper provides a theorem that states that if the number of users enrolled in each experiment is Poisson distributed, then, with an appropriate scaling factor, the leave-one-out cross-validation estimator is unbiased.
The paper also provides a theorem guaranteeing that, with a finite set of decision rules, the cross-validation estimator will select the best rule (in terms of the expected reward) as the number of experiments goes to infinity. This is under the realistic asymptotic regime where the number of experiments can go to infinity but the number of units per experiment remain bounded.

Simulation Results:

The simulation demonstrates that the naive estimator is biased and can prefer "bad" proxy metrics (metrics with weak correlation with the north star metric but strongly correlated measurement error) to "good" proxy metrics.
The cross-validation estimator provides a more accurate estimate of the true reward and correctly ranks the proxy metrics.

Netflix Case Study:

The authors applied their method to 123 historical A/B tests at Netflix.
They compared the performance of the current decision rule to alternative rules based on a new proxy metric.
They found that a new decision rule would increase cumulative returns by an estimated 33%, leading to its adoption.

Contributions:

The paper provides a practical and intuitive framework for evaluating decision rules in A/B testing environments.
The cross-validation estimator effectively mitigates the winner's curse, leading to more accurate estimates of cumulative returns.
The paper demonstrates the importance of using unbiased estimators for decision rule selection, especially when experiments have weak signal-to-noise ratios.
The Netflix case study provides a real-world example of the benefits of the proposed methodology.

Future Directions:

Exploring estimators that allow for dependence across experiments.
Evaluating other dimensions of experiment decision-making, such as statistical significance thresholds.
Learning good decision rules rather than selecting from a predefined set.