Papers
Topics
Authors
Recent
Search
2000 character limit reached

MultiBanAbs: Online FDR Control for A/B/n Testing

Updated 2 February 2026
  • MultiBanAbs is a doubly-sequential online framework that integrates multi-armed bandit (MAB) testing with always-valid p-values and online FDR control.
  • It employs an adaptive best-arm identification method using a LUCB algorithm variant to optimize sample efficiency while rigorously controlling error rates.
  • The approach enables interleaved A/B/n experiments with near-optimal discovery power and 50%-70% reduced sample complexity compared to traditional fixed-sample methods.

MultiBanAbs refers to a doubly-sequential online framework for running a sequence of bandit-based “A/B/n” tests, each with optimal sample efficiency and anytime false discovery rate (FDR) control. Unlike classical workflows where each A/B test is separately fixed-sample or sequential, MultiBanAbs encapsulates a protocol in which each experiment is an adaptive best-arm identification instance, and the decision to declare a discovery is governed by an online FDR algorithm. This approach enables massive-scale, interleaved (possibly overlapping) multiple hypothesis testing while ensuring statistical rigor and near-optimal efficiency (Yang et al., 2017).

1. Formal Problem Definition

MultiBanAbs addresses scenarios where a stream of experiments is tested sequentially, each corresponding to a multi-armed bandit (MAB) with one control arm (μ0\mu_0) and KK alternatives (μ1,,μK\mu_1,\dots,\mu_K). For each experiment jj, the hypotheses are: H0j:μ0μiϵ, i  vs.  H1j:i such that μiμ0+ϵH_0^j: \mu_0 \ge \mu_i - \epsilon,~\forall i~~\textrm{vs.}~~ H_1^j: \exists i~\textrm{such that}~\mu_i \ge \mu_0 + \epsilon where ϵ0\epsilon \geq 0 determines the minimum margin for declaring a successful alternative.

The core requirement is to test many such MAB instances, adaptively and in parallel, while controlling the online false discovery rate at level α\alpha, globally across all tests at all times (i.e., under arbitrary data-dependent stopping).

2. Always-Valid Sequential pp-Values for Bandit Instances

A central technical innovation in MultiBanAbs is the construction of always-valid pp-values for each best-arm MAB instance, ensuring validity under optional stopping. At any stopping time tt (potentially dependent on prior data and test outcomes),

PrH0(Ptα)α , α(0,1)\Pr_{H_0}(P_t \leq \alpha) \leq \alpha~,~\forall\alpha\in(0,1)

The pp-value process is built using non-asymptotic law-of-the-iterated-logarithm confidence bands: n(δ)=ln(1/δ)+3lnln(1/δ)+32lnln(en)n\ell_n(\delta) = \sqrt{\frac{\ln(1/\delta)+3\ln\ln(1/\delta)+\frac{3}{2}\ln\ln(en)}{n}} with upper and lower confidence bounds per arm,

LCBi(t)=μ^i,ni(t)ni(t)(α2K), UCBi(t)=μ^i,ni(t)+ni(t)(α2)LCB_i(t) = \widehat{\mu}_{i,n_i(t)} - \ell_{n_i(t)}\Big(\frac{\alpha}{2K}\Big),~ UCB_i(t) = \widehat{\mu}_{i,n_i(t)} + \ell_{n_i(t)}\Big(\frac{\alpha}{2}\Big)

The per-instance pp-value is then

Pi,t=sup{γ:LCBi(t)UCB0(t)+ϵ}P_{i,t} = \sup\left\{\gamma: LCB_i(t) \leq UCB_0(t) + \epsilon\right\}

and the overall pp-value for the experiment is the running minimum across arms and timepoints.

3. Best-Arm Identification with Controlled Error

Within each MAB instance, MultiBanAbs runs a variant of the LUCB algorithm parameterized by the required error tolerance (ϵ\epsilon) and the confidence parameter set by the associated FDR procedure:

  • Each arm is initially pulled once.
  • At each round, arms with largest empirical means and upper confidence bounds are evaluated.
  • The stopping condition is declared when the control is not demonstrably beaten by any variant, or a treatment arm is highly likely to be superior.
  • On stopping, an always-valid pp-value is returned corresponding to the smallest time at which evidence for an alternative arises.

Sample complexity is near-optimal: the bandit subroutine halts in

O(i=0KΔ~i2ln(Kln(Δ~i2)δ))O\left(\sum_{i=0}^K \widetilde{\Delta}_i^{-2} \ln\left(\frac{K\ln(\widetilde{\Delta}_i^{-2})}{\delta}\right)\right)

where Δ~i\widetilde{\Delta}_i encodes the entailed effect sizes after accounting for ϵ\epsilon (Yang et al., 2017).

4. Online FDR Control Framework Integration

MultiBanAbs pipelines the bandit-level pp-values into an online FDR control algorithm, such as LORD, SAFFRON, or general α\alpha-investing protocols. These algorithms sequentially update a test-specific confidence threshold αj\alpha_j for each experiment jj while properly accounting for prior rejections, continuing the following protocol:

  1. Obtain test level αj\alpha_j using the online FDR rule and past history.
  2. Run the MAB best-arm algorithm at confidence δ=αj\delta = \alpha_j; on stopping, produce pp-value PjP^j.
  3. If PjαjP^j \leq \alpha_j declare discovery, otherwise accept the null.
  4. Update the FDR wealth and record status for use in subsequent tests.

This architecture ensures

FDR(J)=E[jH0Rjj=1JRj1]α\mathrm{FDR}(J) = \mathbb{E}\left[\frac{\sum_{j \in H_0}R_j}{\sum_{j=1}^J R_j \vee 1}\right] \leq \alpha

uniformly over all horizons JJ and arbitrary (possibly data-dependent) stopping (Yang et al., 2017).

5. Statistical Guarantees and Sample Efficiency

The core theoretical results established are:

  • Anytime mFDR and FDR control: The protocol is guaranteed to maintain mFDR(J)αm\mathrm{FDR}(J) \leq \alpha and FDR(J)α\mathrm{FDR}(J) \leq \alpha at every JJ regardless of adaptive sampling or stopping.
  • Sample-optimal discovery rate: The best-arm MAB subroutine, run at confidence δ=αj\delta = \alpha_j, terminates in O(i=0KΔi2logK/αj)O\left(\sum_{i=0}^K \Delta_i^{-2} \log K/\alpha_j\right) pulls, matching classical MAB efficiency up to log factors.
  • High power: The best-arm discovery rate (fraction of true alternatives declared discoveries) remains bounded away from zero; power is competitive with non-MAB fixed-sample alternatives.

6. Empirical Validation and Practical Impact

Evaluation on both simulated bandit data (Gaussian and Bernoulli arms) and real-world settings (e.g., New Yorker Cartoon Caption Contest) validates the MultiBanAbs framework:

  • Achieves 50%50\%70%70\% reduction in sample complexity relative to uniform-sampling A/B/n strategies at equivalent power.
  • Maintains realized FDR close to the prescribed α\alpha even under massive scale and adaptive monitoring.
  • Outperforms naive combinations of bandit selection and independent tests, which fail to provide rigorous error bounds, and outperforms Bonferroni-FWER correction (which is overly conservative in this adaptive regime) (Yang et al., 2017).

\

MultiBanAbs unifies classical online multiple testing and multi-armed bandit best-arm identification into a single adaptive process. It is particularly impactful in large-scale settings where continuous monitoring, efficiency, and statistical validity are critical (e.g., digital A/B/n experimentation, high-throughput scientific discovery).

Key directions identified for further research include:

  • Adapting the framework to contextual and structured bandit settings.
  • Incorporating early stopping rules and dynamic resource allocation.
  • Extending to high-dimensional treatment selection and streaming data environments.

The framework is foundational for practitioners requiring rigorous discovery with minimal sampling overhead and robust error control, supporting scalable experimentation in both academic and industrial domains (Yang et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiBanAbs.