MultiBanAbs: Online FDR Control for A/B/n Testing

Updated 2 February 2026

MultiBanAbs is a doubly-sequential online framework that integrates multi-armed bandit (MAB) testing with always-valid p-values and online FDR control.
It employs an adaptive best-arm identification method using a LUCB algorithm variant to optimize sample efficiency while rigorously controlling error rates.
The approach enables interleaved A/B/n experiments with near-optimal discovery power and 50%-70% reduced sample complexity compared to traditional fixed-sample methods.

MultiBanAbs refers to a doubly-sequential online framework for running a sequence of bandit-based “A/B/n” tests, each with optimal sample efficiency and anytime false discovery rate (FDR) control. Unlike classical workflows where each A/B test is separately fixed-sample or sequential, MultiBanAbs encapsulates a protocol in which each experiment is an adaptive best-arm identification instance, and the decision to declare a discovery is governed by an online FDR algorithm. This approach enables massive-scale, interleaved (possibly overlapping) multiple hypothesis testing while ensuring statistical rigor and near-optimal efficiency (Yang et al., 2017).

1. Formal Problem Definition

MultiBanAbs addresses scenarios where a stream of experiments is tested sequentially, each corresponding to a multi-armed bandit (MAB) with one control arm ( $\mu_0$ ) and $K$ alternatives ( $\mu_1,\dots,\mu_K$ ). For each experiment $j$ , the hypotheses are: $H_0^j: \mu_0 \ge \mu_i - \epsilon,~\forall i~~\textrm{vs.}~~ H_1^j: \exists i~\textrm{such that}~\mu_i \ge \mu_0 + \epsilon$ where $\epsilon \geq 0$ determines the minimum margin for declaring a successful alternative.

The core requirement is to test many such MAB instances, adaptively and in parallel, while controlling the online false discovery rate at level $\alpha$ , globally across all tests at all times (i.e., under arbitrary data-dependent stopping).

2. Always-Valid Sequential $p$ -Values for Bandit Instances

A central technical innovation in MultiBanAbs is the construction of always-valid $p$ -values for each best-arm MAB instance, ensuring validity under optional stopping. At any stopping time $t$ (potentially dependent on prior data and test outcomes),

$\Pr_{H_0}(P_t \leq \alpha) \leq \alpha~,~\forall\alpha\in(0,1)$

The $p$ -value process is built using non-asymptotic law-of-the-iterated-logarithm confidence bands: $\ell_n(\delta) = \sqrt{\frac{\ln(1/\delta)+3\ln\ln(1/\delta)+\frac{3}{2}\ln\ln(en)}{n}}$ with upper and lower confidence bounds per arm,

$LCB_i(t) = \widehat{\mu}_{i,n_i(t)} - \ell_{n_i(t)}\Big(\frac{\alpha}{2K}\Big),~ UCB_i(t) = \widehat{\mu}_{i,n_i(t)} + \ell_{n_i(t)}\Big(\frac{\alpha}{2}\Big)$

The per-instance $p$ -value is then

$P_{i,t} = \sup\left\{\gamma: LCB_i(t) \leq UCB_0(t) + \epsilon\right\}$

and the overall $p$ -value for the experiment is the running minimum across arms and timepoints.

3. Best-Arm Identification with Controlled Error

Within each MAB instance, MultiBanAbs runs a variant of the LUCB algorithm parameterized by the required error tolerance ( $\epsilon$ ) and the confidence parameter set by the associated FDR procedure:

Each arm is initially pulled once.
At each round, arms with largest empirical means and upper confidence bounds are evaluated.
The stopping condition is declared when the control is not demonstrably beaten by any variant, or a treatment arm is highly likely to be superior.
On stopping, an always-valid $p$ -value is returned corresponding to the smallest time at which evidence for an alternative arises.

Sample complexity is near-optimal: the bandit subroutine halts in

$O\left(\sum_{i=0}^K \widetilde{\Delta}_i^{-2} \ln\left(\frac{K\ln(\widetilde{\Delta}_i^{-2})}{\delta}\right)\right)$

where $\widetilde{\Delta}_i$ encodes the entailed effect sizes after accounting for $\epsilon$ (Yang et al., 2017).

4. Online FDR Control Framework Integration

MultiBanAbs pipelines the bandit-level $p$ -values into an online FDR control algorithm, such as LORD, SAFFRON, or general $\alpha$ -investing protocols. These algorithms sequentially update a test-specific confidence threshold $\alpha_j$ for each experiment $j$ while properly accounting for prior rejections, continuing the following protocol:

Obtain test level $\alpha_j$ using the online FDR rule and past history.
Run the MAB best-arm algorithm at confidence $\delta = \alpha_j$ ; on stopping, produce $p$ -value $P^j$ .
If $P^j \leq \alpha_j$ declare discovery, otherwise accept the null.
Update the FDR wealth and record status for use in subsequent tests.

This architecture ensures

$\mathrm{FDR}(J) = \mathbb{E}\left[\frac{\sum_{j \in H_0}R_j}{\sum_{j=1}^J R_j \vee 1}\right] \leq \alpha$

uniformly over all horizons $J$ and arbitrary (possibly data-dependent) stopping (Yang et al., 2017).

5. Statistical Guarantees and Sample Efficiency

The core theoretical results established are:

Anytime mFDR and FDR control: The protocol is guaranteed to maintain $m\mathrm{FDR}(J) \leq \alpha$ and $\mathrm{FDR}(J) \leq \alpha$ at every $J$ regardless of adaptive sampling or stopping.
Sample-optimal discovery rate: The best-arm MAB subroutine, run at confidence $\delta = \alpha_j$ , terminates in $O\left(\sum_{i=0}^K \Delta_i^{-2} \log K/\alpha_j\right)$ pulls, matching classical MAB efficiency up to log factors.
High power: The best-arm discovery rate (fraction of true alternatives declared discoveries) remains bounded away from zero; power is competitive with non-MAB fixed-sample alternatives.

6. Empirical Validation and Practical Impact

Evaluation on both simulated bandit data (Gaussian and Bernoulli arms) and real-world settings (e.g., New Yorker Cartoon Caption Contest) validates the MultiBanAbs framework:

Achieves $50\%$ – $70\%$ reduction in sample complexity relative to uniform-sampling A/B/n strategies at equivalent power.
Maintains realized FDR close to the prescribed $\alpha$ even under massive scale and adaptive monitoring.
Outperforms naive combinations of bandit selection and independent tests, which fail to provide rigorous error bounds, and outperforms Bonferroni-FWER correction (which is overly conservative in this adaptive regime) (Yang et al., 2017).

MultiBanAbs unifies classical online multiple testing and multi-armed bandit best-arm identification into a single adaptive process. It is particularly impactful in large-scale settings where continuous monitoring, efficiency, and statistical validity are critical (e.g., digital A/B/n experimentation, high-throughput scientific discovery).

Key directions identified for further research include:

Adapting the framework to contextual and structured bandit settings.
Incorporating early stopping rules and dynamic resource allocation.
Extending to high-dimensional treatment selection and streaming data environments.

The framework is foundational for practitioners requiring rigorous discovery with minimal sampling overhead and robust error control, supporting scalable experimentation in both academic and industrial domains (Yang et al., 2017).

Markdown Report Issue Upgrade to Chat

References (1)

A framework for Multi-A(rmed)/B(andit) testing with online FDR control (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiBanAbs.