Robust Batched Bandit Algorithms

Updated 11 October 2025

Robust batched bandit algorithms are adaptive sampling procedures that operate in fixed communication rounds with robust estimators like the median-of-means to manage heavy-tailed reward distributions.
They achieve near-optimal regret by balancing batch complexity and statistical uncertainty, using tailored batch schedules that adjust to the tail index and problem instance.
These methods are important in practical applications such as clinical trials and online experiments where real-time updates are impractical and data may include adversarial corruption.

Robust batched bandit algorithms constitute a class of adaptive sampling procedures in the multi-armed bandit (MAB) paradigm where interaction with the environment is restricted to a small number of batched communication rounds. These methods are critically important when feedback is naturally grouped—such as in clinical trials, online experimentation, high-throughput industrial testing, or large-scale digital platforms—where real-time or fully sequential updating is impractical. Over recent years, robustification has addressed two major real-world challenges: heavy-tailed reward distributions and adversarial corruptions. The theoretical and algorithmic developments in this space delineate new trade-offs between batch complexity, statistical uncertainty, and computational efficiency.

1. Algorithmic Foundation and Modeling of Heavy-Tailed Noise

The robust batched bandit setting is motivated by applications where rewards are sampled in batches and the reward distributions may exhibit heavy tails, meaning they possess only limited moments (often only a finite $1+\epsilon$ moment). Classical empirical mean estimators and sub-Gaussian concentration inequalities are inadequate in these regimes. The key innovation in this setting is to replace standard mean estimators with robust alternatives, such as the median-of-means estimator, which underpins Batched Successive Elimination for Heavy-Tailed rewards (“BaSE-H”, (Guo et al., 4 Oct 2025)).

For a set of arms, the algorithm proceeds by allocating a fixed budget of samples per arm within each batch, then using robust estimators to determine which arms to eliminate. The median-of-means estimator guarantees with high probability,

$|\widehat{\mu} - \mu| \leq v^{1/(1+\epsilon)} \left( \frac{c\log(1/\delta)}{n} \right)^{\epsilon/(1+\epsilon)},$

where $n$ is the number of samples per batch, $v$ is the upper bound on the $(1+\epsilon)$ moment, and $\epsilon>0$ characterizes the tail index. This property gives rise to confidence intervals valid even under massively heavy-tailed distributions, allowing elimination rules and confidence bounds in the absence of sub-Gaussianity.

The batch schedule, i.e., the communication grid, is specifically tuned depending on the statistical regime:

Instance-independent: Batching points follow a geometric progression with exponents derived from the tail index,

$t_1 = l, \quad t_m = l \cdot t_{m-1}^{\epsilon/(1+\epsilon)}, \quad l = T^{1/(1+\epsilon-\epsilon(\epsilon/(1+\epsilon))^{M-1})},$

where $M$ is the number of batches and $T$ is the time horizon.

Instance-dependent: The grid is $t_m' = l' \cdot t_{m-1}'$ , $l' = T^{1/M}$ , matching standard phased elimination in lighter-tailed cases.

In high-dimensional continuous-armed or Lipschitz settings, the action space is partitioned recursively into shrinking hypercubes or balls, and robust elimination is performed based on batch-wise robust means over each region.

2. Regret Guarantees and Batch Complexity Characterization

The main result establishes sharp upper and matching lower bounds for the regret and communication complexity as functions of the reward distribution's heavy-tailedness:

Finite-arm, instance-independent regret: The expected cumulative regret is bounded by

$\mathbb{E}[R_T] = O\left(v^{1/(1+\epsilon)} \log K\; (K \log(TK))^{\epsilon/(1+\epsilon)} T^{1/(1+\epsilon-\epsilon (\epsilon/(1+\epsilon))^{M-1})}\right),$

and to achieve near-optimal regret, the minimal number of batches scales as

$\Omega\left( \log^{-1}\left( \frac{1+\epsilon}{\epsilon} \right)\log \log T \right).$

This means that heavier tails (i.e., smaller $\epsilon$ ) paradoxically reduce the minimal number of required batches for optimal regret in the instance-independent regime—unlike in the light-tailed setting.

Instance-dependent regime: The regret scales as

$O\left( [1/\Delta_{\min}]^{1/\epsilon} v^{1/\epsilon} \log K\; K \log(TK)\; T^{1/M} \right),$

and $\Omega(\log T)$ batches are required, independent of the tail index. Thus, increased tail breaks the dependence of batch complexity on $\epsilon$ ; heavier tails do not reduce the number of batches needed for near-optimal, instance-dependent regret.

Lipschitz (continuous) case: In a $d$ -dimensional action space, with zooming dimension $d_z$ , the worst-case regret is

$\mathbb{E}[R_T] = O\left( v^{1/\epsilon} C_z \log \log(T/\log T) (\log T)^{1/(d_z+1+1/\epsilon)} T^{(d_z+1/\epsilon)/(d_z+1+1/\epsilon)} \right),$

and only $O(\log \log T)$ batches suffice for optimal scaling.

Lower bound constructions in both static and adaptive grid cases confirm that these rates are unavoidable up to logarithmic factors.

3. Robust Mean Estimation and Elimination Process

A central aspect of these algorithms is the batch-wise robust mean estimation. Standard sample averages are not resilient to outliers in heavy-tailed distributions, as their estimation error diverges. The median-of-means estimator, as used in the BaSE-H algorithm, divides the $n$ samples per batch into $k$ groups, computes means per group, and takes the median across groups. This technique ensures that the estimator's deviation from the true mean is controlled in probability, even when the distribution lacks finite variance.

These robust estimates are used to construct confidence intervals for each arm. An arm is eliminated at the end of a batch if its (robust) upper confidence bound is below the lower confidence bound of the currently best observed arm. The crucial innovation is that, due to the altered rate of concentration, the sample size per batch and the elimination thresholds explicitly depend on the heavy-tail parameter and on the batch index.

For continuous-armed (Lipschitz) settings, the elimination and partitioning process employs recursive partitioning schemes tied to the zooming dimension and the tail index, with each batch using a new covering at a smaller scale and robustified empirical means within each region.

4. Communication Grid Design and Its Surprising Effects

Batch schedule design is central: in the instance-independent regime, fewer batches suffice as tails become heavier, contrasting starkly with the intuition that heavier noises require more frequent updating. This is a direct consequence of the slower concentration rates provided by robust mean estimators—delaying updates amortizes the uncertainty over a larger number of samples.

The communication grid is constructed recursively, as described above, with geometric or multiplicative expansion, so the number of communication rounds to achieve near-optimal regret shrinks as the reward distribution becomes heavier tailed. In the instance-dependent regime, the required batch complexity remains essentially static, reflecting the necessity of rapid feedback to tease apart small gaps under local exploration.

In higher-dimensional continuous settings, the batch schedule is further guided by the zooming dimension of the arm space, with partition parameters adapted to the geometry and tail parameters to maintain optimal regret while minimizing rounds of communication.

5. Applications, Implications, and Limitations

These algorithms are directly motivated by clinical trials, drug testing, or financial decision-making, where outcomes are often heavy-tailed and it is logistically or ethically infeasible to update per-subject or per-instance. The insight that heavier-tailed settings can be managed with fewer global decision rounds underpins the potential for efficient, adaptive clinical trial designs with rare but extreme outcomes.

Software and algorithmic implementation is streamlined because only batch-wise communication/policy changes are necessary. However, the curse of dimensionality persists in continuous settings: regret is exponentially dependent on intrinsic or zooming dimension, even with optimal batch adaption.

Practical limitations remain. The work highlights the need for empirical validation (not provided in the original paper (Guo et al., 4 Oct 2025)), acknowledges minor logarithmic gaps between upper and lower bounds, and suggests potential for further refinements using alternative robust estimators beyond median-of-means. Extending these approaches to contextual or reinforcement learning environments with heavy-tailed transitions or rewards is a plausible avenue for future research.

6. Comparative Perspective and Future Work

Compared with prior batched bandit algorithms designed for sub-Gaussian or light-tailed settings (Esfandiari et al., 2019), robust batched bandits are tailored to regimes where classical statistical guarantees break down. Their regret matches lower bounds up to logarithmic factors, are computationally efficient due to their batchwise design, and—counter-intuitively—often benefit in communication complexity with increased tail heaviness.

Potential future directions include:

Empirical testing on realistic, heavy-tailed datasets.
Sharpening (removing logarithmic) gaps in both regret and batch requirements.
Extending to contextual bandits or reinforcement learning with complex stochastic structures.
Investigating more efficient or alternative robust mean estimators (e.g., Catoni’s estimator) and their impact on performance and communication costs.
Deriving a deeper theoretical understanding of the interplay between tail index and optimal batch schedule in generalized adaptive settings.

7. Summary Table: Regime Dependence of Batch Complexity

Bandit Setting	Regret (Heavy-tailed)	Min. Batches for Near-Optimality	Batch Complexity Dependence
Instance-independent, finite K	$\tilde O(T^{1/(1+\epsilon-\epsilon(\epsilon/(1+\epsilon))^{M-1})})$	$O(\log^{-1}((1+\epsilon)/\epsilon)\log\log T)$	Decreases with heavier tails ( $\downarrow\epsilon$ )
Instance-dependent (min gap Δ)	$\tilde O([1/\Delta_{\min}]^{1/\epsilon}T^{1/M})$	$O(\log T)$	Invariant to tail index
Lipschitz arm space ([0,1]^d)	$\tilde O(T^{(d_z+1/\epsilon)/(d_z+1+1/\epsilon)})$	$O(\log\log T)$	Weak dimension effect (via $d_z$ )

This table summarizes headline theoretical findings from (Guo et al., 4 Oct 2025).

In conclusion, robust batched bandit algorithms for heavy-tailed reward models achieve near-optimal regret with efficiently designed batch schedules. The mathematical structure of robust mean estimation techniques, the dependence of both regret and required batch complexity on the tail index and problem instance, and the principled adaptation to geometric and batchwise updating, establish a new foundation for adaptive experimentation in high-uncertainty regimes with severe communication or adaptivity constraints.

PDF Markdown Chat (Pro)

References (2)

Robust Batched Bandits (2025)

Regret Bounds for Batched Bandits (2019)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Robust Batched Bandit Algorithms.