Multiple-Play Stochastic Bandits

Updated 1 January 2026

Multiple-play stochastic bandits are a framework where multiple arms are selected per round under complex combinatorial and feasibility constraints, advancing the theory beyond classical single-play MABs.
Algorithmic approaches like MP-TS, KL-UCB variants, and combinatorial heuristics balance exploration and exploitation while addressing issues such as capacity limits, delays, and position biases.
Empirical studies in sectors like web ranking and telecom demonstrate that specialized multiple-play bandit algorithms attain regret performance near theoretical lower bounds, outperforming traditional black-box methods.

Multiple-play stochastic bandits extend the classical multi-armed bandit (MAB) framework by allowing the learner to select more than one arm per round, often modeled as pulling $K$ arms from $N$ candidates at each decision time. This generalization, motivated by applications such as web ranking, resource allocation, and system scheduling, introduces nontrivial combinatorial and statistical challenges. The multiple-play setting encompasses various structures, including distinct arms per round, capacity and priority constraints, blocking (delays), position bias, budget restrictions, and non-equivalent order effects. Recent research directions formalize and analyze regret bounds, sample complexity, and optimal action policies, with specialized algorithms—UCB variants, Thompson sampling extensions, combinatorial heuristics—achieving performance near theoretical lower bounds.

1. Formal Problem Definitions and Taxonomy

Multiple-play stochastic bandits are formally characterized by a set of $N$ arms, each with a potentially unknown reward distribution, and in every round, the learner chooses a multi-arm action $A_t$ subject to feasibility constraints. The basic variants are:

Top-K selection ("standard MP-MAB"): At time $t$ , select exactly $K$ distinct arms, each yielding an independent reward; regret is measured against playing the $K$ arms with largest means in every round (Komiyama et al., 2015).
Combinatorial constraints: Feasible actions are encoded by an independence system $I\subset 2^A$ (matchings, knapsacks), supporting arbitrary combinatorial selection (Atsidakou et al., 2021).
Shareable/capacitated arms: Arms may serve multiple plays per round, up to capacity $m_k$ ; rewards depend on load and capacity estimation becomes critical (Wang et al., 2022).
Blocking bandits: Once an arm is pulled, it becomes unavailable for $D_{i,t}-1$ rounds (stochastic or deterministic blocking) (Atsidakou et al., 2021).
Prioritized resource-sharing: Multiple plays compete for arm capacity, which is stochastically allocated by priority weights $\alpha_k$ (Xie et al., 25 Dec 2025).
Position-based models (PBM): Arms are displayed in ordered lists, and observed feedback depends on position-specific examination probabilities $\kappa_\ell$ (Lagrée et al., 2016).
Budget constraints: Each arm carries a reward and a cost; cumulative cost must not exceed budget $B$ ; the game horizon is defined by budget exhaustion (Zhou et al., 2017).
Non-equivalent multiple plays: Arms' reward contributions depend on order/position with slot-dependent payoff and information structure (Vorobev et al., 2015).
Markovian restarts: Arms evolve as Markov chains only when played, and selection per round is multi-arm (Moulos, 2020).

Regret is typically defined as the cumulative difference between the reward achieved by the algorithm and that of an optimal strategy (oracle) that knows the true arm parameters, capacities, priorities, or delays.

2. Theoretical Regret Bounds and Sample Complexity

Regret analysis in multiple-play settings generalizes the classical Lai–Robbins lower bounds and reveals nuanced combinatorial dependencies:

Standard top-K play: For independent Bernoulli arms, with $L$ arms played per round, Anantharam et al.'s lower bound shows $E[\text{Reg}(T)] \geq \sum_{i=L+1}^K \frac{\Delta_{i,L}}{d(\mu_i,\mu_L)}\log T$ , with $\Delta_{i,L}=\mu_L-\mu_i$ and $d(\cdot,\cdot)$ the Bernoulli KL divergence (Komiyama et al., 2015). Achievable by MP-TS and MP-KL-UCB.
Blocking and delays: Approximability is fundamentally limited by maximum blocking delay $d_{\max}$ . If delay realizations are observable, no algorithm guarantees ratio better than $O(1/d_{\max})$ ; hereditary constraints are necessary for any constant-factor approximation (Atsidakou et al., 2021). Bandit regret scales as $O((kr/\Delta)\log T + d_{\max}k)$ ; $d_{\max}\to 1$ recovers classical bounds.
Shareable finite-capacity arms: Regret lower bound per arm includes a KL-type term for distinguishing suboptimal arms, plus a sample complexity term proportional to $m_k^2/\mu_k^2$ for learning capacities (Wang et al., 2022). Overall $E[\text{Reg}(T)]$ is matched by a combination of capacity estimation samples and standard arm rewards selection.
Prioritized capacity sharing: Instance-independent lower bound is $\Omega(\alpha_1\sigma\sqrt{MK T})$ , and the instance-dependent rate is $\Omega(\alpha_1\sigma^2 \frac{M}{\Delta}\ln T)$ , where priority $\alpha_1$ and reward tail $\sigma$ dominate scaling (Xie et al., 25 Dec 2025).
Budget-constrained multiple play: Regret is bounded as $O(NK^4\ln B)$ ; the budget induces an early stopping random horizon and exploration must control reward-to-cost ratios (Zhou et al., 2017).
Position-based models: The regret lower bound generalizes KL-divergence rates to position-examination probabilities: $\liminf_{T\to\infty}R(T)/\log T \geq \sum_{k=L+1}^K\min_{1\leq \ell\leq L}\frac{\Delta_{v_{k,\ell}}{d(\kappa_\ell\theta_k,\kappa_\ell\theta_L)}$, where $v_{k,\ell}$ represents a list with $k$ substituted at position $\ell$ (Lagrée et al., 2016).
Non-equivalent plays: Leading order regret is slot-dependent; identifying arms' effects per slot yields $R_T\geq\sum_{j>n} \max_{i\leq n}\min_{k=1..m} \text{Reg}(k,j)/I_k(a_j,a_i)\cdot\log T$ (Vorobev et al., 2015).

Capacity estimation sample complexity lower bounds in the shareable arms setting match the corresponding regret terms, demonstrating the necessity of precise capacity learning for optimal policy performance (Wang et al., 2022).

3. Algorithmic Approaches: Greedy, UCB, and Thompson Sampling Extensions

Algorithm design in multiple-play settings adapts classical bandit ideas to combinatorial, capacity, and structural constraints:

MP-TS (Multiple-play Thompson Sampling): For binary rewards, sample arm posteriors independently and play top $L$ arms by sample values. Achieves optimal regret matching lower bounds (Komiyama et al., 2015). The improved IMP-TS restricts exploration to a single slot per round, empirically reducing regret.
KL-UCB and Round-Robin KL-UCB: KL-UCB extends to multiple-play by selecting arms with highest KL-index or KL-UCB values; round-robin index calculations maintain regret performance at a fraction of computational cost (Moulos, 2020).
CBBSD-UCB (Combinatorial Blocking Bandits with Stochastic Delays - UCB): For unknown rewards under stochastic delays, estimate means and select via combinatorial approximation oracle using optimistic indices (Atsidakou et al., 2021).
Capacity Estimation via Uniform Confidence Intervals (UCI): Learn arm capacities in shareable arm settings using parsimonious individual and united explorations; sample-complexity tight intervals enable correct identification of $m_k$ as rounds progress (Wang et al., 2022).
MSB-PRS-OffOpt and ApUCB: For prioritized sharing, combinatorial matching algorithms (Hungarian/Crouse) identify optimal assignments based on learned reward/capacity parameters; ApUCB leverages confidence intervals for exploration-exploitation balance (Xie et al., 25 Dec 2025).
PBM-UCB, PBM-PIE: In position-biased displays, UCB and KL-UCB-style parsimonious exploration indices decomposed over positions guide arm ranking and placement, yielding regret rates matching position-weighted lower bounds (Lagrée et al., 2016).
UCB-MB: Adaptation for budget constraints uses optimism in reward-to-cost ratio and controls stopping with budget exhaustion, maintaining polynomial regret in $K$ (Zhou et al., 2017).

Technical innovations include combinatorial oracle integration, exploration strategies exploiting structure (e.g., single-arm focus in non-equivalent plays), and sample-efficient confidence interval construction for latent parameters such as capacities and delays.

4. Structural and Computational Hardness Results

Structural constraints sharply impact approximability and computational tractability:

Constraint Type	Hardness/Barrier	Sufficient Condition for Tractability
Delay realizations observable	$\omega(1/d_{max})$ approx. impossible (Atsidakou et al., 2021)	Delay ignorance (only distributions known)
Non-hereditary action family $I$	No poly-time better than $\Omega(k^{-1/2+\epsilon})$ (Atsidakou et al., 2021)	Hereditary $I$ (independence system)
Max- $k$ -Cover structure	No poly-time better than $1-1/e+\epsilon$ approx. (Atsidakou et al., 2021)	Linear optimization over $I$
Priority-based utility (shareable arms)	Nonlinear in allocation, matching complexity $O(M^3K^3)$ (Xie et al., 25 Dec 2025)	Weighted matching + monotonicity

These hardness results delineate inherent barriers, with positive results contingent upon action family heredity, delay ignorance, and combinatorially tractable structures like independence systems, bipartite matchings, or monotonic allocation graphs.

5. Empirical Performance and Applied Contexts

Empirical studies validate theoretical findings and demonstrate the relevance of multiple-play stochastic bandits in practical domains:

Advertising and web ranking: MP-TS, PBM-TS, and PBM-PIE achieve near-optimal regret in large-scale click-through prediction; position bias models yield substantial improvements over models ignoring examination probability (Komiyama et al., 2015, Lagrée et al., 2016).
Resource allocation in LLM/edge AI: Prioritized sharing models mirror practical priority-aware scheduling in LLM cloud instances and edge computing; empirical algorithms match theoretical bounds (Xie et al., 25 Dec 2025).
Telecom/base station selection: OrchExplore outperforms successive elimination and two-phase ETC-UCB in 5G/4G settings; capacity estimation is fundamental (Wang et al., 2022).
Budgeted operations: UCB-MB algorithms yield logarithmic regret in total budget while controlling cumulative cost in stochastic environments (Zhou et al., 2017).
Computational efficiency: Round-robin KL-UCB reduces wall-clock cost by an order of magnitude while preserving optimal regret scaling (Moulos, 2020).

A plausible implication is that specialized algorithms—tuned to the combinatorial and information structure of the specific application—substantially outperform black-box approaches, especially when action feasibility, delays, and resource sharing are present.

6. Variants, Open Questions, and Research Directions

Active research focuses on novel structural variants and open problems:

Capacity estimation tightness: Upper and lower bounds on sample complexity and regret in capacity-limited settings are tight up to constant factors, but exploring settings with correlated loads or adversarial capacity processes remains open (Wang et al., 2022).
Non-equivalent permutations: Regret lower bounds highlight the combinatorial complexity in slot-dependent order models; future work may refine upper bounds by adaptive slot targeting (Vorobev et al., 2015).
Budgeted adversarial bandits: Improving regret bounds in budgeted adversarial environments (Exp3.M.B) requires new martingale and concentration techniques (Zhou et al., 2017).
Higher-order blocking/delay dependencies: Beyond i.i.d. blocking times, incorporating Markovian or adversarially chosen delays raises computational and analysis complexity (Atsidakou et al., 2021).
Priority-sharing under dynamic system constraints: Extending the prioritized resource-sharing framework to dynamic or adversarial arm capacities may prove valuable for robust inference and scheduling in nonstationary systems (Xie et al., 25 Dec 2025).
Tightening the polynomial dependence on $K$ in regret bounds: Several analyses (UCB-MB, combinatorial UCB) reveal $K^4$ factors; it is unclear if $K^2$ or even $K$ scaling is achievable with tractable oracle structure (Zhou et al., 2017).

Fundamentally, the multiple-play stochastic bandit paradigm provides a flexible statistical decision-theoretic framework for complex sequential allocation, with rapidly expanding theoretical and practical reach driven by combinatorial generalizations and structural constraints.