Multi-Player Multi-Armed Bandit

Updated 28 December 2025

Multi-Player Multi-Armed Bandits (MP-MAB) are a framework where multiple agents select arms with unknown reward distributions, facing collisions that affect outcomes.
Key algorithmic techniques include batched exploration, cyclic scheduling, and error-correcting code communication to minimize regret in decentralized settings.
Empirical results show that methods like BEACON achieve near-optimal logarithmic regret, even under asynchronous, adversarial, and non-stationary conditions.

A multi-player multi-armed bandit (MP-MAB) problem generalizes the classical stochastic multi-armed bandit (MAB) by considering multiple sequential decision-makers (players) interacting with a set of arms with unknown reward distributions. At each time step, every player selects an arm and receives a reward, typically subject to a collision model, wherein selecting the same arm with another player leads to collisions, which affect the rewards collected. The MP-MAB framework models decentralized learning and resource allocation under limited or implicit communication and is fundamentally motivated by problems in cognitive radio, wireless networking, and distributed multi-agent control.

1. Formal Problem Setup and Collision Models

Let $M$ denote the number of players and $K$ the number of arms. Each player $m \in [M]$ at time $t \in [T]$ selects an arm $s_m(t) \in [K]$ , receiving a reward determined both by the player's action and the collision configuration on that arm.

Reward and collision settings include:

Classic zero-reward collision: Each arm $k$ provides i.i.d. samples $X_{k,m}(t)$ from a distribution $\varphi_{k,m}$ with mean $\mu_{k,m} \in [0,1]$ . If more than one player chooses arm $k$ ( $n_k(t) > 1$ ), then each colliding player receives reward $0$; otherwise, the reward is $X_{k,m}(t)$ (Shi et al., 2021, Proutiere et al., 2019).
Collision-dependent rewards: If $M_k(t) \ge 1$ players select arm $k$ , the reward may be split (e.g., $X_k(t)/M_k(t)$ ) or degraded (distinct distributions for collision/no-collision), reflecting partial resource sharing (Xu et al., 2023, Shi et al., 2021).
Finite shareable resources: Each arm $i$ has a capacity $R_i$ ; selecting $m_i(t)$ players on arm $i$ yields for each player up to one reward, but capped at $R_i$ total (Wang et al., 2022).
Heterogeneous rewards: Each (player, arm) pair $(k,m)$ has a different distribution, modeling diverse user/channel combinations (Shi et al., 2021, Magesh et al., 21 Jan 2025).
Abruptly changing or adversarial environments: Arms' means may change at unknown breakpoints, or adversaries may attack (block) some arms (Wei et al., 2018, Magesh et al., 21 Jan 2025).

Feedback and sensing regimes:

Collision sensing: Players observe both received rewards and explicit collision flags indicating whether their selection resulted in a collision (Shi et al., 2021, Besson et al., 2017).
No-sensing: Players observe only the reward, not whether a collision occurred, requiring inference about collisions from outcome distributions (Shi et al., 2021, Shi et al., 2020).
Communication models: Explicit leader-to-follower or forced-collision protocols for information exchange; in some works, communication is purely implicit (e.g., via actions), or prohibited except for minimal coordination bits (Shi et al., 2021, Magesh et al., 21 Jan 2025, Zhou et al., 8 Oct 2025).

The system objective is often maximizing cumulative system reward (social welfare), compared to an "oracle" allocation that assigns the $M$ best arms to players each round. The regret is defined as the gap between the oracle reward and the algorithm's obtained reward, often normalized per horizon $T$ .

2. Lower Bounds, Centralized versus Decentralized Regret

The optimal lower bounds for MP-MAB match the order of the centralized multi-play bandit problem [Anantharam, Varaiya, and Walrand]. For the setup with $M$ players and $K$ arms (assuming i.i.d. rewards), the centralized problem's minimal possible regret is

$R(T) = \Omega\left(\sum_{k > M} \frac{(\mu_M - \mu_k)}{\operatorname{kl}(\mu_k, \mu_M)} \log T\right)$

where $\operatorname{kl}(\mu_k, \mu_M)$ is the binary KL-divergence (Proutiere et al., 2019, Besson et al., 2017).

Decentralized algorithms faced a historical gap to this lower bound, often due to the costs of implicit communication (via forced collisions or protocol design) and ineffectiveness at avoiding collisions or learning the optimal matching. For instance,

Early distributed policies could incur $\tilde O(M^3 K \log T / \Delta_{min})$ regret, with multiplicative overheads in $M$ and $K$ (Shi et al., 2021).
Modern policies close this gap, achieving optimal $O(\log T)$ instance-dependent regret for both linear (sum) and certain nonlinear reward functions under various feedback models (Shi et al., 2021, Proutiere et al., 2019, Besson et al., 2017, Pacchiano et al., 2021, Zhou et al., 8 Oct 2025).
In abrupt or adversarial settings, minimax regret rates are often $O(T^{(1+\nu)/2} \log T)$ , where $\nu<1$ quantifies the rate of nonstationarity (Wei et al., 2018).

3. Algorithmic Techniques and Protocols

Adaptive Communication and Decentralized Learning

Batched exploration and communication: BEACON (Shi et al., 2021) periodically synchronizes empirical statistics through an adaptive differential communication (ADC) protocol. By quantizing and transmitting only differences of means, BEACON incurs $O(MK \log T)$ total forced-collision steps, matching centralized lower bounds for system regret.
Single-leader exploitation: DPE (Proutiere et al., 2019) coordinates only one exploring player ("leader"), with followers exploiting greedily. Leader updates are broadcast to followers using rare, finite-length collision signals, achieving finite (i.e., $O(1)$ ) expected communication rounds.
Round-robin/orthogonalization: Many algorithms initialize by randomly assigning players to arms to obtain a collision-free (orthogonal) starting configuration (Zhou et al., 8 Oct 2025, Hanawal et al., 2018, Wang et al., 2022). This forms a basis for structured communication (e.g., for rank assignment or block schedule formation).
Error-correcting code-based communication: In collision-dependent or no-sensing feedback models, implicit communication is formulated as a noisy channel coding problem, with Hamming or similar codes used to send quantized means via forced or inferred collisions (Shi et al., 2021, Shi et al., 2020).
Elimination and UCB-style learning: Most protocols are ultimately elimination-based—agents estimate means using (KL-)UCB indices, iteratively accept/reject arms or matchings (Shi et al., 2021, Zhou et al., 8 Oct 2025, Besson et al., 2017).

Coordination without Explicit Communication

Cyclic and schedule-based policies: Players maintain local arm schedules, often in cycles. Upon detecting a collision, conflicting players randomize their cyclic positions, converging to a collision-free schedule (Evirgen et al., 2017).
Successive elimination and implicit gap estimation: Protocols estimate suboptimality gaps per player via successive elimination, alternating between collision-free round-robin exploration and collision-based bitwise encoding to partition arms (Pacchiano et al., 2021).
Trekking mechanisms: Players independently "trek" toward higher-valued arms, using deterministic test phases to confirm availability and irrevocably commit once collision-free (Hanawal et al., 2018). No knowledge of $M$ or global synchronization is required.

Handling Asynchrony, Sharable Capacity, and Adversarial Scenarios

Asynchronous and dynamic participation: Adaptive protocols, such as ACE (Fan et al., 30 Sep 2025), allow players to enter/leave at arbitrary unknown times, maintaining a dynamic set of "occupied" arms through collision pattern detection and UCB learning.
Stochastic or finite arm capacity: Arms serve "arriving requests" with uncertain stochastic capacity and/or known finite capacity, with optimal allocations and consensus achieved via greedy or iterative distributed routines, even in the absence of communication (Xie et al., 20 Aug 2024, Wang et al., 2022).
Robustness to adversarial attacks: Algorithms combine minimal synchronization (e.g., single-bit phase signals) and payoff-based regularization, achieving $O(\log^{1+\delta} T + W)$ regret, where $W$ is the total number of time steps in which arms are attacked (Magesh et al., 21 Jan 2025).

4. Main Theoretical Results and Regret Bounds

Key instance-dependent results for decentralized heterogeneous and homogeneous MP-MAB include:

Setting/type	Regret Upper Bound	Centralized Lower Bound	Communication Cost	Reference
Heterogeneous MP-MAB, linear reward	$\tilde O(M^2 K/\Delta_{min} \log T)$	$\Omega(M^2 K/\Delta_{min} \log T)$	$O(MK \log T)$	(Shi et al., 2021)
Homogeneous, collision-sensing	$O(\sum_{k>M}\frac{\log T}{\Delta_k})$	$O(\sum_{k>M}\frac{\log T}{\Delta_k})$	Finite or $O(\log T)$	(Proutiere et al., 2019, Besson et al., 2017)
No-sensing, collision-dependent	$O(\sum_{k>M} \frac{\log T}{\Delta_k} + M^2 K \log T)$	$O(\sum_{k>M} \frac{\log T}{\Delta_k})$	$O(\log T)$	(Shi et al., 2021, Shi et al., 2020)
Finite shareable resources/capacity	$O(\log T)$	$O(\log T)$	$O(K^3)$ (init)	(Wang et al., 2022, Xie et al., 20 Aug 2024)
Adversarial attacks, heterogeneous rewards	$O(\log^{1+\delta} T + W)$	$O(\log T + W)$	$O(\log T)$	(Magesh et al., 21 Jan 2025)
Abruptly changing (non-stationary)	$O(T^{(1+\nu)/2} \log T)$	N/A	$-$	(Wei et al., 2018)
Asynchronous (unknown entry/exit times)	$O(\sqrt{T \log T} + (\log T)/\Delta^2)$	N/A	None	(Fan et al., 30 Sep 2025)

5. Extension to Generalized Rewards and Structural Models

Recent work extends MP-MAB to more general reward structures:

General (nonlinear) system rewards: If the system reward is a monotone, bounded-smooth function $v(\cdot)$ of the per-player rewards, BEACON achieves $\tilde O(\sum_{k,m} \frac{\Delta_{k,m}^{\max}}{[f^{-1}(\Delta_{k,m}^{min})]^2} \log T + M^2 K \Delta_c \log T)$ regret (Shi et al., 2021).
Sharable and finite resource arms: Algorithms such as DPE-SDI and OptArmPulProfile solve for optimal allocations accounting for the number of players per arm and capacity constraints. Logarithmic regret is attainable even when players must coordinate without direct communication (Wang et al., 2022, Xie et al., 20 Aug 2024).
Averaging collision model: SMAA achieves instance-optimal per-player and system regret by modeling reward sharing as Nash equilibria in one-shot games; decentralized policies robust to strategic deviations (Xu et al., 2023).

6. Empirical Validation and Practical Applications

Empirical studies consistently confirm theoretical regret bounds, stability, and coordination efficiency across scenarios:

Regret reduction: BEACON achieves 6–7× lower regret than previous METC and matches centralized CUCB (Shi et al., 2021); SynCD outperforms previous fully distributed and leader-follower schemes in both group and individual regret (Zhou et al., 8 Oct 2025).
Scalability: Practical protocols (e.g., BEACON, SMAA, ACE) show robust performance for $M, K$ up to 50–100 and $T>10^6$ .
Networking applications: Resource allocation in D2D, edge computing, and 5G networks, where varying feedback and collision models reflect physical-layer constraints (Neogi et al., 2018, Wang et al., 2022).
Stability and robustness: Algorithms adapt successfully to asynchronous dynamics, abrupt change-points, adversarial interference, and partial observability, demonstrating applicability to IoT and wireless systems (Fan et al., 30 Sep 2025, Magesh et al., 21 Jan 2025).

7. Open Problems and Future Directions

Ongoing challenges and active topics in MP-MAB research include:

Sub-logarithmic regret: Lowering the poly-log factor or achieving instance-optimality in broad settings (e.g., $\Theta(\log T)$ for all variants).
Heterogeneous and correlated reward models: Extending optimal regret and communication-efficient protocols to fully heterogeneous, time-correlated, and/or adversarial settings.
Unknown system size and non-stationarity: Removing requirements for known $M$ or $K$ , handling unbounded or adversarial player dynamics, or adapting efficiently to non-stationary means.
Hybrid resource models: Further generalizations combining stochastic arm capacities, time-varying sharable loads, and multi-resource (vector) constraints.
Communication-efficient protocols: Reducing communication/coordination costs below logarithmic scaling, or developing practical coding methods for implicit communication under severe feedback constraints.

For detailed constructions, regret analyses, and concrete pseudocode, see (Shi et al., 2021, Proutiere et al., 2019, Zhou et al., 8 Oct 2025), and references therein.