Syndicated Bandits: Multi-Agent Online Learning

Updated 16 November 2025

Syndicated bandits are multi-agent extensions of the classical multi-armed bandit model that enable independent agents to learn and act under constraints such as collisions and privacy requirements.
They employ diverse coordination strategies, including collision-based ranking, decentralized epsilon-greedy methods, and private distributed elimination to achieve sublinear regret bounds.
Applications span cognitive radio networks, edge computing, and recommender systems, highlighting scalable, robust, and privacy-preserving online decision-making.

Syndicated bandits refer to multi-agent extensions of the classical multi-armed bandit (MAB) model where multiple players (agents, users, devices) independently interact with a shared set of arms, often under additional constraints such as collisions, privacy requirements, lack of communication, or distributed system architectures. This class of problems emerges in environments such as cognitive radio networks, edge computing, recommender systems, and online resource allocation with decentralized actors. The core interest lies in rigorously addressing exploration-exploitation trade-offs for a group, where local learning and global efficiency must be balanced despite partial observability and conflicting incentives.

1. Formal Models of Syndicated Bandits

The syndicated bandit paradigm encompasses a variety of formal models, including:

Adversarial Multi-Player Bandit Model: $K$ cooperative players share $N$ arms. At round $t$ , each player $k$ chooses an arm $I_k^t \in [N]$ (or abstains), and an oblivious adversary assigns losses $\ell_i^t \in [0,1]$ to all arms. Collisions arise if multiple players choose the same arm, incurring a fixed penalty (loss 1); solo players receive the actual arm loss. Players receive only local collision feedback $C_k^t \in \{0,1\}$ and loss if no collision, with no inter-player communication allowed (Alatur et al., 2019).
Stochastic Multi-User Bandit Model: $M$ users, $K$ arms with unknown reward distributions $\mu_k$ . Each user independently selects one arm per round; collisions lead to zero reward (with binary feedback). The socially optimal solution maximizes joint rewards by allocating best arms to users without overlap, but users act without central coordination or knowledge of $M$ (Avner et al., 2014).
Networked Bandits for Privacy: Multiple distributed players attempt best-arm identification without sharing raw data. Only aggregated elimination votes are exchanged, protecting user-level privacy. The shared means assumption, $\mu_k^n = \mu_k$ for all $n$ , simplifies analysis. The objective is to find $\varepsilon$ -optimal arms with high probability using minimal inter-node communication (Féraud, 2016).
Contextual Syndicated Bandits for Hyper-parameter Tuning: Contextual bandit algorithms (LinUCB, LinTS, UCB-GLM) depend on hyper-parameters (e.g., exploration coefficient $\alpha$ , regularizer $\lambda$ ). Offline tuning is infeasible; the framework introduces layered bandit models (EXP3 atop the base bandit) to dynamically select hyper-parameters online across multiple parameters (Ding et al., 2021).

2. Algorithmic Architectures and Coordination Mechanisms

Coordination in syndicated bandits is achieved via diverse algorithmic strategies:

Collision-Based Distributed Coordination: Algorithms like C-Play (Alatur et al., 2019) employ three phases:
- Ranking: All players randomly pick arms for $T_R = Ke\log T$ rounds to establish ranks with high probability.
- Blocked Coordination: A single coordinator player runs blocked-EXP3 over meta-arms (combinations of $K$ distinct arms), coordinating assignments within blocks, while followers discover their assigned arms via round-robin trial-and-error.
- Play and Feedback: Arm assignments persist throughout the block; unbiased meta-arm loss estimates feed back into EXP3 updates.
Decentralized $\epsilon$ -Greedy with Collision Avoidance (MEGA): Each user tracks arm availability and empirical means locally. Upon collision, with probability $p$ , a user persists or otherwise declares the arm unavailable for a randomized "quiet period" $O(t^\beta)$ . This mechanism passively orthogonalizes users across arms and enables adaptation to unknown and changing user counts (Avner et al., 2014).
Private Distributed Elimination (DME, EDME): Each player runs a local Median Elimination. Players cannot see others' statistics; when local estimates suggest elimination of an arm, they vote. Once sufficient votes ( $N_\gamma/2$ ) are received, the central server broadcasts an arm's elimination to all nodes. EDME runs multiple DME instances in parallel to handle uncertainty in the active user set size (Féraud, 2016).
Layered Online Hyper-parameter Selection: A top layer of adversarial bandit solvers (EXP3 per parameter) selects values for each hyper-parameter; a bottom layer runs the contextual bandit algorithm with the chosen configuration. Reward feedback is propagated upwards using importance-weighted updates, avoiding naive joint tuning over the configuration grid (Ding et al., 2021).

3. Theoretical Guarantees and Regret Bounds

Regret bounds in syndicated bandits quantify the efficiency of decentralized learning:

Adversarial Multi-Player Regret (Alatur et al., 2019):
- The expected collective regret satisfies
$\mathbb{E}[R_T] \leq 4\,K^{4/3}N^{2/3}(\log N)^{1/3}T^{2/3} + 2K^2e\log T$

where $T$ is the horizon, $K$ players, $N$ arms. This is sublinear in $T$ even under adversarial losses and no communication.
Stochastic Multi-User Regret (Avner et al., 2014):
- Combined regret bound for MEGA:
$E[R(t)] \leq C_1 N^2 K t^{1 - \beta/2} + NKT + NK C_3 t^\beta + Nm + \frac{c K^2 N}{d^2 (K-1)}\log t$

Optimizing $\beta=2/3$ yields $E[R(t)] = O(t^{2/3})$ . The protocol is robust to unknown and varying user counts.
Distributed Median Elimination Communication and Sample Complexity (Féraud, 2016):
- DME transmits $C = 2N (K-1)\lceil\log_2 K\rceil = O(NK\log K)$ bits.
- Speed-up factor is $\Omega(N_\gamma/(1+\ln K))$ , with per-player sample complexity $O(\frac{K}{\varepsilon^2 N_\gamma} \ln \frac{K}{\delta})$ .
Layered Auto-Tuning Regret (Ding et al., 2021):
- Regret with $L$ hyper-parameters, $n_\ell$ candidates each:
$\mathbb{E}[R(T)] = \widetilde{O}(T^{2/3}) + \sum_{\ell=1}^L O(\sqrt{n_\ell T \ln n_\ell})$

For theory-optimal values covered, regret improves to $\widetilde{O}(\sqrt{T}) + \sum_{\ell=1}^L O(\sqrt{n_\ell T \ln n_\ell})$ . The additive dependence avoids exponential scaling in $L$ .

4. Practical Implementations and Scalability

Several architectures have demonstrated practical viability:

Decentralized Bandits in Cognitive Radio Networks (Avner et al., 2014): MEGA exhibits sublinear regret and collision rates, adaptively handling nonstationary users (joining/leaving) with minimal overhead and no explicit inter-user coordination.
Privacy-Preserving Best Arm Identification (Féraud, 2016): DME/EDME protocols yield near-optimal wall-clock speed-ups (up to $10 \times$ for $N=64$ devices) with negligible communication cost, outperforming per-user bandit baselines in regret. EDME dynamically adapts the number of parallel instances to varied user activity profiles.
Contextual Auto-Tuning Frameworks (Ding et al., 2021): Syndicated Bandits framework is compatible with LinUCB, LinTS, UCB-GLM, and other contextual bandit algorithms. Empirical benchmarks on synthetic and MovieLens-100K data show consistent advantages over generic online parameter tuning (OP, Corral) and single-layer approaches, especially with multiple hyper-parameters.

Relation to Networked and Social Bandits: Approaches such as "A Gang of Bandits" leverage social network structures for signal sharing and collective contextual learning, outperforming network-agnostic baselines by exploiting relational information among users (Cesa-Bianchi et al., 2013).
Collision Feedback and Orthogonalization: Explicit collision signals (binary indicators) are essential for decentralized protocols like MEGA; without them, practical regret and collision rates degrade sharply.
Regret Rate Limitations: Decentralized schemes (O( $t^{2/3}$ ) regret) are slower than centralized or known- $N$ baselines (logarithmic regret) (Avner et al., 2014). Frequent changes in user population can deteriorate guarantees toward linear regret.
Communication-Accuracy Trade-offs: Privacy constraints restrict raw data exchange. DME/EDME achieve optimal communication efficiency but require aggregation servers and may need adaptation for highly dynamic networks (Féraud, 2016).

6. Applications and Implications

Syndicated bandit methods are deployed or directly applicable in:

Cognitive Radio and Wireless Resource Allocation: Decentralized approaches efficiently assign spectral channels to users, mitigating collisions and adapting to adversarial or nonstationary environments (Alatur et al., 2019, Avner et al., 2014).
Privacy-Sensitive Edge Networks: DME/EDME enable best-arm identification near devices, minimizing core network exposure and complying with privacy-by-design principles (Féraud, 2016).
Online Auto-tuning for Bandit-Based Recommendation Systems: Syndicated Bandits automate hyper-parameter selection, integrating seamlessly with online contextual learning systems without manual tuning or offline data requirements (Ding et al., 2021).
Networked Content Serving: Socially informed and clustered contextual bandit algorithms, as in "A Gang of Bandits," exploit network structure for improved prediction and allocation (Cesa-Bianchi et al., 2013).

A plausible implication is that syndicated bandit frameworks provide a principled solution to scalable, robust, and privacy-preserving online decision-making in multiparty stochastic and adversarial domains, with broad applicability from communications networks to machine learning model selection.

7. Summary Table: Algorithmic Properties

Algorithm	Communication	Regret Bound	Adaptivity & Privacy
MEGA (Avner et al., 2014)	None	$O(t^{2/3})$	Adapts to unknown $M$ , privacy-neutral
C-Play (Alatur et al., 2019)	None	$O(T^{2/3})$	Robust to adversarial losses; no comm.
DME (Féraud, 2016)	$O(NK\log K)$	$\varepsilon$ -opt. identification	Aggregated messages only
EDME (Féraud, 2016)	$O(MNK\log K)$	Near-optimal, dynamic $N_\gamma$	Adapts to unknown players
Syndicated Bandits (Ding et al., 2021)	None (single-agent)	$\widetilde O(T^{2/3} + \sum_\ell \sqrt{n_\ell T})$	Hyper-parameter auto-tuning

Further algorithm selection should account for feedback modalities (availability of collision signals), communication constraints, and the degree of adversarial nonstationarity in the loss or reward structure.

PDF Markdown Chat (Pro)

References (5)

Multi-Player Bandits: The Adversarial Case (2019)

Concurrent bandits and cognitive radio networks (2014)

Network of Bandits insure Privacy of end-users (2016)

Syndicated Bandits: A Framework for Auto Tuning Hyper-parameters in Contextual Bandit Algorithms (2021)

A Gang of Bandits (2013)

Follow Topic

Get notified by email when new papers are published related to Syndicated Bandits.