Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling on Sparse Hypergraphs (2312.15549v1)
Abstract: We study the multi-agent multi-armed bandit (MAMAB) problem, where $m$ agents are factored into $\rho$ overlapping groups. Each group represents a hyperedge, forming a hypergraph over the agents. At each round of interaction, the learner pulls a joint arm (composed of individual arms for each agent) and receives a reward according to the hypergraph structure. Specifically, we assume there is a local reward for each hyperedge, and the reward of the joint arm is the sum of these local rewards. Previous work introduced the multi-agent Thompson sampling (MATS) algorithm \citep{verstraeten2020multiagent} and derived a Bayesian regret bound. However, it remains an open problem how to derive a frequentist regret bound for Thompson sampling in this multi-agent setting. To address these issues, we propose an efficient variant of MATS, the $\epsilon$-exploring Multi-Agent Thompson Sampling ($\epsilon$-MATS) algorithm, which performs MATS exploration with probability $\epsilon$ while adopts a greedy policy otherwise. We prove that $\epsilon$-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size. We also derive a lower bound for this setting, which implies our frequentist regret upper bound is optimal up to constant and logarithm terms, when the hypergraph is sufficiently sparse. Thorough experiments on standard MAMAB problems demonstrate the superior performance and the improved computational efficiency of $\epsilon$-MATS compared with existing algorithms in the same setting.
- Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, volume 55. US Government printing office.
- Multi-Agent Multi-Armed Bandits with Limited Communication. Journal of Machine Learning Research, 23(212): 1–24.
- Analysis of Thompson Sampling for the Multi-Armed Bandit Problem. In Conference on Learning Theory, 39–1. JMLR Workshop and Conference Proceedings.
- Near-Optimal Regret Bounds for Thompson Sampling. Journal of the ACM (JACM), 64(5): 1–24.
- The Nonstochastic Multiarmed Bandit Problem. SIAM journal on computing, 32(1): 48–77.
- Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems. In International Conference on Machine Learning, 482–490. PMLR.
- Multi-Player Bandits Revisited. In Algorithmic Learning Theory, 56–92. PMLR.
- Selfish Robustness and Equilibria in Multi-Player Bandits. In Conference on Learning Theory, 530–581. PMLR.
- Online Learning for Cooperative Multi-Player Multi-Armed Bandits. In IEEE Conference on Decision and Control (CDC), 7248–7253. IEEE.
- An Empirical Evaluation of Thompson Sampling. In Advances in Neural Information Processing Systems, 2249–2257.
- Decentralised Online Planning for Multi-Robot Warehouse Commissioning. In Conference on Autonomous Agents and MultiAgent Systems, 492–500.
- Learning Multi-Agent State Space Representations. In International Conference on Autonomous Agents and Multiagent Systems: Volume 1-Volume 1, 715–722.
- PhyGCN: Pre-trained Hypergraph Convolutional Neural Networks with Self-supervised Learning. bioRxiv, 2023–10.
- Contextual Combinatorial Volatile Bandits with Satisfying via Gaussian Processes. arXiv preprint arXiv:2111.14778.
- Maximum Power-Point Tracking Control for Wind Farms. Wind Energy, 18(3): 429–447.
- Online Clustering of Bandits. In International Conference on Machine Learning, 757–765. PMLR.
- Multiagent Planning with Factored MDPs. Advances in neural information processing systems, 14.
- Multi-Armed Bandits with Correlated Arms. IEEE Transactions on Information Theory, 67(10): 6711–6732.
- Budget Allocation as a Multi-Agent System of Contextual & Continuous Bandits. In ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2937–2945.
- Multi-Objective Coordination Graphs for the Expected Scalarised Returns with Generative Flow Models. arXiv preprint arXiv:2207.00368.
- Towards Optimal Algorithms for Multi-Player Bandits without Collision Sensing Information. In Conference on Learning Theory, 1990–2012. PMLR.
- Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo. arXiv preprint arXiv:2305.18246.
- MOTS: Minimax Optimal Thompson Sampling. In International Conference on Machine Learning, 5074–5083. PMLR.
- Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits. Advances in Neural Information Processing Systems, 35: 38475–38487.
- Double Explore-Then-Commit: Asymptotic Optimality and Beyond. In Conference on Learning Theory, 2584–2633. PMLR.
- Thompson Sampling with Less Exploration is Fast and Optimal. In International Conference on Machine Learning, 15239–15261. PMLR.
- Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis. In Algorithmic Learning Theory, 199–213. Springer.
- Best Arm Identification in Spectral Bandits. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2220–2226.
- Collaborative Multiagent Reinforcement Learning by Payoff Propagation. Journal of Machine Learning Research, 7: 1789–1828.
- Thompson Sampling for 1-Dimensional Exponential Family Bandits. Advances in neural information processing systems, 26.
- Distributed Cooperative Decision-Making in Multiarmed Bandits: Frequentist and Bayesian Algorithms. In IEEE Conference on Decision and Control (CDC), 167–172. IEEE.
- Collaborative Filtering Bandits. In ACM SIGIR Conference on Research and Development in Information Retrieval, 539–548.
- Multiplayer Bandits without Observing Collision Information. Mathematics of Operations Research, 47(2): 1247–1265.
- Active Search and Bandits on Graphs Using Sigma-Optimality. In Uncertainty in Artificial Intelligence, volume 542, 551.
- Multi-User MABs with User Dependent Rewards for Uncoordinated Spectrum Access. In Asilomar Conference on Signals, Systems, and Computers, 969–972. IEEE.
- A Practical Algorithm for Multiplayer Bandits When Arm Means Vary among Players. In International Conference on Artificial Intelligence and Statistics, 1211–1221. PMLR.
- On Regret-Optimal Learning in Decentralized Multiplayer Multiarmed Bandits. IEEE Transactions on Control of Network Systems, 5(1): 597–606.
- Optimal Algorithms for Latent Bandits with Cluster Structure. arXiv preprint arXiv:2301.07040.
- Computing Convex Coverage Sets for Faster Multi-Objective Coordination. Journal of Artificial Intelligence Research, 52: 399–443.
- Social Learning in Multi Agent Multi Armed Bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3): 1–35.
- Solving Transition-Independent Multi-Agent MDPs with Sparse Interactions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30.
- Decentralized Multi-Player Multi-Armed Bandits with No Collision Information. In International Conference on Artificial Intelligence and Statistics, 1519–1528. PMLR.
- Heterogeneous Multi-Player Multi-Armed Bandits: Closing the Gap and Generalization. Advances in Neural Information Processing Systems, 34: 22392–22404.
- DCOPs and Bandits: Exploration and Exploitation in Decentralised Coordination. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, 289–296.
- Gossip-Based Distributed Stochastic Bandit Algorithms. In International Conference on Machine Learning, 19–27. PMLR.
- Maximizing and Satisficing in Multi-Armed Bandits with Graph Information. Advances in Neural Information Processing Systems, 35: 2019–2032.
- Thompson, W. R. 1933. On the Likelihood That One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3-4): 285–294.
- Spectral Bandits for Smooth Graph Functions. In International Conference on Machine Learning, 46–54. PMLR.
- Multi-Agent Thompson Sampling for Bandit Applications with Sparse Neighbourhood Structures. Scientific reports, 10(1): 1–13.
- Scalable Optimization for Wind Farm Control Using Coordination Graphs. In International Conference on Autonomous Agents and MultiAgent Systems, 1362–1370.
- Fleetwide Data-Enabled Reliability Improvement of Wind Turbines. Renewable and Sustainable Energy Reviews, 109: 428–437.
- Optimal Algorithms for Multiplayer Multi-Armed Bandits. In International Conference on Artificial Intelligence and Statistics, 4120–4129. PMLR.
- Distributed Bandit Learning: Near-optimal Regret with Efficient Communication. In International Conference on Learning Representations.
- On Distributed Multi-Player Multiarmed Bandit Problems in Abruptly Changing Environment. In IEEE Conference on Decision and Control (CDC), 5783–5788. IEEE.
- Wiering, M. A.; et al. 2000. Multi-Agent Reinforcement Learning for Traffic Light Control. In International Conference on Machine Learning, 1151–1158.
- Neural Contextual Bandits with Deep Representation and Shallow Exploration. In International Conference on Learning Representations.
- Langevin monte carlo for contextual bandits. In International Conference on Machine Learning, 24830–24850. PMLR.
- Laplacian-Regularized Graph Bandits: Algorithms and Theoretical Analysis. In International Conference on Artificial Intelligence and Statistics, 3133–3143. PMLR.
- Neural Thompson Sampling. In International Conference on Learning Representations.
- Global convergence of localized policy iteration in networked multi-agent reinforcement learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 7(1): 1–51.
- Federated Bandit: A Gossiping Approach. In Abstract Proceedings of the 2021 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, 3–4.
- Tianyuan Jin (14 papers)
- Hao-Lun Hsu (5 papers)
- William Chang (14 papers)
- Pan Xu (68 papers)