2000 character limit reached
Bridging Adversarial and Nonstationary Multi-armed Bandit (2201.01628v3)
Published 5 Jan 2022 in cs.LG and stat.ML
Abstract: In the multi-armed bandit framework, there are two formulations that are commonly employed to handle time-varying reward distributions: adversarial bandit and nonstationary bandit. Although their oracles, algorithms, and regret analysis differ significantly, we provide a unified formulation in this paper that smoothly bridges the two as special cases. The formulation uses an oracle that takes the best-fixed arm within time windows. Depending on the window size, it turns into the oracle in hindsight in the adversarial bandit and dynamic oracle in the nonstationary bandit. We provide algorithms that attain the optimal regret with the matching lower bound.
- Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pp. 322–331. IEEE.
- The nonstochastic multiarmed bandit problem. SIAM journal on computing 32(1), 48–77.
- Auer, P. and C.-K. Chiang (2016). An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, pp. 116–120. PMLR.
- Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, pp. 138–158. PMLR.
- Stochastic multi-armed-bandit problem with non-stationary rewards. Advances in neural information processing systems 27, 199–207.
- Non-stationary stochastic optimization. Operations research 63(5), 1227–1244.
- Optimal exploration–exploitation in a multi-armed bandit problem with non-stationary rewards. Stochastic Systems 9(4), 319–337.
- A survey on practical applications of multi-armed and contextual bandits. Working Paper.
- Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5(1), 1–122.
- The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pp. 42–1. JMLR Workshop and Conference Proceedings.
- A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. In Conference on Learning Theory, pp. 696–726. PMLR.
- Learning to optimize under non-stationarity. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1079–1087. PMLR.
- Non-stationary reinforcement learning: The blessing of (more) optimism. Working Paper.
- den Boer, A. V. (2015). Dynamic pricing and learning: historical origins, current research, and new directions. Surveys in operations research and management science 20(1), 1–18.
- On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, Berlin, Heidelberg, pp. 174–188. Springer.
- Learning product rankings robust to fake users. In Proceedings of the 22nd ACM Conference on Economics and Computation, pp. 560–561.
- Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pp. 1562–1578. PMLR.
- Efficient contextual bandits in non-stationary worlds. In Conference On Learning Theory, pp. 1739–1776. PMLR.
- Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114–122.
- One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pp. 1287–1295. PMLR.
- Performance evaluation and stochastic optimization with gradually changing non-stationary data. Working Paper.
- Yu, B. (1997). Assouad, fano, and le cam. In Festschrift for Lucien Le Cam, pp. 423–435. Springer.
- Regime switching bandits. In Advances in neural information processing systems, pp. Forthcoming.