Restless Bandit Problem with Rewards Generated by a Linear Gaussian Dynamical System (2405.09584v2)
Abstract: Decision-making under uncertainty is a fundamental problem encountered frequently and can be formulated as a stochastic multi-armed bandit problem. In the problem, the learner interacts with an environment by choosing an action at each round, where a round is an instance of an interaction. In response, the environment reveals a reward, which is sampled from a stochastic process, to the learner. The goal of the learner is to maximize cumulative reward. In this work, we assume that the rewards are the inner product of an action vector and a state vector generated by a linear Gaussian dynamical system. To predict the reward for each action, we propose a method that takes a linear combination of previously observed rewards for predicting each action's next reward. We show that, regardless of the sequence of previous actions chosen, the reward sampled for any previously chosen action can be used for predicting another action's future reward, i.e. the reward sampled for action 1 at round $t-1$ can be used for predicting the reward for action $2$ at round $t$. This is accomplished by designing a modified Kalman filter with a matrix representation that can be learned for reward prediction. Numerical evaluations are carried out on a set of linear Gaussian dynamical systems and are compared with 2 other well-known stochastic multi-armed bandit algorithms.
- O. Besbes, Y. Gur, and A. Zeevi, “Stochastic multi-armed-bandit problem with non-stationary rewards,” Advances in neural information processing systems, vol. 27, pp. 199–207, 2014.
- A. Slivkins and E. Upfal, “Adapting to a changing environment: the brownian restless bandits.” in COLT, 2008, pp. 343–354.
- I. Bogunovic, J. Scarlett, and V. Cevher, “Time-varying gaussian process bandit optimization,” in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, A. Gretton and C. C. Robert, Eds., vol. 51. Cadiz, Spain: PMLR, 09–11 May 2016, pp. 314–323. [Online]. Available: https://proceedings.mlr.press/v51/bogunovic16.html
- Q. Chen, N. Golrezaei, and D. Bouneffouf, “Non-stationary bandits with auto-regressive temporal dependency,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- J. Parker-Holder, V. Nguyen, and S. J. Roberts, “Provably efficient online hyperparameter optimization with population-based bandits,” Advances in neural information processing systems, vol. 33, pp. 17 200–17 211, 2020.
- J. Gornet, M. Hosseinzadeh, and B. Sinopoli, “Stochastic multi-armed bandits with non-stationary rewards generated by a linear dynamical system,” in 2022 IEEE 61st Conference on Decision and Control (CDC). IEEE, 2022, pp. 1460–1465.
- A. Tsiamis and G. J. Pappas, “Finite sample analysis of stochastic system identification,” in 2019 IEEE 58th Conference on Decision and Control (CDC). IEEE, 2019, pp. 3648–3654.
- R. Agrawal, “Sample mean based index policies by o (log n) regret for the multi-armed bandit problem,” Advances in Applied Probability, vol. 27, no. 4, pp. 1054–1078, 1995.
- A. Garivier and E. Moulines, “On upper-confidence bound policies for non-stationary bandit problems,” arXiv preprint arXiv:0805.3415, 2008.
- P. Whittle, “Restless bandits: Activity allocation in a changing world,” Journal of applied probability, vol. 25, no. A, pp. 287–298, 1988.
- C. Tekin and M. Liu, “Online learning of rested and restless bandits,” IEEE Transactions on Information Theory, vol. 58, no. 8, pp. 5588–5611, 2012.
- R. Ortner, D. Ryabko, P. Auer, and R. Munos, “Regret bounds for restless markov bandits,” in International conference on algorithmic learning theory. Springer, 2012, pp. 214–228.
- S. Wang, L. Huang, and J. Lui, “Restless-ucb, an efficient and low-complexity algorithm for online restless bandits,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 878–11 889, 2020.
- W. Dai, Y. Gai, B. Krishnamachari, and Q. Zhao, “The non-bayesian restless multi-armed bandit: A case of near-logarithmic regret,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011, pp. 2940–2943.
- H. Liu, K. Liu, and Q. Zhao, “Logarithmic weak regret of non-bayesian restless multi-armed bandit,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011, pp. 1968–1971.
- Y. H. Jung and A. Tewari, “Regret bounds for thompson sampling in episodic restless bandit problems,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Improved algorithms for linear stochastic bandits,” in Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, Eds., vol. 24. Curran Associates, Inc., 2011.
- M. Deistler, K. Peternell, and W. Scherrer, “Consistency and relative efficiency of subspace methods,” Automatica, vol. 31, no. 12, pp. 1865–1875, 1995.
- T. Knudsen, “Consistency analysis of subspace identification methods based on a linear regression approach,” Automatica, vol. 37, no. 1, pp. 81–89, 2001.
- B. Sinopoli, L. Schenato, M. Franceschetti, K. Poolla, M. Jordan, and S. Sastry, “Kalman filtering with intermittent observations,” IEEE Transactions on Automatic Control, vol. 49, no. 9, pp. 1453–1464, 2004.
- J. Yan, X. Yang, Y. Mo, and K. You, “A distributed implementation of steady-state kalman filter,” IEEE Transactions on Automatic Control, vol. 68, no. 4, pp. 2490–2497, 2022.