Restless Linear Bandits (2405.10817v1)
Abstract: A more general formulation of the linear bandit problem is considered to allow for dependencies over time. Specifically, it is assumed that there exists an unknown $\mathbb{R}d$-valued stationary $\varphi$-mixing sequence of parameters $(\theta_t,~t \in \mathbb{N})$ which gives rise to pay-offs. This instance of the problem can be viewed as a generalization of both the classical linear bandits with iid noise, and the finite-armed restless bandits. In light of the well-known computational hardness of optimal policies for restless bandits, an approximation is proposed whose error is shown to be controlled by the $\varphi$-dependence between consecutive $\theta_t$. An optimistic algorithm, called LinMix-UCB, is proposed for the case where $\theta_t$ has an exponential mixing rate. The proposed algorithm is shown to incur a sub-linear regret of $\mathcal{O}\left(\sqrt{d n\mathrm{polylog}(n) }\right)$ with respect to an oracle that always plays a multiple of $\mathbb{E}\theta_t$. The main challenge in this setting is to ensure that the exploration-exploitation strategy is robust against long-range dependencies. The proposed method relies on Berbee's coupling lemma to carefully select near-independent samples and construct confidence ellipsoids around empirical estimates of $\mathbb{E}\theta_t$.
- P. Auer, “Using confidence bounds for exploitation-exploration trade-offs,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 397–422, 2002.
- Y. Abbasi-yadkori, D. Pál, and C. Szepesvári, “Improved algorithms for linear stochastic bandits,” in Advances in Neural Information Processing Systems, vol. 24, 2011.
- S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.
- R. Ortner, D. Ryabko, P. Auer, and R. Munos, “Regret bounds for restless markov bandits,” Theoretical Computer Science, vol. 558, pp. 62–76, 2014.
- S. Grünewälder and A. Khaleghi, “Approximations of the restless bandit problem,” The Journal of Machine Learning Research, vol. 20, no. 1, pp. 514–550, 2019.
- C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of optimal queuing network control,” Mathematics of Operations Research, vol. 24, no. 2, pp. 293–305, 1999.
- Q. Chen, N. Golrezaei, and D. Bouneffouf, “Non-stationary bandits with auto-regressive temporal dependency,” Advances in Neural Information Processing Systems, vol. 36, pp. 7895–7929, 2023.
- H. C. Berbee, “Random walks with stationary increments and renewal theory,” Mathematisch Centrum, 1979.
- S. Grünewälder and A. Khaleghi, “Estimating the mixing coefficients of geometrically ergodic markov processes,” arXiv preprint arXiv:2402.07296, 2024.
- A. Khaleghi and G. Lugosi, “Inferring the mixing properties of a stationary ergodic process from a single sample-path,” IEEE Transactions on Information Theory, 2023.