Context-lumpable stochastic bandits (2306.13053v2)

Published 22 Jun 2023 in cs.LG

Abstract: We consider a contextual bandit problem with $S$ contexts and $K$ actions. In each round $t=1,2,\dots$, the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into $r\le \min{S,K}$ groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an $\epsilon$-optimal policy after using at most $\widetilde O(r (S +K )/\epsilon^2)$ samples with high probability and provide a matching $\Omega(r(S+K)/\epsilon^2)$ lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $\widetilde O(\sqrt{r^3(S+K)T})$. To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and $\widetilde O(\sqrt{{poly}(r)(S+K)T})$ minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (49)

Authors (6)

Chung-Wei Lee (19 papers)
Qinghua Liu (33 papers)
Yasin Abbasi-Yadkori (35 papers)
Chi Jin (90 papers)
Tor Lattimore (74 papers)
Csaba Szepesvári (75 papers)

Citations (2)

View on Semantic Scholar

Context-lumpable stochastic bandits (2306.13053v2)

Related Papers