A Time and Space Efficient Algorithm for Contextual Linear Bandits (1207.3024v4)
Abstract: We consider a multi-armed bandit problem where payoffs are a linear function of an observed stochastic contextual variable. In the scenario where there exists a gap between optimal and suboptimal rewards, several algorithms have been proposed that achieve $O(\log T)$ regret after $T$ time steps. However, proposed methods either have a computation complexity per iteration that scales linearly with $T$ or achieve regrets that grow linearly with the number of contexts $|\myset{X}|$. We propose an $\epsilon$-greedy type of algorithm that solves both limitations. In particular, when contexts are variables in $\realsd$, we prove that our algorithm has a constant computation complexity per iteration of $O(poly(d))$ and can achieve a regret of $O(poly(d) \log T)$ even when $|\myset{X}| = \Omega (2d) $. In addition, unlike previous algorithms, its space complexity scales like $O(Kd2)$ and does not grow with $T$.
- José Bento (29 papers)
- Stratis Ioannidis (67 papers)
- S. Muthukrishnan (51 papers)
- Jinyun Yan (8 papers)