On Frequentist Regret of Linear Thompson Sampling (2006.06790v3)

Published 11 Jun 2020 in cs.LG and stat.ML

Abstract: This paper studies the stochastic linear bandit problem, where a decision-maker chooses actions from possibly time-dependent sets of vectors in $\mathbb{R}^d$ and receives noisy rewards. The objective is to minimize regret, the difference between the cumulative expected reward of the decision-maker and that of an oracle with access to the expected reward of each action, over a sequence of $T$ decisions. Linear Thompson Sampling (LinTS) is a popular Bayesian heuristic, supported by theoretical analysis that shows its Bayesian regret is bounded by $\widetilde{\mathcal{O}}(d\sqrt{T})$, matching minimax lower bounds. However, previous studies demonstrate that the frequentist regret bound for LinTS is $\widetilde{\mathcal{O}}(d\sqrt{dT})$, which requires posterior variance inflation and is by a factor of $\sqrt{d}$ worse than the best optimism-based algorithms. We prove that this inflation is fundamental and that the frequentist bound of $\widetilde{\mathcal{O}}(d\sqrt{dT})$ is the best possible, by demonstrating a randomization bias phenomenon in LinTS that can cause linear regret without inflation.We propose a data-driven version of LinTS that adjusts posterior inflation using observed data, which can achieve minimax optimal frequentist regret, under additional conditions. Our analysis provides new insights into LinTS and settles an open problem in the field.

Authors (2)

Nima Hamidi (6 papers)
Mohsen Bayati (31 papers)

Citations (6)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/kwangsungjun/status/1918462623160598966

On Frequentist Regret of Linear Thompson Sampling (2006.06790v3)

Summary

Related Papers

Tweets