Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations (2112.06517v4)

Published 13 Dec 2021 in cs.LG and stat.ML

Abstract: We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, \emph{evaluations} of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated. First, we show a $\widetilde{O}(T^{2/3})$ regret in the general case when the observation functions are a genearalized linear function of the true rewards. On the other hand, we show that an improved $\widetilde{O}(\sqrt{T})$ regret can be derived when the observation functions are noisy linear functions of the true rewards. Finally, we report an empirical validation that confirms our theoretical findings, provides a thorough comparison to alternative approaches, and further supports the interest of this setting in practice.

Authors (4)

Evrard Garcelon (13 papers)
Vashist Avadhanula (11 papers)
Alessandro Lazaric (78 papers)
Matteo Pirotta (45 papers)

Citations (5)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations (2112.06517v4)

Summary

Related Papers