Optimal Learning for Structured Bandits (2007.07302v3)
Abstract: We study structured multi-armed bandits, which is the problem of online decision-making under uncertainty in the presence of structural information. In this problem, the decision-maker needs to discover the best course of action despite observing only uncertain rewards over time. The decision-maker is aware of certain convex structural information regarding the reward distributions; that is, the decision-maker knows the reward distributions of the arms belong to a convex compact set. In the presence such structural information, they then would like to minimize their regret by exploiting this information, where the regret is its performance difference against a benchmark policy that knows the best action ahead of time. In the absence of structural information, the classical upper confidence bound (UCB) and Thomson sampling algorithms are well known to suffer minimal regret. As recently pointed out, neither algorithms are, however, capable of exploiting structural information that is commonly available in practice. We propose a novel learning algorithm that we call "DUSA" whose regret matches the information-theoretic regret lower bound up to a constant factor and can handle a wide range of structural information. Our algorithm DUSA solves a dual counterpart of the regret lower bound at the empirical reward distribution and follows its suggested play. We show that this idea leads to the first computationally viable learning policy with asymptotic minimal regret for various structural information, including well-known structured bandits such as linear, Lipschitz, and convex bandits, and novel structured bandits that have not been studied in the literature due to the lack of a unified and flexible framework.
- Aubin JP, Frankowska H (2009) Set-Valued Analysis (Springer Science & Business Media).
- Barvinok A (2002) A Course in Convexity, volume 54 (American Mathematical Society).
- Bertsekas D (2009) Convex Optimization Theory (Athena Scientific Belmont).
- Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research 57(6):1407–1420.
- Boyd S, Vandenberghe L (2004) Convex Optimization (Cambridge University Press).
- Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning 5(1):1–122.
- Chandrasekaran V, Shah P (2017) Relative entropy optimization and its applications. Mathematical Programming 161(1-2):1–32.
- Cover T, Thomas J (2012) Elements of Information Theory (John Wiley & Sons).
- Graves T, Lai T (1997) Asymptotically efficient adaptive choice of control laws incontrolled Markov chains. SIAM Journal on Control and Optimization 35(3):715–743.
- Hoeffding W (1994) Probability inequalities for sums of bounded random variables. The Collected Works of Wassily Hoeffding, 409–426 (Springer).
- Jun KS, Zhang C (2020) Crush optimism with pessimism: Structured bandits beyond asymptotic optimality. Advances in Neural Information Processing Systems 33.
- Keskin N, Zeevi A (2014) Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Operations Research 62(5):1142–1167.
- Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1):4–22.
- Lattimore T, Munos R (2014) Bounded regret for finite-armed structured bandits. Advances in Neural Information Processing Systems 27:550–558.
- Nesterov Y (2004) Introductory Lectures on Convex Optimization: A Basic Course (Kluwer Academic Publishers).
- Rusmevichientong P, Tsitsiklis JN (2010) Linearly parameterized bandits. Mathematics of Operations Research 35(2):395–411.
- Russo D, Van Roy B (2018) Learning to optimize via information-directed sampling. Operations Research 66(1):230–252.
- Slivkins A (2011) Contextual bandits with similarity information. Proceedings of the 24th annual Conference On Learning Theory, 679–702.
- Thompson W (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.