Learning to Optimize Via Posterior Sampling (1301.2609v5)

Published 11 Jan 2013 in cs.LG

Abstract: This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multi-armed bandit problems. The algorithm, also known as Thompson Sampling, offers significant advantages over the popular upper confidence bound (UCB) approach, and can be applied to problems with finite or infinite action spaces and complicated relationships among action rewards. We make two theoretical contributions. The first establishes a connection between posterior sampling and UCB algorithms. This result lets us convert regret bounds developed for UCB algorithms into Bayesian regret bounds for posterior sampling. Our second theoretical contribution is a Bayesian regret bound for posterior sampling that applies broadly and can be specialized to many model classes. This bound depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm Bayesian regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. Further, our analysis provides insight into performance advantages of posterior sampling, which are highlighted through simulation results that demonstrate performance surpassing recently proposed UCB algorithms.

Authors (2)

Daniel Russo (51 papers)
Benjamin Van Roy (88 papers)

Citations (671)

View on Semantic Scholar

Summary

The paper demonstrates that posterior sampling can achieve Bayesian regret bounds comparable to UCB algorithms in multi-armed bandit problems.
It introduces a novel regret bound using the eluder dimension that matches top results in linear models and outperforms UCB in generalized linear models.
The methodology offers design simplicity and computational efficiency by eliminating the need for explicit confidence intervals, as validated by simulations.

Learning to Optimize Via Posterior Sampling

This paper by Daniel Russo and Benjamin Van Roy examines the application of posterior sampling as a method for learning to optimize actions, particularly in the context of multi-armed bandit (MAB) problems. The authors delve into the advantages of this approach compared to the traditional upper confidence bound (UCB) algorithms and propose two key theoretical contributions.

Theoretical Contributions

Connection to UCB Algorithms: The paper establishes a theoretical link between posterior sampling and UCB algorithms. By translating regret bounds applicable to UCB algorithms into Bayesian regret bounds for posterior sampling, it reveals that many desirable properties of UCB can be preserved within posterior sampling without the need for explicit confidence bounds.
Bayesian Regret Bound: A new Bayesian regret bound for posterior sampling is derived, which applies broadly across various model classes. This bound is contingent on the concept of "eluder dimension," a novel measure of action reward dependence. The results show that for specific model classes, such as linear models, the obtained bounds match the best available results, while for generalized linear models, the bounds outperform existing UCB-based bounds.

Implications of Posterior Sampling

Posterior sampling, also referred to as Thompson Sampling, offers several potential advantages over UCB algorithms:

Design Simplicity: Unlike UCB algorithms that require carefully designed confidence bounds, posterior sampling can be implemented simply by leveraging existing Bayesian inference techniques.
Computational Efficiency: For complex problems with large or infinite action spaces, posterior sampling avoids the computational burden associated with optimizing over confidence bounds.
Empirical Performance: Through simulations, posterior sampling has demonstrated superior performance compared to UCB algorithms, particularly in cases where UCB fails to tightly bound confidence intervals.

Simulation Study

The authors present simulation results to empirically validate the performance advantages of posterior sampling over UCB algorithms. In scenarios involving linear reward functions and Gaussian noise, posterior sampling exhibited notably lower regret compared to state-of-the-art UCB methods. Tuning confidence parameters manually for UCB demonstrated potential room for improvement, although practical implementation would require fixing the action horizon.

Broader Impact and Future Implications

The introduction of the eluder dimension as a metric broadens the theoretical understanding of decision-making in MAB contexts. Posterior sampling's ability to maintain effective balance between exploration and exploitation without complex statistical analyses positions it as a versatile tool for various practical applications, ranging from adaptive sampling to dynamic decision-making.

In future developments, the principles and methodologies presented could be extended to even more complex AI applications, including reinforcement learning environments where structured priors and dependencies exist. The insights into the relationship between posterior sampling and UCB algorithms could catalyze further exploration into other combinations of Bayesian and frequentist approaches, fostering advancements in the theoretical and applied aspects of online learning and optimization.

In conclusion, while posterior sampling aligns with Bayesian principles and offers computational simplicity, its strong theoretical underpinnings as highlighted in this paper underscore its efficacy and potential as a preferred approach in learning to optimize complex actions in uncertain environments.

PDF Markdown