Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling (1802.09127v1)

Published 26 Feb 2018 in stat.ML and cs.LG

Abstract: Recent advances in deep reinforcement learning have made significant strides in performance on applications such as Go and Atari games. However, developing practical methods to balance exploration and exploitation in complex domains remains largely unsolved. Thompson Sampling and its extension to reinforcement learning provide an elegant approach to exploration that only requires access to posterior samples of the model. At the same time, advances in approximate Bayesian methods have made posterior approximation for flexible neural network models practical. Thus, it is attractive to consider approximate Bayesian neural networks in a Thompson Sampling framework. To understand the impact of using an approximate posterior on Thompson Sampling, we benchmark well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems. We found that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario. In particular, we highlight the challenge of adapting slowly converging uncertainty estimates to the online setting.

Citations (343)

View on Semantic Scholar

Summary

The paper benchmarks traditional and novel Bayesian deep network methods, including variational inference, dropout, and bootstrapping, within the Thompson Sampling framework.
It reveals that methods excelling in static supervised learning often underperform in dynamic, online reinforcement learning settings.
The study highlights neural linear regression and bootstrapped networks as promising approaches that improve exploration and uncertainty estimation in practice.

An Empirical Analysis of Bayesian Approximations in Deep Bayesian Bandits

The research paper offers a thorough empirical evaluation of various Bayesian deep network approaches applied within the Thompson Sampling framework for contextual bandits. The paper's core objective is to scrutinize the effectiveness of these methodologies, especially in online settings where balancing exploration and exploitation is critical. This exploration-exploitation dilemma remains a formidable challenge in sequential decision-making tasks central to reinforcement learning.

Key Insights and Methodological Approach

The authors have implemented a methodological framework that subjects well-established and novel Bayesian approximation techniques to rigorous benchmarks on contextual bandit problems. This evaluation includes both traditional methods, such as variational inference and expectation-propagation, and newer approaches like bootstrapping and dropout. A significant element in their evaluation is how these methods handle approximate posterior sampling and its impact on Thompson Sampling efficacy, given the intractability of maintaining exact model posteriors in complex models.

The paper's results highlight a performance discrepancy between these sophisticated Bayesian methods when translated from a supervised to a sequential decision-making context. Notably, several methods that perform admirably in a static supervised setting (where models are not continually updated based on new data) show diminished performance in the dynamic, feedback-driven environment of reinforcement learning.

Numerical Results and Comparative Analysis

The paper makes bold assertions regarding the inadequacies of certain methods when faced with the real-time demand of online learning. For instance, traditional Variational Inference and Black Box $\alpha$ -Divergence methods, while theoretically robust in estimating posterior distributions, struggle with adaptation to online learning scenarios due to non-convergence of uncertainty estimates. This highlights a critical gap in translating theoretical Bayesian methods to practical, application-specific contexts.

On the other hand, techniques like neural linear regression demonstrate strong potential by decoupling feature learning from uncertainty estimation. This separation proves advantageous, offering computational simplicity and resilience in diverse benchmark scenarios. The exploration also identifies bootstrapped network approaches as competent in providing improved exploration capabilities over the baseline network models, as validated by experiments conducted across several standard datasets such as Mushroom, Statlog, and Covertype.

Practical and Theoretical Implications

This comprehensive empirical paper provides several implications for the future of AI in decision-making applications. It raises important questions about the suitability of approximate Bayesian methods in reinforcement learning and advocates for a re-evaluation of how these tools are leveraged in environments requiring rapid, continuous adaptation.

Moreover, by providing a benchmark dataset and open-source implementations, the research establishes a valuable resource for subsequent investigations aimed at refining posterior approximation methods or developing hybrid approaches that might better balance between exploration and exploitation under uncertainty.

Future Directions

Potential lines for future investigation include further refinement of methods like NeuralLinear, which leverage the strengths of both neural representations and classical linear models. Another exploration avenue is the investigation of the intrinsic noise inherent in gradient-based optimization as a source of exploration, suggesting that simpler models could be advantageous in scenarios where interpretability and transparency are paramount.

In sum, the paper contributes a significant empirical framework to the discourse on Bayesian methods in reinforcement learning, demonstrating the need for methodologies that are not only theoretically sound but also practically viable in complex, real-world decision-making scenarios.

PDF Markdown

Related Papers

Reddit

"Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling", Riquelme et al 2018 {G} (2 points, 0 comments)