Thompson Sampling for Complex Bandit Problems (1311.0466v1)

Published 3 Nov 2013 in stat.ML and cs.LG

Abstract: We consider stochastic multi-armed bandit problems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms' rewards, and the feedback observed may not necessarily be the reward per-arm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be coupled due to the nature of the reward function. We prove a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretely-supported priors over the parameter space and without additional structural properties such as closed-form posteriors, conjugate prior structure or independence across arms. The regret bound scales logarithmically with time but, more importantly, with an improved constant that non-trivially captures the coupling across complex actions due to the structure of the rewards. As applications, we derive improved regret bounds for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear MAX reward feedback from subsets.

Citations (195)

View on Semantic Scholar

Summary

The paper establishes a frequentist regret bound for Thompson Sampling in complex multi-armed bandit settings with coupled nonlinear rewards.
It leverages particle filtering and a novel posterior concentration transformation to manage complex actions and nonlinear reward feedback.
Practical experiments demonstrate its effectiveness in job scheduling and subset selection, paving the way for advanced reinforcement learning applications.

An Evaluation of Thompson Sampling in Complex Multi-Armed Bandit Problems

The paper "Thompson Sampling for Complex Online Problems" investigates the application of Thompson Sampling (TS) in complex multi-armed bandit (MAB) scenarios, enhancing the exploration-exploitation balance strategy typically employed in simpler bandit problems. This research addresses the challenge that arises when decision-makers require complex actions, which involve subsets of basic arms, influencing the nature of reward observations and coupling across action feedback.

Overview of the Contribution

A key contribution of this work is the establishment of a frequentist regret bound for Thompson Sampling, extending its applicability to general MAB settings with parameter, action, and observation spaces. Unlike conventional methods that rely on independence across arms or structured priors, the regret bound developed by the authors operates under discretely-supported priors. Crucially, the regret bound retains its logarithmic scaling with time, featuring an enhanced preconstant that accounts for informational complexity and coupling of rewards across complex actions.

Theoretical Framework and Results

The paper demonstrates improved regret bounds for subsets of arms selected under coupling conditions, presenting the first substantial results for nonlinear reward aggregation scenarios like maximum reward feedback from subsets. The analysis leverages a Bayesian-inspired approach that integrates efficient numerical techniques, such as particle filtering, to solve complex bandit problems where reward feedback is aggregated or nonlinear.

A significant highlight is the transformation of posterior concentration into a path-based optimization problem, presenting a novel proof technique tailored to derive regret bounds. This approach circumvents the need for a specific structural assumption on prior distributions and provides insights into the posterior dynamics in MAB settings with complex actions.

Practical Implications

Practically, the robustness and flexibility of TS in handling complex feedback are underscored, with numerical studies showcasing its performance in job scheduling and subset-selection bandit scenarios. The algorithm, implemented using particle filters, demonstrates the practicality of TS for environments where per-arm observations are infeasible, and information models deviate from the simple reward observation structure.

Speculations for Future Developments

Looking ahead, this pseudo-Bayesian framework opens pathways for addressing adversarial settings, where TS could be adapted to suit Bayesian-inspired algorithms. There is potential for TS to effectively manage large-scale Reinforcement Learning (RL) problems with complex, state-dependent Markovian dynamics, revolutionizing approaches in online advertising and scheduling through optimal bidding and policy iteration methodologies.

Moreover, extending TS to handle continuous state spaces and X-armed bandit problems could provide substantial advancements in computational efficiency and theoretical understanding, deepening insights into the AI domain's capacity to learn and adapt in complex scenarios.

In summary, this paper establishes a foundational understanding of Thompson Sampling for complex online problems, paving the way for its application and refinement in diverse decision-making environments. Such advancements promise to augment AI systems' capabilities in optimizing strategies across increasingly intricate and interconnected action spaces.

PDF Markdown