Papers
Topics
Authors
Recent
Search
2000 character limit reached

Thompson Sampling in Bayesian Bandits

Updated 25 February 2026
  • Thompson Sampling is a Bayesian algorithm for sequential decision-making that balances exploration and exploitation by sampling models from the posterior distribution.
  • It employs policy distillation to approximate the original TS policy, significantly reducing computation and latency in complex action spaces.
  • Empirical evaluations show that distilled TS closely matches the regret performance of exact TS while enabling scalable deployment in industrial applications.

Thompson Sampling (TS) is a Bayesian algorithm for sequential decision-making problems that require balancing exploration and exploitation, such as multi-armed bandits and contextual bandits. At each decision point, TS samples a model from the posterior distribution over unknown parameters given observed data and selects the action that appears optimal for this sampled model. This randomized probability matching achieves Bayesian optimality in terms of exploration, often yielding provably low regret. However, practical limitations involving posterior inference and online optimization motivate approximations or distillation methods for complex action spaces and large models (Namkoong et al., 2020).

1. Core Principles and Bayesian Foundations

Thompson Sampling is defined by maintaining a Bayesian posterior over unknown parameters and using this updated distribution to make sequential decisions. At each round tt:

  • Posterior update: Compute P(θHt)P(θ)i<tP(RiAi,Si,θ)P(\theta| H_t) \propto P(\theta)\prod_{i < t} P(R_i | A_i, S_i, \theta), where HtH_t is the observed history (Namkoong et al., 2020).
  • Posterior sampling: Draw θtP(Ht)\theta_t \sim P(\cdot | H_t).
  • Action selection: Choose At=argmaxaAfθt(a,St)A_t = \arg\max_{a \in \mathcal{A}} f_{\theta_t}(a, S_t).

The frequentist regret (against a fixed θ\theta) after TT rounds is:

Regfr(T)=t=1TE[maxafθ(a,St)fθ(At,St)]\text{Reg}^{fr}(T) = \sum_{t=1}^T \mathbb{E}\left[\max_a f_\theta(a, S_t) - f_\theta(A_t, S_t)\right]

TS generalizes across bandit and contextual bandit models, including exponential families, heavy-tailed reward models, and settings with latent structure (Russo et al., 2017, Lee et al., 2023, Lee et al., 2023, Hong et al., 2021). Its Bayesian formulation explicitly quantifies uncertainty and drives information-directed sampling.

2. Computational Bottlenecks and Distillation

A key challenge in deploying TS on complex models is the computational cost of posterior updates and optimization in large or continuous action spaces, particularly for deep Bayesian neural networks or when fast decision time is required (Namkoong et al., 2020, Osband et al., 2023, Zhang et al., 2020). For a generic contextual bandit with deep models, these costs are often:

  • O(n2)O(n^2)-O(n3)O(n^3) for each posterior sample (due to numerically demanding sampling or inference).
  • O(Acostf)O(|\mathcal{A}| \cdot \text{cost}_f) for action optimization over action space.

To address this, a major innovation is policy distillation for TS:

  • Offline distillation trains an explicit policy πm(as)\pi^m(a|s) (typically a small neural network) to imitate the stochastic decision map induced by the true TS policy.
  • During online operation, actions are sampled from πm(s)\pi^m( \cdot | s ), trading expensive online sampling/optimization for a single policy network forward pass—resulting in 5-10×\times lower latency in experiments (Namkoong et al., 2020).

This approach preserves the exploration behavior of TS within the capacity of the policy class and scales to billions of queries (e.g., video upload optimization at Meta) (Namkoong et al., 2020).

3. Algorithmic Workflow and Pseudocode

The Distilled Thompson Sampling pipeline organizes the workflow as:

  1. Online phase:
    • Deploy πm\pi^m on-device; for each context StS_t, sample action Atπmt(St)A_t \sim \pi^{m_t}(\cdot | S_t).
    • Log (St,At,Rt)(S_t, A_t, R_t) for offline training.
  2. Offline batch update:
    • Given all collected history HtH_t, update the full TS posterior P(θHt)P(\theta|H_t).
    • For a large pool of contexts {si}\{s_i\}, sample θP(Ht)\theta \sim P(\cdot|H_t) and compute AiTS=argmaxafθ(a,si)A^{TS}_i = \arg \max_a f_\theta(a, s_i).
    • Fit mt+1m_{t+1} by maximizing the average log likelihood:

    mt+1=argmaxmMESPS,ATSπtTS(S)[logπm(ATSS)]m_{t+1} = \arg\max_{m \in \mathcal{M}} \mathbb{E}_{S \sim \mathcal{P}_S, A^{TS} \sim \pi_t^{TS}(\cdot | S)} [\log \pi^{m}(A^{TS} | S)]

  3. Policy network architecture:

    • Two hidden layers (width 100), tanh/ReLU, final softmax over A|\mathcal{A}| actions.
    • Training via stochastic gradient ascent.

The imitation error is controlled by KL divergence: the expected KL between the TS policy and the distilled policy decreases as O(1/N)O(1/\sqrt{N}) with the number of offline contexts.

4. Empirical Evaluation and Deployment

Distilled TS closely tracks the regret of true batch TS policies across a suite of benchmarks (Mushroom, Warfarin, Wheel bandit, video transcoding). Notably, in real-world deployment on Meta’s video upload pipeline for mobile, the distilled TS policy improves watch-time by approximately 3%3\% (95\% CI: [2.5%,3.5%][2.5\%, 3.5\%]) and increases upload success rate by 1%1\% relative to uniform allocation (Namkoong et al., 2020).

Measured per-query decision times demonstrate substantial reductions: | Algorithm | Latency (ms/query) | |------------------------------|-------------------| | Linear-TS | 0.715 | | Linear-TS-IL (Distilled) | 0.184 | | NeuralLinear-TS | 1.142 | | NeuralLinear-TS-IL (Distilled)| 0.178 |

These results demonstrate that policy distillation achieves order-of-magnitude speedups without sacrificing exploration efficiency or regret performance.

5. Limitations and Theoretical Challenges

While the imitation approach closely matches TS regret in practice, the following limitations are noted:

  • Theoretical analysis of regret focuses on Bayes regret; there are no formal frequentist guarantees when the generative model is misspecified or in adversarial settings.
  • Accumulation of imitation errors may degrade performance if errors are compounded adversarially over time.
  • Distillation assumes abundant unlabeled contexts; in limited-data settings, imitation error may be non-negligible.
  • The representational power of the policy class (e.g., compact neural networks) may be insufficient for capturing the nuanced exploration behavior of complex Bayesian TS (Namkoong et al., 2020).
  • Extensions such as alternative discrepancy metrics (e.g., Wasserstein for continuous action spaces), online or semi-online policy updates, and formal worst-case analysis for imitation-induced regret remain open (Namkoong et al., 2020).

6. Broader Implications and Future Directions

The policy distillation framework operationalizes Thompson Sampling for applications with strict computational and latency constraints, unlocking scalable deployment in online platforms. The approach is generalizable:

  • Can be extended to imitation of other posterior sampling-based strategies (e.g., UCB, information-directed sampling).
  • Supports integration of richer model classes in the TS step (e.g., deep Gaussian processes), with the distilled policy serving as a fast proxy.
  • Ongoing work seeks to provide formal frequentist imitation bounds, improve distillation robustness in low-data regimes, and adapt to rapidly nonstationary environments (Namkoong et al., 2020).

Table summarizing computational and empirical highlights:

Aspect Vanilla TS Distilled TS
Online Complexity Posterior sampling, argmax Single network forward
Typical Latency 0.7–1.1 ms 0.18 ms
Exploration Fidelity Exact Near-exact (sampling error)
Regret curve Bayes optimal Indistinguishable
Deployment Challenging (latency) Applied at Meta scale

This distilled imitation approach enables principled Bayesian exploration at industrial scale, provided that the offline policy fitting remains aligned with the true TS-induced randomized policy (Namkoong et al., 2020).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Thompson Sampling (TS).