Thompson Sampling in Bayesian Bandits

Updated 25 February 2026

Thompson Sampling is a Bayesian algorithm for sequential decision-making that balances exploration and exploitation by sampling models from the posterior distribution.
It employs policy distillation to approximate the original TS policy, significantly reducing computation and latency in complex action spaces.
Empirical evaluations show that distilled TS closely matches the regret performance of exact TS while enabling scalable deployment in industrial applications.

Thompson Sampling (TS) is a Bayesian algorithm for sequential decision-making problems that require balancing exploration and exploitation, such as multi-armed bandits and contextual bandits. At each decision point, TS samples a model from the posterior distribution over unknown parameters given observed data and selects the action that appears optimal for this sampled model. This randomized probability matching achieves Bayesian optimality in terms of exploration, often yielding provably low regret. However, practical limitations involving posterior inference and online optimization motivate approximations or distillation methods for complex action spaces and large models (Namkoong et al., 2020).

1. Core Principles and Bayesian Foundations

Thompson Sampling is defined by maintaining a Bayesian posterior over unknown parameters and using this updated distribution to make sequential decisions. At each round $t$ :

Posterior update: Compute $P(\theta| H_t) \propto P(\theta)\prod_{i < t} P(R_i | A_i, S_i, \theta)$ , where $H_t$ is the observed history (Namkoong et al., 2020).
Posterior sampling: Draw $\theta_t \sim P(\cdot | H_t)$ .
Action selection: Choose $A_t = \arg\max_{a \in \mathcal{A}} f_{\theta_t}(a, S_t)$ .

The frequentist regret (against a fixed $\theta$ ) after $T$ rounds is:

$\text{Reg}^{fr}(T) = \sum_{t=1}^T \mathbb{E}\left[\max_a f_\theta(a, S_t) - f_\theta(A_t, S_t)\right]$

TS generalizes across bandit and contextual bandit models, including exponential families, heavy-tailed reward models, and settings with latent structure (Russo et al., 2017, Lee et al., 2023, Lee et al., 2023, Hong et al., 2021). Its Bayesian formulation explicitly quantifies uncertainty and drives information-directed sampling.

2. Computational Bottlenecks and Distillation

A key challenge in deploying TS on complex models is the computational cost of posterior updates and optimization in large or continuous action spaces, particularly for deep Bayesian neural networks or when fast decision time is required (Namkoong et al., 2020, Osband et al., 2023, Zhang et al., 2020). For a generic contextual bandit with deep models, these costs are often:

$O(n^2)$ - $O(n^3)$ for each posterior sample (due to numerically demanding sampling or inference).
$O(|\mathcal{A}| \cdot \text{cost}_f)$ for action optimization over action space.

To address this, a major innovation is policy distillation for TS:

Offline distillation trains an explicit policy $\pi^m(a|s)$ (typically a small neural network) to imitate the stochastic decision map induced by the true TS policy.
During online operation, actions are sampled from $\pi^m( \cdot | s )$ , trading expensive online sampling/optimization for a single policy network forward pass—resulting in 5-10 $\times$ lower latency in experiments (Namkoong et al., 2020).

This approach preserves the exploration behavior of TS within the capacity of the policy class and scales to billions of queries (e.g., video upload optimization at Meta) (Namkoong et al., 2020).

3. Algorithmic Workflow and Pseudocode

The Distilled Thompson Sampling pipeline organizes the workflow as:

Online phase:
- Deploy $\pi^m$ on-device; for each context $S_t$ , sample action $A_t \sim \pi^{m_t}(\cdot | S_t)$ .
- Log $(S_t, A_t, R_t)$ for offline training.
Offline batch update:
- Given all collected history $H_t$ , update the full TS posterior $P(\theta|H_t)$ .
- For a large pool of contexts $\{s_i\}$ , sample $\theta \sim P(\cdot|H_t)$ and compute $A^{TS}_i = \arg \max_a f_\theta(a, s_i)$ .
- Fit $m_{t+1}$ by maximizing the average log likelihood:
$m_{t+1} = \arg\max_{m \in \mathcal{M}} \mathbb{E}_{S \sim \mathcal{P}_S, A^{TS} \sim \pi_t^{TS}(\cdot | S)} [\log \pi^{m}(A^{TS} | S)]$
Policy network architecture:
- Two hidden layers (width 100), tanh/ReLU, final softmax over $|\mathcal{A}|$ actions.
- Training via stochastic gradient ascent.

The imitation error is controlled by KL divergence: the expected KL between the TS policy and the distilled policy decreases as $O(1/\sqrt{N})$ with the number of offline contexts.

4. Empirical Evaluation and Deployment

Distilled TS closely tracks the regret of true batch TS policies across a suite of benchmarks (Mushroom, Warfarin, Wheel bandit, video transcoding). Notably, in real-world deployment on Meta’s video upload pipeline for mobile, the distilled TS policy improves watch-time by approximately $3\%$ (95\% CI: $[2.5\%, 3.5\%]$ ) and increases upload success rate by $1\%$ relative to uniform allocation (Namkoong et al., 2020).

Measured per-query decision times demonstrate substantial reductions: | Algorithm | Latency (ms/query) | |------------------------------|-------------------| | Linear-TS | 0.715 | | Linear-TS-IL (Distilled) | 0.184 | | NeuralLinear-TS | 1.142 | | NeuralLinear-TS-IL (Distilled)| 0.178 |

These results demonstrate that policy distillation achieves order-of-magnitude speedups without sacrificing exploration efficiency or regret performance.

5. Limitations and Theoretical Challenges

While the imitation approach closely matches TS regret in practice, the following limitations are noted:

Theoretical analysis of regret focuses on Bayes regret; there are no formal frequentist guarantees when the generative model is misspecified or in adversarial settings.
Accumulation of imitation errors may degrade performance if errors are compounded adversarially over time.
Distillation assumes abundant unlabeled contexts; in limited-data settings, imitation error may be non-negligible.
The representational power of the policy class (e.g., compact neural networks) may be insufficient for capturing the nuanced exploration behavior of complex Bayesian TS (Namkoong et al., 2020).
Extensions such as alternative discrepancy metrics (e.g., Wasserstein for continuous action spaces), online or semi-online policy updates, and formal worst-case analysis for imitation-induced regret remain open (Namkoong et al., 2020).

6. Broader Implications and Future Directions

The policy distillation framework operationalizes Thompson Sampling for applications with strict computational and latency constraints, unlocking scalable deployment in online platforms. The approach is generalizable:

Can be extended to imitation of other posterior sampling-based strategies (e.g., UCB, information-directed sampling).
Supports integration of richer model classes in the TS step (e.g., deep Gaussian processes), with the distilled policy serving as a fast proxy.
Ongoing work seeks to provide formal frequentist imitation bounds, improve distillation robustness in low-data regimes, and adapt to rapidly nonstationary environments (Namkoong et al., 2020).

Table summarizing computational and empirical highlights:

Aspect	Vanilla TS	Distilled TS
Online Complexity	Posterior sampling, argmax	Single network forward
Typical Latency	0.7–1.1 ms	0.18 ms
Exploration Fidelity	Exact	Near-exact (sampling error)
Regret curve	Bayes optimal	Indistinguishable
Deployment	Challenging (latency)	Applied at Meta scale

This distilled imitation approach enables principled Bayesian exploration at industrial scale, provided that the offline policy fitting remains aligned with the true TS-induced randomized policy (Namkoong et al., 2020).