Thompson Sampling: Bayesian Decision Making

Updated 7 November 2025

Thompson Sampling is a Bayesian sequential decision-making method that uses probability matching to balance exploration and exploitation.
It computes and samples from the posterior distribution over model parameters to choose actions based on their estimated optimality.
The method offers strong theoretical regret guarantees and has been extended with scalable approximations and applications in complex, high-dimensional settings.

Thompson Sampling (TS) is a Bayesian strategy for sequential decision-making in uncertain environments, originally introduced by W.R. Thompson in 1933. The essence of TS is "probability matching": at each decision step, the agent samples model parameters from their posterior (or an approximation) given historical data, then chooses an action that is optimal for that sampled model. This randomized approach to balancing exploitation and exploration has enabled TS to achieve strong empirical and theoretical performance in domains ranging from classical multi-armed bandits to high-dimensional reinforcement learning and Bayesian optimization. TS has been widely adopted for its simplicity, adaptability, and statistical efficiency, with a growing body of research focused on its analytical foundations, scalable implementation for complex models, and principled extensions for structured and nonstandard settings.

1. Algorithmic Principles and Mathematical Foundations

Thompson Sampling operates within the Bayesian framework. The key procedural steps are:

Posterior computation: Given dataset $\mathcal{D}_t$ up to round $t$ , compute (or approximate) the posterior $p(\theta|\mathcal{D}_t)$ over parameters $\theta$ that index environment models or reward distributions.
Sampling: Draw a realization $\hat{\theta}_t \sim p(\theta|\mathcal{D}_t)$ .
Action optimization: Choose action $a_t = \arg\max_{a} \mathbb{E}[r(a)|\mathcal{D}_t, \hat{\theta}_t]$ .
Posterior update: Observe the outcome, update the posterior.

For binary rewards (Bernoulli bandits), with a Beta prior, this yields closed-form updates and efficient computation. More generally, for non-conjugate models, MCMC or particle-based approximation methods may be required. The probability that TS chooses arm $i$ at round $t$ is the posterior probability that $i$ is optimal: $P(i\text{ optimal at } t) = \int \mathbf{1}\left[ \mathbb{E}[r(i)] \geq \max_{j\neq i} \mathbb{E}[r(j)] \right] p(\theta|\mathcal{D}_t) d\theta$ TS thus implements a randomized policy that matches current epistemic uncertainty about optimality.

2. Exploration-Exploitation Trade-off

TS balances exploration and exploitation by sampling models from the posterior uncertainty. When confidence in the optimal arm is high, sampled parameters tend to favor exploitation; when uncertainty is high, randomization induces more exploration. Unlike algorithms with explicit optimism (e.g., UCB), TS does not maintain confidence intervals, but instead allocates actions in proportion to current belief in their optimality.

Recent research has recast TS as a solution to an online optimization problem with regularization by a point-biserial correlation term, revealing that the mechanism for balancing exploration and exploitation can be quantitatively described by the biserial covariance between reward gaps and arm identities (Qu et al., 8 Oct 2025). This framework allows stationarization of finite-horizon regret via a squared regret surrogate, yielding a stationary Bellman equation and a time-invariant optimal policy, with the TS regularizer

$\tilde{\nu}(\pi) = \mathrm{Cov}_\pi(\theta_1-\theta_2,\,\mathrm{sign}(\theta_1-\theta_2))$

which explicitly measures residual uncertainty about which arm is best.

3. Theoretical Guarantees and Regret Analysis

TS enjoys several strong regret bounds:

For $N$ -armed Bernoulli bandits, expected regret over horizon $T$ is

$E[\mathcal{R}(T)] = O\left( \left[ \sum_{i=2}^N \frac{1}{\Delta_i^2} \right]^2 \ln T \right)$

where $\Delta_i = \mu^* - \mu_i$ is the gap between the optimal and arm $i$ (Agrawal et al., 2011). For two arms, the bound tightens to $O\left( \frac{\ln T}{\Delta} + \frac{1}{\Delta^3} \right)$ .

TS matches Lai-Robbins asymptotic optimality in many settings, and regret bounds for linear, kernelized, and generalized function classes scale as $O(d\sqrt{T}\log T)$ , where $d$ is parameter dimension (Russo et al., 2017).
In reinforcement learning, TS-based algorithms for unknown MDPs (e.g., TSDE) achieve Bayesian expected regret of $\tilde{O}(HS\sqrt{AT})$ , where $S$ is state, $A$ is action space, $T$ is time horizon, and $H$ is the bias span (Ouyang et al., 2017). In highly general stochastic environments, TS is asymptotically optimal in mean under a recoverability assumption, and enjoys sublinear regret (Leike et al., 2016).
Robustness results apply to modifications of TS with parameter $h$ , showing logarithmic regret holds in a problem-dependent range, but can degrade to linear if $h$ is tuned aggressively (Ha, 2017).

4. Scalable Posterior Approximation and Modern Extensions

Exact posterior computations in TS become intractable for complex or high-dimensional models. A range of research has addressed scalable approximation:

Particle-based TS and Gradient Flows: Particle-based interactive TS ( $\pi$ -TS) leverages optimal transport and Wasserstein gradient flows for scalable approximation, moving a set of interacting particles toward the posterior via joint optimization with convexity guarantees (Zhang et al., 2019). The update rule combines SVGD-repulsion, log-posterior gradients, and an entropy-regularized Wasserstein force.
Bootstrap Thompson Sampling (BTS): BTS replaces the posterior with an online bootstrap distribution, offering pipeline parallelism and improved robustness to model misspecification, with comparable empirical regret to TS (Eckles et al., 2014).
Neural and Local-Uncertainty TS: Deep contextual bandits have motivated approaches such as Neural Thompson Sampling (NeuralTS), which constructs a pseudo-posterior over arm rewards using neural network mean and variance from neural tangent features, matching the best-known regret bounds in terms of an effective NTK dimension (Zhang et al., 2020). Local uncertainty methods sample from variational approximations of context-specific latent variables to provide expressive and computationally tractable uncertainty quantification (Wang et al., 2019).
Mixture and Nonstandard Priors: MixTS generalizes TS to latent mixture priors for multi-task and population-structured bandits, with regret bounds scaling with the mixture structure (Hong et al., 2021). Regenerative Particle TS (RPTS) periodically deletes low-weight particles and reinjects new ones near surviving particles, remedying particle degeneracy (Zhou et al., 2022).
Heavy-Tailed and Noncompliant Bandits: Specialized algorithms for symmetric $\alpha$ -stable bandits (TS with SMiN representation and robust truncation) provide finite-time polynomial regret bounds for TS in heavy-tailed regimes (Dubey et al., 2019). Noncompliance models revise regret and propose latent variable extensions combining TS with variational inference for compliance (Stirn et al., 2018).

5. Applications and Structured Decision Problems

TS now permeates areas beyond simple bandits:

Contextual Bandits: TS deploys effective probability matching in settings where action rewards depend on observed side information; state-of-the-art contextual bandit methods employ structured posteriors (linear, kernel, deep) (Russo et al., 2017).
Combinatorial and Variable Selection: TVS frames variable selection as a combinatorial bandit problem, extending TS to subset selection with regret guarantees for combinatorial super-arms (Liu et al., 2020).
Bayesian Optimization with GP-TS: Efficient global optimization strategies for TS acquisition functions (e.g., GP-TS via rootfinding) systematically identify local optima using separable spectral representations, leading to decisive empirical improvement over UCB/Expected Improvement in high-dimensional settings (Adebiyi et al., 2024).
Large-Scale Decision Problems: TS variants using normal approximations of plug-in estimators, Monte Carlo truncation, and parametric policy classes enable scalable learning and control in epidemic management and ecological planning, with proven consistency (Hu et al., 2019).
Adaptive and Game-theoretic Extensions: Generalized Thompson sampling formalizes TS as a Bayesian mixture over optimal policies, allowing transfer to adaptive control, multi-agent games, causal inference, and dynamic environments. These setups leverage TS for strategic co-adaptation and active structure learning (Ortega et al., 2013).

6. Contemporary Insights, Challenges, and Rigorous Enhancements

Recent work has illuminated TS from a broader optimization perspective, answering longstanding questions about its exploration-exploitation calibration and providing new regularization-informed frameworks for time-invariant policy synthesis (Qu et al., 8 Oct 2025). TS variants with virtual helping agents provide a flexible knob for exploration via ensembles and combiners, supporting applications in best-arm identification and time-sensitive learning with tunable regret performance (Pant et al., 2022). Open questions remain in settings with high-dimensional actions, hierarchical dependencies, or adverse reward distributions, and the field continues to advance scalable, robust, and interpretable TS architectures for modern sequential decision tasks.