Double Thompson Sampling (D-TS)
- Double Thompson Sampling is a Bayesian sequential decision method that uses two independent posterior draws to enhance exploration–exploitation balance.
- It applies dual sampling in dueling bandits, contextual bandits, and finite MDPs to refine candidate selection and reduce regret.
- The dual mechanism improves statistical robustness and computational efficiency while addressing complexity in diverse decision-making problems.
Double Thompson Sampling (D-TS) is a family of Bayesian sequential decision algorithms that extends the classic Thompson Sampling (TS) paradigm via the introduction of a second randomized sampling or estimation mechanism. This “double” procedural structure has been realized in several settings—such as dueling bandits, contextual bandits with doubly robust estimation, and reinforcement learning in finite Markov Decision Processes (MDPs)—where it is used to further refine the exploration–exploitation trade-off or to improve statistical and computational efficiency. The D-TS approach is distinguished by its use of two independent or hierarchically staged posterior draws, which may act on different layers (actions, policies, models, or estimator corrections), yielding distinct theoretical regimes and regret properties relative to TS.
1. Algorithmic Formulations of Double Thompson Sampling
The D-TS principle manifests as two distinct posterior sampling steps, whose interaction depends on problem structure:
- Dueling Bandits: D-TS for dueling (pairwise comparison) bandits maintains Beta posteriors for win probabilities between arm pairs. At each round, two independent sets of samples are drawn from these posteriors. The first sample set is used—following elimination by upper confidence bounds—to identify arm candidates most likely to be Copeland winners (arms with highest normalized Copeland score). The second sample set, after lower confidence-based elimination, selects the most informative comparison partner. This two-stage process strictly involves separate and independent sampling events and candidate elimination phases (Wu et al., 2016).
- Doubly Robust Contextual Bandits: The “doubly robust” TS algorithm for linear contextual bandits employs a doubly robust estimator to impute pseudo-rewards for all arms, using the actual observed reward only for the chosen arm, and model-based predictions for all others. Parameter estimation incorporates these imputed values with a ridge regression-style update, and arm selection proceeds by TS. The twofold robustness comes from the combination of imputation (regression) and inverse probability weighting, and the algorithm introduces a resampling step to ensure bounded selection probabilities. The “double” nature refers both to the estimator and to the resampling mechanism, which together enable an additive regret decomposition (Kim et al., 2021).
- Finite Discounted MDPs: In D-TS for stochastic games, the algorithm maintains (a) a Bayesian posterior over the state transition dynamics of the MDP, and (b) a posterior over stationary policies, which is updated in a count-based manner. At each round, the transition model is sampled, and, independently, a stationary policy is sampled according to its performance relative to local optima and count-based exploration weights. Actions are selected by combining the two sampled objects (Shi et al., 2022).
These variants are unified by the key property that both stages of sampling are performed with respect to independent posteriors or conditional distributions, ensuring that the additional layer is not simply redundant but structurally adds modeling capability, information gain, or correction for bias.
2. Regret Analysis and Theoretical Properties
The D-TS approach aims to achieve improved, or at least competitive, regret guarantees relative to classic TS by leveraging its double sampling architecture.
- Dueling Bandits: For general Copeland dueling bandits, D-TS achieves a regret bound of where is the number of arms and the number of rounds. In the Condorcet regime (a singleton winner beats all others), a refined analysis yields regret . These bounds are established via decompositions that leverage the elimination steps (RUCB/RLCB) and the statistical independence of the two sampling stages, resulting in effective reduction of redundant or uninformative exploration (Wu et al., 2016).
- Doubly Robust TS (Contextual Bandits): The cumulative regret scales as where denotes the minimum eigenvalue of the context covariance matrix. In typical high-dimensional settings, this yields , a factor improvement over previous TS bounds, due both to the doubly robust estimator aggregating all available context and rewards, and a resampling step that ensures stable selection probabilities (Kim et al., 2021).
- Finite Stochastic Games/MDPs: The DTS-RL algorithm admits a regret of order with states, actions, diameter , and horizon , and also due to the count-based policy update. These rates represent improvements over single-sampling posterior approaches by reducing the risk of premature lock-in to locally optimal but globally suboptimal policies (Shi et al., 2022).
In all settings, theoretical improvements rely upon careful analysis of how double sampling facilitates refined partitioning of action or policy space, tighter control over statistical uncertainty, and prevention of over-exploitation of early misleading information.
3. Mechanisms for Refinement of Exploration–Exploitation
The double sampling structure supports nuanced management of the exploration–exploitation trade-off:
- Elimination and Filtering: In D-TS dueling bandits, aggressive elimination of suboptimal comparisons via confidence bounds prior to sampling reduces wasted exploration.
- Doubly Robust Estimation: In contextual bandits, the fusion of imputation and IPW pseudo-rewards allows more robust parameter learning, combining the strengths of model-based and importance-weighted updates. This duality corrects for missing data and bias in estimator convergence.
- Policy and Model Posterior Decoupling: DTS for finite MDPs separates local dynamic modeling from long-term strategy optimization, balancing immediate reward against global optimality by explicitly maintaining and sampling from both distributions.
This dual structure often involves a first stage filtering, projection, or candidate selection, followed by a second, possibly more refined, sampling or estimation—each step focusing on a different aspect of the uncertainty landscape.
4. Connections to Information-Directed Sampling and Posterior Optimization
Theoretical work establishes close relationships between D-TS approaches and Information-Directed Sampling (IDS), Generalized TS (GTS), and recent online optimization perspectives on TS.
- IDS: IDS explicitly optimizes over the trade-off between immediate regret and information gain. The D-TS structure, in several variants, can be interpreted as implementing an implicit or explicit information regularization, granting the algorithm additional ability to hedge against uncertainty by a double layer of stochasticity or uncertainty quantification (Zhou, 2015).
- Online Optimization and Stationary Bellman Formulations: TS can be recast as solving at each time step an online quadratic minimization regularized by a measure of posterior uncertainty, such as a point-biserial covariance. D-TS, by introducing an additional sampling layer, may be seen as “doubling” the regularizer or incorporating higher-order uncertainty, leading to priors or posteriors that better reflect information-theoretic risk. A plausible implication is that D-TS algorithms may be analyzed as optimizing similar objectives with enhanced regularization structures (Qu et al., 8 Oct 2025).
These connections provide the theoretical underpinning of why and how D-TS can achieve both empirical robustness and provable regret improvements in structured domains.
5. Empirical Performance and Benchmarking
Empirical assessments substantiate the theoretical advantages of D-TS:
- In dueling bandits, D-TS and D-TS consistently exhibit lower cumulative regret, faster convergence to asymptotic optimality, and reduced variance in a variety of synthetic and real-world data (such as MSLR and ArXiv datasets), often outperforming UCB-type and alternative TS-based dueling algorithms (Wu et al., 2016).
- In contextual bandits with high-dimensional contexts, Doubly Robust TS variants outperform classic LinTS and Balanced LinTS (BLTS), particularly in early rounds and with ill-conditioned covariates, due to the exploitation of all available context data and improved parameter stabilizing properties (Kim et al., 2021).
- In finite stochastic games, DTS converges more rapidly to ε-optimal policies and demonstrates lower regret growth under increasing state/action complexity, attributed directly to its dual posterior sampling and count-based adaptive exploration (Shi et al., 2022).
Algorithmic enhancements, such as improved tie-breaking (e.g., in D-TS), deepen practical performance, especially when the problem structure (e.g., existence of a Condorcet winner) is favorable.
6. Extensions, Limitations, and Prospects
D-TS is inherently modular and has been considered or proposed in a range of applications:
- Extensions to Large and Complex Decision Spaces: Approximate D-TS variants, including those based on state-space truncation or approximate posterior computations (e.g., via MLE perturbation), have demonstrated statistical consistency with scalable computational properties, as in epidemic control and resource management (Hu et al., 2019).
- Integration of Local and Global Uncertainty: Techniques based on local latent variable uncertainty and variational inference (LU-TS, SIVI-TS) suggest that hybrid D-TS could combine local, context-specific exploration with global model robustness and stability. A plausible implication is that such architectures may harmonize fast context adaptation with error control (Wang et al., 2019).
- Count-Based and Policy-Regularized Exploration: In stochastic games and control, count-based regularization in policy sampling may further reduce regret, particularly in sparse or underexplored environments.
- Limitations: The primary practical limitation across D-TS variants is the potential computational cost of sampling or updating two independent posteriors, especially as model or action space size grows. The design and calibration of confidence thresholds, resampling steps, and estimator aggregation methods require problem-specific tuning.
Prospective research directions include extension to continuous state/action spaces, adversarial and non-stationary environments, and deeper theoretical integration of D-TS within the online optimization and dynamic programming frameworks.
In summary, Double Thompson Sampling constitutes a principled and theoretically robust extension of Bayesian sequential decision-making, combining dual layers of stochastic estimation or sampling to enhance the reliability, adaptability, and performance of classic TS across a variety of complex bandit, contextual, and reinforcement learning regimes. Its design is supported by careful probabilistic modeling, regret analysis, and empirical benchmarking, with multiple pathways for further refinement and application in modern sequential decision problems.