Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Minimax Optimal Thompson Sampling

Updated 30 June 2025

Minimax Optimal Thompson Sampling is an approach that achieves regret bounds matching the minimax lower limits in multi-armed bandit and reinforcement learning problems.
The method employs adaptive clipping and greedy modifications to control over-optimism, ensuring robust performance with theoretical guarantees like O(√(KT)) regret bounds.
It underpins advancements in sequential decision-making by improving gap dependence, enabling efficient implementations in stochastic, contextual, and batched settings.

Minimax Optimal Thompson Sampling refers to algorithmic strategies within the Thompson Sampling (TS) paradigm that achieve regret bounds matching the minimax lower bounds for all problem instances, up to constant factors. The topic occupies a central position in the paper of sequential decision-making under uncertainty, particularly in stochastic multi-armed bandits, contextual bandits, and reinforcement learning. Below is a detailed exposition summarizing the historical development, theoretical results, practical methods, and persistent challenges associated with minimax optimal TS.

1. Theoretical Foundations of Minimax Optimality

Minimax optimality in the multi-armed bandit setting is defined by achieving, in the worst case over all problem instances, a regret that matches the lower bound up to constants. For a $K$ -armed stochastic bandit and time horizon $T$ , the minimax regret lower bound is $\Omega(\sqrt{KT})$ .

Classical Thompson Sampling, as analyzed in "Analysis of Thompson Sampling for the multi-armed bandit problem" (1111.1797), achieves the following regret bounds in the standard stochastic $K$ -armed bandit:

Two-armed bandit:

$E[\mathcal{R}(T)] = O\left( \frac{\ln T}{\Delta} + \frac{1}{\Delta^3} \right)$

where $\Delta$ is the gap between the best and second-best arm.

$N$ -armed bandit:

$E[\mathcal{R}(T)] = O\left( \left( \sum_{i=2}^{N} \frac{1}{\Delta_i^2} \right)^2 \ln T \right)$

where $\Delta_i = \mu_1 - \mu_i$ , the gaps between the best and other arms.

These rates are optimal in $T$ (i.e., logarithmic) but not in the gaps $\Delta_i$ ; the dependence on $\Delta_i$ is polynomially worse than the minimax lower bound's $O(1/\Delta_i)$ dependence. The UCB family, by contrast, achieves $E[\mathcal{R}(T)] = O\left( \sum_{i=2}^N \frac{1}{\Delta_i} \ln T \right)$ , which matches the gap dependence of minimax rates.

Minimax optimality for TS thus refers to variants of TS that avoid the loose gap dependence present in traditional TS regret bounds.

2. Minimax Optimal TS: Key Algorithms and Major Advances

a. MOTS: Minimax Optimal Thompson Sampling

The MOTS algorithm, as introduced and analyzed in "MOTS: Minimax Optimal Thompson Sampling" (2003.01803), is the first Thompson Sampling type algorithm rigorously shown to achieve the minimax optimal regret $O(\sqrt{KT})$ for arbitrary finite horizon $T$ . The algorithm modifies standard TS via adaptive "clipping" of samples drawn from posterior distributions:

For arm $i$ , after $T_i(t)$ pulls, the mean reward is $\hat{\mu}_i(t)$ . Draw a sample $\tilde{\theta}_i$ from $\mathcal{N}(\hat{\mu}_i(t), 1 / (\rho T_i(t)))$ for some $\rho \in (1/2, 1)$ .
Define a clip threshold $\tau_i(t) = \hat{\mu}_i(t) + \sqrt{\frac{\alpha}{T_i(t)} \log^+\left(\frac{T}{K T_i(t)}\right)}$ , with suitable constant $\alpha$ .
Use $\theta_i(t) = \min\{\tilde{\theta}_i, \tau_i(t)\}$ as the sample for TS selection.
Pull the arm with largest $\theta_i(t)$ and update statistics.

This adaptive clipping averts rare but costly over-optimism, which is the main bottleneck in problem-independent regret bounds for classical TS, and achieves:

$R_\mu(T) = O\left( \sqrt{KT} + \sum_{i=2}^K \Delta_i \right)$

and, for Gaussian rewards,

$\lim_{T \rightarrow \infty} \frac{R_\mu(T)}{\log T} = \sum_{i: \Delta_i > 0} \frac{2}{\Delta_i}$

matching the asymptotic lower bound of Lai and Robbins.

b. ExpTS and ExpTS $^+$ : Exponential Family Bandits

"Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits" (2206.03520) introduces the ExpTS algorithm for one-dimensional exponential family bandits. The algorithm constructs a novel sampling distribution with tight anti-concentration properties, addressing underestimation and overestimation problems that limit classical TS. The refined method, ExpTS $^+$ , further adds a "greedy" exploitation step (playing the empirical mean with high probability) to eliminate the extra $\sqrt{\log K}$ factor in the minimax regret:

$R_\mu(T) = O\left(\sum_{i=2}^{K} \Delta_i + \sqrt{V K T}\right)$

where $V$ is the maximum variance across arms.

This matches the minimax rate for exponential family rewards and is anytime (requires no knowledge of $T$ ).

c. Batched and Contextual Settings

"Batched Thompson Sampling" (2110.00202) demonstrates that Thompson Sampling, with adaptive batching, retains minimax regret up to a $\sqrt{\log T}$ factor in bandits with limited feedback frequency:

$\mathbb{E}[R(T)] \leq C \sqrt{T \log T}$

"Adaptive Data Augmentation for Thompson Sampling" (2506.14479) shows minimax-optimal regret for linear contextual bandits can be achieved via carefully constructed data augmentation and estimator coupling, bypassing prior context distribution assumptions:

$R(T) \leq O\left(d \sqrt{T} \log T \right)$

This matches the minimax lower bound in linear contextual settings.

3. Practical Implementation and Efficiency

Minimax optimal TS algorithms are designed to be as practical and efficient as standard TS, preserving the advantages of ease of implementation and computational tractability. Typical algorithmic steps include:

Posterior sampling of arm means (often via conjugate priors or as Gaussian approximations).
Clipping or augmentation mechanisms that add minimal computational overhead.
For ExpTS $^+$ , sampling is via explicit, invertible distributions for one-dimensional exponential families.
In contextual and linear settings, "adaptive" augmentation adds only $O(d)$ samples per round instead of inflating by the number of arms $K$ .

Empirical evaluations across multiple works show that minimax optimal TS variants achieve regret uniformly close to known lower bounds and outcompete classical TS or UCB in worst-case settings, while retaining robust performance in benign regimes and in the presence of delayed or batched feedback.

Algorithm	Time Regret	Gap Dependence	Minimax Optimal	Notes
UCB	$O(\log T)$	$O(1/\Delta_i)$	Yes	Classical non-Bayesian optimism
Classical TS	$O(\log T)$	$O(1/\Delta_i^4)$	No	Not minimax in gap, still optimal in $T$
MOTS, ExpTS $^+$	$O(\sqrt{KT})$	$O(1/\Delta_i)$	Yes	Clipping / greedy modification
Batched TS (adaptive)	$O(\sqrt{T\log T})$	$O(1/\Delta_i)$	Near-optimal	With adaptive batching
LinTS (frequentist)	$O(d\sqrt{dT})$	$d$	No	Requires inflation in $d$
Adaptive Augmentation TS	$O(d\sqrt{T})$	$d$	Yes	Arbitrary contexts, no assumptions

Standard Thompson Sampling is empirically robust and nearly optimal; however, worst-case regret bounds can be poor in adversarial, high-dimensional, or poorly designed contextual scenarios unless augmented as described above.

5. Limitations, Extensions, and Open Problems

While the minimax optimality of TS variants is established for canonical stochastic bandits and some generalized settings (contextual bandits, exponential family rewards, and reinforcement learning with certain structure), notable caveats and ongoing areas of research remain:

Combinatorial Bandits: Standard TS suffers exponential regret in combinatorial semi-bandit problems under uniform priors (2102.05502). Established minimax optimal variants for these regimes are an open research direction.
Linear Bandits (Frequentist): Variance inflation is necessary for LinTS to be minimax in the frequentist sense unless geometric conditions on contexts hold (2006.06790); data-driven inflation resolves some cases, but minimaxity in fully adversarial contexts is challenging.
Pure Exploration and Best-Arm Identification: TS is suboptimal for PAC-style pure exploration, but recent advances show that top-two style sampling with TS primitives can be minimax optimal for linear pure exploration (2310.06069).
General RL and Nonparametric Settings: Asymptotic optimality (sublinear regret) is established for countable, non-Markov, and partially observable environments under TS (1602.07905), provided recoverability.

6. Summary Table

Setting	Proven Minimax-Optimal TS Variant	Regret Bound	Further Notes
Finite stochastic bandit	MOTS, ExpTS $^+$	$O(\sqrt{KT})$	Both anytime and gap-optimal
Linear contextual bandit	Adaptive augmentation TS (2506.14479)	$O(d\sqrt{T}\log T)$	Arbitrary contexts, no diversity assumed
Exponential family bandit	ExpTS $^+$	$O(\sqrt{VKT})$	$V$ is max arm variance
General RL	Resampled TS with recoverability (1602.07905)	$o(T)$ (sublinear)	Very general environments
Batched setting	Adaptive batched TS	$O(\sqrt{T\log T})$	Near min-max, anytime, few batches
Combinatorial/High-dim	None with uniform prior	Exponential regret	Requires further research

7. Outlook

Minimax optimal Thompson Sampling is realized through principled algorithmic modifications—clipping, greedy sampling, or adaptive augmentation—that rigorously control posterior over- and under-exploration across all problem instances. These developments close theoretical gaps left by earlier analyses and ground the empirical efficacy of TS in robust worst-case regret guarantees. Remaining challenges include optimality in combinatorial, high-dimensional, and adversarially structured problems, where further advances in posterior design and adaptive control are expected to play a critical role.