Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Minimax Optimal Thompson Sampling

Updated 30 June 2025
  • Minimax Optimal Thompson Sampling is an approach that achieves regret bounds matching the minimax lower limits in multi-armed bandit and reinforcement learning problems.
  • The method employs adaptive clipping and greedy modifications to control over-optimism, ensuring robust performance with theoretical guarantees like O(√(KT)) regret bounds.
  • It underpins advancements in sequential decision-making by improving gap dependence, enabling efficient implementations in stochastic, contextual, and batched settings.

Minimax Optimal Thompson Sampling refers to algorithmic strategies within the Thompson Sampling (TS) paradigm that achieve regret bounds matching the minimax lower bounds for all problem instances, up to constant factors. The topic occupies a central position in the paper of sequential decision-making under uncertainty, particularly in stochastic multi-armed bandits, contextual bandits, and reinforcement learning. Below is a detailed exposition summarizing the historical development, theoretical results, practical methods, and persistent challenges associated with minimax optimal TS.


1. Theoretical Foundations of Minimax Optimality

Minimax optimality in the multi-armed bandit setting is defined by achieving, in the worst case over all problem instances, a regret that matches the lower bound up to constants. For a KK-armed stochastic bandit and time horizon TT, the minimax regret lower bound is Ω(KT)\Omega(\sqrt{KT}).

Classical Thompson Sampling, as analyzed in "Analysis of Thompson Sampling for the multi-armed bandit problem" (1111.1797), achieves the following regret bounds in the standard stochastic KK-armed bandit:

  • Two-armed bandit:

E[R(T)]=O(lnTΔ+1Δ3)E[\mathcal{R}(T)] = O\left( \frac{\ln T}{\Delta} + \frac{1}{\Delta^3} \right)

where Δ\Delta is the gap between the best and second-best arm.

  • NN-armed bandit:

E[R(T)]=O((i=2N1Δi2)2lnT)E[\mathcal{R}(T)] = O\left( \left( \sum_{i=2}^{N} \frac{1}{\Delta_i^2} \right)^2 \ln T \right)

where Δi=μ1μi\Delta_i = \mu_1 - \mu_i, the gaps between the best and other arms.

These rates are optimal in TT (i.e., logarithmic) but not in the gaps Δi\Delta_i; the dependence on Δi\Delta_i is polynomially worse than the minimax lower bound's O(1/Δi)O(1/\Delta_i) dependence. The UCB family, by contrast, achieves E[R(T)]=O(i=2N1ΔilnT)E[\mathcal{R}(T)] = O\left( \sum_{i=2}^N \frac{1}{\Delta_i} \ln T \right), which matches the gap dependence of minimax rates.

Minimax optimality for TS thus refers to variants of TS that avoid the loose gap dependence present in traditional TS regret bounds.


2. Minimax Optimal TS: Key Algorithms and Major Advances

a. MOTS: Minimax Optimal Thompson Sampling

The MOTS algorithm, as introduced and analyzed in "MOTS: Minimax Optimal Thompson Sampling" (2003.01803), is the first Thompson Sampling type algorithm rigorously shown to achieve the minimax optimal regret O(KT)O(\sqrt{KT}) for arbitrary finite horizon TT. The algorithm modifies standard TS via adaptive "clipping" of samples drawn from posterior distributions:

  1. For arm ii, after Ti(t)T_i(t) pulls, the mean reward is μ^i(t)\hat{\mu}_i(t). Draw a sample θ~i\tilde{\theta}_i from N(μ^i(t),1/(ρTi(t)))\mathcal{N}(\hat{\mu}_i(t), 1 / (\rho T_i(t))) for some ρ(1/2,1)\rho \in (1/2, 1).
  2. Define a clip threshold τi(t)=μ^i(t)+αTi(t)log+(TKTi(t))\tau_i(t) = \hat{\mu}_i(t) + \sqrt{\frac{\alpha}{T_i(t)} \log^+\left(\frac{T}{K T_i(t)}\right)}, with suitable constant α\alpha.
  3. Use θi(t)=min{θ~i,τi(t)}\theta_i(t) = \min\{\tilde{\theta}_i, \tau_i(t)\} as the sample for TS selection.
  4. Pull the arm with largest θi(t)\theta_i(t) and update statistics.

This adaptive clipping averts rare but costly over-optimism, which is the main bottleneck in problem-independent regret bounds for classical TS, and achieves:

Rμ(T)=O(KT+i=2KΔi)R_\mu(T) = O\left( \sqrt{KT} + \sum_{i=2}^K \Delta_i \right)

and, for Gaussian rewards,

limTRμ(T)logT=i:Δi>02Δi\lim_{T \rightarrow \infty} \frac{R_\mu(T)}{\log T} = \sum_{i: \Delta_i > 0} \frac{2}{\Delta_i}

matching the asymptotic lower bound of Lai and Robbins.

b. ExpTS and ExpTS+^+: Exponential Family Bandits

"Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits" (2206.03520) introduces the ExpTS algorithm for one-dimensional exponential family bandits. The algorithm constructs a novel sampling distribution with tight anti-concentration properties, addressing underestimation and overestimation problems that limit classical TS. The refined method, ExpTS+^+, further adds a "greedy" exploitation step (playing the empirical mean with high probability) to eliminate the extra logK\sqrt{\log K} factor in the minimax regret:

Rμ(T)=O(i=2KΔi+VKT)R_\mu(T) = O\left(\sum_{i=2}^{K} \Delta_i + \sqrt{V K T}\right)

where VV is the maximum variance across arms.

This matches the minimax rate for exponential family rewards and is anytime (requires no knowledge of TT).

c. Batched and Contextual Settings

"Batched Thompson Sampling" (2110.00202) demonstrates that Thompson Sampling, with adaptive batching, retains minimax regret up to a logT\sqrt{\log T} factor in bandits with limited feedback frequency:

E[R(T)]CTlogT\mathbb{E}[R(T)] \leq C \sqrt{T \log T}

"Adaptive Data Augmentation for Thompson Sampling" (2506.14479) shows minimax-optimal regret for linear contextual bandits can be achieved via carefully constructed data augmentation and estimator coupling, bypassing prior context distribution assumptions:

R(T)O(dTlogT)R(T) \leq O\left(d \sqrt{T} \log T \right)

This matches the minimax lower bound in linear contextual settings.


3. Practical Implementation and Efficiency

Minimax optimal TS algorithms are designed to be as practical and efficient as standard TS, preserving the advantages of ease of implementation and computational tractability. Typical algorithmic steps include:

  • Posterior sampling of arm means (often via conjugate priors or as Gaussian approximations).
  • Clipping or augmentation mechanisms that add minimal computational overhead.
  • For ExpTS+^+, sampling is via explicit, invertible distributions for one-dimensional exponential families.
  • In contextual and linear settings, "adaptive" augmentation adds only O(d)O(d) samples per round instead of inflating by the number of arms KK.

Empirical evaluations across multiple works show that minimax optimal TS variants achieve regret uniformly close to known lower bounds and outcompete classical TS or UCB in worst-case settings, while retaining robust performance in benign regimes and in the presence of delayed or batched feedback.


Algorithm Time Regret Gap Dependence Minimax Optimal Notes
UCB O(logT)O(\log T) O(1/Δi)O(1/\Delta_i) Yes Classical non-Bayesian optimism
Classical TS O(logT)O(\log T) O(1/Δi4)O(1/\Delta_i^4) No Not minimax in gap, still optimal in TT
MOTS, ExpTS+^+ O(KT)O(\sqrt{KT}) O(1/Δi)O(1/\Delta_i) Yes Clipping / greedy modification
Batched TS (adaptive) O(TlogT)O(\sqrt{T\log T}) O(1/Δi)O(1/\Delta_i) Near-optimal With adaptive batching
LinTS (frequentist) O(ddT)O(d\sqrt{dT}) dd No Requires inflation in dd
Adaptive Augmentation TS O(dT)O(d\sqrt{T}) dd Yes Arbitrary contexts, no assumptions

Standard Thompson Sampling is empirically robust and nearly optimal; however, worst-case regret bounds can be poor in adversarial, high-dimensional, or poorly designed contextual scenarios unless augmented as described above.


5. Limitations, Extensions, and Open Problems

While the minimax optimality of TS variants is established for canonical stochastic bandits and some generalized settings (contextual bandits, exponential family rewards, and reinforcement learning with certain structure), notable caveats and ongoing areas of research remain:

  • Combinatorial Bandits: Standard TS suffers exponential regret in combinatorial semi-bandit problems under uniform priors (2102.05502). Established minimax optimal variants for these regimes are an open research direction.
  • Linear Bandits (Frequentist): Variance inflation is necessary for LinTS to be minimax in the frequentist sense unless geometric conditions on contexts hold (2006.06790); data-driven inflation resolves some cases, but minimaxity in fully adversarial contexts is challenging.
  • Pure Exploration and Best-Arm Identification: TS is suboptimal for PAC-style pure exploration, but recent advances show that top-two style sampling with TS primitives can be minimax optimal for linear pure exploration (2310.06069).
  • General RL and Nonparametric Settings: Asymptotic optimality (sublinear regret) is established for countable, non-Markov, and partially observable environments under TS (1602.07905), provided recoverability.

6. Summary Table

Setting Proven Minimax-Optimal TS Variant Regret Bound Further Notes
Finite stochastic bandit MOTS, ExpTS+^+ O(KT)O(\sqrt{KT}) Both anytime and gap-optimal
Linear contextual bandit Adaptive augmentation TS (2506.14479) O(dTlogT)O(d\sqrt{T}\log T) Arbitrary contexts, no diversity assumed
Exponential family bandit ExpTS+^+ O(VKT)O(\sqrt{VKT}) VV is max arm variance
General RL Resampled TS with recoverability (1602.07905) o(T)o(T) (sublinear) Very general environments
Batched setting Adaptive batched TS O(TlogT)O(\sqrt{T\log T}) Near min-max, anytime, few batches
Combinatorial/High-dim None with uniform prior Exponential regret Requires further research

7. Outlook

Minimax optimal Thompson Sampling is realized through principled algorithmic modifications—clipping, greedy sampling, or adaptive augmentation—that rigorously control posterior over- and under-exploration across all problem instances. These developments close theoretical gaps left by earlier analyses and ground the empirical efficacy of TS in robust worst-case regret guarantees. Remaining challenges include optimality in combinatorial, high-dimensional, and adversarially structured problems, where further advances in posterior design and adaptive control are expected to play a critical role.