Papers
Topics
Authors
Recent
2000 character limit reached

Feel-Good Thompson Sampling (FGTS)

Updated 9 November 2025
  • Feel-Good Thompson Sampling (FGTS) is a posterior-sampling method that adds an optimism bonus to classical TS, promoting aggressive exploration in decision-making problems.
  • It achieves minimax-optimal regret in linear bandits, contextual dueling bandits, and reinforcement learning by biasing the posterior toward high-reward outcomes.
  • FGTS employs advanced sampling techniques like LMC, MALA, and HMC, with extensions for smoothing and variance awareness to balance exploration and computational efficiency.

Feel-Good Thompson Sampling (FGTS) is a posterior-sampling-based methodology for sequential decision making that augments classical Thompson Sampling (TS) with an explicit optimism bonus. This construction incentivizes exploration more aggressively than standard TS by biasing the posterior toward high-reward explanations, achieving minimax-optimal regret guarantees in both contextual bandit and reinforcement learning settings under appropriate conditions. FGTS and its extensions have been systematically studied in both exact and approximate posterior regimes, with substantive implications for scalability, sampling algorithms, and empirical performance on bandit and reinforcement learning benchmarks.

1. Core Algorithmic Principle and Mathematical Formulation

Feel-Good Thompson Sampling operates within the contextual bandit framework. At each round tt:

  • The agent observes a context xt\mathbf{x}_t, with a (possibly finite) action set XtRd\mathcal{X}_t \subset \mathbb{R}^d;
  • For each arm aXta \in \mathcal{X}_t, the associated feature is ϕ(xt,a)Rd\phi(\mathbf{x}_t, a) \in \mathbb{R}^d;
  • Pulling arm aa yields a reward rt=fθ(ϕ(xt,a))+ξtr_t = f_{\theta^*}(\phi(\mathbf{x}_t, a)) + \xi_t, where θRd\theta^* \in \mathbb{R}^d (unknown).

Standard TS maintains a Bayesian posterior: πt(θ)exp(ηs<t[fθ(ϕ(xs,as))rs]2)p0(θ)\pi_t(\theta) \propto \exp\left(-\eta \sum_{s<t}\left[ f_\theta(\phi(\mathbf{x}_s, a_s)) - r_s \right]^2 \right) p_0(\theta) and draws θtπt(θ)\theta_t \sim \pi_t(\theta). The action choice is

at=argmaxaXtθt,ϕ(xt,a).a_t = \arg\max_{a \in \mathcal{X}_t} \langle \theta_t, \phi(\mathbf{x}_t, a) \rangle.

FGTS modifies the likelihood through the inclusion of a feel-good bonus: LtFG(θ)=s=1t1[η(fθ(ϕ(xs,as))rs)2λmin(b,fθ(ϕ(xs,as)))],\mathcal{L}_t^{\mathrm{FG}}(\theta) = \sum_{s=1}^{t-1} \left[ \eta (f_\theta(\phi(\mathbf{x}_s, a_s)) - r_s)^2 - \lambda \min(b, f_\theta(\phi(\mathbf{x}_s, a_s))) \right], with λ>0\lambda > 0 (bonus scale), bb (bonus cap), and η\eta the inverse noise variance. The posterior becomes: θtexp(LtFG(θ))p0(θ),\theta_t \sim \exp\left(-\mathcal{L}_t^{\mathrm{FG}}(\theta)\right) p_0(\theta), and the per-action bonus is bt(ϕ(x,a))=λmin(b,fθt(ϕ(x,a)))b_t(\phi(\mathbf{x}, a)) = \lambda \min(b, f_{\theta_t}(\phi(\mathbf{x}, a))). The action is selected via

at=argmaxaXtθt,ϕ(xt,a)+bt(ϕ(xt,a)).a_t = \arg\max_{a \in \mathcal{X}_t} \langle \theta_t, \phi(\mathbf{x}_t, a) \rangle + b_t(\phi(\mathbf{x}_t, a)).

A smoothed variant, SFG-TS, replaces the hard minimum: Φs(u)=1slog(1+exp(su)),\Phi_s(u) = \frac{1}{s} \log(1 + \exp(su)), enabling gradient-based sampling in nonconvex neural bandits by smoothing the non-differentiability in the bonus.

2. Theoretical Regret Guarantees

FGTS admits minimax-optimal regret bounds in several regimes:

  • Linear Bandits (Exact Posterior): For fθ(ϕ)=θϕf_\theta(\phi) = \theta^\top \phi, with exact Gaussian posterior inference and well-chosen (λ,b)(\lambda, b), the regret is

R(T)=E[t=1Trt(xt)rt(xt)]=O(dT),R(T) = \mathbb{E}\left[ \sum_{t=1}^T r_t(x_t^*) - r_t(x_t) \right] = O(d \sqrt{T}),

matching the information-theoretic lower bound Ω(dT)\Omega(d \sqrt{T}) (Zhang, 2021).

  • Frequentist Regret with Finite Actions: With sub-Gaussian rewards and A(xt)K|\mathcal{A}(x_t)| \leq K, applying FGTS with η14\eta \leq \frac{1}{4}, b=1b = 1, and appropriate λ\lambda yields

O(KTlnΩ)O\left( \sqrt{K T \ln |\Omega|} \right)

regret, attaining the minimax rate.

  • Infinite/Linearly Embeddable Actions: Assuming a bilinear structure f(θ,x,a)=w(θ,x)ϕ(x,a)f(\theta, x, a) = w(\theta, x)^\top \phi(x, a) and f(θ,x,a)bf(\theta, x, a) \geq -b, the regret is

O(KTdlnT)O(\sqrt{K T d \ln T})

in a dd-dimensional linear parametric setting.

These results are underpinned by a decoupling-coefficient analysis, which connects the regret decomposition to the analysis of online least squares, leveraging tools from the theory of online prediction and aggregation (Zhang, 2021, Li et al., 3 Nov 2025).

3. Practical Implementation and Sampling Algorithms

Implementation of FGTS for large-scale or non-Gaussian models (e.g., neural bandits) requires approximate sampling from the modified posterior. The main strategies are:

  • Langevin Monte Carlo (LMC): For the loss LtFG(θ)\mathcal{L}_t^{\mathrm{FG}}(\theta), LMC iterates

θθτLtFG(θ)+2β1τξ,ξN(0,I)\theta \leftarrow \theta - \tau \nabla \mathcal{L}_t^{\mathrm{FG}}(\theta) + \sqrt{2 \beta^{-1} \tau} \xi, \quad \xi \sim \mathcal{N}(0, I)

with step-size τ\tau and inverse temperature β\beta.

Computational cost depends on the model class and sampler:

  • Exact linear FGTS: O(d3)O(d^3) per round.
  • LMC/MALA: O(Kd)O(K \cdot d) per step, K100K \approx 100 for practical mixing.
  • HMC: O(Ld)O(L \cdot d) per leapfrog step, L10L \approx 10.

Memory usage is dominated by the posterior structure and is especially significant when backpropagating through unrolled MCMC steps for neural networks.

Hyperparameter tuning for (λ,b,η,s)(\lambda, b, \eta, s) is typically not prohibitive; defaults such as λ{0.01,0.1}\lambda \in \{ 0.01, 0.1 \} and smoothing s[5,20]s \in [5, 20] are sufficient in most settings (Anand et al., 21 Jul 2025).

4. Empirical Performance and Trade-offs

Comprehensive benchmarking across synthetic and real datasets highlights the following patterns (Anand et al., 21 Jul 2025):

  • Exact Posterior Regimes (Linear/Logistic Bandits): FG-TS (MALA or HMC) yields 10–20% lower cumulative regret vs. TS or LinUCB; SFG-TS matches or slightly improves over closed-form LinTS in logistic bandits.
  • Approximate Posterior Regimes (Neural Bandits, Stochastic-Gradient MCMC): FG-TS can degrade regret due to amplification of sampling noise (especially with large bonuses), leading to instability in neural settings. Vanilla stochastic-gradient LMC-TS is typically more reliable for these cases.
  • Bonus Scale Sensitivity: λ0.01\lambda \approx 0.01 is generally optimal; larger values (λ0.5\lambda \geq 0.5) harm regret unless the sampler approximates the posterior very accurately.
  • Preconditioning and Prior Strength: HMC benefits from preconditioning for ill-conditioned problems, but aggressive preconditioning can be detrimental in LMC without MH filtering. Mild prior regularization stabilizes exploration, but excessively tight priors suppress effective exploration.

Empirical ablations confirm that the exploration benefit of the FGTS bonus is meaningful only when the posterior approximation is reliable—otherwise, the bonus exacerbates estimation noise and leads to erratic exploration.

5. Extensions: Smoothed, Variance-Aware, Dueling, and RL Regimes

FGTS forms the basis for multiple variants and extensions:

  • Smoothed FGTS (SFG-TS): Substitutes min(b,f)\min(b, f) with a differentiable soft maximum to facilitate MCMC in models with nondifferentiable activations.
  • Variance-Aware FGTS (FGTS-VA): Weights exploration and loss contributions by observed noise variances, with regret O~(dclogFt=1Tσt2+dc)\tilde{O}(\sqrt{\mathrm{dc}\cdot\log|\mathcal{F}|\sum_{t=1}^T\sigma_t^2}+\mathrm{dc}) in the finite model class case—matching state-of-the art UCB-based rates for weighted linear bandits (Li et al., 3 Nov 2025).
  • FGTS for Contextual Dueling Bandits: Adapts the FG bonus to the dueling bandit formulation, leverages conditional independence of sampled arms, and achieves minimax-optimal O~(dT)\tilde O(d\sqrt T) regret (Li et al., 9 Apr 2024).
  • Reinforcement Learning (RL): The FGTS principle applies to linear MDPs and general RL by introducing a feel-good prior at initial stages and using squared Bellman error in subsequent steps. Empirical results with approximate sampling (LMC/ULMC) integrated into DQN architectures demonstrate superior deep exploration and performance on RL benchmarks (e.g., N-chain, Atari hard exploration games) (Ishfaq et al., 18 Jun 2024).

The table summarizes key variants and their regret guarantees:

FGTS Variant Setting Regret Bound
FG-TS (base) Linear bandit (exact) O(dT)O(d\sqrt{T})
SFG-TS Logistic/neural bandit O(dT)O(d\sqrt{T}) (with accurate sampler)
FGTS-VA Linear bandit, var. aware O~(dΛ+d)\tilde O(d\sqrt{\Lambda}+d)
FGTS.CDB Contextual dueling bandit O~(dT)\tilde O(d\sqrt{T})
FGTS-RL Linear MDP O(dH3/2Tlog(dT))O(dH^{3/2}\sqrt{T}\log(dT))

6. Implementation Guidance and Recommendations

Best practices for applying FGTS and its variants are well-characterized (Anand et al., 21 Jul 2025, Ishfaq et al., 18 Jun 2024):

  • Linear/Logistic Bandits (Exact): Prefer FG-TS or SFG-TS with λ=0.01\lambda=0.01, b1000b\sim 1000, use MALA or HMC sampling.
  • Neural/Approximate Settings: Default to LMC-TS or neural-specific methods; small or zero FG bonuses are safer.
  • Smoothing: Use SFG-TS with s10s \approx 10 for nondifferentiable models.
  • Parameter Tuning: Small grid search over λ\lambda, moderate regularization; tuning can usually be limited to a narrow parameter regime.
  • Computational Cost: Consider sampler selection based on mixing efficiency versus implementation complexity; ULMC is preferable in strongly log-concave posterior landscapes for accelerated mixing.
  • Empirical Reliability: Aggressive optimism bonuses are only beneficial when the posterior sampling is accurate; otherwise, bonus-driven exploration should be moderated.
  • Open-source Reference: Code and experimental framework are available at https://github.com/SarahLiaw/ctx-bandits-mcmc-showdown, facilitating reproducibility in bandit experiments.

7. Connections, Limitations, and Perspective

FGTS provides a unifying framework for optimism-driven exploration in posterior sampling. The decoupling coefficient theory undergirds minimax-optimal regret in both bandit and RL contexts. However, empirical investigations reveal that the core advantage of FGTS—bonus-driven optimism—is contingent upon high-fidelity sampling. In high-noise or large-scale neural regimes, excessive optimism can degrade performance. A plausible implication is that scalability to deep models is fundamentally limited by sampler fidelity and the stability of the modified posterior. Thus, FGTS serves as a robust, theoretically grounded baseline in medium-scale linear/logistic environments with exact or accurate approximate posteriors, but should be employed with caution in high-dimensional, low-sample accuracy scenarios.

FGTS and its smoothed/variance-aware extensions represent significant theoretical and empirical milestones in posterior sampling and exploration research. They establish a rigorous bridge between optimism-in-the-face-of-uncertainty and randomized exploration, highlight the necessity of sampling accuracy, and provide practical algorithmic pathways for a wide range of contextual decision-making problems.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Feel-Good Thompson Sampling (FGTS).