Papers
Topics
Authors
Recent
2000 character limit reached

Constrained Thompson Sampling Overview

Updated 10 February 2026
  • Constrained Thompson Sampling is a Bayesian decision-making strategy that extends classic TS by incorporating explicit probabilistic and operational constraints.
  • It leverages posterior sampling and constrained optimization (e.g., chance constraints and linear programming) to balance reward maximization with safety, budget, and risk limits.
  • Empirical studies demonstrate that Con-TS achieves logarithmic regret and stringent violation bounds across domains like power grids, wireless networks, and video streaming.

Constrained Thompson Sampling (Con-TS) is a class of Bayesian sequential decision-making algorithms that extend the classical Thompson Sampling (TS) heuristic to optimization under explicit, typically probabilistic, constraints. Con-TS has emerged as a central approach for safe exploration and learning in multi-armed bandits (MABs), contextual bandits, and linear bandit models, where reward maximization must be balanced against requirements such as safety, resource budgets, reliability, or regulatory risk limits. The essential idea is to incorporate constraints into the TS action-selection step, often by conditioning choices on sampled parameters while ensuring that selected actions are likely to satisfy the imposed constraints, either in expectation or with high probability. This framework achieves principled exploration while providing operational and safety guarantees across a variety of applied and theoretical domains.

1. Core Principles and Algorithmic Structure

Constrained Thompson Sampling modifies the action selection step of TS to enforce constraints, typically involving chance constraints, budget, safety, or risk. Con-TS proceeds by maintaining a posterior distribution over unknown parameters encoding rewards and constraint-relevant metrics. At each round, the algorithm samples these parameters from the posterior and solves a constrained optimization—e.g., a linear program (LP), a chance-constrained program, or a feasible super-arm selection—under the sampled model.

A canonical Con-TS iteration consists of:

  • Contextual and reward observation.
  • Sampling model parameters (or complete models) from the posterior.
  • Solving a stochastic optimization—maximizing expected reward subject to the sampled or posterior-probabilistic constraint(s).
  • Taking the action, receiving feedback, and updating the posterior in closed form.

For example, in "Constrained Thompson Sampling for Real-Time Electricity Pricing with Grid Reliability Constraints" (Con-TS-RTP), the step is:

pτ=argminpPEθ=θ~[f(D(p),Vτ)]    s.t.    Pθ=θ~[gj(D(p))0]1μp_τ = \arg\min_{p∈P} E_{θ=θ̃}[f(D(p),V_τ)] \;\; \text{s.t.} \;\; P_{θ=θ̃}[g_j(D(p)) ≤ 0] ≥ 1−μ

where pp is the control, θ~θ̃ is the sampled model, f()f(\cdot) is the cost, gjg_j are reliability constraints, and μμ is a risk tolerance (Tucker et al., 2019).

2. Constraint Types and Problem Domains

Con-TS frameworks address a variety of constraint classes and domains:

  • Chance Constraints: Bound constraint violations probabilistically, e.g., ensuring that voltage/line-flow limits are maintained with high probability in power systems (Tucker et al., 2019).
  • Budget or Resource Constraints: Maintain cumulative or per-action resource use under a threshold (Budgeted-MAB, knapsack bandits) (Xia et al., 2015, Saxena et al., 2020).
  • Safety/Reliability Measures: Require satisfaction of reliability or safety baselines, as in wireless link optimization (latency constraints) (Saxena et al., 2019).
  • Auxiliary Constraints: Enforce instance-level or cumulative constraints on other outcome metrics, including reward or cost metrics relative to a baseline policy (Daulton et al., 2019).
  • Risk Constraints: Limit risk via measures like Conditional Value at Risk (CVaR), e.g., in risk-constrained bandits (Chang et al., 2020).
  • Feasibility under Linear Constraints: Ensure actions reside in polytopes or under linear system constraints, as in safe linear bandits (Gangrade et al., 3 Mar 2025).
  • Stochastic Feasibility Constraints: Select optimal arms among those satisfying unknown stochastic feasibility limits, e.g., best-arm identification with constraints (Yang et al., 7 Jan 2025).

3. Posterior Updates and Bayesian Inference

All Con-TS algorithms rely on recursive Bayesian updates of posteriors for both reward and constraint-relevant parameters. With conjugate priors (e.g., Beta-Bernoulli, Normal-Gamma for Gaussian arms, or regularized least squares for linear models), closed-form updates are efficient and maintain statistical consistency. The key update takes the form:

πτ(θ)(Yτ;pτ,θ)πτ1(θ)\pi_{τ}(θ) \propto \ell(Y_{τ}; p_τ, θ) \cdot \pi_{τ-1}(θ)

where likelihood \ell depends on the observed feedback and the action (Tucker et al., 2019).

This Bayesian formalism supports exploration, enforces persistence in learning near the constraint boundary, and enables closed-form regret and safety analyses.

4. Theoretical Guarantees: Regret and Safety

The performance of Con-TS is characterized by regret, which measures the performance gap to an oracle with a priori parameter knowledge, and by rigorous safety or constraint-violation bounds. Results across model variants include:

  • Logarithmic Regret: Under identifiability and a finite parameter/arm space, Con-TS achieves RT=O(logT)R_T = O(\log T) expected regret, matching standard TS up to constants determined by constraint tightness and problem complexity (Tucker et al., 2019, Saxena et al., 2020).
  • Constraint Violation Guarantees: By enforcing prior-based or chance constraints at each selection, the empirical frequency of violations can be bounded with high probability. For instance, in power networks, enforcing the posterior-based chance constraint (parameter ν\nu) ensures overall violation probability δ\leq \delta (Tucker et al., 2019).
  • Risk/Regret Rates in Specific Models: For linearly constrained bandits, cumulative regret and violation are O(logT)O(\log T) for all but the optimal arms (Saxena et al., 2020). In risk-constrained (CVaR) settings, regret is O(logn)O(\log n), matching information-theoretic lower bounds, and constraint satisfaction is reliable even in finite horizons (Chang et al., 2020).
  • Best-Arm Identification Under Constraints: In BFAI-TS, the rate at which the posterior probability of wrongly identifying the best feasible arm decays exponentially, with the rate characterized by an explicit optimization over mixture parameters (Yang et al., 7 Jan 2025).

These guarantees are typically established using concentration inequalities adapted to the stochastic constraint structure, duality theory for LPs with random coefficients, and Bayesian large-deviation arguments.

5. Algorithmic Instantiations and Pseudocode

A representative pseudocode for Con-TS with chance constraints ((Tucker et al., 2019), condensed) is:

1
2
3
4
5
6
7
8
9
Input: Θ finite, P finite, reliability parameters μ, ν; prior π₀ on Θ.
For τ = 1…T do
    1. Observe context V_τ
    2. Sample θ̃ ∼ π_{τ-1}
    3. Solve p_τ = argmin_{p∈P} E_{θ=θ̃}[f(D(p),V_τ)]
        subject to: P_{θ=θ̃}[g_j(D(p)) ≤ 0] ≥ 1−μ
    4. Broadcast p_τ; observe outcome Y_τ
    5. Update ∀θ∈Θ: π_τ(θ) ∝ ℓ(Y_τ; p_τ, θ)·π_{τ−1}(θ)
end

In budgeted-MAB, the ratio of sampled reward and cost guides selection up to exhaustion of the budget (Xia et al., 2015). In linear safe bandits, the action is selected via a perturbed LP with sampled reward and constraint parameters, leveraging a coupled-noise design for optimism and feasibility (Gangrade et al., 3 Mar 2025).

6. Empirical Validation and Application Case Studies

Empirical studies across disciplines corroborate the theoretical properties of Con-TS variants:

  • Power Grids: On realistic feeders (38-bus, LinDistFlow), Con-TS-RTP achieves near-optimal demand shaping and exhibits zero violations under chance constraints, outperforming unconstrained TS and alternative exploration heuristics (Tucker et al., 2019).
  • Wireless Communication: Con-TS for rate selection substantially outperforms UCB and unconstrained TS in throughput/violation ratios and safety metrics in nonstationary wireless environments (Saxena et al., 2019).
  • Video Transcoding: In large-scale, contextual settings, TS with auxiliary safety constraints efficiently navigates reward–reliability tradeoffs, outperforming baselines and providing tunable slack via a constraint parameter α (Daulton et al., 2019).
  • Linear Bandits: COLTS demonstrates $5$–20×20\times speedup and regret parity or improvements versus prior SOCP-based safe linear-Bandit methods, scaling to high dimensions and many constraints (Gangrade et al., 3 Mar 2025).
  • Combinatorial Selection: In variable selection and BAI, TVS and BFAI-TS outperform LASSO, UCB, and baseline allocation in both power and false discovery metrics, maintaining O(plogTp\,\log T) regret and adapting flexibly to matroid or cardinality constraints (Liu et al., 2020, Yang et al., 7 Jan 2025).

7. Extensions, Open Challenges, and Limitations

Con-TS models generalize flexibly to multiple constraints, non-convex and matroid-structured constraint sets, and combinatorial settings:

  • Multiple or Nonlinear Constraints: Extensions using vectorized dual variables or feasibility resampling can handle more complex constraint geometries (Saxena et al., 2020, Gangrade et al., 3 Mar 2025).
  • Contextual and Sequential Problems: Instance-level and cumulative auxiliary constraints are handled in multi-outcome contextual bandits (Daulton et al., 2019).
  • Scalability and Computation: Recent work (COLTS) replaces SOCPs with LPs or even line search for efficiency with many constraints (Gangrade et al., 3 Mar 2025).
  • Limitations: Existing analyses assume accurate model posteriors and do not cover non-Gaussian or heavy-tailed noise in all instances. Many approaches lack instance-optimal regret for optimal arms, and constrained TS for non-contextual, non-i.i.d., or delayed feedback regimes remains open.
  • Theoretical Gaps: Some Con-TS algorithms have conjectured but unproven regret and safety probability bounds; rigorous analysis is highlighted as an open problem in certain contextual settings (Daulton et al., 2019).

Constrained Thompson Sampling thus constitutes a theoretically principled, empirically validated, and versatile methodology for safe and efficient learning under uncertainty and operational constraints across many sequential decision-making domains.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Constrained Thompson Sampling.