Posterior Sampling in Sequential Decision

Updated 8 February 2026

Posterior Sampling in Sequential Decision is a Bayesian framework that samples from posterior distributions to balance exploration and exploitation in complex environments.
It extends Thompson sampling to various domains including multi-armed bandits, reinforcement learning, contextual bandits, and design of experiments, demonstrating strong statistical efficiency.
Empirical evaluations show that these methods achieve near-optimal regret bounds and ensure safety in constrained settings by integrating efficient Bayesian updates with optimization strategies.

Posterior sampling in sequential decision refers to a class of Bayesian algorithms, notably including Thompson sampling and its generalizations to reinforcement learning (RL) and design of experiments (DOE), that balance exploration and exploitation by sampling models or value functions from their posterior distributions and optimizing actions under the sampled parameters. This approach natively incorporates uncertainty and delivers principled exploration in complex, unknown environments, including Markov Decision Processes (MDPs), constrained MDPs (CMDPs), and general adaptive control and learning frameworks.

1. Foundations of Posterior Sampling in Sequential Decision

Posterior sampling maintains a Bayesian posterior over unknown quantities—such as model parameters, reward/cost functions, or value functions—and, at each decision point, draws a sample from the posterior to induce a candidate model or policy. Actions or policies are then optimized under this sample, yielding a randomized, uncertainty-aware exploration strategy.

In classical multi-armed bandits, this reduces to Thompson sampling, where the agent samples the mean reward of each arm from its posterior, then selects the arm with the highest sample. For general sequential decision tasks, including RL and DOE, the algorithm samples parameters (model transitions, reward vectors, Q-functions, etc.) or strategies and solves the resulting (possibly complex) planning or optimization problem (Russo et al., 2013, Kandasamy et al., 2018).

Posterior sampling has seen significant adoption due to its strong statistical efficiency, computational simplicity, and favorable regret properties across a wide variety of problem classes (Russo et al., 2013, Theocharous et al., 2017).

2. Posterior Sampling Algorithms Across Domains

Posterior sampling generalizes across sequential decision domains:

Multi-armed bandits: Thompson sampling samples reward means from the posterior at each round and plays the best sample (Russo et al., 2013). Bayesian regret is controlled by concentration of the posterior.
Contextual bandits: The policy samples model parameters or expected reward functions, inducing context-sensitive randomized policies (Russo et al., 2013).
Reinforcement learning (RL): Posterior Sampling for RL (PSRL) samples the transition and/or reward model from its posterior at each episode, computes the optimal policy for that sample, and commits to it for the episode (Theocharous et al., 2017). Extensions include deterministic episode schedules (DS-PSRL) and non-episodic updates.
Q-learning with posterior sampling (PSQL): Maintains Gaussian posteriors over Q-values and uses Thompson sampling for exploration, yielding a provably near-optimal regret guarantee in episodic tabular RL (Agrawal et al., 1 Jun 2025).
Value or policy function posteriors: Bayesian methods directly construct and sample from the posterior over optimal Q-functions (via “noisy Bellman equation” likelihoods) and derive policy decisions by greedily acting with respect to Q* samples (Guo et al., 3 May 2025).
Design of Experiments: Myopic Posterior Sampling (MPS) chooses the next experiment to minimize the expected penalty relative to a sampled parameter and data history, a direct generalization of Thompson sampling to structured DOE (Kandasamy et al., 2018).
Constrained RL (CMDPs): Sampling and primal-dual methods enforce constraints while optimizing objectives under sampled models (Kalagarla et al., 2023, Provodin et al., 2023).

3. Theoretical Guarantees and Regret Bounds

Posterior sampling often achieves near-optimal regret rates, sometimes matching information-theoretic lower bounds up to polylogarithmic or model-dependent factors:

Bayesian Regret Framework: For arbitrary function classes, Bayesian regret is linked to model complexity (captured by the “eluder dimension”), with bounds scaling as $O(\sqrt{\text{dim}_E(\mathcal{F},\varepsilon)\dim_K(\mathcal{F})T \log T})$ for general classes, and $O(d\sqrt{T}\log T)$ for d-dimensional linear models (Russo et al., 2013).
RL and MDPs: Episodic PSRL and Q-learning with posterior sampling achieve regret bounds of the form $\tilde{O}(H^2\sqrt{SAT})$ and, in model-based settings, $O(H\sqrt{SAT})$ (S, A: state/action space, H: horizon, T: time) (Agrawal et al., 1 Jun 2025, Theocharous et al., 2017). Closing the remaining $H$ -factor gap in regret remains a primary challenge (Agrawal et al., 1 Jun 2025).
Constrained RL (CMDPs): Safe PSRL and variants bound cumulative constraint violations uniformly (typically $O(1)$ ), while Bayesian reward regret scales as $\tilde{O}(H^{2.5}|S|^2|A|K)$ for K episodes in finite-horizon CMDPs (Kalagarla et al., 2023). For infinite-horizon communicating CMDPs, Bayesian regret for each cost component is $\tilde{O}(HS\sqrt{AT})$ , matching lower bounds in T (Provodin et al., 2023).
DOE and general adaptive experimentation: Myopic posterior sampling achieves sublinear regret bounds, scaling with the square root of sample size, number of actions, and the model’s maximum information gain (Kandasamy et al., 2018).

These bounds typically rely on a posterior-sampling lemma: conditional on the current data, the sampled parameter and the true parameter are identically distributed, enabling information-theoretic and coupling arguments.

4. Methodological Structure and Algorithmic Patterns

Posterior sampling in sequential decision problems is typically organized as follows:

Bayesian update: Maintain a posterior $p_t$ on unknown model/item parameters (e.g., transition kernels, rewards, Q*).
Sampling step: At each decision epoch or episode, sample a parameter/model $p^*_t \sim p_t$ .
Planning/Optimization: Compute the optimal policy or action under the sampled model (solving DP, LP, or maximizing Q* as appropriate).
Action selection and execution: Execute the selected action or policy; observe data.
Posterior update: Incorporate the observed data into the posterior.

Advanced variants, particularly in safe or constrained RL, embed posterior sampling within primal-dual frameworks, where Lagrangian, dual variable, or pessimism penalties are dynamically updated to enforce constraints and ensure safety (Kalagarla et al., 2023, Provodin et al., 2023).

For Q-learning, posterior sampling can be combined with bootstrapped target distributions, regularized Bayes updates, and optimistic action-value estimation strategies to achieve powerful exploration without explicit upper confidence bonuses (Agrawal et al., 1 Jun 2025).

5. Extensions: Constraints, Incentivized Exploration, and Beyond

Posterior sampling frameworks have been extended far beyond standard bandits and RL settings:

Constrained decision processes: Safe PSRL introduces pessimism by dynamically tightening safety constraints ( $\tau - \epsilon_k$ ) at each episode, updating dual multipliers, and leveraging a primal-dual decomposition for bounded constraint violation regret (Kalagarla et al., 2023). PSConRL similarly achieves near-optimal regret for both primary and constraint costs in communicating CMDPs (Provodin et al., 2023).
Incentivized exploration: Filtered posterior sampling schemes exploit information asymmetry to align agent incentives, ensuring Bayesian incentive compatibility (BIC) even with private types, correlated priors, or complex recommendation semantics, and reduce to standard Thompson sampling as a special case (Kalvit et al., 2024).
Design of Experiments: MPS leverages posterior sampling to generalize along multi-objective, non-Gaussian, or non-myopic DOE goals. Algorithms utilize posterior sampling over arbitrary model classes, combined with probabilistic programming and approximate inference (Kandasamy et al., 2018).

6. Empirical Evaluations and Application Domains

Posterior sampling methods have been empirically validated in a wide range of settings:

Reinforcement learning benchmarks: In tabular and parametrized MDPs, PSRL and PSQL outperform or match UCB-based and count-based exploration baselines, especially in sparse-reward and deep exploration scenarios (Theocharous et al., 2017, Agrawal et al., 1 Jun 2025, Guo et al., 3 May 2025).
Constrained RL: Safe PSRL empirically maintains bounded constraint violations and achieves lower reward regret than safe-policy-based algorithms in streaming and buffer management CMDPs (Kalagarla et al., 2023).
Incentivized recommendation: Filtered posterior sampling attains optimal BIC properties and regret, enabling unbiased incentive alignment for social learning, recommendation, and experimentation (Kalvit et al., 2024).
DOE tasks: MPS matches or exceeds problem-specific algorithms in logistic regression active learning, linear regression, astrophysical density estimation, and multi-objective battery design (Kandasamy et al., 2018).

7. Conceptual Insights and Future Directions

Posterior sampling offers a unifying Bayesian methodology for exploration in sequential decision domains:

Principled exploration: By sampling entire models or value functions, posterior sampling naturally induces deep, temporally extended exploration and avoids excessive myopia (Guo et al., 3 May 2025).
Computational simplicity: Most algorithms require only posterior updates and optimization under sampled parameters, circumventing the need to design tight confidence intervals or explicit exploration bonuses (Russo et al., 2013).
Scalability: Deterministic schedule methods (e.g., DS-PSRL) enable scaling to non-episodic settings and high-dimensional parameterizations (Theocharous et al., 2017).
Safety and constraints: Integration with primal-dual and pessimistic mechanisms provides guarantees for safety-critical domains, even without access to safe base policies (Kalagarla et al., 2023).
Adaptive and randomized policies: Posterior sampling generalizes classic Thompson sampling, enabling robust, randomized, and often asymptotically optimal adaptive control strategies (Agrawal et al., 1 Jun 2025, Guo et al., 3 May 2025).

Challenges remain in closing theoretical gaps (e.g., reducing multiplicative horizon factors in regret for Q-learning), extending rigorous guarantees to general function approximation, and unifying model-free and model-based posterior sampling regimes.

Key references:

(Russo et al., 2013): Learning to Optimize via Posterior Sampling (Theocharous et al., 2017): Posterior Sampling for Large Scale Reinforcement Learning (Kalagarla et al., 2023): Safe Posterior Sampling for Constrained MDPs with Bounded Constraint Violation (Provodin et al., 2023): Provably Efficient Exploration in Constrained Reinforcement Learning:Posterior Sampling Is All You Need (Guo et al., 3 May 2025): Bayesian learning of the optimal action-value function in a Markov decision process (Agrawal et al., 1 Jun 2025): Q-learning with Posterior Sampling (Kandasamy et al., 2018): Myopic Bayesian Design of Experiments via Posterior Sampling and Probabilistic Programming (Kalvit et al., 2024): Incentivized Exploration via Filtered Posterior Sampling