Bayesian Regret Bounds Overview

Updated 21 April 2026

Bayesian regret bounds are metrics that quantify the expected excess loss of an algorithm compared to the optimal Bayesian strategy under stochastic uncertainty.
They use information-theoretic measures, such as maximum information gain and kernel-dependent rates, to derive tight, sublinear, and often instance-dependent performance guarantees.
Applications span multi-armed bandits, Bayesian optimization, and reinforcement learning, where these bounds inform the trade-off between exploration and exploitation.

Bayesian regret bounds quantify the performance gap between a learning or optimization algorithm and the optimal Bayesian strategy, under a stochastic prior on model parameters or reward-generating functions. In both bandit, sequential decision-making, and Bayesian optimization regimes, these bounds express how well an algorithm can adapt to underlying uncertainty, emphasizing information acquisition and exploitation tradeoffs. Recent developments provide tight, often instance-dependent, Bayesian regret guarantees for bandit, Bayesian optimization, and reinforcement learning settings, under Gaussian process priors, linear models, and broader nonparametric regimes.

1. Definitions and Fundamental Quantities

Bayesian cumulative regret is typically defined as the expectation (w.r.t. the prior over the data-generating mechanism) of the total excess loss or cost incurred by an algorithm relative to the optimal oracle policy after $T$ rounds.

Multi-Armed Bandits: For arm set $A=[K]$ , underlying parameter $\theta\sim h$ , with rewards $\mu_a(\theta)$ and $A_*$ the optimal arm, the Bayes regret is

$R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].$

Bayesian Optimization: For an unknown function $f$ drawn from a GP prior over domain $\mathcal{X}$ ,

$R_T = \mathbb{E}\left[\sum_{t=1}^T (f(x^*) - f(x_t))\right],\, x^* = \arg\max_{x\in\mathcal{X}} f(x).$

RL/MDP Setting: For a Bayesian prior over MDPs, Bayes regret is

$BR(T) = \mathbb{E}_{\mathcal{M}}\left[\sum_{t=1}^N (V_1^{*,\mathcal{M}} - V_1^{\pi^t, \mathcal{M}})\right],$

over $A=[K]$ 0 episodes, each of horizon $A=[K]$ 1.

Key complexity measures appear in the bounds:

Maximum Information Gain (MIG), denoted $A=[K]$ 2, quantifies the mutual information between the function values and observations up to $A=[K]$ 3 rounds, governing the effective dimensionality of the learning problem (Iwazaki, 2 Jun 2025).
Kolmogorov $A=[K]$ 4-dimension in RL quantifies the effective covering size of the environment class (Moradipari et al., 2023).
Prior gaps and covering numbers determine instance-dependent minimax rates in finite-armed and linear settings (Atsidakou et al., 2023).

2. Regret Bounds in Bayesian Optimization

In Bayesian optimization with Gaussian process priors, the canonical regret rate scales as $A=[K]$ 5, where the sublinear growth (in $A=[K]$ 6) of $A=[K]$ 7 is kernel-dependent. For SE (RBF) kernels, $A=[K]$ 8; for Matérn- $A=[K]$ 9 kernels with $\theta\sim h$ 0, improved analyses yield $\theta\sim h$ 1 in 1D and nearly so in higher dimensions.

Main high-probability bounds for GP-UCB:

Matérn kernel, $\theta\sim h$ 2 (with suitable $\theta\sim h$ 3):

$\theta\sim h$ 4

This matches the minimax $\theta\sim h$ 5 lower bound (Iwazaki, 2 Jun 2025, Scarlett, 2018).

Squared Exponential kernel:

$\theta\sim h$ 6

This sharpens earlier bounds by removing extraneous logarithmic factors through algorithm-dependent, local information-gain bounds rather than worst-case global ones (Iwazaki, 2 Jun 2025).

The technical novelty involves decomposing regret into regions of the search space and exploiting the geometric concentration of query points near the maximizer, allowing much tighter information gain evaluations.

For Thompson Sampling (TS) in BO:

Expected cumulative regret is, up to logarithmic factors, $\theta\sim h$ 7, matching GP-UCB (Takeno et al., 2023, Takeno et al., 10 Mar 2026).
High-probability regret for GP-TS can be larger, with polynomial dependence on $\theta\sim h$ 8 at failure probability $\theta\sim h$ 9 (Takeno et al., 10 Mar 2026).

In preference-based (dueling) BO with only pairwise comparisons, MR-LPF algorithm achieves

$\mu_a(\theta)$ 0

where $\mu_a(\theta)$ 1 is the information gain of the dueling kernel. This matches scalar-feedback BO, showing that pairwise-feedback sample complexity is not increased (Kayal et al., 29 May 2025).

For unknown hyperparameters (e.g., lengthscales), the Length-Scale Balancing (LB) algorithm achieves regret only $\mu_a(\theta)$ 2 away from the oracle with optimally chosen hyperparameter, eliminating the polynomial factor penalty of A-GP-UCB (Ziomek et al., 2024).

In Bayesian optimization over unknown domains, regret remains sublinear with high probability provided the search volume is increased at a hyperharmonic rate: $\mu_a(\theta)$ 3 set by the rate of volume expansion and kernel properties (Tran-The et al., 2020).

3. Bandit and Linear Bandit Settings

For Bayesian multi-armed and linear bandit problems, recent advances yield finite-time, gap- and prior-dependent logarithmic regrets.

Gap-dependent (Bayesian UCB):

$\mu_a(\theta)$ 4

where $\mu_a(\theta)$ 5 is the instance gap (Atsidakou et al., 2023).

Prior-dependent:

$\mu_a(\theta)$ 6

where $\mu_a(\theta)$ 7 depends on the prior's mass on near-optimal arms. This matches Lai's asymptotic lower bound (Atsidakou et al., 2023).

For offline Bayesian linear bandits, the high-confidence Bayes regret (VaR) can be tightly bounded via convex conic optimization: $\mu_a(\theta)$ 8 which is shown to be essentially optimal (Petrik et al., 2023). Algorithms that minimize such certificates strictly outperform those based on pessimistic lower confidence bounds.

4. Reinforcement Learning: Bayesian Regret in MDPs

Bayesian regret in RL/MDPs is governed by the structure of uncertainty over the environment class, the horizon $\mu_a(\theta)$ 9, and the environment's statistical covering (via Kolmogorov $A_*$ 0-dimension or mutual information over rate-distortion partitions).

Key results:

Information-directed RL (IDS), for tabular finite-horizon MDPs with $A_*$ 1 states, $A_*$ 2 actions, and horizon $A_*$ 3:

$A_*$ 4

By learning a less-informative surrogate environment (via rate-distortion partitions), this is improved to

$A_*$ 5

with matching instance-dependent certificates (Hao et al., 2022).

Thompson Sampling in RL (MDPs with Kolmogorov $A_*$ 6-dimension $A_*$ 7):

$A_*$ 8

with concrete $A_*$ 9 for tabular ( $R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].$ 0), linear function approximation, or finite-mixture models (Moradipari et al., 2023).

GP-based RL (continuous control with Gaussian process prior): for GP-PSRL with horizon $R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].$ 1 and maximum information gain $R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].$ 2,

$R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].$ 3

This holds for unbounded state spaces and leverages chaining and concentration inequalities for GP sample paths (Flynn et al., 9 Mar 2026).

Variational Bayesian RL/Boltzmann entropy-regularized RL: K-learning yields

$R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].$ 4

matching TS up to log factors and unifying risk-seeking utility with entropy-regularized RL (O'Donoghue, 2018).

5. Information-Theoretic Lower and Upper Bounds

A unified information-theoretic perspective establishes matching (up to logarithms) lower and upper bounds for Bayesian regret in terms of information acquired (measured in bits). For $R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].$ 5 bits gained about the optimal policy,

$R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].$ 6

for $R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].$ 7-armed bandits; similar scaling holds for linear and general policy spaces (Shufaro et al., 2024). Entropy-constrained versions yield

$R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].$ 8

and for more general decision sets, lower bounds are derived via Fano's inequality and packing arguments. Upper bounds are achieved by TS via the information-ratio principle, reaching the minimax rates with explicit trade-offs between information acquisitions and cumulative regret.

6. Bayesian Regret Bounds in Meta- and Hierarchical Bayesian Learning

Meta Bayesian optimization, where the prior is estimated from offline data across multiple tasks, achieves regret converging to the noise level, with the estimation error decaying as $R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].$ 9, where $f$ 0 is the number of offline functions observed (Wang et al., 2018).

For hierarchical priors, e.g., Student- $f$ 1 or hierarchical Gaussian, log-loss Bayesian regret bounds are explicit in the parameter regularity and structure:

Student- $f$ 2 prior yields only logarithmic dependence of regret on $f$ 3, conferring robustness.
Hierarchical Gaussian structures encourage statistical strength sharing across tasks, reducing regret if parameters are similar (Huggins et al., 2015).

7. Robustness to Approximate Inference and Frequentist Regimes

Recent work demonstrates that Bayesian regret guarantees can be retained under bounded approximate inference error, provided certain $f$ 4-divergences between the approximate and true posterior are controlled (both above and below), yielding $f$ 5 regret in the bandit setting (Huang et al., 2022). In the frequentist setting, sharper high-probability GP regression error analysis closes the gap between Bayesian and frequentist optimization regret rates, giving

$f$ 6

where $f$ 7 is the RKHS bound and $f$ 8 domain dimension (Wang et al., 2024).

References

"Improved Regret Bounds for Gaussian Process Upper Confidence Bound in Bayesian Optimization" (Iwazaki, 2 Jun 2025)
"Tight Regret Bounds for Bayesian Optimization in One Dimension" (Scarlett, 2018)
"Posterior Sampling-Based Bayesian Optimization with Tighter Bayesian Regret Bounds" (Takeno et al., 2023)
"On Regret Bounds of Thompson Sampling for Bayesian Optimization" (Takeno et al., 10 Mar 2026)
"Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds" (Kayal et al., 29 May 2025)
"Bayesian Optimisation with Unknown Hyperparameters: Regret Bounds Logarithmically Closer to Optimal" (Ziomek et al., 2024)
"Sub-linear Regret Bounds for Bayesian Optimisation in Unknown Search Spaces" (Tran-The et al., 2020)
"Finite-Time Logarithmic Bayes Regret Upper Bounds" (Atsidakou et al., 2023)
"Bayesian Regret Minimization in Offline Bandits" (Petrik et al., 2023)
"Regret Bounds for Information-Directed Reinforcement Learning" (Hao et al., 2022)
"Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning" (Moradipari et al., 2023)
"Posterior Sampling Reinforcement Learning with Gaussian Processes for Continuous Control: Sublinear Regret Bounds for Unbounded State Spaces" (Flynn et al., 9 Mar 2026)
"On Bits and Bandits: Quantifying the Regret-Information Trade-off" (Shufaro et al., 2024)
"Risk and Regret of Hierarchical Bayesian Learners" (Huggins et al., 2015)
"Regret bounds for meta Bayesian optimization with an unknown Gaussian process prior" (Wang et al., 2018)
"Optimal Regret Is Achievable with Bounded Approximate Inference Error: An Enhanced Bayesian Upper Confidence Bound Framework" (Huang et al., 2022)
"Variational Bayesian Reinforcement Learning with Regret Bounds" (O'Donoghue, 2018)