Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bayesian Regret Bounds Overview

Updated 21 April 2026
  • Bayesian regret bounds are metrics that quantify the expected excess loss of an algorithm compared to the optimal Bayesian strategy under stochastic uncertainty.
  • They use information-theoretic measures, such as maximum information gain and kernel-dependent rates, to derive tight, sublinear, and often instance-dependent performance guarantees.
  • Applications span multi-armed bandits, Bayesian optimization, and reinforcement learning, where these bounds inform the trade-off between exploration and exploitation.

Bayesian regret bounds quantify the performance gap between a learning or optimization algorithm and the optimal Bayesian strategy, under a stochastic prior on model parameters or reward-generating functions. In both bandit, sequential decision-making, and Bayesian optimization regimes, these bounds express how well an algorithm can adapt to underlying uncertainty, emphasizing information acquisition and exploitation tradeoffs. Recent developments provide tight, often instance-dependent, Bayesian regret guarantees for bandit, Bayesian optimization, and reinforcement learning settings, under Gaussian process priors, linear models, and broader nonparametric regimes.

1. Definitions and Fundamental Quantities

Bayesian cumulative regret is typically defined as the expectation (w.r.t. the prior over the data-generating mechanism) of the total excess loss or cost incurred by an algorithm relative to the optimal oracle policy after TT rounds.

  • Multi-Armed Bandits: For arm set A=[K]A=[K], underlying parameter θh\theta\sim h, with rewards μa(θ)\mu_a(\theta) and AA_* the optimal arm, the Bayes regret is

R(n)=EθhEπ[t=1n(μA(θ)μAt(θ))].R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].

  • Bayesian Optimization: For an unknown function ff drawn from a GP prior over domain X\mathcal{X},

RT=E[t=1T(f(x)f(xt))],x=argmaxxXf(x).R_T = \mathbb{E}\left[\sum_{t=1}^T (f(x^*) - f(x_t))\right],\, x^* = \arg\max_{x\in\mathcal{X}} f(x).

  • RL/MDP Setting: For a Bayesian prior over MDPs, Bayes regret is

BR(T)=EM[t=1N(V1,MV1πt,M)],BR(T) = \mathbb{E}_{\mathcal{M}}\left[\sum_{t=1}^N (V_1^{*,\mathcal{M}} - V_1^{\pi^t, \mathcal{M}})\right],

over A=[K]A=[K]0 episodes, each of horizon A=[K]A=[K]1.

Key complexity measures appear in the bounds:

2. Regret Bounds in Bayesian Optimization

In Bayesian optimization with Gaussian process priors, the canonical regret rate scales as A=[K]A=[K]5, where the sublinear growth (in A=[K]A=[K]6) of A=[K]A=[K]7 is kernel-dependent. For SE (RBF) kernels, A=[K]A=[K]8; for Matérn-A=[K]A=[K]9 kernels with θh\theta\sim h0, improved analyses yield θh\theta\sim h1 in 1D and nearly so in higher dimensions.

Main high-probability bounds for GP-UCB:

  • Matérn kernel, θh\theta\sim h2 (with suitable θh\theta\sim h3):

θh\theta\sim h4

This matches the minimax θh\theta\sim h5 lower bound (Iwazaki, 2 Jun 2025, Scarlett, 2018).

  • Squared Exponential kernel:

θh\theta\sim h6

This sharpens earlier bounds by removing extraneous logarithmic factors through algorithm-dependent, local information-gain bounds rather than worst-case global ones (Iwazaki, 2 Jun 2025).

The technical novelty involves decomposing regret into regions of the search space and exploiting the geometric concentration of query points near the maximizer, allowing much tighter information gain evaluations.

For Thompson Sampling (TS) in BO:

In preference-based (dueling) BO with only pairwise comparisons, MR-LPF algorithm achieves

μa(θ)\mu_a(\theta)0

where μa(θ)\mu_a(\theta)1 is the information gain of the dueling kernel. This matches scalar-feedback BO, showing that pairwise-feedback sample complexity is not increased (Kayal et al., 29 May 2025).

For unknown hyperparameters (e.g., lengthscales), the Length-Scale Balancing (LB) algorithm achieves regret only μa(θ)\mu_a(\theta)2 away from the oracle with optimally chosen hyperparameter, eliminating the polynomial factor penalty of A-GP-UCB (Ziomek et al., 2024).

In Bayesian optimization over unknown domains, regret remains sublinear with high probability provided the search volume is increased at a hyperharmonic rate: μa(θ)\mu_a(\theta)3 set by the rate of volume expansion and kernel properties (Tran-The et al., 2020).

3. Bandit and Linear Bandit Settings

For Bayesian multi-armed and linear bandit problems, recent advances yield finite-time, gap- and prior-dependent logarithmic regrets.

  • Gap-dependent (Bayesian UCB):

μa(θ)\mu_a(\theta)4

where μa(θ)\mu_a(\theta)5 is the instance gap (Atsidakou et al., 2023).

  • Prior-dependent:

μa(θ)\mu_a(\theta)6

where μa(θ)\mu_a(\theta)7 depends on the prior's mass on near-optimal arms. This matches Lai's asymptotic lower bound (Atsidakou et al., 2023).

For offline Bayesian linear bandits, the high-confidence Bayes regret (VaR) can be tightly bounded via convex conic optimization: μa(θ)\mu_a(\theta)8 which is shown to be essentially optimal (Petrik et al., 2023). Algorithms that minimize such certificates strictly outperform those based on pessimistic lower confidence bounds.

4. Reinforcement Learning: Bayesian Regret in MDPs

Bayesian regret in RL/MDPs is governed by the structure of uncertainty over the environment class, the horizon μa(θ)\mu_a(\theta)9, and the environment's statistical covering (via Kolmogorov AA_*0-dimension or mutual information over rate-distortion partitions).

Key results:

  • Information-directed RL (IDS), for tabular finite-horizon MDPs with AA_*1 states, AA_*2 actions, and horizon AA_*3:

AA_*4

By learning a less-informative surrogate environment (via rate-distortion partitions), this is improved to

AA_*5

with matching instance-dependent certificates (Hao et al., 2022).

  • Thompson Sampling in RL (MDPs with Kolmogorov AA_*6-dimension AA_*7):

AA_*8

with concrete AA_*9 for tabular (R(n)=EθhEπ[t=1n(μA(θ)μAt(θ))].R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].0), linear function approximation, or finite-mixture models (Moradipari et al., 2023).

  • GP-based RL (continuous control with Gaussian process prior): for GP-PSRL with horizon R(n)=EθhEπ[t=1n(μA(θ)μAt(θ))].R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].1 and maximum information gain R(n)=EθhEπ[t=1n(μA(θ)μAt(θ))].R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].2,

R(n)=EθhEπ[t=1n(μA(θ)μAt(θ))].R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].3

This holds for unbounded state spaces and leverages chaining and concentration inequalities for GP sample paths (Flynn et al., 9 Mar 2026).

  • Variational Bayesian RL/Boltzmann entropy-regularized RL: K-learning yields

R(n)=EθhEπ[t=1n(μA(θ)μAt(θ))].R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].4

matching TS up to log factors and unifying risk-seeking utility with entropy-regularized RL (O'Donoghue, 2018).

5. Information-Theoretic Lower and Upper Bounds

A unified information-theoretic perspective establishes matching (up to logarithms) lower and upper bounds for Bayesian regret in terms of information acquired (measured in bits). For R(n)=EθhEπ[t=1n(μA(θ)μAt(θ))].R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].5 bits gained about the optimal policy,

R(n)=EθhEπ[t=1n(μA(θ)μAt(θ))].R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].6

for R(n)=EθhEπ[t=1n(μA(θ)μAt(θ))].R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].7-armed bandits; similar scaling holds for linear and general policy spaces (Shufaro et al., 2024). Entropy-constrained versions yield

R(n)=EθhEπ[t=1n(μA(θ)μAt(θ))].R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].8

and for more general decision sets, lower bounds are derived via Fano's inequality and packing arguments. Upper bounds are achieved by TS via the information-ratio principle, reaching the minimax rates with explicit trade-offs between information acquisitions and cumulative regret.

6. Bayesian Regret Bounds in Meta- and Hierarchical Bayesian Learning

Meta Bayesian optimization, where the prior is estimated from offline data across multiple tasks, achieves regret converging to the noise level, with the estimation error decaying as R(n)=EθhEπ[t=1n(μA(θ)μAt(θ))].R(n) = \mathbb{E}_{\theta\sim h} \mathbb{E}_\pi\left[\sum_{t=1}^n (\mu_{A_*}(\theta) - \mu_{A_t}(\theta))\right].9, where ff0 is the number of offline functions observed (Wang et al., 2018).

For hierarchical priors, e.g., Student-ff1 or hierarchical Gaussian, log-loss Bayesian regret bounds are explicit in the parameter regularity and structure:

  • Student-ff2 prior yields only logarithmic dependence of regret on ff3, conferring robustness.
  • Hierarchical Gaussian structures encourage statistical strength sharing across tasks, reducing regret if parameters are similar (Huggins et al., 2015).

7. Robustness to Approximate Inference and Frequentist Regimes

Recent work demonstrates that Bayesian regret guarantees can be retained under bounded approximate inference error, provided certain ff4-divergences between the approximate and true posterior are controlled (both above and below), yielding ff5 regret in the bandit setting (Huang et al., 2022). In the frequentist setting, sharper high-probability GP regression error analysis closes the gap between Bayesian and frequentist optimization regret rates, giving

ff6

where ff7 is the RKHS bound and ff8 domain dimension (Wang et al., 2024).


References

  • "Improved Regret Bounds for Gaussian Process Upper Confidence Bound in Bayesian Optimization" (Iwazaki, 2 Jun 2025)
  • "Tight Regret Bounds for Bayesian Optimization in One Dimension" (Scarlett, 2018)
  • "Posterior Sampling-Based Bayesian Optimization with Tighter Bayesian Regret Bounds" (Takeno et al., 2023)
  • "On Regret Bounds of Thompson Sampling for Bayesian Optimization" (Takeno et al., 10 Mar 2026)
  • "Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds" (Kayal et al., 29 May 2025)
  • "Bayesian Optimisation with Unknown Hyperparameters: Regret Bounds Logarithmically Closer to Optimal" (Ziomek et al., 2024)
  • "Sub-linear Regret Bounds for Bayesian Optimisation in Unknown Search Spaces" (Tran-The et al., 2020)
  • "Finite-Time Logarithmic Bayes Regret Upper Bounds" (Atsidakou et al., 2023)
  • "Bayesian Regret Minimization in Offline Bandits" (Petrik et al., 2023)
  • "Regret Bounds for Information-Directed Reinforcement Learning" (Hao et al., 2022)
  • "Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning" (Moradipari et al., 2023)
  • "Posterior Sampling Reinforcement Learning with Gaussian Processes for Continuous Control: Sublinear Regret Bounds for Unbounded State Spaces" (Flynn et al., 9 Mar 2026)
  • "On Bits and Bandits: Quantifying the Regret-Information Trade-off" (Shufaro et al., 2024)
  • "Risk and Regret of Hierarchical Bayesian Learners" (Huggins et al., 2015)
  • "Regret bounds for meta Bayesian optimization with an unknown Gaussian process prior" (Wang et al., 2018)
  • "Optimal Regret Is Achievable with Bounded Approximate Inference Error: An Enhanced Bayesian Upper Confidence Bound Framework" (Huang et al., 2022)
  • "Variational Bayesian Reinforcement Learning with Regret Bounds" (O'Donoghue, 2018)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Regret Bounds.