Papers
Topics
Authors
Recent
Search
2000 character limit reached

GKB-UCB: Unified Kernelized Bandits

Updated 3 July 2026
  • GKB-UCB is an algorithmic framework that generalizes kernelized and generalized linear bandits using novel self-normalized Bernstein inequalities.
  • It employs a variance-adaptive, dimension-free concentration inequality to achieve sharp high-probability regret bounds in RKHS settings.
  • The framework integrates maximum likelihood estimation, confidence set construction, and optimistic arm selection for efficient decision-making.

Generalized Kernelized Bandits Upper Confidence Bound (GKB-UCB) is an algorithmic framework for regret minimization in the setting where a learner optimizes an unknown reward function ff^* belonging to a Reproducing Kernel Hilbert Space (RKHS) and observes outcomes corrupted by exponential family (EF) noise with a mean response μ(f)\mu(f^*), a general nonlinear link. This model strictly generalizes both the kernelized bandits (KBs) and generalized linear bandits (GLBs), providing a unified analytical and algorithmic treatment. GKB-UCB employs a novel self-normalized Bernstein-type, dimension-free concentration inequality to derive sharp high-probability regret bounds, overcoming obstacles posed by the heteroscedastic, non-Gaussian, and potentially infinite-dimensional settings (Metelli et al., 3 Aug 2025).

1. Problem Formulation

The decision domain is a subset XRd\mathcal{X}\subseteq\mathbb{R}^d (possibly infinite). At each round t=1,,Tt=1,\dots,T, the learner selects xtXx_t\in\mathcal{X} and observes stochastic feedback yty_t drawn according to an exponential family model,

ytp(yxt;f)=exp(yf(xt)m(f(xt))g+h(y)),y_t \sim p(y \mid x_t; f^*) = \exp\left(\frac{y f^*(x_t) - m(f^*(x_t))}{g} + h(y)\right),

where mm is a convex log-partition function, g>0g>0 a known scale, and hh a normalization function. The reward mean and variance at μ(f)\mu(f^*)0 satisfy

μ(f)\mu(f^*)1

Key assumptions include bounded RKHS norm of μ(f)\mu(f^*)2, bounded kernel μ(f)\mu(f^*)3, bounded noise μ(f)\mu(f^*)4, and self-concordance of μ(f)\mu(f^*)5. The regret is measured by

μ(f)\mu(f^*)6

with μ(f)\mu(f^*)7.

A central parameter is

μ(f)\mu(f^*)8

quantifying the maximum curvature (slackness) of the reward mean function in the domain.

2. Self-Normalized Bernstein-Type, Dimension-Free Inequality

The exponential family structure induces noise with variance dependent on μ(f)\mu(f^*)9. Standard self-normalized concentration results either do not leverage the variance structure (Hoeffding-type, thus overly pessimistic) or become dimension-dependent (ill-suited for infinite-dimensional RKHS).

GKB-UCB establishes a Freedman-style martingale concentration that is both variance-adaptive (Bernstein-like) and dimension-free, using a stitching argument across variance levels. For a martingale sequence with absolute value bounded by XRd\mathcal{X}\subseteq\mathbb{R}^d0 and conditional variance XRd\mathcal{X}\subseteq\mathbb{R}^d1, the result guarantees, with high probability over all XRd\mathcal{X}\subseteq\mathbb{R}^d2,

XRd\mathcal{X}\subseteq\mathbb{R}^d3

where XRd\mathcal{X}\subseteq\mathbb{R}^d4 with XRd\mathcal{X}\subseteq\mathbb{R}^d5, XRd\mathcal{X}\subseteq\mathbb{R}^d6 is a data-dependent weighted norm matrix, and XRd\mathcal{X}\subseteq\mathbb{R}^d7 is logarithmic in XRd\mathcal{X}\subseteq\mathbb{R}^d8 and kernel/scale parameters. Critically, the bound is independent of ambient or feature-space dimension.

3. Algorithmic Structure

At round XRd\mathcal{X}\subseteq\mathbb{R}^d9, GKB-UCB performs:

(a) Maximum Likelihood Estimation: Compute

t=1,,Tt=1,\dots,T0

By the representer theorem, this is finite-dimensional over t=1,,Tt=1,\dots,T1 parameters.

(b) Statistical Confidence Set Construction: For any t=1,,Tt=1,\dots,T2 (or the associated coefficient vector t=1,,Tt=1,\dots,T3),

t=1,,Tt=1,\dots,T4

with

t=1,,Tt=1,\dots,T5

and t=1,,Tt=1,\dots,T6 given by the novel concentration bound.

(c) Optimistic Arm Selection: Play

t=1,,Tt=1,\dots,T7

Since t=1,,Tt=1,\dots,T8 is nondecreasing, this is equivalent to maximizing t=1,,Tt=1,\dots,T9 over the confidence set.

Practical implementations employ finite-dimensional approximations for xtXx_t\in\mathcal{X}0 and loss-based formulations (Appendix A in (Metelli et al., 3 Aug 2025)).

4. Regret Analysis and Theoretical Guarantees

The regret decomposition leverages the confidence that xtXx_t\in\mathcal{X}1 resides in xtXx_t\in\mathcal{X}2 for all xtXx_t\in\mathcal{X}3. The analysis yields: xtXx_t\in\mathcal{X}4 where xtXx_t\in\mathcal{X}5 collects supremal bounds for the confidence width, xtXx_t\in\mathcal{X}6 is the maximal information gain, and xtXx_t\in\mathcal{X}7 is a negligible second-order term arising from self-concordance. This matches, up to logarithmic and multiplicative factors, the minimax rates for both KB and GLB regimes. The analysis requires precise control of the elliptical potential via the new Bernstein-type bound, ensuring optimal dependence in xtXx_t\in\mathcal{X}8, xtXx_t\in\mathcal{X}9, and yty_t0.

5. Unification of Kernelized and Generalized Linear Bandits

GKB-UCB recovers and unifies prior literature:

  • Kernelized Bandits: With yty_t1, yty_t2, yty_t3, the confidence ellipsoid shrinks to the classical Gaussian process bandit structure; the regret bound specializes to yty_t4.
  • Generalized Linear Bandits: With linear kernel and finite yty_t5, yty_t6, and the regret matches yty_t7 as in established GLB results.

GKB-UCB thus subsumes both classes, providing a unified approach for nonparametric, nonlinear, and heteroscedastic bandit models.

6. Implementation Considerations

At each iteration, GKB-UCB requires solving:

  1. A finite-dimensional maximum likelihood problem in the coefficient vector yty_t8;
  2. A convex (typically quadratic) program to select the optimistic arm, subject to a loss-based confidence region.

Key hyperparameters are the regularization yty_t9 and Freedman-stitching parameters ytp(yxt;f)=exp(yf(xt)m(f(xt))g+h(y)),y_t \sim p(y \mid x_t; f^*) = \exp\left(\frac{y f^*(x_t) - m(f^*(x_t))}{g} + h(y)\right),0, with typical choices ytp(yxt;f)=exp(yf(xt)m(f(xt))g+h(y)),y_t \sim p(y \mid x_t; f^*) = \exp\left(\frac{y f^*(x_t) - m(f^*(x_t))}{g} + h(y)\right),1, ytp(yxt;f)=exp(yf(xt)m(f(xt))g+h(y)),y_t \sim p(y \mid x_t; f^*) = \exp\left(\frac{y f^*(x_t) - m(f^*(x_t))}{g} + h(y)\right),2, ytp(yxt;f)=exp(yf(xt)m(f(xt))g+h(y)),y_t \sim p(y \mid x_t; f^*) = \exp\left(\frac{y f^*(x_t) - m(f^*(x_t))}{g} + h(y)\right),3.

The resulting computational complexity per round remains manageable for moderate ytp(yxt;f)=exp(yf(xt)m(f(xt))g+h(y)),y_t \sim p(y \mid x_t; f^*) = \exp\left(\frac{y f^*(x_t) - m(f^*(x_t))}{g} + h(y)\right),4 but scales with the number of past actions.

7. Significance and Scope

The introduction of GKB-UCB and its associated concentration theory closes an open problem regarding dimension-free, variance-aware self-normalized inequalities in RKHS-valued, EF-noise bandit processes. The method offers a robust, theoretically sound, and practically relevant solution for bandit learning under highly flexible reward models, smoothly interpolating between classical linear/GLB and nonparametric/kernelized scenarios (Metelli et al., 3 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GKB-UCB.