GKB-UCB: Unified Kernelized Bandits

Updated 3 July 2026

GKB-UCB is an algorithmic framework that generalizes kernelized and generalized linear bandits using novel self-normalized Bernstein inequalities.
It employs a variance-adaptive, dimension-free concentration inequality to achieve sharp high-probability regret bounds in RKHS settings.
The framework integrates maximum likelihood estimation, confidence set construction, and optimistic arm selection for efficient decision-making.

Generalized Kernelized Bandits Upper Confidence Bound (GKB-UCB) is an algorithmic framework for regret minimization in the setting where a learner optimizes an unknown reward function $f^*$ belonging to a Reproducing Kernel Hilbert Space (RKHS) and observes outcomes corrupted by exponential family (EF) noise with a mean response $\mu(f^*)$ , a general nonlinear link. This model strictly generalizes both the kernelized bandits (KBs) and generalized linear bandits (GLBs), providing a unified analytical and algorithmic treatment. GKB-UCB employs a novel self-normalized Bernstein-type, dimension-free concentration inequality to derive sharp high-probability regret bounds, overcoming obstacles posed by the heteroscedastic, non-Gaussian, and potentially infinite-dimensional settings (Metelli et al., 3 Aug 2025).

1. Problem Formulation

The decision domain is a subset $\mathcal{X}\subseteq\mathbb{R}^d$ (possibly infinite). At each round $t=1,\dots,T$ , the learner selects $x_t\in\mathcal{X}$ and observes stochastic feedback $y_t$ drawn according to an exponential family model,

$y_t \sim p(y \mid x_t; f^*) = \exp\left(\frac{y f^*(x_t) - m(f^*(x_t))}{g} + h(y)\right),$

where $m$ is a convex log-partition function, $g>0$ a known scale, and $h$ a normalization function. The reward mean and variance at $\mu(f^*)$ 0 satisfy

$\mu(f^*)$ 1

Key assumptions include bounded RKHS norm of $\mu(f^*)$ 2, bounded kernel $\mu(f^*)$ 3, bounded noise $\mu(f^*)$ 4, and self-concordance of $\mu(f^*)$ 5. The regret is measured by

$\mu(f^*)$ 6

with $\mu(f^*)$ 7.

A central parameter is

$\mu(f^*)$ 8

quantifying the maximum curvature (slackness) of the reward mean function in the domain.

2. Self-Normalized Bernstein-Type, Dimension-Free Inequality

The exponential family structure induces noise with variance dependent on $\mu(f^*)$ 9. Standard self-normalized concentration results either do not leverage the variance structure (Hoeffding-type, thus overly pessimistic) or become dimension-dependent (ill-suited for infinite-dimensional RKHS).

GKB-UCB establishes a Freedman-style martingale concentration that is both variance-adaptive (Bernstein-like) and dimension-free, using a stitching argument across variance levels. For a martingale sequence with absolute value bounded by $\mathcal{X}\subseteq\mathbb{R}^d$ 0 and conditional variance $\mathcal{X}\subseteq\mathbb{R}^d$ 1, the result guarantees, with high probability over all $\mathcal{X}\subseteq\mathbb{R}^d$ 2,

$\mathcal{X}\subseteq\mathbb{R}^d$ 3

where $\mathcal{X}\subseteq\mathbb{R}^d$ 4 with $\mathcal{X}\subseteq\mathbb{R}^d$ 5, $\mathcal{X}\subseteq\mathbb{R}^d$ 6 is a data-dependent weighted norm matrix, and $\mathcal{X}\subseteq\mathbb{R}^d$ 7 is logarithmic in $\mathcal{X}\subseteq\mathbb{R}^d$ 8 and kernel/scale parameters. Critically, the bound is independent of ambient or feature-space dimension.

3. Algorithmic Structure

At round $\mathcal{X}\subseteq\mathbb{R}^d$ 9, GKB-UCB performs:

(a) Maximum Likelihood Estimation: Compute

$t=1,\dots,T$ 0

By the representer theorem, this is finite-dimensional over $t=1,\dots,T$ 1 parameters.

(b) Statistical Confidence Set Construction: For any $t=1,\dots,T$ 2 (or the associated coefficient vector $t=1,\dots,T$ 3),

$t=1,\dots,T$ 4

with

$t=1,\dots,T$ 5

and $t=1,\dots,T$ 6 given by the novel concentration bound.

$t=1,\dots,T$ 7

Since $t=1,\dots,T$ 8 is nondecreasing, this is equivalent to maximizing $t=1,\dots,T$ 9 over the confidence set.

Practical implementations employ finite-dimensional approximations for $x_t\in\mathcal{X}$ 0 and loss-based formulations (Appendix A in (Metelli et al., 3 Aug 2025)).

4. Regret Analysis and Theoretical Guarantees

The regret decomposition leverages the confidence that $x_t\in\mathcal{X}$ 1 resides in $x_t\in\mathcal{X}$ 2 for all $x_t\in\mathcal{X}$ 3. The analysis yields: $x_t\in\mathcal{X}$ 4 where $x_t\in\mathcal{X}$ 5 collects supremal bounds for the confidence width, $x_t\in\mathcal{X}$ 6 is the maximal information gain, and $x_t\in\mathcal{X}$ 7 is a negligible second-order term arising from self-concordance. This matches, up to logarithmic and multiplicative factors, the minimax rates for both KB and GLB regimes. The analysis requires precise control of the elliptical potential via the new Bernstein-type bound, ensuring optimal dependence in $x_t\in\mathcal{X}$ 8, $x_t\in\mathcal{X}$ 9, and $y_t$ 0.

5. Unification of Kernelized and Generalized Linear Bandits

GKB-UCB recovers and unifies prior literature:

Kernelized Bandits: With $y_t$ 1, $y_t$ 2, $y_t$ 3, the confidence ellipsoid shrinks to the classical Gaussian process bandit structure; the regret bound specializes to $y_t$ 4.
Generalized Linear Bandits: With linear kernel and finite $y_t$ 5, $y_t$ 6, and the regret matches $y_t$ 7 as in established GLB results.

GKB-UCB thus subsumes both classes, providing a unified approach for nonparametric, nonlinear, and heteroscedastic bandit models.

6. Implementation Considerations

At each iteration, GKB-UCB requires solving:

A finite-dimensional maximum likelihood problem in the coefficient vector $y_t$ 8;
A convex (typically quadratic) program to select the optimistic arm, subject to a loss-based confidence region.

Key hyperparameters are the regularization $y_t$ 9 and Freedman-stitching parameters $y_t \sim p(y \mid x_t; f^*) = \exp\left(\frac{y f^*(x_t) - m(f^*(x_t))}{g} + h(y)\right),$ 0, with typical choices $y_t \sim p(y \mid x_t; f^*) = \exp\left(\frac{y f^*(x_t) - m(f^*(x_t))}{g} + h(y)\right),$ 1, $y_t \sim p(y \mid x_t; f^*) = \exp\left(\frac{y f^*(x_t) - m(f^*(x_t))}{g} + h(y)\right),$ 2, $y_t \sim p(y \mid x_t; f^*) = \exp\left(\frac{y f^*(x_t) - m(f^*(x_t))}{g} + h(y)\right),$ 3.

The resulting computational complexity per round remains manageable for moderate $y_t \sim p(y \mid x_t; f^*) = \exp\left(\frac{y f^*(x_t) - m(f^*(x_t))}{g} + h(y)\right),$ 4 but scales with the number of past actions.

7. Significance and Scope

The introduction of GKB-UCB and its associated concentration theory closes an open problem regarding dimension-free, variance-aware self-normalized inequalities in RKHS-valued, EF-noise bandit processes. The method offers a robust, theoretically sound, and practically relevant solution for bandit learning under highly flexible reward models, smoothly interpolating between classical linear/GLB and nonparametric/kernelized scenarios (Metelli et al., 3 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Generalized Kernelized Bandits: Self-Normalized Bernstein-Like Dimension-Free Inequality and Regret Bounds (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GKB-UCB.