GKB-UCB: Unified Kernelized Bandits
- GKB-UCB is an algorithmic framework that generalizes kernelized and generalized linear bandits using novel self-normalized Bernstein inequalities.
- It employs a variance-adaptive, dimension-free concentration inequality to achieve sharp high-probability regret bounds in RKHS settings.
- The framework integrates maximum likelihood estimation, confidence set construction, and optimistic arm selection for efficient decision-making.
Generalized Kernelized Bandits Upper Confidence Bound (GKB-UCB) is an algorithmic framework for regret minimization in the setting where a learner optimizes an unknown reward function belonging to a Reproducing Kernel Hilbert Space (RKHS) and observes outcomes corrupted by exponential family (EF) noise with a mean response , a general nonlinear link. This model strictly generalizes both the kernelized bandits (KBs) and generalized linear bandits (GLBs), providing a unified analytical and algorithmic treatment. GKB-UCB employs a novel self-normalized Bernstein-type, dimension-free concentration inequality to derive sharp high-probability regret bounds, overcoming obstacles posed by the heteroscedastic, non-Gaussian, and potentially infinite-dimensional settings (Metelli et al., 3 Aug 2025).
1. Problem Formulation
The decision domain is a subset (possibly infinite). At each round , the learner selects and observes stochastic feedback drawn according to an exponential family model,
where is a convex log-partition function, a known scale, and a normalization function. The reward mean and variance at 0 satisfy
1
Key assumptions include bounded RKHS norm of 2, bounded kernel 3, bounded noise 4, and self-concordance of 5. The regret is measured by
6
with 7.
A central parameter is
8
quantifying the maximum curvature (slackness) of the reward mean function in the domain.
2. Self-Normalized Bernstein-Type, Dimension-Free Inequality
The exponential family structure induces noise with variance dependent on 9. Standard self-normalized concentration results either do not leverage the variance structure (Hoeffding-type, thus overly pessimistic) or become dimension-dependent (ill-suited for infinite-dimensional RKHS).
GKB-UCB establishes a Freedman-style martingale concentration that is both variance-adaptive (Bernstein-like) and dimension-free, using a stitching argument across variance levels. For a martingale sequence with absolute value bounded by 0 and conditional variance 1, the result guarantees, with high probability over all 2,
3
where 4 with 5, 6 is a data-dependent weighted norm matrix, and 7 is logarithmic in 8 and kernel/scale parameters. Critically, the bound is independent of ambient or feature-space dimension.
3. Algorithmic Structure
At round 9, GKB-UCB performs:
(a) Maximum Likelihood Estimation: Compute
0
By the representer theorem, this is finite-dimensional over 1 parameters.
(b) Statistical Confidence Set Construction: For any 2 (or the associated coefficient vector 3),
4
with
5
and 6 given by the novel concentration bound.
(c) Optimistic Arm Selection: Play
7
Since 8 is nondecreasing, this is equivalent to maximizing 9 over the confidence set.
Practical implementations employ finite-dimensional approximations for 0 and loss-based formulations (Appendix A in (Metelli et al., 3 Aug 2025)).
4. Regret Analysis and Theoretical Guarantees
The regret decomposition leverages the confidence that 1 resides in 2 for all 3. The analysis yields: 4 where 5 collects supremal bounds for the confidence width, 6 is the maximal information gain, and 7 is a negligible second-order term arising from self-concordance. This matches, up to logarithmic and multiplicative factors, the minimax rates for both KB and GLB regimes. The analysis requires precise control of the elliptical potential via the new Bernstein-type bound, ensuring optimal dependence in 8, 9, and 0.
5. Unification of Kernelized and Generalized Linear Bandits
GKB-UCB recovers and unifies prior literature:
- Kernelized Bandits: With 1, 2, 3, the confidence ellipsoid shrinks to the classical Gaussian process bandit structure; the regret bound specializes to 4.
- Generalized Linear Bandits: With linear kernel and finite 5, 6, and the regret matches 7 as in established GLB results.
GKB-UCB thus subsumes both classes, providing a unified approach for nonparametric, nonlinear, and heteroscedastic bandit models.
6. Implementation Considerations
At each iteration, GKB-UCB requires solving:
- A finite-dimensional maximum likelihood problem in the coefficient vector 8;
- A convex (typically quadratic) program to select the optimistic arm, subject to a loss-based confidence region.
Key hyperparameters are the regularization 9 and Freedman-stitching parameters 0, with typical choices 1, 2, 3.
The resulting computational complexity per round remains manageable for moderate 4 but scales with the number of past actions.
7. Significance and Scope
The introduction of GKB-UCB and its associated concentration theory closes an open problem regarding dimension-free, variance-aware self-normalized inequalities in RKHS-valued, EF-noise bandit processes. The method offers a robust, theoretically sound, and practically relevant solution for bandit learning under highly flexible reward models, smoothly interpolating between classical linear/GLB and nonparametric/kernelized scenarios (Metelli et al., 3 Aug 2025).