Generalized Linear Bandits: UCB Algorithms

Updated 27 October 2025

Generalized linear bandits are sequential decision frameworks that extend classical linear models by incorporating nonlinear reward functions through generalized linear models (GLMs).
They employ UCB-based algorithms, such as UCB-GLM and SupCB-GLM, to balance exploration and exploitation using directional confidence intervals tailored to GLMs.
These methods achieve near-optimal regret bounds of order O(√(dT)) and offer computational efficiency for applications like personalized advertising, news recommendation, and adaptive clinical trials.

Generalized linear bandits (GLBs) formalize a class of sequential decision problems in which the reward model for each round is a nonlinear function of context, parameterized by a generalized linear model (GLM). This framework generalizes classical linear bandits to cover settings with binary, count, or otherwise non-Gaussian outcomes, as commonly encountered in applications such as personalized advertising, news recommendation, and adaptive clinical trials. GLBs are of particular interest because they combine statistical expressiveness (handling nonlinearity in rewards) with the theoretical tractability needed for deriving sharp regret guarantees and designing principled exploration–exploitation algorithms.

1. Core Methodology: UCB-Based Algorithms for Generalized Linear Bandits

GLB algorithms select actions over $T$ rounds, receiving a contextual feature vector $X_{t,a}\in\mathbb{R}^d$ for each arm $a$ in round $t$ , then observing a (possibly binary) reward $Y_t$ drawn from a distribution with mean $\mu(X_{t,a_t}'\theta^*)$ , for fixed and unknown $\theta^*\in\mathbb{R}^d$ . The mean function $\mu$ is determined by the inverse link of the canonical GLM (e.g., logistic, probit, or Poisson). The primary challenge lies in balancing pure exploitation (choosing the arm with the highest predicted reward) and exploration (acquiring sufficient information to distinguish optimal actions).

The primary methodology introduced in (Li et al., 2017) is an upper confidence bound (UCB)–driven approach for this nonlinear setting. Two main variants are presented:

UCB-GLM: After an initial exploration epoch ensuring identifiability, the agent at each round computes the maximum-likelihood estimator (MLE) $\hat\theta_t$ by solving

$\sum_{i=1}^{t-1} \left[ Y_i - \mu(X_i'\theta) \right] X_i = 0.$

The next arm $a_t$ is selected as

$a_t \in \arg\max_a \left\{ X_{t,a}' \hat\theta_t + \alpha \|X_{t,a}\|_{V_t^{-1}} \right\}$

with $V_t = \sum_{i=1}^{t-1} X_i X_i'$ and $\alpha$ a tunable parameter. The second term quantifies (using a norm induced by the empirical covariance) directional uncertainty, serving as the exploration bonus.

SupCB-GLM: This more refined elimination algorithm iteratively partitions rounds into stages, screens arms with confidence intervals computed from independent reward sets, and eliminates arms with insufficient potential for optimality. Rather than using projection steps onto confidence sets (as in earlier works), it leverages directional confidence intervals derived from the new finite-sample analysis.

The essential advantage of these methods is the move from generic $\ell_2$ -norm balls towards confidence intervals tailored for each direction, directly reflecting the non-linearity and local curvature of the GLM.

2. Theoretical Regret Guarantees

A principal result is the establishment of $\tilde{O}(\sqrt{dT})$ worst-case regret (with $\tilde{O}(\cdot)$ hiding logarithmic factors) for both UCB-GLM (under certain technical conditions) and SupCB-GLM algorithms. This matches, up to logarithmic terms, the minimax lower bounds for stochastic linear bandits and removes the suboptimal $\sqrt{d}$ factor present in earlier work for non-linear models, such as Filippi et al. (2010), which gave regret $\tilde{O}(d\sqrt{T})$ .

The improvement is attributed to a new finite-sample confidence bound on the MLE for GLMs. Under regularity conditions (sub-Gaussian noise, invertibility of design matrix, lower bound on the derivative of $\mu$ ), for any $x \in \mathbb{R}^d$ ,

$|x'(\hat\theta_n - \theta^*)| \leq \frac{3\sigma}{\kappa}\sqrt{\log(1/\delta)}\,\|x\|_{V_n^{-1}}$

with high probability. Here, $\sigma$ is the sub-Gaussian noise parameter and $\kappa$ is a lower bound on the derivative of $\mu$ . The regret proof proceeds by showing that the sum of these directional uncertainties over the time horizon can be controlled, ensuring cumulative regret does not exceed $\tilde{O}(\sqrt{dT})$ .

For SupCB-GLM, this analysis is particularly instrumental in avoiding traditional projection steps required by earlier methods, leading to both computational and theoretical simplification.

3. Finite-Sample Confidence Intervals for GLM MLEs

A substantial contribution is the proof of sharp nonasymptotic (finite-sample) confidence intervals for the GLM MLE, which extend traditional normality-type results beyond asymptotics. Under natural eigenvalue conditions on the observed covariance matrix $V_n$ , the MLE exhibits sub-Gaussian concentration along any direction, i.e.,

$|x'(\hat\theta_n - \theta^*)| \leq \frac{3\sigma}{\kappa}\sqrt{\log(1/\delta)}\,\|x\|_{V_n^{-1}}$

with probability at least $1 - 3\delta$ provided $\lambda_{\min}(V_n) \gtrsim d^2 +\log(1/\delta)$ . This result, structurally similar to what is available for linear models, is notable given the nonlinearity and the absence of closed forms for GLMs. The bound serves as the analytical basis for UCB construction and analysis and removes the need for projection (or "shrinking") of MLEs in practical algorithms—a significant simplification and performance improvement.

4. Algorithmic and Practical Aspects

The UCB-GLM algorithm is particularly attractive for deployment, given its simplicity and the constant/low per-round computational cost once the feature dimension $d$ is moderate and the number of arms is not excessive. It only requires calculation of the MLE (a convex optimization problem) and standard matrix updates for $V_t$ . The exploration bonus is implemented as an explicit, closed-form function of past data; there is no need for sample partitioning or computationally intensive projections.

The algorithm is robust in settings with "dense" feature spaces or meaningful reward gaps between arms: when the summed inverse squared minimum eigenvalue $\sum_{t} [\lambda_{\min}(V_t)]^{-1/2}$ is $O(\sqrt{T})$ , optimal regret is maintained. This condition is often met unless there is severe degeneracy in the context distribution.

In contrast, algorithms requiring projection often incur significant computational overhead per round—this work avoids such costs except in carefully constructed edge cases.

5. Key Mathematical Formulations

Central to the analysis are the following:

Concept	Formula	Interpretation
MLE definition	$\sum_{i=1}^{t-1} [Y_i - \mu(X_i'\theta)] X_i = 0$	Score equation for parameter estimation in GLMs
Directional confidence	$\|x'(\hat\theta_n - \theta^*)\| \leq (3\sigma/\kappa) \sqrt{\log(1/\delta)} \\|x\\|_{V_n^{-1}}$	High-probability error for any direction
UCB-GLM selection rule	$a_t \in\arg\max_a \{X_{t,a}'\hat\theta_t + \alpha\\|X_{t,a}\\|_{V_t^{-1}}\}$	Exploration–exploitation balance per round
Regret bound (high-prob.)	$R_T \leq \tau + (2L_\mu\sigma d/\kappa)\log(T/(d\delta)) \sqrt{T}$	Performance guarantee for UCB-GLM
SupCB-GLM regret	$R_T \leq O((\sigma L_\mu/\kappa) \sqrt{d T \log T \log(TK/\delta) \log(T/d)})$	For multi-stage elimination, matched to minimax up to logs

These formulas make explicit the roles of the design matrix, sub-Gaussian noise, curvature of the link function, and the effect of the dimension.

6. Applications and Real-World Contexts

GLBs are particularly well-suited for personalized online systems involving binary or otherwise bounded discrete outcomes. In news recommendation, user and article features are modeled as contexts, and click/no-click is modeled by a Bernoulli reward, with logistic regression serving as the mean function. Here, GLB algorithms select articles to recommend based on both expected reward and uncertainty, efficiently trading off exploitation and exploration.

Similarly, in online advertising, the binary nature of observed engagement (e.g., click/no click) naturally motivates employing a GLM bandit with logistic link; the UCB-based algorithms allow advertisers to learn effective allocation policies adaptively, adjusting as more data accumulates. The theoretical guarantees ensure that, up to logarithmic factors, optimal performance is attainable without over-exploring or being overly conservative.

7. Implications, Significance, and Extensions

The primary significance of (Li et al., 2017) is the closing of the gap between the theoretical best regret bounds for linear and generalized linear contextual bandits. The sharp finite-sample confidence bounds for the MLE provide a foundation for future extensions: these techniques may be adapted to more complex or structured bandit models (e.g., generalized low-rank or nonparametric models). In addition, the established machinery supports rigorous design and deployment in large-scale applications where statistical guarantees and real-time computational efficiency must coexist.

A particularly notable advance is the avoidance of expensive projection steps and the elimination of loose confidence sets in prior work, leading to both strong regret guarantees and clean, scalable algorithms. These contributions position GLB algorithms as a default choice whenever sequential decision making must combine contextual information with nonlinear reward structures, which is ubiquitous in modern personalized information systems.

PDF Markdown Chat (Pro)

References (1)

Provably Optimal Algorithms for Generalized Linear Contextual Bandits (2017)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Generalized Linear Bandits.