Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
88 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
10 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Generalized Linear Bandit Models

Updated 18 July 2025
  • Generalized Linear Bandits (GLB) are contextual bandit models that use nonlinear link functions to map linear predictors to diverse reward distributions.
  • They employ maximum likelihood estimation and confidence sets to balance exploration and exploitation, achieving near-optimal regret bounds.
  • GLBs are applied in personalized recommendations, online advertising, and clinical decision-making, with ongoing research to improve scalability and robustness.

Generalized Linear Bandit (GLB) models constitute a broad class of contextual multi-armed bandits in which the expected reward for an action is governed by a generalized linear model—a nonlinear mapping of a linear predictor through a known link function. GLBs have emerged as a theoretical and practical extension of linear contextual bandits, enabling the modeling of binary, count, and other non-Gaussian reward distributions, essential for domains such as personalized recommendations, advertising, clinical decision-making, and adaptive experimentation.

1. Foundations of Generalized Linear Bandits

A generalized linear bandit is defined over T rounds. At each round t, a learner observes a set of d-dimensional feature vectors {xt,1,,xt,K}\{x_{t,1},\ldots,x_{t,K}\}, selects an arm at[K]a_t\in[K], and observes the reward:

Yt=μ(xt,atθ)+ηt,Y_t = \mu(x_{t,a_t}^\top \theta^*) + \eta_t,

where θRd\theta^*\in\mathbb{R}^d is the unknown parameter vector, μ\mu is a known strictly increasing link function (such as the logistic sigmoid for binary rewards), and ηt\eta_t is noise, typically assumed sub-Gaussian or bounded. The learner’s objective is to minimize cumulative regret—the difference between the total expected reward of an oracle policy knowing θ\theta^* and that obtained by the algorithm over T rounds.

GLBs generalize the linear bandit model (μ(v)=v\mu(v)=v) and include important cases such as logistic, Poisson, and probit bandits. The nonlinearity of μ\mu introduces substantial theoretical and algorithmic challenges in estimation and exploration.

2. Algorithmic Principles and Methodologies

The backbone of GLB algorithms rests on two key principles:

Maximum Likelihood Estimation and Confidence Sets:

Most GLB approaches, including UCB-GLM and its variants, iteratively compute maximum likelihood estimates (MLE) of θ\theta^* by solving:

i=1t1[Yiμ(xiθ)]xi=0\sum_{i=1}^{t-1}[Y_i - \mu(x_i^\top \theta)]x_i = 0

and maintain an empirical design matrix Vt=i=1t1xixiV_t=\sum_{i=1}^{t-1} x_i x_i^\top. Tracking the uncertainty in θ\theta^*, tight finite-sample confidence sets are constructed in the form

{θ:x(θ^tθ)βtxVt1}\{\theta : |x^\top (\hat\theta_t - \theta^*)| \leq \beta_t \|x\|_{V_t^{-1}}\}

where βt\beta_t is calibrated via concentration inequalities or PAC-Bayes bounds.

Action Selection via Optimism or Randomization:

GLB algorithms select actions by maximizing an upper confidence bound (UCB) or via randomization (such as Thompson Sampling). UCB-GLM chooses

at=argmaxa[K]{xt,aθ^t+αxt,aVt1},a_t = \arg\max_{a\in[K]} \{x_{t,a}^\top \hat\theta_t + \alpha \|x_{t,a}\|_{V_t^{-1}}\},

where α\alpha is a parameter set to control the exploration/exploitation tradeoff. SupCB-GLM uses staged exploration to obtain nearly independent samples, achieving sharper regret bounds when the action set is small. Randomized approaches, including those based on Laplace approximation (GLM-TSL) and reward perturbation (GLM-FPL), sample parameter estimates to induce diversified exploration and often achieve comparable theoretical guarantees.

3. Theoretical Regret Analysis and Confidence Bounds

The primary performance metric for GLBs is cumulative regret RT=t=1T[μ(xt,θ)μ(xt,atθ)]R_T = \sum_{t=1}^T [\mu(x_{t,*}^\top\theta^*) - \mu(x_{t,a_t}^\top\theta^*)]. Regret analysis leverages concentration inequalities adapted to the nonlinear structure of GLMs.

  • Minimax rates: UCB-GLM and SupCB-GLM obtain regret RT=O~(dT)R_T = \tilde{O}(\sqrt{dT}) (logarithmic factors hidden), matching the minimax lower bound for linear bandits and substantially improving (by a factor of d\sqrt{d}) over previous GLM bandit analyses for fixed arm sets.
  • Sharper finite-sample guarantees: The central contribution is a directional nonasymptotic confidence bound for the GLM MLE (Theorem 1), which holds uniformly over all xRdx\in\mathbb{R}^d:

x(θ^nθ)(3σ/κ)log(1/δ)xVn1|x^\top (\hat\theta_n - \theta^*)| \leq (3\sigma/\kappa) \sqrt{\log(1/\delta)}\|x\|_{V_n^{-1}}

if λmin(Vn)C[d2+log(1/δ)]\lambda_{\min}(V_n)\geq C[d^2 + \log(1/\delta)]. Here, κ\kappa is a lower bound on the derivative of μ()\mu(\cdot), and σ\sigma is the sub-Gaussian noise level. This bound is strictly sharper than those derived via naive Gaussian approximations, enabling tighter control of the exploration bonus and thus improved regret rates.

  • Generality and implications: The finite-sample normality-type result for the GLM MLE can be directly applied to other online estimation and adaptive experimental design problems beyond bandits.

4. Practical Implementation and Considerations

  • Computational requirements: UCB-GLM is computationally efficient as it only requires solving the MLE (typically via Newton’s method) and updating the d×dd\times d design matrix at each round, yielding per-round cost O(d2)O(d^2). SupCB-GLM, while optimal in low-arm regimes, requires elaborate staged sampling and is more complex.
  • Scalability challenges: In very high dimensions or when the action set is extremely large, the cost of matrix inversion and parameter estimation may become significant, motivating recent research in scalable GLBs using online Newton or SGD updates (1706.00136), hash-based approximations, and federated/distributed frameworks.
  • Exploration phase: An explicit or implicit initial exploration phase is typically required to ensure that VtV_t is invertible. In practice, random arm selection for the first τd\tau\geq d rounds suffices.
  • Robustness and model misspecification: The performance of UCB-GLM type algorithms relies on the correct specification of the link function μ()\mu(\cdot); misspecification may lead to suboptimal or even linear regret.
  • Applications: GLBs have been deployed in settings where outcomes are binary or categorical (e.g., news click-through, medical trials), as well as for decision support systems where the action context is high-dimensional and rich.

5. Comparative Analysis and Algorithmic Refinements

  • Advances over prior methods: UCB-GLM and SupCB-GLM significantly improve on classical GLM-UCB [Filippi et al. 2010] by employing sharper, directional confidence bounds rather than loose 2\ell_2 balls, eliminating unnecessary dependence on d\sqrt{d} in the regret rate (for finite K), and reducing computational burden compared to projection-based or randomized approaches.
  • Limitations: Both algorithms depend on matrix inversion per round, which may be computationally intensive for very high-dimensional problems. SupCB-GLM’s complexity grows with the number of stages and candidate sets. Both also require a well-conditioned design matrix and an initial burn-in, though the practical impact diminishes in the “dense” regime.
  • Related methods: Alternative strategies, such as those employing randomized exploration (GLM-TSL, GLM-FPL (1906.08947)) and reward-biased likelihood maximization (2010.04091), offer tradeoffs between computation, variance of exploration, and ease of tuning. Extensions to accommodate sparsity, high-dimensionality, and non-stationarity have been explored in subsequent research.

6. Applications and Future Directions

GLBs are widely applied in scenarios where rewards are nonlinear in covariates:

  • Personalization and recommendation: Binary click prediction and news recommendation, using logistic (or probit) GLMs.
  • Online advertising: Modeling of user behavior in digital advertising with categorical outcomes.
  • Clinical trials and adaptive experimentation: Treatment assignment with binary or ordinal outcomes using logistic/probit GLMs.

Future research is addressing several outstanding challenges:

  • Scalability to massive arm sets and high dimensions: Online/SGD approaches, hash-based algorithms, and federated/distributed GLBs are being developed to reduce time and space complexity.
  • Lower bounds with explicit arm dependence: New lower bounds that scale with the number of arms K are being investigated.
  • Randomized algorithms for GLBs: Thompson Sampling and its variants are being tuned for GLMs to address delayed feedback and other practical constraints.
  • Sharper finite-sample analysis: Advances in finite sample theory for MLEs in GLMs may enable further tightening of regret bounds and flexibility to broader model classes.

This synthesis distills and contextualizes the contributions of "Provably Optimal Algorithms for Generalized Linear Contextual Bandits" (1703.00048), which established foundational methods and theoretical rates for GLB algorithms and opened avenues for both theoretical and practical advances in contextual decision-making under non-linear reward models.