Parameterized Bandit Algorithms Overview

Updated 20 November 2025

Parameterized bandit algorithms are a set of strategies driven by tunable parameters, such as priors and thresholds, that systematically control exploration, computation, and regret trade-offs.
They unify classical approaches like EXP3 and Thompson Sampling by smoothly interpolating behavior through parameter tuning to achieve statistical adaptivity and near-optimal regret bounds.
These algorithms enable meta-learning and data-driven hyperparameter tuning, offering a practical framework for scalable and adaptable decision-making in complex bandit settings.

A parameterized family of bandit algorithms consists of a collection of bandit strategies, each indexed by a set of tunable parameters—such as priors, thresholds, regularization strengths, or exploration-exploitation trade-off coefficients—that systematically control their behavior and adaptivity. This rich structure enables the formulation of bandit models that interpolate between classic algorithms, achieve statistical adaptivity to unknown problem structure, and facilitate meta-learning or data-driven hyperparameter tuning. Such parameterizations are central to both the analysis and deployment of modern multi-armed, linear, structured, and nonparametric bandit approaches.

1. Canonical Examples of Parameterized Bandit Families

Parameterized bandit families appear universally across stochastic, adversarial, and structured bandit problems. Seminal archetypes include:

Tsallis-entropy regularization and FTPL families: The Tsallis-entropy parameter $\alpha\in(0,1)$ interpolates between EXP3 (as $\alpha\to1$ ) and a minimax-optimal mirror descent at $\alpha=1/2$ , controlling the exploration-exploitation bias in adversarial bandit learning. Similarly, perturbation-based (Follow-the-Perturbed-Leader, FTPL) algorithms form a parameterized family where the hazard rate and moment structure of the perturbation distribution (e.g., Gumbel, Pareto, Weibull, Frechet) govern the regret (Abernethy et al., 2015).
Thompson Sampling and Bayesian priors: The choice of prior (e.g., Gaussian, Beta, Jeffreys, conjugate within exponential families) indexes a continuum of TS algorithms. For one-dimensional exponential families, the use of the Jeffreys prior yields a parameter-free, asymptotically optimal family that achieves the Lai–Robbins regret lower bound (Korda et al., 2013).
Stochastic linear bandits (ROFUL family): The ROFUL framework is parameterized by the confidence-radius parameter $\rho$ , the optimism-in-expectation parameter $p$ , and the sieving rate $\alpha$ in sieved-greedy (SG) variants, smoothly spanning optimistic (OFUL), randomized (Thompson) and greedy behaviors. Regret bounds are explicit functions of these parameters (Hamidi et al., 2020).
Computation-regret tradeoff families: Recent Thompson Sampling variants such as TS-MA- $\alpha$ and TS-TD- $\alpha$ use a parameter $\alpha\in[0,1]$ to mediate the trade-off between computational efficiency (i.e., number of posterior samples or resamplings) and regret, enabling smooth interpolation between full TS (optimal regret, maximal sampling) and highly efficient heuristics (slightly worse regret, polylog sampling) (Hu et al., 2 May 2024).
Regionally, structurally, or finitely parameterized bandits: In regional bandit algorithms (UCB-g, SW-UCB-g), the control parameter is the sliding window or group-level confidence schedule; in finitely parameterized multi-armed bandits (FP-UCB), the structure of the finite parameter set directly determines arm-elimination protocols and their regret scaling (Panaganti et al., 2020, Wang et al., 2018).

2. Theoretical Frameworks and Regret Analysis

Parameterized families are analyzed by making regret bounds explicit functions of the parameters—often exposing precise exploration-exploitation-computation trade-offs, optimality criteria, or tuning guidelines.

Asymptotic optimality and one-parameter exponential families: In the exponential family bandit setting, the TS family parameterized by prior achieves regret

$R(T) \leq \sum_{a:\Delta_a>0} \frac{\Delta_a}{\mathrm{KL}(\theta_a,\theta^*)} \log T + o(\log T),$

where KL is the Bregman divergence of the cumulant, and the prior affects only lower order terms if chosen as Jeffreys (Korda et al., 2013).

Family-level regret bounds: In ROFUL, general regret is

$\mathrm{Reg}(T) \leq 2\sqrt{(2K/p + D)T},$

where $K$ is the “uncertainty complexity” and $p$ reflects the optimism-in-expectation parameter. OFUL ( $p=1$ ), TS, and sieved greedy ( $p=\alpha^2$ ) are all specializations (Hamidi et al., 2020).

Computation-adaptivity in TS-MA- $\alpha$ / TS-TD- $\alpha$ : Regret scales as $O\left( K \ln^{1+\alpha} T/\Delta \right)$ under the parameter $\alpha$ . Lower $\alpha$ improves regret but increases sampling complexity; higher $\alpha$ decreases total computation but yields an extra logarithmic term (Hu et al., 2 May 2024).
Finitely parameterized FP-UCB: When the “confusion set” is empty, regret is $O(1)$ independent of $T$ ; otherwise, regret grows logarithmically, but with smaller constants than classical UCB due to parameter structure (Panaganti et al., 2020).
General recipe for randomized bandit families: Parameterization appears not only in the choice of divergence $D_\pi$ (KL, empirical divergence, moment-based duals) but in the sampling rule, and in verifying that the two-sided bounds on arm sampling probabilities are satisfied. Asymptotically optimal regret is recovered across parametric, bounded, or heavy-tailed families; for example, for exponential families with the canonical KL divergence, TS and MED match the Lai–Robbins lower bound (Baudry et al., 2023).

3. Representative Table of Parameterized Bandit Families

Family/Algorithm	Parameter(s)	Regret/Optimality Properties
Tsallis-Exp family	$\alpha$	$O(\sqrt{TN})$ for $\alpha=1/2$ ; interpolates EXP3/ $\alpha=1$ (Abernethy et al., 2015)
FTPL family	Perturbation dist. (hazard)	$O(\sqrt{T N \log N})$ for bounded hazard, e.g. Gumbel (Abernethy et al., 2015)
TS (exp. families)	Prior (e.g. Jeffreys)	Asymptotic optimality: $R(T) = O(\log T)$ , matches lower bound (Korda et al., 2013)
ROFUL (linear bandit)	$\rho$ , $p$ , $\alpha$	$O\left( d/\alpha \sqrt{T\log T} \right)$ for sieved greedy (Hamidi et al., 2020)
FP-UCB	Structure of $\Theta$	$O(1)$ when confusion set empty; $O(\log T)$ otherwise (Panaganti et al., 2020)
TS-MA- $\alpha$ /TS-TD- $\alpha$	$\alpha\in[0,1]$	$O\left(K \ln^{1+\alpha}T/\Delta \right)$ ; sample efficiency tunable (Hu et al., 2 May 2024)

This table demonstrates explicit trade-offs governed by parameters, with regret or sample complexity analytically characterizable as a function of the family index.

4. Rationale for Parameterization: Flexibility, Adaptivity, Meta-Learning

Parameterized bandit families enable systematic exploration of algorithmic behavior and statistical adaptivity. Key advantages include:

Problem-structure exploitation: Families such as FP-UCB, regional bandits, and parametric/PGLB bandits adapt to finite or grouped reward models, reducing regret relative to structure-agnostic baselines (Panaganti et al., 2020, Wang et al., 2018).
Computational tractability: Computation-regret trade-off families (e.g., TS-MA- $\alpha$ ) allow practitioners to set $\alpha$ to optimize between statistical efficiency and sample budget, crucial for very large-scale applications (Hu et al., 2 May 2024).
Meta-learning and data-driven selection: The sample complexity of optimizing over a parameterized family (hyperparameter transfer) is precisely quantifiable in improving bandits, allowing for data-driven tuning without performance loss (Blum et al., 13 Nov 2025).
Analytic unification: A multitude of classical and modern strategies (UCB, KL-UCB, TS, MED, h-NPTS, RB-SDA) can be unified as parameter choices in abstract frameworks (e.g., divergence-based, randomized, or regularization-based families) (Baudry et al., 2023, Baudry et al., 2020).

5. Parameterized Algorithm Design Principles

Designing effective parameterized bandit families involves the following key principles:

Explicit parameter indexing: Specify continuous or discrete parameters (e.g., prior variance, entropy order, confidence width, threshold power, batch size) and analyze their effect on exploration probability, statistical estimation, and regret bounds.
Ensuring adaptivity: Analytical guarantees must cover both worst-case and instance-adaptive regimes, often requiring uniform finite-time concentration and carefully chosen confidence/threshold schedules or priors (Korda et al., 2013, Hu et al., 2 May 2024).
Universality vs. specialization: Some families (e.g., RB-SDA) are distribution-free and require no parameter tuning, while others (e.g., TS with prior, FTPL with hazard) deliver optimality only for suitable parameter choices, emphasizing the need for hyperparameter robustness (Baudry et al., 2020, Abernethy et al., 2015).
Efficient meta-optimization: The hyperparameter transfer complexity is $O((H/\epsilon)^2 \log(kT))$ for piecewise-constant loss in improving bandits, enabling practical model selection over rich families (Blum et al., 13 Nov 2025).

6. Recent Advances and Open Problems

Contemporary work continues to generalize and sharpen the role of parameterized families:

Non-parametric and heavy-tailed families: TS-like and divergence-based families parameterize over moment constraints, with concrete minimax-optimal bounds for $h$ -moment-limited distributions or nonparametric settings (Baudry et al., 2023).
Functional parameterizations: Algorithms can be indexed by functionals or norm-approximators (gradient-stable surrogates), capturing entire classes of performance metrics and enabling unified bandit/online-load-balancing analyses for exotic norms (Kesselheim et al., 2022).
Best-of-both-worlds data-dependent families: Hybrid parameterized designs (e.g., Hybrid $_{α,B}$ ) are capable of best-arm identification when problem structure permits and automatically revert to worst-case guarantees otherwise, driven by explicit parameterized switching rules (Blum et al., 13 Nov 2025).
Online semi-bandit/algorithm-selection problems: When the loss function itself is parameterized (hyperparameter optimization), the semi-bandit optimization framework yields instance-optimal regret in parameter space via continuous Exp3-SET, exploiting dispersion and Lipschitz-continuity (Balcan et al., 2019).

Potential open directions include: adaptive parameter selection under nonstationarity, robustification to model misspecification, and scalable meta-optimization in high-dimensional or combinatorial parameter spaces.

7. Summary and Impact

Parameterized families of bandit algorithms provide a mechanism to encode, analyze, and computationally tune exploration strategies, adaptivity, and exploitation across a vast array of bandit settings. They unify classic and modern methods, enable meta-learning and structural exploitation, and admit explicit regret and (sometimes) sample complexity guarantees as functions of the family index. The theoretical characterization of such families, the methods for hyperparameter transfer, and the development of distribution-free or instance-adaptive versions are among the most significant advances in the last decade of bandit research (Korda et al., 2013, Abernethy et al., 2015, Hamidi et al., 2020, Panaganti et al., 2020, Kesselheim et al., 2022, Hu et al., 2 May 2024, Blum et al., 13 Nov 2025).