Quick-Draw Bandits: Quickly Optimizing in Nonstationary Environments with Extremely Many Arms

Published 30 May 2025 in cs.LG and stat.ML | (2505.24692v1)

Abstract: Canonical algorithms for multi-armed bandits typically assume a stationary reward environment where the size of the action space (number of arms) is small. More recently developed methods typically relax only one of these assumptions: existing non-stationary bandit policies are designed for a small number of arms, while Lipschitz, linear, and Gaussian process bandit policies are designed to handle a large (or infinite) number of arms in stationary reward environments under constraints on the reward function. In this manuscript, we propose a novel policy to learn reward environments over a continuous space using Gaussian interpolation. We show that our method efficiently learns continuous Lipschitz reward functions with $\mathcal{O}^*(\sqrt{T})$ cumulative regret. Furthermore, our method naturally extends to non-stationary problems with a simple modification. We finally demonstrate that our method is computationally favorable (100-10000x faster) and experimentally outperforms sliding Gaussian process policies on datasets with non-stationarity and an extremely large number of arms.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the Quick-Draw bandit policy that efficiently estimates Lipschitz reward functions over both spatial and temporal dimensions.
The proposed method leverages conditional Normal likelihoods to derive closed‐form estimates of mean and variance for effective UCB-based arm selection.
Experiments on simulated and real-world datasets show Quick-Draw outperforms baselines by achieving lower regret and higher performance metrics.

The paper "Quick-Draw Bandits: Quickly Optimizing in Nonstationary Environments with Extremely Many Arms" (2505.24692) addresses the challenging multi-armed bandit (MAB) problem in scenarios where the reward environment is non-stationary and the number of arms (action space) is extremely large, potentially continuous. Existing MAB algorithms typically handle either non-stationarity with a small number of arms or a large/continuous action space in stationary environments, but not both simultaneously. Canonical methods break down when the number of arms $K$ is comparable to or larger than the timescale of environmental change ( $K \gtrsim T_w$ ), making exhaustive exploration within a stable period infeasible.

The authors propose the Quick-Draw bandit policy, a novel approach designed to efficiently learn Lipschitz reward functions over a continuous feature space that also change smoothly over time. The core idea is to model the expected payout function $\mu(x, t)$ probabilistically, specifically assuming a conditional Normal likelihood for an observation at point $x$ at time $t$ given a past observation $(x_s, y_s)$ at time $t_s$ . The variance of this conditional likelihood, $\hat{\sigma}_s^2(x,t)$ , is modeled as a function of the spatial distance $D(x, x_s)$ and temporal distance $(t - t_s)$ :

$\hat{\sigma}^2_s(x,t) \equiv \rho^2 + \left(\frac{D(x, x_s)}{\ell_x} \right)^2 + \left(\frac{t - t_s}{\ell_t}\right)^2$

where $\rho^2$ represents irreducible noise, $\ell_x$ is a spatial bandwidth, and $\ell_t$ is a temporal bandwidth.

The policy combines information from all past observations $\mathcal{D}_T = \{(x_s, y_s, t_s)\}_{s=1}^T$ by assuming the joint likelihood is a product of these conditional likelihoods. For Gaussian distributions, this product is also Gaussian, resulting in closed-form expressions for the estimated mean $\hat{\mu}_T(x, t)$ and variance $\hat{\Sigma}_T^2(x, t)$ . These estimates are given by:

$\hat{\Sigma}^2_T(x,t) = \left[ \sum^T_{s=1} \frac{1}{\hat{\sigma}_s^2(x,t)} \right]^{-1}$

$\hat{\mu}_T(x,t) = \left[ \sum^T_{s=1} \frac{y_s}{\hat{\sigma}_s^2(x,t)} \right] \hat{\Sigma}_T^2(x,t)$

The Quick-Draw policy is a Upper Confidence Bound (UCB) type algorithm. At each round $t$ , it calculates a UCB index for each arm $k$ (or potential point $x_k$ ) using the estimated mean and variance:

$\rm{UCB}_{k,t} = \min( \hat{\mu}_T(x_k, t) + \gamma_{T+1}\hat{\Sigma}_T(x_k, t), 1 )$

where $\gamma_{T+1}$ is a scaling constant, and 1 is a ceiling based on the assumed bounded payout range $[0, 1]$ . The policy then selects the arm with the maximum UCB index.

From an implementation perspective, the Quick-Draw policy requires storing past observations (arm, payout, time). At each step, to compute the UCB for a given arm, it iterates through all past observations to calculate the weighted mean and variance. The weights are determined by the inverse of the uncertainty $\hat{\sigma}_s^2$ , which depends on the distance in space and time to the current arm and round. A brute-force implementation calculates $\hat{\Sigma}^2_T$ and $\hat{\mu}_T$ for all $K$ arms at each round $T$ , leading to $\mathcal{O}(K \cdot T)$ complexity per round if all past observations are processed naively. However, if past uncertainty contributions $1/\hat{\sigma}_s^2(x_k, t)$ are cached for each arm-time pair relative to previous observations, each update might approach $\mathcal{O}(K)$ if only the new observation needs to be added. A more efficient implementation could maintain the sums $\sum_s 1/\hat{\sigma}_s^2$ and $\sum_s y_s/\hat{\sigma}_s^2$ for each arm iteratively.

Comparing to Gaussian Process (GP) bandits, the Quick-Draw policy can be viewed as a form of kernel interpolation similar to Nadaraya-Watson estimation, while GP bandits rely on matrix inversion of a kernel matrix. This difference gives Quick-Draw a significant computational advantage. While exact GP bandits have $\mathcal{O}(T^3)$ complexity per round (for updating the model based on $T$ observations), Quick-Draw is much faster, scaling linearly or near-linearly in the number of past observations $T$ for calculating the mean and variance. Experiments show Quick-Draw being orders of magnitude faster than GP-UCB.

The paper provides a theoretical regret bound for the stationary case (only spatial Lipschitzness). Under assumptions, the cumulative regret is shown to be $\mathcal{O}(\sqrt{T} \ln^2 T)$ , which is comparable to the regret bounds achieved by GP bandit policies in similar stationary settings, but Quick-Draw's bound applies to Lipschitz reward functions.

The effectiveness of the Quick-Draw policy is demonstrated through extensive experiments on both simulated data and a real-world dataset.

Simulated Experiments: Using Gaussian random fields with controllable spatial and temporal correlations, noise levels, and reward function sharpness, Quick-Draw is compared against Sliding-Window GP-UCB (SW-GP-UCB), sliding $\epsilon$ -greedy, restless bandit, and random sampling. Quick-Draw consistently outperforms the baselines across various challenging non-stationary settings, showing better adaptation to changes and more efficient exploration guided by spatial and temporal dependencies. It is particularly effective when spatial and temporal correlations are significant. The hyperparameters $\ell_x$ and $\ell_t$ exhibit robustness; performance is relatively insensitive to their exact values within a reasonable range (e.g., around 1 when distances are normalized to [0, 1]).
Open Bandit Dataset Evaluation: The policy is evaluated on a large-scale public dataset from an online advertising platform, aiming to maximize click-through rate (CTR) for different products (arms) presented to users (contexts). The problem involves 46 arms and exhibits non-stationarity over time. Using Inverse Propensity Scoring (IPS) for off-policy evaluation, Quick-Draw achieves an estimated CTR of 3.51%, significantly higher than random (0.49%), SW-GP-UCB (0.57%), restless bandit (0.98%), and sliding $\epsilon$ -greedy (2.12%). This demonstrates the policy's practical applicability and superior performance in real-world, non-stationary environments with a moderately large number of arms.

In summary, the Quick-Draw bandit policy offers a practical and computationally efficient solution for multi-armed bandit problems in challenging real-world scenarios characterized by both non-stationarity and a large action space. By explicitly modeling the decay of information over spatial and temporal distance, it effectively balances exploration and exploitation, achieving lower regret and higher performance compared to existing methods while being significantly faster than methods like GP bandits. The policy's robustness to hyperparameter tuning further simplifies its implementation and deployment.

Markdown