Papers
Topics
Authors
Recent
2000 character limit reached

Streaming Bandit Problems: Theory & Applications

Updated 8 January 2026
  • Streaming bandit problems are sequential decision-making models that operate on data streams with limited memory and adaptive constraints.
  • Novel algorithms employ multi-pass elimination, threshold policies, and optimal pull allocations to balance exploration and exploitation under memory limits.
  • Theoretical regret analyses reveal tradeoffs among memory, pass count, and statistical efficiency, guiding the design of order-optimal streaming methods.

The bandit problem in data streams encompasses a family of sequential decision-making models where the set of available actions ("arms") is accessed via a data stream, under constraints that fundamentally differ from the classical multi-armed bandit (MAB) setting. In this streaming context, arms may arrive online, memory for storing statistics is typically sublinear in the number of arms, and algorithms must operate under resource, adaptivity, and data-access limitations. This paradigm has motivated novel lower bounds, adaptive algorithms, and regret and sample-complexity analyses that delineate the fundamental tradeoffs between space, passes over data, and statistical efficiency.

1. Formal Models of Streaming Bandit Problems

In the streaming bandit framework, the learner sequentially observes a potentially adversarial stream of KK distinct arms over a time horizon TT. Each arm x[K]x\in[K] has an unknown reward distribution νx\nu_x supported on [0,1][0,1], with mean μx\mu_x. At each round t=1,,Tt = 1,\ldots,T, the learner may select ("pull") only arms currently stored in a working memory of size MKM\ll K and receives iid samples rtνxtr_t \sim \nu_{x_t}. If the memory is full and a new arm arrives, eviction is mandatory, and all associated statistics are lost. The primary goal is to minimize cumulative regret: R(T)=t=1T(μμxt),μ:=maxx[K]μx,R(T) = \sum_{t=1}^T (\mu^* - \mu_{x_t}), \quad \mu^* := \max_{x\in[K]}\mu_x, subject to memory and pass constraints. The multi-pass model allows the learner B1B \geq 1 full scans (passes) of the stream, where only the MM arms retained at the end of a pass can be revisited in subsequent passes, thus further shaping information flow and accessible exploration policies. This model is a strict generalization of the classical MAB, reducing to it only when M=KM=K and BB is unbounded (Li et al., 2023, Maiti et al., 2020).

Alternate streaming models include the irreversible online bandit scenario, in which the learner must irrevocably accept or reject each arm in a single pass and may not revisit discarded arms (Roy et al., 2017). Further, resource-constrained contextual and distributed streaming bandits address settings where, in addition to streaming constraints, context, communication and cost structures fundamentally interact with memory and adaptivity (Tekin et al., 2013, Gisselbrecht, 2018).

2. Fundamental Lower Bounds and Tradeoffs

Streaming bandit problems display intrinsic separations from the centralized bandit regime due to information loss under bounded memory. For a learner restricted to BB passes and memory M=o(K/B)M=o(K/B) over TT rounds and KK arms, the tight worst-case regret lower bound is (Li et al., 2023): R(T)=Ω((TB)αK1α),α=2B2B+11.R(T) = \Omega\left((TB)^\alpha K^{1-\alpha}\right), \quad \alpha = \frac{2^B}{2^{B+1}-1}. In the regime B=Θ(loglogT)B = \Theta(\log\log T), this yields R(T)=Ω(KTloglogT)R(T) = \Omega\left(\sqrt{KT\log\log T}\right)—an unavoidable loglogT\sqrt{\log\log T} penalty versus the classical Ω(KT)\Omega(\sqrt{KT}) lower bound.

The derivation employs a reduction from regret to a sequence of ϵ\epsilon-optimal arm identification tasks: any memory-limited algorithm with regret O(ϵT)O(\epsilon T) must, with constant probability, have identified an O(ϵ)O(\epsilon)-optimal arm, which under streaming constraints has sample complexity Ω(Kϵ2)\Omega(K\epsilon^{-2}) per single pass. Optimizing pull allocations across BB passes (via a hierarchy of thresholds ϵ1>>ϵB\epsilon_1>\ldots>\epsilon_B) and summing exploration plus exploitation phases yields the optimal dependency in (T,K,B)(T,K,B).

Instance-dependent lower bounds further demonstrate that, for gaps Δx:=μμx>0\Delta_x := \mu^* - \mu_x > 0,

R(T)=Ω(T1/(B+1)x:Δx>0μΔx),R(T) = \Omega\left(T^{1/(B+1)} \sum_{x: \Delta_x>0} \frac{\mu^*}{\Delta_x}\right),

matching the best-known rates in centralized bandits only when memory is unbounded (Li et al., 2023). These bounds also extend to best-arm identification, where for rr-round adaptive algorithms storing O(r)O(r) arms, the sample complexity lower bound is $\Omega\left(\frac{n}{\epsilon^2 r^4} \ilog^{(r)}(n)\right)$ for nn arms and accuracy ϵ\epsilon (Maiti et al., 2020).

3. Algorithms and Achievability Results

Streaming bandit algorithms are structured to exploit the limited adaptivity and memory available. The multi-pass streaming successive elimination algorithm matches the lower bounds up to logarithmic factors with O(1)O(1) memory: in each of the BB passes, only two arms—the "best-so-far" and a current challenger—are compared using confidence bounds and capped pull allocations (bp=Θ((TB)2βpK2βp)b_p = \Theta((TB)^{2\beta_p} K^{-2\beta_p}) for pass pp), determined by an optimal schedule from the lower-bound optimization. After BB passes, the survivor is played exclusively (Li et al., 2023). This yields: R(T)=O((TB)αK1αlogT)R(T) = O\left((TB)^\alpha K^{1-\alpha} \sqrt{\log T}\right) with high probability, confirming the order-optimality of the strategy under memory constraints.

In the online streaming Bernoulli-bandit model (accept/reject per arm, no revisit), threshold policies yield loss per pull matching the fundamental rate Ω(max{N1/m,K1/(m+1)})\Omega(\max\{N^{-1/m}, K^{-1/(m+1)}\})—where NN is the number of arms, KK is the budget, and the underlying mean distribution's CDF exhibits mm-times differentiable behavior at the left tail (Roy et al., 2017).

Further, rr-round adaptive best-arm identification with O(r)O(r) memory achieves

$N = O\left(\frac{n}{\epsilon^2}( \ilog^{(r)}(n) + \log(1/\delta) )\right)$

sample complexity, matching the lower bounds up to polynomial factors in rr (Maiti et al., 2020). Two-arm heuristics (memory-2) attain near-optimal identification probability under random arm arrival.

These results reveal that streaming bandits demand fundamentally new algorithmic designs—elimination and comparison procedures, pull-capping schedules, and structurally minimal memory logic—which are not compatible with classical MAB methods relying on complete historical statistics.

4. Extensions: Streaming Bandits in High-Dimensional, Contextual, and Nonstationary Settings

Extensions of the streaming bandit model arise in high-dimensional and dynamic data-stream mining environments, such as subspace search for pattern detection (Fouché et al., 2020), decentralized contextual classification (Tekin et al., 2013), and change-point detection under sampling constraints (Zhang et al., 2020, Gopalan et al., 2021).

In high-dimensional streaming subspace search, the Streaming Greedy Maximum Random Deviation (SGMRD) algorithm formulates the search for informative subspaces as a multiple-play bandit, where each arm corresponds to a dimension's subspace. Pulling (searching) an arm yields a reward if a new subspace with higher dependence score is discovered. SGMRD employs multiple-play Thompson sampling with exp-smoothed reward signals to adaptively allocate explorations, achieving state-of-the-art results on anomaly detection tasks and inheriting the O(logT)O(\log T) gap-dependent regret behavior under nonstationarity (Fouché et al., 2020).

Distributed online classification of large-scale data streams is cast as a cooperative contextual bandit where each learner selects either from its own classifier set or offloads to another learner, with reward being classification accuracy minus communication or computation cost. The Classify-or-Send (CoS) algorithm achieves regret $O(T^{(2\alpha+d)/(3\alpha+d)}\polylog T)$ in a dd-dimensional context space, capturing both local and decentralized learning (Tekin et al., 2013).

Nonstationary and resource-constrained streaming bandits, especially in change-point or anomaly detection regimes, leverage bandit-based adaptive sampling. For sequential change-point detection over high-dimensional streams when only a subset of sensors can be monitored per time, methods such as Thompson-Sampling Shiryaev-Roberts-Pollak (TSSRP) adaptively allocate sampling effort and exploit sum-shrinkage decision statistics to optimize detection delay under false alarm constraints (Zhang et al., 2020). Coupled with theoretical guarantees, such models ensure that exploitation and exploration are optimally balanced even under stringent data-access budgets. Relatedly, bandit quickest change-point detection with information-theoretic lower bounds on delay matches efficient ϵ\epsilon-greedy algorithms, establishing the necessity of adaptive (bandit) sensing in structured data stream monitoring (Gopalan et al., 2021).

5. Practical Applications: Social Media, Health Interventions, Data Mining

Streaming bandit frameworks have been instantiated in several real-world streaming-data applications. In social media data capture, users are modeled as arms, and the learner sequentially selects subsets to follow maximizing relevance under listening constraints. Stochastic, contextual, and latent-space bandit models (e.g., LinUCB, nonstationary contextual bandits) allow rapid identification of pertinent sources, with regret bounds and empirical performance validated on live Twitter streams and synthetic datasets (Gisselbrecht, 2018).

Restless multi-armed bandit models for health intervention planning (Streaming RMAB) account for arms (patients or signals) that arrive and depart dynamically, with each arm described by a partially observed Markov process and finite lifetime. Index-based allocation using computed (or approximated) Whittle indices accounting for horizon decay achieves two orders of magnitude runtime speed-up while maintaining solution quality, with provably near-optimal intervention benefit on large patient-cohort streams (Mate et al., 2021).

In high-volume classification and anomaly detection, streaming bandit principles enable real-time adaptation in resource-constrained, high-dimensional, and distributed learning scenarios. Contextual and multiple-play approaches have been shown to scale with sublinear regret under nonstationary and cooperative settings (Tekin et al., 2013, Fouché et al., 2020).

6. Open Problems, Limitations, and Future Directions

Several open problems and research frontiers remain in streaming bandit theory and practice. Prominent challenges include closing the logT\sqrt{\log T} or loglogT\sqrt{\log\log T} gaps between upper and lower regret bounds in the memory-limited, multi-pass regime (Li et al., 2023), designing algorithms with optimal instance-dependent regret, and developing adaptive memory-allocation strategies across multiple passes.

Open directions include handling non-i.i.d. rewards (e.g., heavy-tailed or context-dependent distributions), adversarial stream orders, and dynamic or correlated arms in both streaming and distributed contexts (Li et al., 2023, Tekin et al., 2013, Fouché et al., 2020). In change-point detection and monitoring, rigorous proofs of first-order (minimax) delay optimality under complex sampling controls remain an active topic (Zhang et al., 2020).

Given the growing prominence of massive, high-velocity data streams in scientific and industrial applications, practical algorithms for streaming bandits will continue to necessitate principled trade-offs between space, adaptivity, statistical efficiency, and real-world robustness.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bandit Problem in Data Streams.