Papers
Topics
Authors
Recent
Search
2000 character limit reached

Non-Stochastic Infinite-Armed Bandit

Updated 18 March 2026
  • Non-stochastic infinite-armed bandit is an online learning framework with an infinite set of actions where each arm’s reward evolves non-stationarily and adversarially after being played.
  • A novel blackbox conversion method reduces the infinite-armed problem to finite-armed instances by subsampling arms and triggering adaptive resets based on empirical regret thresholds.
  • The approach achieves minimax-optimal, parameter-free regret bounds by leveraging reservoir regularity and significant shift measures, with bounds adapting to cumulative and rotting variations.

A non-stochastic infinite-armed bandit is an online learning problem in which the decision maker faces an infinite set of possible actions (arms). Each arm, when first selected, draws its initial mean reward independently from a reservoir distribution. Unlike the stochastic setting, in the non-stationary and adversarial variant, arm rewards can evolve adaptively or adversarially over time, with the constraint that only arms pulled at a given time (the rested model) can change their value. This setting generalizes stationary and rotting infinite-armed bandits and introduces new challenges regarding exploration, exploitation, and managing dynamically shifting reward distributions (Suk et al., 31 Jan 2025).

1. Formal Definition and Problem Structure

Let A={1,2,…}\mathcal{A} = \{1,2,\ldots\} denote the infinite set of arms and TT the time horizon. At round t∈{1,…,T}t \in \{1,\ldots,T\}, the learner either draws a new arm at∈Aa_t \in \mathcal{A} from the reservoir or selects one of the previously sampled arms to play. The initial mean reward μ0(a)∈[0,1]\mu_0(a) \in [0,1] for each freshly drawn arm is sampled i.i.d. from a reservoir distribution, unknown to the learner and assumed to be β\beta-regular. Subsequently, the individual reward processes {Yt(a)}\{Y_t(a)\}, and thus the means μt(a)\mu_t(a), evolve according to an adaptive or adversarial process, but only for arms that are actually played at each step.

The performance metric is cumulative regret, defined as:

RT=∑t=1Tδt(at),δt(a)=1−μt(a)R_T = \sum_{t=1}^T \delta_t(a_t), \quad \delta_t(a) = 1 - \mu_t(a)

where δt(a)\delta_t(a) is the instantaneous gap of arm TT0 at time TT1.

Non-stationarity is captured along the play path TT2 via:

  • Total (realized) variation: TT3
  • Total rotting variation: TT4

Reservoir regularity assumes the existence of TT5 and TT6 such that for all TT7:

TT8

2. Parameter-Free Blackbox Conversion for Non-Stationary Infinite-Armed Bandits

A central methodological innovation is a blackbox conversion scheme which reduces the infinite-armed non-stationary bandit problem to a sequence of finite-armed bandit instances. The approach proceeds as follows:

  • Subsample a finite set TT9 of t∈{1,…,T}t \in \{1,\ldots,T\}0 arms from the reservoir.
  • Run a gap-dependent finite-armed MAB base algorithm on t∈{1,…,T}t \in \{1,\ldots,T\}1 in doubling time blocks.
  • Monitor empirical regret within each block; trigger parameter-free resets/restarts if empirical regret exceeds a controlled threshold.

The decomposition of regret for such a strategy is:

t∈{1,…,T}t \in \{1,\ldots,T\}2

where t∈{1,…,T}t \in \{1,\ldots,T\}3. The first term represents "missed-arm" regret, arising from only subsampling t∈{1,…,T}t \in \{1,\ldots,T\}4 arms, and can be controlled by the reservoir's t∈{1,…,T}t \in \{1,\ldots,T\}5-regularity as t∈{1,…,T}t \in \{1,\ldots,T\}6. The second term corresponds to the static regret for the finite-armed problem.

Resets are governed by empirical regret thresholds, ensuring that if the base algorithm remains effective, regret is bounded; otherwise, significant non-stationarity or base algorithm suboptimality is detected, triggering re-exploration.

3. Regret Analysis and Main Theoretical Results

Regret bounds are established with high probability (w.h.p., i.e., with probability at least t∈{1,…,T}t \in \{1,\ldots,T\}7). For realized count t∈{1,…,T}t \in \{1,\ldots,T\}8 (number of detected resets/changes) and variation t∈{1,…,T}t \in \{1,\ldots,T\}9, the following hold:

  • For at∈Aa_t \in \mathcal{A}0 and block sizes at∈Aa_t \in \mathcal{A}1:

at∈Aa_t \in \mathcal{A}2

  • For at∈Aa_t \in \mathcal{A}3 and at∈Aa_t \in \mathcal{A}4:

at∈Aa_t \in \mathcal{A}5

Here, at∈Aa_t \in \mathcal{A}6 characterizes the cost of resets; at∈Aa_t \in \mathcal{A}7 corresponds to total-variation regret bounds.

This approach achieves minimax-optimal and parameter-free regret rates for all at∈Aa_t \in \mathcal{A}8 regimes, even without knowledge of at∈Aa_t \in \mathcal{A}9, μ0(a)∈[0,1]\mu_0(a) \in [0,1]0, or other non-stationarity parameters (Suk et al., 31 Jan 2025).

4. Significant Shift Measures

A refined measure of non-stationarity is the "significant shift" count μ0(a)∈[0,1]\mu_0(a) \in [0,1]1, which seeks to only capture shifts that force a restart or re-exploration. Informally, an interval μ0(a)∈[0,1]\mu_0(a) \in [0,1]2 is safe if there exists an arm among the first μ0(a)∈[0,1]\mu_0(a) \in [0,1]3 sampled at the start, whose cumulative regret on the block does not exceed μ0(a)∈[0,1]\mu_0(a) \in [0,1]4.

Epochs between significant shifts are defined recursively: set μ0(a)∈[0,1]\mu_0(a) \in [0,1]5, and then μ0(a)∈[0,1]\mu_0(a) \in [0,1]6 is the first μ0(a)∈[0,1]\mu_0(a) \in [0,1]7 where no safe arm exists for μ0(a)∈[0,1]\mu_0(a) \in [0,1]8. μ0(a)∈[0,1]\mu_0(a) \in [0,1]9 is the largest β\beta0 with β\beta1. Crucially, β\beta2 w.h.p., emphasizing that β\beta3 discounts regime changes that do not affect exploration policy.

The main theoretical result with respect to this metric for the randomized elimination algorithm is:

β\beta4

demonstrating that regret depends adaptively on the truly significant non-stationarity (Suk et al., 31 Jan 2025).

5. Randomized Elimination and Adaptive Algorithms

The randomized elimination variant replaces the base finite-armed algorithm with a uniform sampling and importance-weighted elimination strategy within each block. For each block β\beta5:

  • Draw a subsample β\beta6 of β\beta7 arms.
  • Maintain an active set β\beta8. For each β\beta9 in the block, play {Yt(a)}\{Y_t(a)\}0 and observe {Yt(a)}\{Y_t(a)\}1.
  • Compute importance-weighted losses {Yt(a)}\{Y_t(a)\}2. Eliminate {Yt(a)}\{Y_t(a)\}3 if the cumulative weight exceeds a threshold.

This mechanism provides direct detection of significant shifts—via elimination of all arms in {Yt(a)}\{Y_t(a)\}4—while avoiding overreaction to mild variation. The resulting regret bound is shift-adaptive, incurring no penalty for non-critical regime changes.

6. Connections to Prior Work and Advances

Prior foundational results on stationary infinite-armed bandits (Berry & Fristedt, Wang et al., Carpentier & Locatelli) achieve regret {Yt(a)}\{Y_t(a)\}5 under the assumption {Yt(a)}\{Y_t(a)\}6. Non-stationary results were previously limited to monotone (rotting rewards) settings, often requiring explicit knowledge of regularity or rotting parameters and restricted to certain non-stationarity regimes (Kim et al. 2022, 2024).

The non-stochastic paradigm investigated here achieves several advances (Suk et al., 31 Jan 2025):

  • Fully adversarial, adaptive reward evolution for played arms.
  • Parameter-free algorithms applicable to all {Yt(a)}\{Y_t(a)\}7.
  • Regret rates depending only on realized non-stationarity and significant shifts, not on global knowledge of variation parameters.
  • Introduction of the significant-shift measure {Yt(a)}\{Y_t(a)\}8.
  • High probability regret guarantees and adaptivity in the presence of arbitrary non-stationarity.

A plausible implication is that shift-adaptive methodologies could inform robust exploration in large-scale, open-world online learning applications where the reward landscape evolves adversarially and arm discovery is unbounded.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Non-Stochastic Infinite-Armed Bandit.