Non-Stochastic Infinite-Armed Bandit
- Non-stochastic infinite-armed bandit is an online learning framework with an infinite set of actions where each arm’s reward evolves non-stationarily and adversarially after being played.
- A novel blackbox conversion method reduces the infinite-armed problem to finite-armed instances by subsampling arms and triggering adaptive resets based on empirical regret thresholds.
- The approach achieves minimax-optimal, parameter-free regret bounds by leveraging reservoir regularity and significant shift measures, with bounds adapting to cumulative and rotting variations.
A non-stochastic infinite-armed bandit is an online learning problem in which the decision maker faces an infinite set of possible actions (arms). Each arm, when first selected, draws its initial mean reward independently from a reservoir distribution. Unlike the stochastic setting, in the non-stationary and adversarial variant, arm rewards can evolve adaptively or adversarially over time, with the constraint that only arms pulled at a given time (the rested model) can change their value. This setting generalizes stationary and rotting infinite-armed bandits and introduces new challenges regarding exploration, exploitation, and managing dynamically shifting reward distributions (Suk et al., 31 Jan 2025).
1. Formal Definition and Problem Structure
Let denote the infinite set of arms and the time horizon. At round , the learner either draws a new arm from the reservoir or selects one of the previously sampled arms to play. The initial mean reward for each freshly drawn arm is sampled i.i.d. from a reservoir distribution, unknown to the learner and assumed to be -regular. Subsequently, the individual reward processes , and thus the means , evolve according to an adaptive or adversarial process, but only for arms that are actually played at each step.
The performance metric is cumulative regret, defined as:
where is the instantaneous gap of arm 0 at time 1.
Non-stationarity is captured along the play path 2 via:
- Total (realized) variation: 3
- Total rotting variation: 4
Reservoir regularity assumes the existence of 5 and 6 such that for all 7:
8
2. Parameter-Free Blackbox Conversion for Non-Stationary Infinite-Armed Bandits
A central methodological innovation is a blackbox conversion scheme which reduces the infinite-armed non-stationary bandit problem to a sequence of finite-armed bandit instances. The approach proceeds as follows:
- Subsample a finite set 9 of 0 arms from the reservoir.
- Run a gap-dependent finite-armed MAB base algorithm on 1 in doubling time blocks.
- Monitor empirical regret within each block; trigger parameter-free resets/restarts if empirical regret exceeds a controlled threshold.
The decomposition of regret for such a strategy is:
2
where 3. The first term represents "missed-arm" regret, arising from only subsampling 4 arms, and can be controlled by the reservoir's 5-regularity as 6. The second term corresponds to the static regret for the finite-armed problem.
Resets are governed by empirical regret thresholds, ensuring that if the base algorithm remains effective, regret is bounded; otherwise, significant non-stationarity or base algorithm suboptimality is detected, triggering re-exploration.
3. Regret Analysis and Main Theoretical Results
Regret bounds are established with high probability (w.h.p., i.e., with probability at least 7). For realized count 8 (number of detected resets/changes) and variation 9, the following hold:
- For 0 and block sizes 1:
2
- For 3 and 4:
5
Here, 6 characterizes the cost of resets; 7 corresponds to total-variation regret bounds.
This approach achieves minimax-optimal and parameter-free regret rates for all 8 regimes, even without knowledge of 9, 0, or other non-stationarity parameters (Suk et al., 31 Jan 2025).
4. Significant Shift Measures
A refined measure of non-stationarity is the "significant shift" count 1, which seeks to only capture shifts that force a restart or re-exploration. Informally, an interval 2 is safe if there exists an arm among the first 3 sampled at the start, whose cumulative regret on the block does not exceed 4.
Epochs between significant shifts are defined recursively: set 5, and then 6 is the first 7 where no safe arm exists for 8. 9 is the largest 0 with 1. Crucially, 2 w.h.p., emphasizing that 3 discounts regime changes that do not affect exploration policy.
The main theoretical result with respect to this metric for the randomized elimination algorithm is:
4
demonstrating that regret depends adaptively on the truly significant non-stationarity (Suk et al., 31 Jan 2025).
5. Randomized Elimination and Adaptive Algorithms
The randomized elimination variant replaces the base finite-armed algorithm with a uniform sampling and importance-weighted elimination strategy within each block. For each block 5:
- Draw a subsample 6 of 7 arms.
- Maintain an active set 8. For each 9 in the block, play 0 and observe 1.
- Compute importance-weighted losses 2. Eliminate 3 if the cumulative weight exceeds a threshold.
This mechanism provides direct detection of significant shifts—via elimination of all arms in 4—while avoiding overreaction to mild variation. The resulting regret bound is shift-adaptive, incurring no penalty for non-critical regime changes.
6. Connections to Prior Work and Advances
Prior foundational results on stationary infinite-armed bandits (Berry & Fristedt, Wang et al., Carpentier & Locatelli) achieve regret 5 under the assumption 6. Non-stationary results were previously limited to monotone (rotting rewards) settings, often requiring explicit knowledge of regularity or rotting parameters and restricted to certain non-stationarity regimes (Kim et al. 2022, 2024).
The non-stochastic paradigm investigated here achieves several advances (Suk et al., 31 Jan 2025):
- Fully adversarial, adaptive reward evolution for played arms.
- Parameter-free algorithms applicable to all 7.
- Regret rates depending only on realized non-stationarity and significant shifts, not on global knowledge of variation parameters.
- Introduction of the significant-shift measure 8.
- High probability regret guarantees and adaptivity in the presence of arbitrary non-stationarity.
A plausible implication is that shift-adaptive methodologies could inform robust exploration in large-scale, open-world online learning applications where the reward landscape evolves adversarially and arm discovery is unbounded.