Aging Bandit with Adaptive Reset
- Aging bandit with adaptive reset algorithms are non-stationary multi-armed bandit strategies that dynamically adjust exploration and exploitation by detecting drift and implementing resets.
- They utilize adaptive windowing, change detection, and periodic monitoring to maintain near-optimal performance in environments with continuous decay or abrupt shifts.
- Empirical and theoretical studies demonstrate sublinear regret bounds and effective performance in applications such as edge network AoI minimization and infinite rotting bandit scenarios.
An aging bandit with adaptive reset is a class of non-stationary multi-armed bandit (MAB) algorithms designed to accommodate environments where expected rewards of arms evolve gradually (aging/rot), abruptly (change-points), or both. The unifying characteristic of this regime is the use of adaptive detection and resetting mechanisms to maintain near-optimal decision making as distributional properties drift or shift over time. These techniques have emerged as essential tools for systems ranging from edge networks with freshness constraints (Zhuang et al., 13 Jan 2026), to infinitely-armed rotting bandit problems (Kim et al., 2024), and general nonstationary MABs (Komiyama et al., 2021). The following sections review key models, algorithmic frameworks, theoretical guarantees, and empirical findings in the study of aging bandits with adaptive reset.
1. Problem Formulation and Nonstationary Bandit Models
The aging bandit problem generalizes the classical MAB by allowing reward distributions to change over time, either continuously (aging, “rested rotting”), abruptly (piecewise-constant), or in a mixed regime. In canonical models:
- Rested Rotting Bandits: Each arm possesses a mean reward that decreases (rots) upon being pulled, with the decay possibly adversarially controlled but globally bounded by total decay or switch count (Kim et al., 2024). Given infinite arms, the regret is measured as .
- Abrupt and Gradual Nonstationarity: Mean rewards of arms can undergo sudden jumps (change-points) or bounded per-round drift (Komiyama et al., 2021, Zhuang et al., 13 Jan 2026).
- Edge Network AoI Minimization: Clients make decisions about which access nodes (ANs) to request fresh content from, with AoI (age of information) reduction being the reward (Zhuang et al., 13 Jan 2026). The objective is to minimize time-averaged AoI, reformulated as maximizing cumulative age reduction over a horizon .
Nonstationarity is often partially observable, and may involve history- and coupling-dependencies between arms, as in collaborative edge scenarios (Zhuang et al., 13 Jan 2026).
2. Adaptive Resetting: Algorithmic Designs
The central architectural feature in this class is adaptive resetting, which combines:
- Sliding/Adaptive Windows: Maintain reward statistics over a window whose span adapts to data-driven nonstationarity, either via explicit window-length optimizations (Kim et al., 2024) or change-point/evidence-based resizing (Komiyama et al., 2021, Zhuang et al., 13 Jan 2026).
- Change Detection: Employ detection schemes such as ADWIN (Adaptive Windowing) (Komiyama et al., 2021, Zhuang et al., 13 Jan 2026) to trigger resets. ADWIN maintains a variable window over recent samples and signals a change when the mean difference across every bipartition exceeds a threshold:
for a chosen confidence .
- Periodic Monitoring: To address nonstationarity in rarely-pulled arms, algorithms enforce periodic exploration of each arm within “blocks” to ensure changes are detectable (Zhuang et al., 13 Jan 2026, Komiyama et al., 2021).
- Reset and Re-initialization: Upon detection, empirical means, counters, and relevant algorithmic state are purged and the procedure restarts, often with an adjusted time horizon.
A representative algorithm is ABAR (Aging Bandit with Adaptive Reset) (Zhuang et al., 13 Jan 2026), structured as follows: - Time is divided into exponentially-growing blocks, each interleaving monitoring pulls and UCB-driven exploitation. - Arm statistics are updated online; resets are triggered upon drift detection by ADWIN. - Monitoring frequency and window sizes are tuned adaptively according to theoretical nonstationarity parameters.
For infinitely-armed settings, an adaptive sliding-window UCB discards arms as soon as their sliding-window UCB drops below a threshold, with the window span selected to optimally trade off bias (due to rotting) and variance (Kim et al., 2024).
3. Theoretical Regret Guarantees
These algorithms aim for sublinear regret with respect to relevant nonstationarity metrics. Universal patterns in guarantees include:
- Gradual Drift (Aging): For bounded drift , ABAR and ADR-bandit both achieve
which is sublinear for (Zhuang et al., 13 Jan 2026, Komiyama et al., 2021). The lower bound, up to logs, is with (Komiyama et al., 2021).
- Abrupt Changes: For detectable global change-points, regret scales as
(Komiyama et al., 2021, Allesiardo et al., 2016). Detection delays are controlled by block scheduling and statistical monitoring.
- Rested Rotting (Infinite Arms): For slow-rotting (budget ) or abrupt-rotting (budget ), the minimax regret (for initial mean exponent ) is
- Stationary Sub-case: Both ABAR and ADR-bandit recover classical logarithmic regret of the base bandit in stationary environments () (Zhuang et al., 13 Jan 2026, Komiyama et al., 2021).
Critical to these results is the adaptivity of the reset/monitoring schedule and the drift-tolerant character of the underlying bandit subroutine (e.g., UCB, Thompson Sampling, KL-UCB), all of which have been analyzed to satisfy concentration and regret conditions under nonstationary regimes (Komiyama et al., 2021).
4. Empirical Evaluation and Comparative Performance
Extensive simulations and real-world benchmarks demonstrate:
- In decentralized edge networks optimizing AoI, ABAR achieves average AoI within 5–10% of a centralized oracle, outperforming D-UCB, SW-UCB, and multi-agent MAB baselines by large margins in both well-separated and closely-spaced AN scenarios (Zhuang et al., 13 Jan 2026). Its cumulative AoI regret flattens (sublinear growth) while other methods display linear increase.
- ADR-bandit matches theoretical regret rates on synthetic environments with both abrupt and gradual nonstationarity, outperforming fixed-window, SW-UCB, D-UCB, and discounting variants (Komiyama et al., 2021). On real datasets with strongly time-varying rewards, ADR-bandit adapts more rapidly than windowed competitors, requiring only a single confidence parameter.
- In infinitely-armed slow and abrupt rotting, adaptive sliding-window UCB with Bandit-over-Bandits meta-tuning approaches minimax regret, showing substantial practical improvement over pessimistic UCB-TP and other windowed strategies (Kim et al., 2024).
- Random-shuffling and adaptive-reset elimination (SER) gives uniformly improved regret and sample complexity over both vanilla and fixed-reset successive elimination in environments with arbitrary or adversarial switching (Allesiardo et al., 2016), providing a robust template for best-arm identification under nonstationarity.
5. Extensions, Open Questions, and Limitations
Several algorithmic and theoretical extensions remain active research topics:
- Parameter-Free Adaptivity: While meta-algorithms such as Bandit-over-Bandits offer automatic tuning for unknown nonstationarity, a parameter-free solution matching optimal regret without extra cost is open (Kim et al., 2024).
- Locally Non-Stationary/Partial Drift: Most adaptive-reset frameworks presuppose global change affecting all arms similarly. Performance can degrade if only a subset of arms ages or shifts (violating the global assumption), as ADWIN-driven resets may not detect all local changes (Komiyama et al., 2021).
- Structured/Augmented Nonstationarity: Regret bounds and algorithmic effectiveness under feature-dependent decay, adversarial drift in non-rested settings, or continuous arm spaces are not fully characterized (Kim et al., 2024).
- Trade-offs between Change Detection and Exploitation: Monitoring overhead, estimation bias/variance trade-off, and the choice of window/block sizes must be carefully balanced, particularly in environments with mixtures of slow and fast-changing arms.
6. Synthesis and Comparative Algorithmic Taxonomy
Multiple instantiations of the aging bandit with adaptive reset paradigm exist, differentiated by the interaction of detection, resetting, and monitoring mechanisms. Representative designs are summarized below:
| Algorithm | Detection Mechanism | Reset Policy | Monitoring Strategy |
|---|---|---|---|
| ABAR (Zhuang et al., 13 Jan 2026) | ADWIN-based window | Full statistics/horizon reset | Periodic block-based pulls |
| ADR-bandit (Komiyama et al., 2021) | ADWIN (per-arm) | Full base-bandit + window reset | Block-doubling, minimal forced |
| UCB-sliding window (Kim et al., 2024) | Windowed UCB scores | Arm elimination (drop forever) | Exponentially spaced windows |
| SER (Allesiardo et al., 2016) | Block length/random | Full SE reset, random timing | Random shuffle within block |
These algorithms share a common core: adaptive reset is jointly essential with controlled exploitation and minimum exploration for tracking in nonstationary, aging, or rotting regimes. Their high-probability regret guarantees, empirical resilience, and broad adaptability make them foundational in modern research on robust online learning in evolving environments.