Aging Bandit with Adaptive Reset

Updated 20 January 2026

Aging bandit with adaptive reset algorithms are non-stationary multi-armed bandit strategies that dynamically adjust exploration and exploitation by detecting drift and implementing resets.
They utilize adaptive windowing, change detection, and periodic monitoring to maintain near-optimal performance in environments with continuous decay or abrupt shifts.
Empirical and theoretical studies demonstrate sublinear regret bounds and effective performance in applications such as edge network AoI minimization and infinite rotting bandit scenarios.

An aging bandit with adaptive reset is a class of non-stationary multi-armed bandit (MAB) algorithms designed to accommodate environments where expected rewards of arms evolve gradually (aging/rot), abruptly (change-points), or both. The unifying characteristic of this regime is the use of adaptive detection and resetting mechanisms to maintain near-optimal decision making as distributional properties drift or shift over time. These techniques have emerged as essential tools for systems ranging from edge networks with freshness constraints (Zhuang et al., 13 Jan 2026), to infinitely-armed rotting bandit problems (Kim et al., 2024), and general nonstationary MABs (Komiyama et al., 2021). The following sections review key models, algorithmic frameworks, theoretical guarantees, and empirical findings in the study of aging bandits with adaptive reset.

1. Problem Formulation and Nonstationary Bandit Models

The aging bandit problem generalizes the classical MAB by allowing reward distributions to change over time, either continuously (aging, “rested rotting”), abruptly (piecewise-constant), or in a mixed regime. In canonical models:

Rested Rotting Bandits: Each arm possesses a mean reward that decreases (rots) upon being pulled, with the decay possibly adversarially controlled but globally bounded by total decay $V_T$ or switch count $S_T$ (Kim et al., 2024). Given infinite arms, the regret is measured as $R(T) = \mathbb{E}\left[\sum_{t=1}^{T} (1 - \mu_t(a_t))\right]$ .
Abrupt and Gradual Nonstationarity: Mean rewards $\mu_{k,t}$ of arms $k$ can undergo sudden jumps (change-points) or bounded per-round drift $|\mu_{k,t+1} - \mu_{k,t}| \leq b$ (Komiyama et al., 2021, Zhuang et al., 13 Jan 2026).
Edge Network AoI Minimization: Clients make decisions about which access nodes (ANs) to request fresh content from, with AoI (age of information) reduction being the reward (Zhuang et al., 13 Jan 2026). The objective is to minimize time-averaged AoI, reformulated as maximizing cumulative age reduction over a horizon $T$ .

Nonstationarity is often partially observable, and may involve history- and coupling-dependencies between arms, as in collaborative edge scenarios (Zhuang et al., 13 Jan 2026).

2. Adaptive Resetting: Algorithmic Designs

The central architectural feature in this class is adaptive resetting, which combines:

Sliding/Adaptive Windows: Maintain reward statistics over a window whose span adapts to data-driven nonstationarity, either via explicit window-length optimizations (Kim et al., 2024) or change-point/evidence-based resizing (Komiyama et al., 2021, Zhuang et al., 13 Jan 2026).
Change Detection: Employ detection schemes such as ADWIN (Adaptive Windowing) (Komiyama et al., 2021, Zhuang et al., 13 Jan 2026) to trigger resets. ADWIN maintains a variable window over recent samples and signals a change when the mean difference across every bipartition exceeds a threshold:

$|\widehat{\mu}_{W_1} - \widehat{\mu}_{W_2}| \geq \epsilon_{\text{cut}} = \sqrt{\frac{\ln(1/\delta)}{2|W_1|}} + \sqrt{\frac{\ln(1/\delta)}{2|W_2|}}$

for a chosen confidence $\delta$ .

Periodic Monitoring: To address nonstationarity in rarely-pulled arms, algorithms enforce periodic exploration of each arm within “blocks” to ensure changes are detectable (Zhuang et al., 13 Jan 2026, Komiyama et al., 2021).
Reset and Re-initialization: Upon detection, empirical means, counters, and relevant algorithmic state are purged and the procedure restarts, often with an adjusted time horizon.

A representative algorithm is ABAR (Aging Bandit with Adaptive Reset) (Zhuang et al., 13 Jan 2026), structured as follows: - Time is divided into exponentially-growing blocks, each interleaving monitoring pulls and UCB-driven exploitation. - Arm statistics are updated online; resets are triggered upon drift detection by ADWIN. - Monitoring frequency and window sizes are tuned adaptively according to theoretical nonstationarity parameters.

For infinitely-armed settings, an adaptive sliding-window UCB discards arms as soon as their sliding-window UCB drops below a threshold, with the window span selected to optimally trade off bias (due to rotting) and variance (Kim et al., 2024).

3. Theoretical Regret Guarantees

These algorithms aim for sublinear regret with respect to relevant nonstationarity metrics. Universal patterns in guarantees include:

Gradual Drift (Aging): For bounded drift $|\mu_{k, t+1} - \mu_{k, t}| \leq b = T^{-d}$ , ABAR and ADR-bandit both achieve

$\mathbb{E}[R_T] = O\left( \sqrt{K} T^{1 - d/3} \log^{O(1)} T \right)$

which is sublinear for $d > 0$ (Zhuang et al., 13 Jan 2026, Komiyama et al., 2021). The lower bound, up to logs, is $O(V^{1/3} T^{2/3})$ with $V = T b$ (Komiyama et al., 2021).

Abrupt Changes: For $M$ detectable global change-points, regret scales as

$\mathbb{E}[R_T] = O\left(\sqrt{M K T}\right)$

(Komiyama et al., 2021, Allesiardo et al., 2016). Detection delays are controlled by block scheduling and statistical monitoring.

Rested Rotting (Infinite Arms): For slow-rotting (budget $V_T$ ) or abrupt-rotting (budget $S_T$ ), the minimax regret (for initial mean exponent $\beta \geq 1$ ) is

$\begin{aligned} &O \left( \max \{ V_T^{1/(\beta+2)} T^{(\beta+1)/(\beta+2)},\, T^{\beta/(\beta+1)} \} \right) \quad \text{(slow-rotting)} \ &O \left( \max \{ S_T^{1/(\beta+1)} T^{\beta/(\beta+1)},\, V_T \} \right) \quad \text{(abrupt-rotting)} \end{aligned}$

(Kim et al., 2024).

Stationary Sub-case: Both ABAR and ADR-bandit recover classical logarithmic regret of the base bandit in stationary environments ( $O(\log T)$ ) (Zhuang et al., 13 Jan 2026, Komiyama et al., 2021).

Critical to these results is the adaptivity of the reset/monitoring schedule and the drift-tolerant character of the underlying bandit subroutine (e.g., UCB, Thompson Sampling, KL-UCB), all of which have been analyzed to satisfy concentration and regret conditions under nonstationary regimes (Komiyama et al., 2021).

4. Empirical Evaluation and Comparative Performance

Extensive simulations and real-world benchmarks demonstrate:

In decentralized edge networks optimizing AoI, ABAR achieves average AoI within 5–10% of a centralized oracle, outperforming D-UCB, SW-UCB, and multi-agent MAB baselines by large margins in both well-separated and closely-spaced AN scenarios (Zhuang et al., 13 Jan 2026). Its cumulative AoI regret flattens (sublinear growth) while other methods display linear increase.
ADR-bandit matches theoretical regret rates on synthetic environments with both abrupt and gradual nonstationarity, outperforming fixed-window, SW-UCB, D-UCB, and discounting variants (Komiyama et al., 2021). On real datasets with strongly time-varying rewards, ADR-bandit adapts more rapidly than windowed competitors, requiring only a single confidence parameter.
In infinitely-armed slow and abrupt rotting, adaptive sliding-window UCB with Bandit-over-Bandits meta-tuning approaches minimax regret, showing substantial practical improvement over pessimistic UCB-TP and other windowed strategies (Kim et al., 2024).
Random-shuffling and adaptive-reset elimination (SER $_4$ ) gives uniformly improved regret and sample complexity over both vanilla and fixed-reset successive elimination in environments with arbitrary or adversarial switching (Allesiardo et al., 2016), providing a robust template for best-arm identification under nonstationarity.

5. Extensions, Open Questions, and Limitations

Several algorithmic and theoretical extensions remain active research topics:

Parameter-Free Adaptivity: While meta-algorithms such as Bandit-over-Bandits offer automatic tuning for unknown nonstationarity, a parameter-free solution matching optimal regret without $\tilde{O}(T^{3/4})$ extra cost is open (Kim et al., 2024).
Locally Non-Stationary/Partial Drift: Most adaptive-reset frameworks presuppose global change affecting all arms similarly. Performance can degrade if only a subset of arms ages or shifts (violating the global assumption), as ADWIN-driven resets may not detect all local changes (Komiyama et al., 2021).
Structured/Augmented Nonstationarity: Regret bounds and algorithmic effectiveness under feature-dependent decay, adversarial drift in non-rested settings, or continuous arm spaces are not fully characterized (Kim et al., 2024).
Trade-offs between Change Detection and Exploitation: Monitoring overhead, estimation bias/variance trade-off, and the choice of window/block sizes must be carefully balanced, particularly in environments with mixtures of slow and fast-changing arms.

6. Synthesis and Comparative Algorithmic Taxonomy

Multiple instantiations of the aging bandit with adaptive reset paradigm exist, differentiated by the interaction of detection, resetting, and monitoring mechanisms. Representative designs are summarized below:

Algorithm	Detection Mechanism	Reset Policy	Monitoring Strategy
ABAR (Zhuang et al., 13 Jan 2026)	ADWIN-based window	Full statistics/horizon reset	Periodic block-based pulls
ADR-bandit (Komiyama et al., 2021)	ADWIN (per-arm)	Full base-bandit + window reset	Block-doubling, minimal forced
UCB-sliding window (Kim et al., 2024)	Windowed UCB scores	Arm elimination (drop forever)	Exponentially spaced windows
SER $_4$ (Allesiardo et al., 2016)	Block length/random	Full SE reset, random timing	Random shuffle within block

These algorithms share a common core: adaptive reset is jointly essential with controlled exploitation and minimum exploration for tracking in nonstationary, aging, or rotting regimes. Their high-probability regret guarantees, empirical resilience, and broad adaptability make them foundational in modern research on robust online learning in evolving environments.

Markdown Report Issue Upgrade to Chat

References (4)

Adaptive Requesting in Decentralized Edge Networks via Non-Stationary Bandits (2026)

An Adaptive Approach for Infinitely Many-armed Bandits under Generalized Rotting Constraints (2024)

Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits (2021)

Random Shuffling and Resets for the Non-stationary Stochastic Bandit Problem (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Aging Bandit with Adaptive Reset.

Aging Bandit with Adaptive Reset

1. Problem Formulation and Nonstationary Bandit Models

2. Adaptive Resetting: Algorithmic Designs

3. Theoretical Regret Guarantees

4. Empirical Evaluation and Comparative Performance

5. Extensions, Open Questions, and Limitations

6. Synthesis and Comparative Algorithmic Taxonomy

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Aging Bandit with Adaptive Reset

1. Problem Formulation and Nonstationary Bandit Models

2. Adaptive Resetting: Algorithmic Designs

3. Theoretical Regret Guarantees

4. Empirical Evaluation and Comparative Performance

5. Extensions, Open Questions, and Limitations

6. Synthesis and Comparative Algorithmic Taxonomy

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research