SW-UCB: Adaptive Bandits in Dynamic Environments

Updated 18 October 2025

Sliding-Window Upper Confidence Bound (SW-UCB) algorithms are designed for non-stationary multi-armed bandit and reinforcement learning problems, using a finite recent observation window to promptly adapt to shifting reward distributions.
They modify the classic UCB strategy by computing empirical means and confidence bounds over a limited window, balancing rapid adaptation with estimation accuracy through tunable window sizes.
Theoretical analyses demonstrate near-optimal regret bounds in diverse settings, and variants like SW-UCB♯ extend the approach to distributed, context-driven, and slowly drifting environments.

Sliding-Window Upper Confidence Bound (SW-UCB) is an algorithmic family designed for multi-armed bandit (MAB) and reinforcement learning problems in non-stationary environments, where statistical properties of the arms (or states/actions) vary over time. SW-UCB approaches differ fundamentally from classical UCB algorithms by using only a finite, recent “window” of observations for estimation and confidence calculation, thereby enabling rapid adaptation to abrupt or drifting changes in the environment. The primary mechanisms, theoretical properties, variants, and applications are outlined below.

1. Algorithm Construction and Principles

The SW-UCB algorithm modifies the classic UCB framework by maintaining empirical statistics exclusively over a sliding window of the most recent $\tau$ rounds (or $W$ in RL), rather than the full observation history. At each time $t$ , the agent for each arm $i$ computes:

Sliding-window mean:

$\bar{r}_i(t, \tau) = \frac{1}{n_i(t, \tau)} \sum_{s = t-\tau + 1}^{t} r_i(s) \mathbb{I}[I_s = i]$

where $n_i(t, \tau)$ counts how often arm $i$ appeared within the window.

Confidence (padding) term:

$c_i(t, \tau) = B \sqrt{\frac{\xi \log(t \wedge \tau)}{n_i(t, \tau)}}$

with $B$ an upper bound on rewards, $\xi$ a positive constant, and $t \wedge \tau = \min(t, \tau)$ .

Decision rule:

$I_t = \arg\max_{i} \left[ \bar{r}_i(t, \tau) + c_i(t, \tau) \right]$

The critical innovation is the temporal restriction: only the recent $\tau$ (“window length”) samples are used, so the algorithm “forgets” older data. This allows the policy to swiftly track changes in the best arm when reward distributions shift abruptly.

2. Theoretical Regret Bounds and Adaptation

The primary metric for algorithmic evaluation in SW-UCB is the expected cumulative regret or the number of suboptimal selections. For the basic SW-UCB, it was shown (0805.3415) that for any fixed window $\tau$ and any suboptimal arm $i$ :

$\mathbb{E}_\pi^\tau[N_i(T)] \leq C(\tau) \cdot (T \log \tau)/\tau + \tau \cdot \Upsilon_T + \log^2(\tau)$

where $N_i(T)$ is the count of times arm $i$ pulled until time $T$ , $\Upsilon_T$ is the number of breakpoints (changes in distribution), and $C(\tau)$ depends on $B$ , $\xi$ , and the reward gap.

Key implications:

A larger $\tau$ decreases stationary-phase regret (first term), but increases adaptation cost across breakpoints (second term).
Tuning $\tau$ —e.g., $\tau \propto \sqrt{T \log(T)/\Upsilon_T}$ —yields $O(\sqrt{T \log T})$ regret when breakpoints are sparse, which matches the lower bound up to log factors.
When changes are more frequent, the regret scales as $O(T^{(1+\beta)/2} \sqrt{\log T})$ for $\Upsilon_T = O(T^\beta)$ .

Extensions have formalized rigorous guarantees for SW-UCB variants with time-varying window size (“SW-UCB $\sharp$ ”) (Wei et al., 2018) and proved minimax-optimality under variation budgets (Wei et al., 2021). For slowly drifting environments, adaptively scaling the window mitigates estimation bias while preserving sublinear regret.

3. SW-UCB Variants and Generalizations

SW-UCB’s principles have been adapted to various settings:

Time-varying window (SW-UCB $\sharp$ ): The window size $\tau(t, \alpha) = \min\{\lceil \lambda t^\alpha \rceil, t\}$ automatically increases with time to balance responsiveness and confidence (Wei et al., 2018).
SW-UCB-g: In regional bandit models with intra-group correlations, SW-UCB-g uses sliding windows to adapt group parameter estimates under slow drift, outperforming SW-UCB in correlated settings (Wang et al., 2018).
Distributed SW-UCB (RR-SW-UCB $\sharp$ , SW-DLP): Multi-player versions integrate sliding windows with decentralized scheduling and prioritization for collision-averse coordination (Wei et al., 2018).
Sliding-Window UCRL2-CW: In non-stationary MDPs, a sliding window is combined with confidence widening to counter statistical uncertainty in estimated transitions (Cheung et al., 2020).
Sliding-window MOSS/UCB: Order-optimal regret $\mathcal{O}((KV_T)^{1/3} T^{2/3})$ is achieved by choosing $\tau = \lceil K^{1/3} (T/V_T)^{2/3} \rceil$ ; this outperforms resetting and discount approaches in variation-bounded environments (Wei et al., 2021).

Algorithmic variants may exploit window-setting heuristics, incorporate oracle quantities (see below), or apply contextual adaptation in non-stationary settings.

4. Unified Theory and Oracle Quantities

A unified theoretical framework for UCB-type policies (Kikkawa et al., 1 Nov 2024) demonstrates that the decision index can be generalized to any “oracle quantity” $z_{k,t}$ (e.g., mean reward, expected maximum, probability of improvement). The UCB policy pulls the arm with the largest upper confidence bound of $z_{k,t}$ , provided:

The confidence interval shrinks appropriately: $z_k^{UCB} - z_k^{LCB} = O((z_{k^*} - z_k) \beta_t^d)$ for some $d > 0$ , $\beta_t = (\ln t)/N_k(t)$ .
Failure rates (i.e., non-optimal pulls) satisfy $N(t) = \Theta(\ln t)$ .
For sliding window implementations, the confidence term and empirical statistics must be computed over window-restricted data, retaining the shrinking interval property and order optimality conditions.

This approach allows systematic design of new UCB algorithms for arbitrary targets, including maximal and improvement-based bandit objectives. For SW-UCB, this suggests constructing windowed oracle quantities and window-based confidence radii (Kikkawa et al., 1 Nov 2024).

SW-UCB differs from other bandit algorithms as follows:

Classic UCB (UCB-1): Uses all historical data; optimal for stationary reward but slow to react to changes.
Discounted UCB (D-UCB): Applies exponential decay to older samples; adapts gradually and is preferred when reward changes are smooth (0805.3415).
Periodic resetting: Restarts estimates every $\tau$ steps; abrupt, may lead to spikes in regret (Wei et al., 2021).
EXP3/S: Designed for adversarial/non-stationary settings but not specialized for abrupt change adaptation (0805.3415).
Meta-UCB: Combines multiple algorithmic policies via UCB-indexing at the policy level; does not discard data but eliminates entire algorithms if their cumulative regret deviates (Cutkosky et al., 2020).

SW-UCB and its variants are favored for real-time adaptation where reward distributions undergo frequent and abrupt shifts.

6. Practical Applications and Implementation Issues

SW-UCB algorithms offer responsiveness and adaptability in non-stationary environments:

Adaptive network/channel selection: For wireless or IoT systems with time-varying interference (Bonnefoi et al., 2019), SW-UCB enables rapid reacquisition of optimal transmission choices.
Online recommender systems: Where user preferences drift, only recent interactions are relevant.
Robotics/multi-agent coordination: SW-UCB ensures robots can “forget” positions of resources that no longer yield high reward (Wei et al., 2018).
Reinforcement learning in dynamic MDPs: SWUCRL2-CW and variants combine windowing with optimism to handle changing state transitions and rewards (Cheung et al., 2020).
Material exploration and max bandit problems: By tracking oracle quantities over windows, materials or compounds that yield high reward are discovered efficiently (this is supported by unified UCB theory and PIUCB algorithms (Kikkawa et al., 1 Nov 2024)).

Implementation requires tracking and storage of the immediate $\tau$ -round history for each arm (or state-action pair), often a modest computational overhead. Window size tuning is crucial: smaller windows yield fast reaction but high variance; larger windows reduce estimation error but slow adaptation.

7. Limitations and Future Directions

SW-UCB algorithms are limited by:

The necessity of estimating or prespecifying the frequency of change to tune window length $\tau$ or the window scaling parameters.
Potential underperformance in settings with continuous, slow drift, where discounted or adaptive window approaches may yield tighter regret bounds.
Increased variance with short windows and the potential for instability.
Computational costs in distributed or high-dimensional contexts, due to multiple instances or required memory.

A plausible implication is that integrating window-based estimation with adaptive or oracle-driven confidence construction (per the unified UCB theory (Kikkawa et al., 1 Nov 2024)) will generate a new family of robust algorithms tailored for diverse, time-varying optimization objectives.

Summary Table: SW-UCB Variants and Their Core Properties

Algorithm	Window Mechanism	Regret Bound
SW-UCB (fixed $\tau$ )(0805.3415)	Fixed-size window	$O(\sqrt{T \log T})$ for sparse change
SW-UCB $\sharp$ (Wei et al., 2018)	Growing window	$O(T^{(1+\nu)/2} \ln T)$ , $\nu$ freq. of change
SW-MOSS(Wei et al., 2021)	Fixed window, minimax-optimal	$\mathcal{O}((K V_T)^{1/3} T^{2/3})$
SW-UCB-g(Wang et al., 2018)	Window per group	Vanishing per-step regret under slow drift
RR-SW-UCB $\sharp$ , SW-DLP(Wei et al., 2018)	Window, distributed	Sublinear regret under collisions
SWUCRL2-CW(Cheung et al., 2020)	Window + confidence widening	$\tilde{O}(T^{3/4})$ with variation budget

SW-UCB algorithms, through their mechanism of temporally localized estimation and padded exploration, furnish a principled and theoretically grounded strategy for adaptive control and learning in dynamic environments. Their near-optimal regret performance, tractable design, and extensibility to diverse bandit and RL formulations make them a foundational component in the theory and practice of adaptive online learning in non-stationary contexts.