Sliding-Window UCRL2-CW for Non-Stationary MDPs

Updated 6 November 2025

The paper introduces a sliding-window extension of UCRL2 that adapts to adversarial changes in rewards and transitions, achieving improved regret scaling.
SW-UCRL employs a time-limited window, empirical estimation, and optimistic planning via extended value iteration to robustly estimate model dynamics in non-stationary environments.
Empirical evaluations demonstrate that optimized window size yields sharper regret bounds and sample complexity compared to traditional episodic restart methods.

Sliding-Window UCRL2-CW refers to a class of reinforcement learning algorithms tailored to finite Markov Decision Processes (MDPs) in which both the reward and state-transition dynamics can abruptly and adversarially change at unknown time points. The sliding-window approach, as instantiated in the SW-UCRL algorithm, modifies the optimism-in-the-face-of-uncertainty principle of UCRL2 (Jaksch et al. 2010), equipping it for environments with arbitrary non-stationarity in both rewards and transitions by employing a time-limited window of experience for model estimation. This method provides performance guarantees on regret relative to the (possibly time-varying) optimal policy sequence and achieves superior adaptation and regret scaling compared to the previously standard approach of episodic restarts (UCRL2-CW) (Gajane et al., 2018).

1. Non-Stationary MDP Setting

The setting involves an agent interacting with a finite MDP where reward functions $r$ and transition kernels $p$ are subject to instantaneous and unannounced changes, the timing and nature of which are dictated by an adversary. The agent is only informed of its current state and observed reward after each action, and is never told when or how the underlying MDP has changed. The goal is to minimize cumulative regret relative to the sequence of optimal stationary policies for each MDP configuration—an evaluative standard denoted as "switching-MDP" regret,

$\mathrm{Regret}(T) = \sum_{t=1}^{T} \left( \rho_{M(t)}^* - r_t \right),$

where $M(t)$ is the active MDP at time $t$ and $\rho_{M(t)}^*$ its optimal average reward.

2. Algorithmic Structure of SW-UCRL

SW-UCRL is structured as a sliding-window analogue of UCRL2. At each episode $k$ :

Data Extraction: For each $(s,a)$ $(s, a)$ , counts are extracted based solely on observations from the previous $W$ $W$ time steps:
- $N_k(s,a)$ : number of times action $a$ was taken in state $s$ during last $W$ steps.
- $R_k(s,a)$ : corresponding cumulative reward.
- $P_k(s,a,s')$ : transitions from $s$ to $s'$ via $a$ .
Estimation: Empirical estimates are formed as

$\widehat{r}_k(s,a) = \frac{R_k(s,a)}{\max\{1,N_k(s,a)\}}, \qquad \widehat{p}_k(s'|s,a) = \frac{P_k(s,a,s')}{\max\{1,N_k(s,a)\}}.$

Confidence Set Construction: For each $(s,a)$ , a set of plausible rewards and transitions is constructed as:

$|\widetilde{r}(s,a) - \widehat{r}_k(s,a)| \leq \sqrt{ \frac{7 \log(2SA t_k / \delta)}{2\max\{1, N_k(s,a)\}} },$

$\| \widetilde{p}(\cdot|s,a) - \widehat{p}_k(\cdot|s,a) \|_1 \leq \sqrt{ \frac{14S \log(2A t_k/\delta)}{\max\{1,N_k(s,a)\}} }.$

Optimistic Planning: Using extended value iteration, the most optimistic plausible MDP (in terms of average reward) is determined, and a near-optimal policy for this model is derived.
Policy Execution and Episode Termination: The chosen policy is executed until, for every $(s,a)$ , the number of occurrences $v_k(s,a)$ during the episode matches $N_k(s,a)$ from the sliding window, ensuring that no episode can exceed $W$ steps.

This approach “forgets” older data potentially contaminated by previous environments, providing rapid adaptation after changes in the MDP.

3. Theoretical Regret and Sample Complexity

Let $S$ be the number of states, $A$ the number of actions, $D = \max_i D(M_i)$ the maximal diameter across all MDPs encountered, $l$ the number of change-points, and $T$ the time horizon. For a fixed window size $W$ , SW-UCRL satisfies: $\mathrm{Regret}(T) \leq 2lW + 66.12\, \left\lceil \frac{T}{\sqrt{W}} \right\rceil DS \sqrt{A \log \left( \frac{T}{\delta} \right) }$ with probability at least $1-\delta$ for $T \geq \max(8\delta, 2A\delta)$ .

Optimal Window Size

Regret can be optimized (if $T$ and $l$ are known) by choosing: $W^* = \left( \frac{16.53}{l} \, T D S \sqrt{A \log \left( \frac{T}{\delta} \right)} \right)^{2/3}$ which yields

$\mathrm{Regret}(T) \leq 38.94\, l^{1/3} T^{2/3} D^{2/3} S^{2/3} \left( A\log{\left(\frac{T}{\delta}\right)} \right)^{1/3}.$

Sample Complexity (PAC Bound)

The number of steps on which the average regret exceeds $\epsilon$ is at most

$T \geq 2 \cdot (38.94)^3 \cdot \frac{ l D^2 S^2 A }{\epsilon^3} \log\left( \frac{(38.94)^3 l D^2 S^2 A}{\epsilon^3\delta} \right).$

A plausible implication is that adaptive sliding-window parameterization can tightly control the tradeoff between responsiveness to change (low $W$ ) and estimation noise (high $W$ ).

UCRL2 with restarts (UCRL2-CW) periodically resets episode counts and statistical estimates, essentially discarding all experience before the most recent restart or detected change. While this provides some degree of adaptation, its granularity is coarser than the continuous, stepwise adaptation of SW-UCRL. Regret for UCRL2-CW scales as

$O\left( l^{1/2} T^{3/4} D S A^{1/2} \right),$

whereas for SW-UCRL it is

$O(l^{1/3} T^{2/3} D^{2/3} S^{2/3} A^{1/3}),$

providing improved exponents with respect to $l, T, D, S, A$ for moderate to large numbers of switches. This suggests that SW-UCRL can exploit finer adaptivity to recent data, especially in rapidly changing environments.

Algorithm	Regret Scaling	Adaptivity	Data Used
UCRL2 (stationary)	$O(\sqrt{T})$	None	All history
UCRL2 with restarts	$O( l^{1/2} T^{3/4} D S A^{1/2})$	Coarse (restarts)	Post-restart
SW-UCRL (sliding window)	$O( l^{1/3} T^{2/3} D^{2/3} S^{2/3} A^{1/3} )$	Finer (windowed)	Past $W$ steps

5. Empirical Evaluation

Experiments on synthetic switching MDPs with $S=5$ , $A=3$ , $T=10^5$ , and varying $l$ (number of switches) demonstrate that all compared algorithms register regret "bumps" at each switch, reflecting the unavoidable adaptation period. SW-UCRL consistently achieves lower cumulative regret than UCRL2 with restarts, particularly as the number of switches increases. Selecting the theoretically prescribed optimal window $W^*$ enables SW-UCRL to surpass even UCRL2-CW with restart intervals optimized via tuning. This aligns experimental findings closely with theoretical predictions.

6. Contributions and Significance

The sliding-window methodology introduced in SW-UCRL constitutes an extension of prior sliding-window bandit literature to full reinforcement learning in the presence of arbitrary, adversarial non-stationarity in both rewards and transitions. This is the first approach providing nontrivial regret guarantees under such general change assumptions. The regret and sample complexity bounds improve upon prior work, yielding sharper exponents in all key parameters $(l, T, D, S, A)$ . The theoretical analysis is substantiated by empirical results, affirming both the practical effectiveness and robustness of the sliding-window mechanism in dynamic MDP scenarios.

7. Formulas and Key Expressions

Switching-MDP Regret:

$\mathrm{Regret}(T) = \sum_{t=1}^{T} \left( \rho_{M(t)}^* - r_t \right)$

Regret Bound (Optimal $W$ ):

$\mathrm{Regret}(T) \leq 38.94\, l^{1/3} T^{2/3} D^{2/3} S^{2/3} ( A \log(T/\delta) )^{1/3}$

Sample Complexity of $\epsilon$ -Suboptimal Steps:

$T \geq 2 \cdot (38.94)^3 \cdot \frac{ l D^2 S^2 A }{\epsilon^3} \log\left( \frac{(38.94)^3 l D^2 S^2 A}{\epsilon^3\delta} \right)$

Optimal Window Size:

$W^* = \left( \frac{16.53}{l} T D S \sqrt{A\log(T/\delta)} \right)^{2/3}$

The sliding-window algorithm for non-stationary MDPs provides a theoretically grounded method for rapid adaptation and regret minimization, marking a significant advancement in model-based reinforcement learning where both rewards and transitions can shift adversarially (Gajane et al., 2018).

PDF Markdown Chat (Pro)

References (1)

A Sliding-Window Algorithm for Markov Decision Processes with Arbitrarily Changing Rewards and Transitions (2018)

Follow Topic

Get notified by email when new papers are published related to Sliding-Window UCRL2-CW.