Sliding-Window UCRL2-CW for Non-Stationary MDPs
- The paper introduces a sliding-window extension of UCRL2 that adapts to adversarial changes in rewards and transitions, achieving improved regret scaling.
- SW-UCRL employs a time-limited window, empirical estimation, and optimistic planning via extended value iteration to robustly estimate model dynamics in non-stationary environments.
- Empirical evaluations demonstrate that optimized window size yields sharper regret bounds and sample complexity compared to traditional episodic restart methods.
Sliding-Window UCRL2-CW refers to a class of reinforcement learning algorithms tailored to finite Markov Decision Processes (MDPs) in which both the reward and state-transition dynamics can abruptly and adversarially change at unknown time points. The sliding-window approach, as instantiated in the SW-UCRL algorithm, modifies the optimism-in-the-face-of-uncertainty principle of UCRL2 (Jaksch et al. 2010), equipping it for environments with arbitrary non-stationarity in both rewards and transitions by employing a time-limited window of experience for model estimation. This method provides performance guarantees on regret relative to the (possibly time-varying) optimal policy sequence and achieves superior adaptation and regret scaling compared to the previously standard approach of episodic restarts (UCRL2-CW) (Gajane et al., 2018).
1. Non-Stationary MDP Setting
The setting involves an agent interacting with a finite MDP where reward functions and transition kernels are subject to instantaneous and unannounced changes, the timing and nature of which are dictated by an adversary. The agent is only informed of its current state and observed reward after each action, and is never told when or how the underlying MDP has changed. The goal is to minimize cumulative regret relative to the sequence of optimal stationary policies for each MDP configuration—an evaluative standard denoted as "switching-MDP" regret,
where is the active MDP at time and its optimal average reward.
2. Algorithmic Structure of SW-UCRL
SW-UCRL is structured as a sliding-window analogue of UCRL2. At each episode :
- Data Extraction: For each , counts are extracted based solely on observations from the previous time steps:
- : number of times action was taken in state during last steps.
- : corresponding cumulative reward.
- : transitions from to via .
- Estimation: Empirical estimates are formed as
- Confidence Set Construction: For each , a set of plausible rewards and transitions is constructed as:
- Optimistic Planning: Using extended value iteration, the most optimistic plausible MDP (in terms of average reward) is determined, and a near-optimal policy for this model is derived.
- Policy Execution and Episode Termination: The chosen policy is executed until, for every , the number of occurrences during the episode matches from the sliding window, ensuring that no episode can exceed steps.
This approach “forgets” older data potentially contaminated by previous environments, providing rapid adaptation after changes in the MDP.
3. Theoretical Regret and Sample Complexity
Let be the number of states, the number of actions, the maximal diameter across all MDPs encountered, the number of change-points, and the time horizon. For a fixed window size , SW-UCRL satisfies: with probability at least for .
Optimal Window Size
Regret can be optimized (if and are known) by choosing: which yields
Sample Complexity (PAC Bound)
The number of steps on which the average regret exceeds is at most
A plausible implication is that adaptive sliding-window parameterization can tightly control the tradeoff between responsiveness to change (low ) and estimation noise (high ).
4. Comparison with UCRL2-CW and Related Methods
UCRL2 with restarts (UCRL2-CW) periodically resets episode counts and statistical estimates, essentially discarding all experience before the most recent restart or detected change. While this provides some degree of adaptation, its granularity is coarser than the continuous, stepwise adaptation of SW-UCRL. Regret for UCRL2-CW scales as
whereas for SW-UCRL it is
providing improved exponents with respect to for moderate to large numbers of switches. This suggests that SW-UCRL can exploit finer adaptivity to recent data, especially in rapidly changing environments.
| Algorithm | Regret Scaling | Adaptivity | Data Used |
|---|---|---|---|
| UCRL2 (stationary) | None | All history | |
| UCRL2 with restarts | Coarse (restarts) | Post-restart | |
| SW-UCRL (sliding window) | Finer (windowed) | Past steps |
5. Empirical Evaluation
Experiments on synthetic switching MDPs with , , , and varying (number of switches) demonstrate that all compared algorithms register regret "bumps" at each switch, reflecting the unavoidable adaptation period. SW-UCRL consistently achieves lower cumulative regret than UCRL2 with restarts, particularly as the number of switches increases. Selecting the theoretically prescribed optimal window enables SW-UCRL to surpass even UCRL2-CW with restart intervals optimized via tuning. This aligns experimental findings closely with theoretical predictions.
6. Contributions and Significance
The sliding-window methodology introduced in SW-UCRL constitutes an extension of prior sliding-window bandit literature to full reinforcement learning in the presence of arbitrary, adversarial non-stationarity in both rewards and transitions. This is the first approach providing nontrivial regret guarantees under such general change assumptions. The regret and sample complexity bounds improve upon prior work, yielding sharper exponents in all key parameters . The theoretical analysis is substantiated by empirical results, affirming both the practical effectiveness and robustness of the sliding-window mechanism in dynamic MDP scenarios.
7. Formulas and Key Expressions
- Switching-MDP Regret:
- Regret Bound (Optimal ):
- Sample Complexity of -Suboptimal Steps:
- Optimal Window Size:
The sliding-window algorithm for non-stationary MDPs provides a theoretically grounded method for rapid adaptation and regret minimization, marking a significant advancement in model-based reinforcement learning where both rewards and transitions can shift adversarially (Gajane et al., 2018).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free