Periodic Non-Stationary Policies
- Periodic non-stationary policies are decision rules that incorporate explicit cyclic patterns to align with temporal rhythms, enabling optimal actions in time-varying environments.
- They extend traditional Markov frameworks by using period-dependent actions, which enhances performance in RL, queueing, and delayed execution scenarios.
- Recent learning algorithms like PASQL and PFNN-based actor-critic methods demonstrate faster convergence and improved cumulative rewards under periodic policy structures.
Periodic non-stationary policies are decision rules in sequential environments that explicitly exploit time-dependent or cyclic structures, making policy actions a periodic function of time or interaction phase. Unlike stationary policies, which prescribe actions as a function only of the current state (often assuming Markovian property and time-invariance), periodic non-stationary policies introduce temporal periodicity that matches temporal, structural, or environmental rhythms. This approach is essential in reinforcement learning (RL), Markov decision processes (MDPs), partially observable Markov decision processes (POMDPs), and queueing or bandit models where system dynamics, observations, rewards, or even the agent-environment interface itself are periodic or non-stationary.
1. Formal Definitions and Illustrative Models
A periodic non-stationary policy satisfies
for a fixed period . The suprapolicy may be decomposed as a tuple with whenever . Such policies arise naturally when modeling time-varying reward and transition functions, agent execution delays, observation cycles, or environmental regimes with intrinsic periodicity.
In agent-state-based settings for POMDPs, suppose the Markov property does not hold for agent states (i.e., recursively computed statistics of the observation-action history). Policies of the form can then be profitably made periodic, breaking the constraint of stationarity. This principle generalizes to MDPs and bandit problems where periodic resets or cyclical decision rules offer increased robustness in the presence of non-stationarity or model mismatch (Sinha et al., 2024, Scherrer, 2012, Derman et al., 2021, Chen et al., 17 Nov 2025, Emami et al., 2023, Wei et al., 2021).
2. Periodic Policy Bellman Equations and Dynamic Programming
Periodic policies lead to "periodic Bellman equations," which generalize stationary fixed-point equations to -coupled value functions. For a discounted MDP or periodic agent state-based policy, these take the form
for . For MDPs with non-stationary transitions and rewards of period ,
with analogous forms for state-value functions. This yields a block-cyclic system whose fixed point is the optimal periodic policy (Chen et al., 17 Nov 2025, Sinha et al., 2024, Derman et al., 2021, Emami et al., 2023).
In queueing systems, periodic routing policies are defined so that each queue or queue class is assigned a deterministic job count per period. Under scaling, the per-queue input becomes nearly deterministic; all policies adhering to this periodic regimen achieve both asymptotic equivalence and optimality (Anselmi et al., 2014).
3. Learning Algorithms for Periodic Policies
Numerous RL algorithms have been proposed to learn periodic non-stationary policies across model classes:
- PASQL (Periodic Agent-State Q-Learning) maintains Q-tables, updating only the current phase. It applies standard stochastic approximation theory under irreducibility and periodicity assumptions for the agent-state-driven Markov chain, ensuring convergence to the unique periodic solution (Sinha et al., 2024).
- Non-stationary Q-learning for Delayed Execution leverages a forward model to predict the future state where actions are executed, maintaining an -periodic Q-function and policy, effectively solving delay-induced non-Markovianity with linear complexity in (Derman et al., 2021).
- Periodic Dynamic Programming and Q-Learning in non-stationary/varying-discount MDPs generalizes tabular value iteration and policy improvement to time-indexed Q-functions, ensuring contraction and convergence across a period [$(Chen et al., 17 Nov 2025)].
- Phasic Actor-Critic with PFNN (Phase-Functioned Neural Networks): for multi-timescale MARL, actor and critic networks are parameterized by phase, inducing periodicity in policy weights and enabling efficient learning under time-coupled dynamics (Emami et al., 2023).
- Periodic Reset and Epoch-Based RL: in non-stationary MABs and MDPs, periodic reset (e.g., R-MOSS) or periodically restarted policy optimization (e.g., PROPO) mitigate the accumulation of error and drift, yielding provably order-optimal regret bounds (Wei et al., 2021, Zhong et al., 2021).
- Augmented State-Space UCB: in periodic MDPs, state augmentation with phase induces effective non-stationary-to-stationary transformation, enabling the use of tabular or function approximation RL with regret scaling optimal in the period (Aniket et al., 2022, Aniket et al., 2023).
4. Rigorous Performance Analysis
Periodic non-stationary policies frequently admit strictly stronger error guarantees and regret bounds compared to stationary policies in both exact and approximate regimes:
- When the underlying process is non-Markovian in observed or agent states, stationary policies are generally sub-optimal; periodic policies can be optimal (Sinha et al., 2024, Derman et al., 2021, Chen et al., 17 Nov 2025).
- For infinite-horizon discounted MDPs under approximate value iteration, executing a periodic policy (cycling through last greedy policies) halves the worst-case error constant from (stationary) to (periodic, ) (Scherrer, 2012, Lesner et al., 2013).
- In large-scale parallel queuing and resource allocation, periodic policies minimize the stationary mean waiting time and are asymptotically equivalent across class, attaining optimality under aggregate demand constraints (Anselmi et al., 2014).
- Empirical results in delayed environments, non-stationary bandits, and MARL confirm the theoretical prediction that periodic policies achieve higher cumulative reward, faster convergence, and sublinear regret scaling in the horizon (Derman et al., 2021, Wei et al., 2021, Emami et al., 2023, Aniket et al., 2022, Aniket et al., 2023).
5. Practical Design Considerations and Computational Structures
Successful deployment of periodic non-stationary policies involves selection or identification of period matching system or environmental rhythms:
- Period choice should correspond to known or inferred cyclical environmental features (e.g., observation aliasing, operational constraints, exogenous regime switches) (Chen et al., 17 Nov 2025, Emami et al., 2023).
- Parameterizations: tabular policies and value functions require storage; function approximation can embed or phase information directly into network architectures (Chen et al., 17 Nov 2025, Emami et al., 2023, Derman et al., 2021).
- Policy improvement and evaluation: dynamic programming and Q-learning generalize easily to periodic regimes by updating only the appropriate subpolicy at (Sinha et al., 2024, Chen et al., 17 Nov 2025).
- Exploration: sufficient periodic exploration guarantees are required for convergence in non-stationary RL settings (Chen et al., 17 Nov 2025, Aniket et al., 2022).
- Robustness: in multi-agent and bandit environments, periodic reset policies automatically adapt to non-stationarity at epoch boundaries, reducing the need for adaptive control and parameter tuning (Wei et al., 2021, Zhong et al., 2021).
6. Extensions and Open Problems
Current research extends the scope and capabilities of periodic non-stationary policies:
- Unknown Period Recovery: Ensemble methods or spectral techniques can be used to estimate the period on-the-fly, with regret bounding proportional to identification delay (Aniket et al., 2023).
- Complex structural non-stationarities: Mixed periodic/non-periodic structure, multi-timescale agent composition, and adversarial non-stationarity are active areas for both theory and scalable RL algorithm development (Emami et al., 2023, Zhong et al., 2021).
- Function approximation and deep RL: Incorporating periodicity via PFNN, concatenated phase vectors, or attention mechanisms remains a strong theme in multi-agent and high-dimensional settings (Emami et al., 2023).
- Analytic ergodicity and occupation measures: Further understanding of the limiting distributions and occupation measures (e.g., in PASQL for POMDPs) is needed to generalize convergence proofs and error bounds (Sinha et al., 2024).
7. Application Domains
Periodic non-stationary policies have demonstrated impact in diverse domains:
- POMDPs with model-free summaries: PASQL and related approaches outperform stationary RL in partially observable regimes (Sinha et al., 2024).
- Delayed action execution in robotics, finance, cloud computing: non-stationary Markov policies with periodic structure efficiently address delays (Derman et al., 2021).
- Multi-agent coordination in energy management, transportation, industrial systems: periodic MARL policies enable coordination over complex time schedules (Emami et al., 2023).
- Queueing and batching in large-scale computing systems: deterministic periodic routing achieves asymptotic optimality in mean waiting time (Anselmi et al., 2014).
- Non-stationary bandits and planning under reward recovery: periodic scheduling approaches yield minimax regret and near-optimality (Wei et al., 2021, Simchi-Levi et al., 2021).
- Reinforcement learning with time-varying or cyclic reward/discount functions: explicit periodic NVMDP approaches provide tractable dynamic programming and policy shaping (Chen et al., 17 Nov 2025).
Periodic non-stationary policies represent a principled and tractable solution class for challenging non-stationary and history-dependent sequential decision problems. Their rigorous theoretical foundations, practical algorithmic instantiations, and proven applicability across RL and allied fields underscore their importance to state-of-the-art performance and analysis.