Non-Markovian Reinforcement Learning

Updated 13 November 2025

Non-Markovian reinforcement learning is a framework that relaxes the Markov property by leveraging entire histories to inform policy, reward, and dynamics modeling.
Algorithmic paradigms such as automata-based encodings and neural state summarization enable effective handling of temporal dependencies and complex reward structures.
Reduction results demonstrate that occupancy measures of history-dependent and Markovian policies can be equivalent, informing convergence and safety in RL strategies.

Non-Markovian reinforcement learning (RL) extends classical RL by relaxing the Markov property, allowing policies, dynamics, or reward functions to depend on longer or unbounded past histories rather than solely the current state or state-action pair. This generalization leads to richer models and algorithms that can capture temporally extended dependencies, reward structures specified by logical formulas, or environments with intrinsic memory effects. The field encompasses methodologically diverse advances, including symbolic automata-based encodings, neural/semi-supervised history summarization, compositional reward modeling, and robust learning under uncertainty about past-dependent system dynamics. This article surveys formal foundations, theoretical reduction results, algorithmic paradigms, and implications for practical RL in non-Markovian settings.

1. Formal Characterizations of Non-Markovianity

Non-Markovianity in RL refers to situations where dynamics, reward, or policy mappings depend on the full or partially summarized trajectory history. Three core types arise:

Non-Markovian dynamics: The transition kernel $p(s_{t+1}\mid h_t, a_t)$ depends on the history $h_t=(s_0,a_0,\dots,s_t)$ , not just $s_t, a_t$ (Gupta et al., 2021).
Non-Markovian rewards: The reward function $R(h_t)$ depends on the full state-action (or label) history, e.g., through finite automata, reward machines, or temporal logic (Gaon et al., 2019, Miao et al., 2023, Umili et al., 16 Aug 2024, Tang et al., 26 Oct 2024).
Non-Markovian policies: Policies $\pi(a_t\mid h_t)$ depend on arbitrary histories, including policies over options or policy mixtures (Laroche et al., 2022).

Non-Markovian Reward Decision Processes (NMRDPs), Non-Markovian Decision Processes (nMDPs), and Transition-Markov Decision Processes (TMDPs) formally encode such settings (Gaon et al., 2019, Huang et al., 12 Nov 2024, Dohmen et al., 2021).

2. Theoretical Equivalence and Occupancy Reduction

A core result is that, for any history-dependent policy $\pi \in \Pi$ in an MDP $m=(S,A,p_0,p,r,\gamma)$ , there exists a stationary Markovian policy $\pi'\in\Pi_m$ that induces the same discounted state–action occupancy measure $\mu^\pi_\gamma$ as $\pi$ (Laroche et al., 2022). Explicitly, the construction uses the Radon–Nikodym derivative: $\pi'(a|s) = \frac{\mu^\pi_\gamma(s,a)}{\mu^\pi_\gamma(s)}$ This result (Theorem 4.1 of (Laroche et al., 2022)) implies that, at the level of discounted state–action frequencies, the entire class of history-dependent policies is not richer than the Markovian class. It enables transfer of convergence rates, sample complexity, and optimality results for policy evaluation and improvement directly from Markovian to non-Markovian policies.

Notably, this reduction holds for occupancy-based analysis in settings such as:

Non-stationary or ensemble policies (replay buffer mixtures in deep RL, data from expert mixtures in offline RL, or option-based controllers),
Hierarchical RL where the active “option” is part of the history-dependent policy,
Any policy mixture, including those used in experience replay or dataset aggregation.

However, only statistics dependent on $\mu_\gamma$ (e.g., bootstrapped or off-policy methods) directly benefit. Full-trajectory-level statistics or return-conditioned methods may not (Laroche et al., 2022).

3. Algorithmic Paradigms for Non-Markovian RL

3.1 Automata-Based Encodings

When reward dependencies are regular in state-action histories (or propositional traces), the canonical reduction is via product MDP construction:

Compile reward specifications (possibly LTL or LTL $_f$ formulas) into deterministic finite automata (DFA), reward machines (RM), or their probabilistic variants (Gaon et al., 2019, Miao et al., 2023, Umili et al., 16 Aug 2024, Dohmen et al., 2021).
Augment the environment’s state space with the automaton state; now the reward is Markovian in the augmented space, enabling standard RL algorithms with convergence guarantees.
Experience classification, priority replay, or modularization schemes further improve sample efficiency and exploration by exploiting automaton structure (Miao et al., 2023, Miao et al., 17 Dec 2024).

3.2 Neural and Statistical State Compression

In true non-Markovian (e.g., partially observed) environments or where automata are intractable, RL can proceed by:

Learning recursively computable approximate sufficient statistics (RCASS) of the history (Chandak et al., 2022), e.g., via learned autoencoders.
Using transformer, RNN, or GRU-based encoders to summarize rolling windows or entire trajectories for use by value/policy networks (Wang, 27 Jul 2025, Qu et al., 2023).
Model-based approaches: Fit explicitly history-dependent or fractional-order dynamics for environments with memory or long-range correlations (Gupta et al., 2021).

3.3 Reward Modeling and Credit Assignment

For delayed or composite non-Markovian rewards, transformer-based reward modeling architectures with in-sequence (bi-directional) attention or weighted-sum decompositions can recover high-fidelity credit assignment, outperforming step-wise or Markovian models as the length and complexity of delay increases (Tang et al., 26 Oct 2024).

3.4 Robust and Inverse RL for Non-Markovian Spaces

Recent advances establish efficient and robust policy learning in non-Markovian nMDPs, including offline distributional robustness under uncertainty sets, even under low-rank or PSR structure (Huang et al., 12 Nov 2024). For inverse RL, Bayesian and LTL-based approaches address non-Markovian reward recovery, e.g., via reward machine posteriors or temporal logic formulas, using (possibly compositional) supervised or simulated annealing frameworks (Topper et al., 20 Jun 2024, Afzal et al., 2021).

4. Specific Applications and Empirical Domains

Hierarchical and Option-based Control: Analysis of semi-MDPs and hierarchical policies maps naturally to non-Markovian policies, with product-MDP or automaton collapses (Laroche et al., 2022).
Offline RL from Mixtures: Dataset aggregation or behavior derived from expert pools is inherently non-Markovian; equivalence results guarantee the applicability of standard convergence and safety bounds, provided discounted occupancy matching (Laroche et al., 2022).
Quantum control: For open quantum systems with environment-induced memory, RL is performed over an embedded (Markovianized) joint system-reservoir state, learned via maximum-likelihood embedding (Neema et al., 7 Feb 2024) or combined with explicit HEOM solvers (Jaouadi et al., 2023).
Networked RL with spatial/temporal memory: Stacking scene graphs (GNNs) with temporal encodings (GRUs/transformers) addresses both non-Markovian traffic and dynamic topologies in routing or communication domains (Wang, 27 Jul 2025).
Safety and Constraints: Safety signals depending on entire trajectory or sub-sequences are modeled via meta-safety variables and specialized actor-critic/lifting methods with dual-gradient Lagrangian adaptation (Low et al., 5 May 2024).

5. Limitations, Open Problems, and Trade-offs

Occupancy measure equivalence does not preserve full trajectory distributions—methods relying on full trajectory statistics remain outside the direct reach of the reduction (Laroche et al., 2022).
Automata and logic-based methods are limited by expressivity (regular languages), symbolic abstraction, and state blowup in highly structured or continuous environments (Gaon et al., 2019, Miao et al., 2023, Umili et al., 16 Aug 2024, Miao et al., 17 Dec 2024).
Sample efficiency and credit assignment become challenging with long-range rewards, rare-event discovery, or exploration in sparse-reward non-Markovian settings (Chandak et al., 2022, Tang et al., 26 Oct 2024).
Groundability of high-level objectives to raw observations without perfect symbol grounding requires semi-supervised or groundability analysis of temporal logic specifications (Umili et al., 16 Aug 2024).
Robust RL in non-Markovian models demands new concentrability coefficients and dual formulations for tractable policy learning and safety guarantees in the face of structural model and distributional uncertainty (Huang et al., 12 Nov 2024).

6. Implications and Unification Across RL Disciplines

At the foundational level, the equivalence of discounted occupancy measures between Markovian and non-Markovian policies (with explicit constructive formula) unifies the analysis of mixture, ensemble, hierarchical, and history-dependent controllers. This result drastically broadens the scope of sample-based RL convergence, fairness, and safety analysis, subject only to occupancy-based objectives, and suggests general methods for merging theoretical and practical lines of RL research (Laroche et al., 2022).

Simultaneously, the wealth of representational, algorithmic, and inferential techniques developed across domains—ranging from automata and logic-based RL, robust estimation, transformer-based history modeling, and specification-grounded representation learning—demonstrates that non-Markovianity is not an impediment but a structural avenue for expressivity, improved policy diversity, and robustness across real-world control and decision systems.

Key References: