Papers
Topics
Authors
Recent
Search
2000 character limit reached

Non-Markovian Reinforcement Learning

Updated 13 November 2025
  • Non-Markovian reinforcement learning is a framework that relaxes the Markov property by leveraging entire histories to inform policy, reward, and dynamics modeling.
  • Algorithmic paradigms such as automata-based encodings and neural state summarization enable effective handling of temporal dependencies and complex reward structures.
  • Reduction results demonstrate that occupancy measures of history-dependent and Markovian policies can be equivalent, informing convergence and safety in RL strategies.

Non-Markovian reinforcement learning (RL) extends classical RL by relaxing the Markov property, allowing policies, dynamics, or reward functions to depend on longer or unbounded past histories rather than solely the current state or state-action pair. This generalization leads to richer models and algorithms that can capture temporally extended dependencies, reward structures specified by logical formulas, or environments with intrinsic memory effects. The field encompasses methodologically diverse advances, including symbolic automata-based encodings, neural/semi-supervised history summarization, compositional reward modeling, and robust learning under uncertainty about past-dependent system dynamics. This article surveys formal foundations, theoretical reduction results, algorithmic paradigms, and implications for practical RL in non-Markovian settings.

1. Formal Characterizations of Non-Markovianity

Non-Markovianity in RL refers to situations where dynamics, reward, or policy mappings depend on the full or partially summarized trajectory history. Three core types arise:

  • Non-Markovian dynamics: The transition kernel p(st+1ht,at)p(s_{t+1}\mid h_t, a_t) depends on the history ht=(s0,a0,,st)h_t=(s_0,a_0,\dots,s_t), not just st,ats_t, a_t (Gupta et al., 2021).
  • Non-Markovian rewards: The reward function R(ht)R(h_t) depends on the full state-action (or label) history, e.g., through finite automata, reward machines, or temporal logic (Gaon et al., 2019, Miao et al., 2023, Umili et al., 2024, Tang et al., 2024).
  • Non-Markovian policies: Policies π(atht)\pi(a_t\mid h_t) depend on arbitrary histories, including policies over options or policy mixtures (Laroche et al., 2022).

Non-Markovian Reward Decision Processes (NMRDPs), Non-Markovian Decision Processes (nMDPs), and Transition-Markov Decision Processes (TMDPs) formally encode such settings (Gaon et al., 2019, Huang et al., 2024, Dohmen et al., 2021).

2. Theoretical Equivalence and Occupancy Reduction

A core result is that, for any history-dependent policy πΠ\pi \in \Pi in an MDP m=(S,A,p0,p,r,γ)m=(S,A,p_0,p,r,\gamma), there exists a stationary Markovian policy πΠm\pi'\in\Pi_m that induces the same discounted state–action occupancy measure μγπ\mu^\pi_\gamma as π\pi (Laroche et al., 2022). Explicitly, the construction uses the Radon–Nikodym derivative: π(as)=μγπ(s,a)μγπ(s)\pi'(a|s) = \frac{\mu^\pi_\gamma(s,a)}{\mu^\pi_\gamma(s)} This result (Theorem 4.1 of (Laroche et al., 2022)) implies that, at the level of discounted state–action frequencies, the entire class of history-dependent policies is not richer than the Markovian class. It enables transfer of convergence rates, sample complexity, and optimality results for policy evaluation and improvement directly from Markovian to non-Markovian policies.

Notably, this reduction holds for occupancy-based analysis in settings such as:

  • Non-stationary or ensemble policies (replay buffer mixtures in deep RL, data from expert mixtures in offline RL, or option-based controllers),
  • Hierarchical RL where the active “option” is part of the history-dependent policy,
  • Any policy mixture, including those used in experience replay or dataset aggregation.

However, only statistics dependent on μγ\mu_\gamma (e.g., bootstrapped or off-policy methods) directly benefit. Full-trajectory-level statistics or return-conditioned methods may not (Laroche et al., 2022).

3. Algorithmic Paradigms for Non-Markovian RL

3.1 Automata-Based Encodings

When reward dependencies are regular in state-action histories (or propositional traces), the canonical reduction is via product MDP construction:

3.2 Neural and Statistical State Compression

In true non-Markovian (e.g., partially observed) environments or where automata are intractable, RL can proceed by:

  • Learning recursively computable approximate sufficient statistics (RCASS) of the history (Chandak et al., 2022), e.g., via learned autoencoders.
  • Using transformer, RNN, or GRU-based encoders to summarize rolling windows or entire trajectories for use by value/policy networks (Wang, 27 Jul 2025, Qu et al., 2023).
  • Model-based approaches: Fit explicitly history-dependent or fractional-order dynamics for environments with memory or long-range correlations (Gupta et al., 2021).

3.3 Reward Modeling and Credit Assignment

For delayed or composite non-Markovian rewards, transformer-based reward modeling architectures with in-sequence (bi-directional) attention or weighted-sum decompositions can recover high-fidelity credit assignment, outperforming step-wise or Markovian models as the length and complexity of delay increases (Tang et al., 2024).

3.4 Robust and Inverse RL for Non-Markovian Spaces

Recent advances establish efficient and robust policy learning in non-Markovian nMDPs, including offline distributional robustness under uncertainty sets, even under low-rank or PSR structure (Huang et al., 2024). For inverse RL, Bayesian and LTL-based approaches address non-Markovian reward recovery, e.g., via reward machine posteriors or temporal logic formulas, using (possibly compositional) supervised or simulated annealing frameworks (Topper et al., 2024, Afzal et al., 2021).

4. Specific Applications and Empirical Domains

  • Hierarchical and Option-based Control: Analysis of semi-MDPs and hierarchical policies maps naturally to non-Markovian policies, with product-MDP or automaton collapses (Laroche et al., 2022).
  • Offline RL from Mixtures: Dataset aggregation or behavior derived from expert pools is inherently non-Markovian; equivalence results guarantee the applicability of standard convergence and safety bounds, provided discounted occupancy matching (Laroche et al., 2022).
  • Quantum control: For open quantum systems with environment-induced memory, RL is performed over an embedded (Markovianized) joint system-reservoir state, learned via maximum-likelihood embedding (Neema et al., 2024) or combined with explicit HEOM solvers (Jaouadi et al., 2023).
  • Networked RL with spatial/temporal memory: Stacking scene graphs (GNNs) with temporal encodings (GRUs/transformers) addresses both non-Markovian traffic and dynamic topologies in routing or communication domains (Wang, 27 Jul 2025).
  • Safety and Constraints: Safety signals depending on entire trajectory or sub-sequences are modeled via meta-safety variables and specialized actor-critic/lifting methods with dual-gradient Lagrangian adaptation (Low et al., 2024).

5. Limitations, Open Problems, and Trade-offs

  • Occupancy measure equivalence does not preserve full trajectory distributions—methods relying on full trajectory statistics remain outside the direct reach of the reduction (Laroche et al., 2022).
  • Automata and logic-based methods are limited by expressivity (regular languages), symbolic abstraction, and state blowup in highly structured or continuous environments (Gaon et al., 2019, Miao et al., 2023, Umili et al., 2024, Miao et al., 2024).
  • Sample efficiency and credit assignment become challenging with long-range rewards, rare-event discovery, or exploration in sparse-reward non-Markovian settings (Chandak et al., 2022, Tang et al., 2024).
  • Groundability of high-level objectives to raw observations without perfect symbol grounding requires semi-supervised or groundability analysis of temporal logic specifications (Umili et al., 2024).
  • Robust RL in non-Markovian models demands new concentrability coefficients and dual formulations for tractable policy learning and safety guarantees in the face of structural model and distributional uncertainty (Huang et al., 2024).

6. Implications and Unification Across RL Disciplines

At the foundational level, the equivalence of discounted occupancy measures between Markovian and non-Markovian policies (with explicit constructive formula) unifies the analysis of mixture, ensemble, hierarchical, and history-dependent controllers. This result drastically broadens the scope of sample-based RL convergence, fairness, and safety analysis, subject only to occupancy-based objectives, and suggests general methods for merging theoretical and practical lines of RL research (Laroche et al., 2022).

Simultaneously, the wealth of representational, algorithmic, and inferential techniques developed across domains—ranging from automata and logic-based RL, robust estimation, transformer-based history modeling, and specification-grounded representation learning—demonstrates that non-Markovianity is not an impediment but a structural avenue for expressivity, improved policy diversity, and robustness across real-world control and decision systems.

Key References:

  • [2205.139

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Non-Markovian Reinforcement Learning.