Memory-Augmented Markov Decision Process (M-MDP)

Updated 26 August 2025

Memory-Augmented MDP is a formalism in reinforcement learning that integrates explicit memory of past states to address non-Markovian challenges.
It leverages memory structures to overcome partial observability and delayed rewards, enhancing decision-making in complex environments.
Research in M-MDPs focuses on benchmark evaluation, adaptive memory design, and integrating deep learning with temporal memory dynamics.

A Memory-Augmented Markov Decision Process (M-MDP) is a formalism in sequential decision theory and reinforcement learning that extends the classical Markov Decision Process (MDP) to account for agents possessing an explicit memory of historical observations, actions, or events. This memory component allows the agent to overcome environments with partial observability, delayed rewards, or system dynamics where strong temporal dependencies exist. The M-MDP paradigm integrates state-based policies with internal or external memory states, enabling optimal decision-making in non-Markovian domains where the current observation alone is insufficient for long-term planning or reward maximization.

1. Formal Definitions and Memory Structures

In a classical MDP, the system state $s_t$ at time $t$ is assumed to contain all requisite information for predicting transitions and rewards via Markovian dynamics:

$P(s_{t+1}, r_t | s_t, a_t)$

However, in many real-world and synthetic environments, the agent's sensory input $z_t$ does not reveal the complete underlying state, giving rise to Partially Observable MDPs (POMDPs). An M-MDP augments the state with memory, typically formalized as a trajectory history $h_t$ of past (observation, action, reward) tuples or as a latent memory state $m_t$ updated through an explicit mechanism.

Recent frameworks employ the concept of Memory Demand Structure (MDS), wherein a set $D \subseteq \{0, ..., t\}$ of key time steps must be remembered for optimal transition prediction:

$P(s_{t+1}, r_t | z_{0:t}, a_{0:t}) = P(s_{t+1}, r_t | \bigcap_{\tau\in D} \{ Z_\tau = z_\tau, A_\tau = a_\tau \})$

The difficulty of a task is thus determined by the minimal size or complexity of $D$ required for policy optimality (Wang et al., 6 Aug 2025). Transition invariance (stationarity, consistency) further modulates the need for memory, as agents may need to remember not just recent steps, but trajectory categories or long-range patterns.

2. Construction and Analysis of Synthetic M-MDPs

Synthetic environments play a critical role in evaluating the capabilities of memory-augmented RL algorithms. Methodologies for constructing tunable M-MDPs include:

Linear Process Dynamics (AR processes):

$z_{t+1} = \left( \sum_{i=0}^{k-1} w_i^{h_t} z_{t-i} \right) \bmod 1$

Varying the order $k$ and coefficient pattern $w_i$ manipulates memory requirements, with non-consistency and non-stationarity imposing distinct challenges.

State Aggregation via Convolution (HAS):

$z_t = \sum_{i=0}^t w_i s_{t-i}$

Here, current observations are "echoes" of multiple past states, increasing MDS complexity.

Reward Redistribution:

Rewards may be shifted or delayed such that the critical information for policy execution can be temporally distant from the corresponding action, e.g. delivered only at episode termination (Wang et al., 6 Aug 2025).

The analysis of agent performance across these synthetic benchmarks elucidates the strengths and limitations of memory architectures (LSTM, LMU, Elman, MLP), especially as environment order and consistency properties are modified.

3. Memory-Augmented RL Algorithms and Applications

Modern DRL algorithms frequently incorporate explicit memory mechanisms to address M-MDPs. The integration strategy typically distinguishes between current feature extraction and memory feature extraction:

For example, in LSTM-TD3, the actor and critic employ LSTM modules to process histories of observations/actions, which are concatenated with current features to output policies or $Q$ -values:

$Q(o_t, a_t, h_t^l) = Q^{\mathrm{pi}}(Q^{\mathrm{me}}(h_t^l) \ \| \ Q^{\mathrm{cf}}(o_t, a_t))$

$\mu(o_t, h_t^l) = \mu^{\mathrm{pi}}(\mu^{\mathrm{me}}(h_t^l) \ \| \ \mu^{\mathrm{cf}}(o_t))$

The explicit separation of memory and current features, as opposed to flat window concatenation, enhances robustness against missing or noisy data and allows for latent integration of temporal dependencies (Meng et al., 2021).

Performance evaluations demonstrate that such memory augmentation yields substantial gains in POMDP environments, with notable improvements in episodic return and resilience under sensor imperfections.

4. Memory in Physical Systems: SWIPT and Energy Harvesting

Memory effects also arise in physical system modeling, such as in wireless information/power transfer (SWIPT) with non-linear energy harvester (EH) circuits. Reactive elements (e.g. capacitors) induce temporal voltage persistence, necessitating state quantization (thresholds $v_0,\ldots,v_{S+1}$ ) and transition modeling via an MDP:

$\rho_{i,j}(r_x) = \frac{1}{v_i-v_{i-1}} \int_{v \in [v_{i-1}, v_i)} \mathbb{1}_{[v_{j-1}, v_j)}(f_v(v, |h_e|r_x)) dv$

Learning-based models employing DNNs are employed to approximate the intractable transition and reward mappings, enabling tractable optimization under memory-dependent dynamics (Shanin et al., 2020).

Optimization problems contrast the convex case (EH state known) with the non-convex case (EH state hidden), showing the superiority of adaptive, memory-aware policies in maximizing harvested power under mutual information constraints.

5. Logic-Based Representation and Policy Synthesis

Decision-theoretic logic programming extensions, notably the pBC+ action language, provide a framework for elaboration-tolerant M-MDP representations. States are interpreted as stable models of fluent valuations; utility laws allow reward assignment to transitions or states:

$v \quad \mathrm{if} \ F \ \mathrm{after} \ G$

This law is compiled into LPMLN rules defining utility atoms that sum to the interpretation’s overall utility. Expectation over candidate actions yields a formal maximum expected utility (MEU) decision principle:

$E[U_\pi(A)] = \sum_{I \models A} U_\pi(I) P_\pi(I|A)$

The pbcplus2mdp system translates high-level pBC+ action descriptions with utility extensions into an MDP instance, enabling optimal policy discovery through classical solvers. This approach is directly extensible to M-MDPs by expanding fluent representations to encode memory, thereby supporting reward functions and transitions that depend on historical context (Wang et al., 2019).

6. Guidelines for Memory-Augmented Environment and Agent Design

Theoretical constructs such as MDS and transition invariance provide actionable criteria for environment and agent specification:

Characterize minimal dependency sets $D$ for the environment to quantify intrinsic task memory load.
Manipulate intra-trajectory stationarity and inter-trajectory consistency to control transition dynamics and agent memory requirements.
Use wrappers that preserve optimal policy while increasing memory complexity for diagnostic benchmarking.
Evaluate agent architectures systematically across synthetic series of controlled difficulty (e.g., AR process order, convolution depth, reward delay) to map the effective representational and forgetting capacity of each model (Wang et al., 6 Aug 2025).

A plausible implication is that robust M-MDP policy design depends critically on matching environment-driven memory demands to an agent’s architectural decisions regarding history integration, memory separation, and irrelevant information suppression.

7. Significance and Research Directions

The M-MDP formalism provides a unified template for reasoning about sequential decision processes in non-Markovian or partially observed environments. It underpins advances in model-based RL, physical system modeling (as in wireless energy harvesting), logic-based decision frameworks, and synthetic benchmarking for memory architectures.

Current challenges include adaptive memory length selection, interpretability of learned latent memory states, scaling to high-dimensional or long-horizon settings, and development of efficient architectures beyond recurrent networks (e.g., attention-based, continuous memory) (Meng et al., 2021).

The field is poised for further progress in:

Synthesizing diagnostic POMDPs with diverse invariance and MDS parameters for benchmarking.
Integrating deep learning modeling of physical memory effects with decision-theoretic criteria.
Bridging declarative, logic-based planning languages with reinforcement learning for memory-augmented policy synthesis.

These research strands collectively clarify the role of memory augmentation in sequential decision making, furnish rigorous tools for agent and environment evaluation, and inform the principled deployment of M-MDPs in technically challenging domains.