Memory-augmented MDP: Theory & Applications

Updated 25 August 2025

Memory-augmented MDPs extend classical MDPs by incorporating historical memory to address non-Markovian dynamics and optimize long-run average objectives.
They employ both finite-memory and infinite-memory strategies, with 2-memory stochastic-update strategies notably achieving Pareto optimal trade-offs for expectation objectives.
Applications include multi-objective control systems, energy harvesting circuits, and partially observable reinforcement learning, supported by efficient linear programming methods.

A Memory-augmented Markov Decision Process (M-MDP) is an extension of the classical Markov Decision Process (MDP) framework in which the agent, environment, or underlying process requires memory—explicitly tracking aspects of history or previous states—either to optimize policies, faithfully model dynamics influenced by temporal dependencies, or address limits of observability. This paradigm arises naturally in settings involving multiple long-run average objectives, realistic physical systems with inherent memory effects, and reinforcement learning in partially observable environments.

1. Formal Definition and Motivation

In a standard MDP, decisions are made based on current state information under the Markov assumption, i.e., future transitions depend only on the current state and action. A memory-augmented MDP (M-MDP) generalizes this model by introducing a memory mechanism, either in the agent’s policy (history-dependent strategies), the state representation (augmented with memory traces), or the environment’s transition structure (involving non-Markovian features).

This augmentation is essential in scenarios where the objective or behavior depends on long-run averages over multiple reward functions, systems with physical memory (e.g., energy harvesters with capacitor dynamics), or tasks in which observability is fundamentally limited and historical information must be retained to make optimal decisions. Derived formally, such processes can be characterized by trajectories $h_t = (z_{0:t}, a_{0:t}, r_{0:t-1})$ with memory demand structure (MDS) $D \subseteq \{0, 1, \dots, t\}$ such that

$P(s_{t+1}, r_t | z_{0:t}, a_{0:t}) = P(s_{t+1}, r_t | \{Z_\tau = z_\tau, A_\tau = a_\tau \}_{\tau \in D}),$

implying only certain time indices in history need to be remembered for prediction or control (Wang et al., 6 Aug 2025).

2. Memory Requirements in Strategy Synthesis

Research in multi-objective MDPs with limit-average (mean-payoff) objectives demonstrates a fundamental need for memory in policy design. Specifically, when optimizing for the expectation of a vector-valued long-run reward, the set of achievable vectors is strictly larger than what can be realized via memoryless or pure strategies. The essential findings include:

For expectation objectives with $k$ limit-average functions, both randomization and memory are required; every achievable vector is “witnessed” by a 2-memory stochastic-update strategy (Brázdil et al., 2011).
The relevant formulation for long-run average reward is:

$lr_{\vec{r}}(\omega) = \lim_{T \to \infty} \frac{1}{T} \sum_{t=1}^T \vec{r}(A_t(\omega))$

Strategies must ensure that

$P^\sigma_{s_0} \{ lr_{\vec{r}} \geq \vec{v} \}$

for desired vector $\vec{v} \in \mathbb{R}^k$ .

Memory demands diverge significantly for different objectives: | Objective Type | Memory Requirement | Approximation Possibility | |-----------------------|--------------------------------------------------------|------------------------------| | Expectation Objective | 2-memory randomized (finite memory, sometimes deterministic-update) | ε-approximation with same memory | | Satisfaction Objective| Infinite memory required for exact achievement | ε-approximation with memoryless randomized |

This establishes that for trade-offs among objectives, a finite and explicitly small amount of memory suffices, but for satisfaction guarantees involving thresholds, infinite memory may be necessary unless ε-approximation is acceptable.

3. Classes of Memory-augmented Strategies

The spectrum of strategies in M-MDPs includes:

Finite-memory strategies: Utilize a fixed set of memory states, with stochastic or deterministic memory updates.
Infinite-memory strategies: Required for exact satisfaction objectives in certain MDPs, as all finite-memory randomized strategies may fail to reach the target vector with nonzero probability (Brázdil et al., 2011).
Memoryless randomized strategies: In satisfaction objectives, these suffice for ε-approximations.

For expectation objectives, the process is structured in “phases”: the first uses randomized selection to reach maximal end components (MECs) with prescribed frequencies (encoded by $y$ -values in a linear constraint system), and the second fixes the frequencies within the chosen MEC (encoded by $x$ -values). The explicit structure validates that Pareto optimal values—best trade-offs among objectives—can be synthesized using 2-memory strategies, whereas memoryless strategies are strictly suboptimal.

4. Computational Methods and Decision Procedures

Efficient decision algorithms for M-MDPs exist based on linear programming formulations:

A linear-constraint system $L$ $L$ is central, comprising equations governing frequency and reward requirements. Representative constraints include:
- Frequency update:
$1_{s_0}(s) + \sum_{a \in A} y_a \cdot \delta(a)(s) = \sum_{a \in Act(s)} y_a + y_s$ - Reward constraint:

$\sum_{a \in A} x_a r_i(a) \geq v_i, \quad \forall 1 \leq i \leq k$
For expectation objectives, constructing a 2-memory strategy realizing an achievable vector $v$ is polynomial time solvable, with the Pareto curve (trade-off frontier) ε-approximable in time polynomial in MDP size and $1/\epsilon$ , though exponential in $k$ .
Satisfaction objectives admit memoryless ε-approximate strategies with similar complexity guarantees.

These algorithms correct the misconception in prior studies that memoryless strategies suffice; counterexamples prove the necessity of memory even for ε-approximation in expectation objectives.

5. Applications of Memory-augmented MDPs

The need for memory arises in diverse domains:

Control systems and resource allocation: Controllers frequently must coordinate actions to balance multiple conflicting long-term requirements, necessitating history-dependent policies.
Energy harvesting circuits: In SWIPT systems, the non-instantaneous charge and discharge of capacitors imparts memory to the physical device. The state variable (quantized capacitor voltage) forms the MDP state, with transitions dependent on both previous voltage and current transmitted symbol (Shanin et al., 2020).
Planning and logic programming: Probabilistic action languages such as pBC+ can be extended to encompass decision-theoretic constructs, enabling modeling of M-MDPs where memory-encoding fluents and static laws capture dependence on historical events (Wang et al., 2019).
Partially observable RL environments: Synthetic benchmarks for memory-augmented RL have advanced, allowing precise grading of tasks based on “memory demand structure” (MDS), linear process dynamics, state aggregation, and reward redistribution. This provides both theoretical analysis and empirical evidence for selecting suitable memory models (Wang et al., 6 Aug 2025).

6. Design and Analysis Guidelines

Systematic approaches for constructing and benchmarking M-MDPs and related environments are grounded in:

Memory demand quantification via MDS, enabling principled specification of which historical elements are requisite for accurate prediction or control.
Manipulation of transition invariance through equivalence relations (e.g., recent-history, initial-condition categories), thereby tuning the stationarity and consistency properties of the process; experimental findings underscore that non-consistency significantly increases challenge for memory models.
State aggregation using reversible convolution-based wrappers ensures the transformed process maintains the inherent difficulty of the underlying MDP.
Reward delay and redistribution techniques test an agent’s ability to “assign credit” and selectively forget irrelevant intermediate history, while preserving the optimality of Markov policies.

7. Impact and Future Directions

The theory and practice of M-MDPs have significantly expanded the understanding of strategy synthesis in multi-objective, memory-reliant processes. The provision of polynomial-time algorithms for controller synthesis, the explicit architectural requirements for policy memory, and benchmark environments sensitive to graded memory demand collectively clarify both the limits of traditional MDP approaches and the advances offered by memory augmentation. Further work is expected to refine the granularity of memory requirement quantification, extend elaboration-tolerant formal languages for M-MDP specification, and develop new benchmarks and analytic tools for evaluating RL methods under controlled memory demand.

In summary, the memory-augmented MDP framework formalizes and addresses the necessity of historical information in policy optimization, system modeling, and algorithm benchmarking, with implications spanning both theoretical foundations and operational applications in decision-making under uncertainty.