Stateful Reflective Decision Process

Updated 1 January 2026

Stateful Reflective Decision Process is a decision-making framework that augments traditional MDPs by integrating persistent memory and self-modifying rules to adapt over time.
It employs explicit memory retrieval, meta-inference, and reflective coding to guide policy decisions, ensuring safety and cross-task generalization.
Practical instantiations in LLM agents, reflective sequential algorithms, and game-theoretic frameworks demonstrate its theoretical robustness and real-world applicability.

A stateful reflective decision process (SRDP) is a class of decision-making architecture in which the agent’s policy is persistently modified or guided by explicit representations of past interactions, meta-inference, or self-modifying rules, allowing the agent to adapt, generalize, or introspect in a temporally extended manner. The formal study of SRDPs spans reinforcement learning, symbolic decision theory, reflective sequential algorithms, and game-theoretic reflexion, with recent computational instantiations in LLM agents that employ memory-augmented or meta-policy approaches (Wu et al., 4 Sep 2025, Wang, 27 Dec 2025, Schewe et al., 2020, Novikov et al., 2018, Tarasenko, 2012, Fox et al., 2013, Kim et al., 21 May 2025).

1. Formal Foundations and Core Definitions

SRDPs generalize the classical Markov decision process (MDP) by coupling the agent’s state with explicit representational or inferential memory, which is read and/or written at each decision step. In the formalization proposed in "Memento-II" (Wang, 27 Dec 2025), an SRDP is described as a tuple: $\langle\,\mathcal S,\;\mathcal A,\;\mathcal P,\;\mathcal R,\;\gamma,\;\mathcal M,\;\mu,\;p_{\rm LLM}\rangle$ where:

$\mathcal S$ is the set of environment states,
$\mathcal M$ is the space of agent memory/episodic knowledge,
At time $t$ , the agent holds $s_t\in\mathcal S$ , $M_t\in\mathcal M$ ,
Read: The agent retrieves a memory case $c_t \sim \mu(\cdot|s_t,M_t)$ ,
Write: The agent stores $(s_t,a_t,r_t)$ in $M$ ,
Action Generation: $a_t$ is produced by an LLM (or other policy) conditioned on $s_t$ and $c_t$ , $a_t \sim p_{\rm LLM}(a|s_t,c_t)$ .

The process thus forms an augmented state $x_t = (s_t, M_t)$ , and induces a "reflected" MDP on this extended space (Wang, 27 Dec 2025). The core operator of reflection is defined via either explicit memory retrieval and writeback or self-inspection and modification of code, beliefs, or symbolic inference rules (Schewe et al., 2020, Fox et al., 2013).

2. Algorithmic Realizations: Memory, Rules, and Reflective Coding

The main parameter distinguishing particular SRDPs is the structure and algorithm for stateful memory or self-reflection.

Meta-Policy Memory (MPM) in LLM Agents

In Meta-Policy Reflexion (MPR) (Wu et al., 4 Sep 2025), the agent maintains an external Meta-Policy Memory (𝓜): $\mathcal{M} = \{ m_i \}_{i=1}^N, \qquad m_i = (\mathrm{pre}_i, \mathrm{act}_i, w_i)$ with:

$\mathrm{pre}_i$ a Boolean predicate over states $S$ ,
$\mathrm{act}_i$ a recommended or forbidden action pattern,
$w_i$ a confidence score.

SRDP operation is described recursively:

Retrieval: At state $s_t$ , extract those $m_i$ where $\mathrm{pre}_i(s_t) = 1$ .
Soft Memory-Guided Decoding: Condition the action policy on the set $\mathcal{M}_t = \{ m \mid \mathrm{pre}_m(s_t)=1 \}$ , e.g., by prompt injection.
Hard Rule Admissibility Check (HAC): Post-filter actions so that forbidden actions are never executed.
Episode-End Reflection: On failure, synthesize new $(\mathrm{pre},\mathrm{act},w)$ rules from the trajectory through an LLM reflection function $f(\tau)$ and merge into $\mathcal{M}$ .

Reflective Sequential Algorithms (RSA)

Behaviorally, RSAs (Schewe et al., 2020) formalize SRDP as abstract state machines whose state $S$ is a pair $\langle D, R\rangle$ of task data and a finite tree representing the current algorithm. The one-step transition rule jointly updates both data and its own algorithmic code. Reflection is realized by tree-algebraic manipulations of the rule-encoding subtree, enabling the system to inspect, modify, or extend its operating procedures as part of the standard decision cycle.

Symbolic and Game-Theoretic Approaches

SRDPs also arise within symbolic decision and reflexive game theory (Fox et al., 2013, Novikov et al., 2018, Tarasenko, 2012), where the agent’s state encompasses beliefs, goals, questions, and multistage influence parameters. Here, a reflection operator acts at the meta-level, initiating or closing decisions, generating new questions (i.e., discovering knowledge gaps), or recursively updating the mutual influences in group decision making. State transitions combine symbolic argumentation, meta-level rule application, and algebraic aggregation of influences.

3. Decision Loop Structure and Theoretical Properties

A unifying feature is the tight feedback loop between world state, memory, and self-modification. At each step, the process includes:

Retrieval of relevant memory or belief meta-data,
Computation of next action or sub-policy, possibly modified by explicit rules, retrieved exemplars, or meta-inference,
Outcome observation and possible writeback or meta-level update.

This results in an augmented Markov chain over $(s_t, M_t)$ , allowing formal analysis by dynamic programming and RL tools. In SRDP with growing memory and local retrieval fidelity, composite policies can be shown to converge to optimality under mild smoothness/coverage conditions (Wang, 27 Dec 2025).

In RSAs, updates apply simultaneously to both data and algorithm, preserving bounded exploration and invariance under state isomorphism (Schewe et al., 2020). In game-theoretic reflexion, iterated application of reflexion-rank update rules converges to equilibria under diminishing step-size and standard concavity assumptions (Novikov et al., 2018).

4. Instantiations and Case Studies

SRDPs have been realized in several domains:

LLM Agents with MPR: Tasks such as object manipulation in AlfWorld demonstrate the consolidation of corrective rules (e.g., "open drawer before picking up apple") into 𝓜, yielding generalization and safety under cross-task evaluations. Empirically, training set accuracy reaches 100% by round 3; held-out accuracy improves from 86.9% (Reflexion) to 87.8% (MPR) and 91.4% with HAC (Wu et al., 4 Sep 2025).
Reflective Policy Iteration: In code generation and program synthesis, memory retrieval provides concrete exemplars leading to rapid policy improvement purely through inference, without parametric updates (Wang, 27 Dec 2025).
Symbolic Diagnostic Agents: SRDPs using argumentation and meta-goal reasoning autonomously initiate sub-decisions, detect knowledge gaps, and execute extended reasoning cycles, as in medical diagnosis or planning (Fox et al., 2013).
Reflexive Game Theory: Multistage and group decision processes are expressed as cascades of sessions, where each stage's outcome forms the state transition for the next, and the final decision follows from the reflexive solution on the terminal state (Tarasenko, 2012).

5. Mechanisms for Reflection, Memory, and Self-Modifying Policies

SRDPs instantiate reflection via several mechanisms:

External Episodic/Meta-Policy Memory: Accumulation of failure traces, rules, and cases that bias future policy decisions (MPR, SRDP) (Wu et al., 4 Sep 2025, Wang, 27 Dec 2025).
Argumentation Engines: Symbolic construction and aggregation of status-annotated propositions with meta-rules for decision initiation, gap detection, and termination (Fox et al., 2013).
Self-Representing State Machines: Embedding and manipulation of a code tree representing the running algorithm directly in the agent’s state, enabling dynamic self-modification (RSAs) (Schewe et al., 2020).
Influence Matrix Updating: In reflexive group decisions, each session updates the influence matrix, thereby setting up the boundary for the next reflexive solution (Tarasenko, 2012).
Belief Hierarchies: Recursive belief structures, allowing agents to encode beliefs about others' beliefs and strategies, feed back into the dynamic adjustment process (Novikov et al., 2018).

SRDPs thus externalize or factor reflection—explicit mechanisms for updating not only environment state but also internal models or policies—within the standard decision loop. This sharply distinguishes them from non-reflective RL or planning approaches.

6. Cross-Task Generalization, Safety, and Adaptability

Empirical and theoretical findings across instantiations indicate that SRDPs:

Support continual, cross-task adaptation without gradient updates to the policy network, which is especially advantageous for LLM agents (Wu et al., 4 Sep 2025, Wang, 27 Dec 2025).
Incorporate hard safety constraints by enforcing rule admissibility at inference time, thereby preventing forbidden or unsafe actions (Wu et al., 4 Sep 2025).
Ground decisions in persistent agent-specific knowledge or rules, increasing robustness against compounding errors, state-goal drift, or hallucination (as shown by improved performance of ReflAct over ReAct-style approaches) (Kim et al., 21 May 2025).

The process decouples learning from parameter fine-tuning, instead achieving adaptation through incremental memory or meta-structure growth and retrieval-based inference, which under identifiable conditions leads to asymptotic optimality and robust transfer (Wang, 27 Dec 2025).

7. Conceptual Scope and Relation to Broader Theories

The SRDP framework unifies views from operational reinforcement learning, symbolic inference, inductive memory architectures, and algorithmic self-modification:

Reflective Sequential Algorithms provide a behavioral theory for self-modification and code-as-data reasoning (Schewe et al., 2020).
Argumentation-based Decision Theory formalizes meta-level control and self-questioning in symbolic agents (Fox et al., 2013).
Reflexion in Game/Collective Behavior Theory supplies explicit, recursive models of belief and strategy, with rigorous convergence theory (Novikov et al., 2018).
Reflexive Game Theory structures multistage and session-based group decision making with clear algebraic state transitions and mutual influence encoding (Tarasenko, 2012).

All such systems, at their core, exploit a tightly coupled, stateful loop of decision, memory, reflection, and adaptation, underpinned by both theoretical guarantees (convergence, expressiveness) and robust empirical performance across tasks and domains.