Multi-Turn MDPs: Framework & Applications

Updated 24 June 2026

Multi-turn MDPs are a mathematical framework for sequential decision-making, modeling states, actions, rewards, and transitions over multiple stages.
They encompass both finite and infinite horizons and integrate methods like backward induction, LP-based policies, and deep RL for handling constraints and non-stationarity.
Applications include mobile health, supply chain management, and automated decision support, with ongoing research focusing on scalability through dimensionality reduction and structure learning.

A Multi-turn Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making under uncertainty over multiple stages (“turns”). This structure encompasses both finite-horizon and infinite-horizon formulations, with each “turn” representing a discrete time step at which the agent observes a state, selects an action, and receives a reward or incurs a cost, followed by a stochastic transition to the next state. Multi-turn MDPs support both discrete and continuous state/action spaces, accommodate constraints, and underpin modern approaches for control, planning, and reinforcement learning in dynamic environments.

1. Formal Structure and Notation

Let $T$ denote the total number of stages (finite horizon) or consider the process as continuing indefinitely (infinite horizon). The canonical specification of a (finite-horizon) multi-turn MDP is the tuple

$\left(\{\mathcal{S}_t\}_{t=1}^T,\;\{\mathcal{A}_t(s)\}_{t=1}^T,\;\{P_t\}_{t=1}^T,\;\{r_t\}_{t=1}^T\right)$

where:

$\mathcal{S}_t$ : state space at turn $t$
$\mathcal{A}_t(s)$ : admissible action set at state $s$ in turn $t$
$P_t(s'|s,a)$ : transition kernel, i.e., probability law over $s'$ , conditioned on current state $s$ and action $\left(\{\mathcal{S}_t\}_{t=1}^T,\;\{\mathcal{A}_t(s)\}_{t=1}^T,\;\{P_t\}_{t=1}^T,\;\{r_t\}_{t=1}^T\right)$ 0
$\left(\{\mathcal{S}_t\}_{t=1}^T,\;\{\mathcal{A}_t(s)\}_{t=1}^T,\;\{P_t\}_{t=1}^T,\;\{r_t\}_{t=1}^T\right)$ 1: instantaneous reward or cost for $\left(\{\mathcal{S}_t\}_{t=1}^T,\;\{\mathcal{A}_t(s)\}_{t=1}^T,\;\{P_t\}_{t=1}^T,\;\{r_t\}_{t=1}^T\right)$ 2 at turn $\left(\{\mathcal{S}_t\}_{t=1}^T,\;\{\mathcal{A}_t(s)\}_{t=1}^T,\;\{P_t\}_{t=1}^T,\;\{r_t\}_{t=1}^T\right)$ 3

A policy $\left(\{\mathcal{S}_t\}_{t=1}^T,\;\{\mathcal{A}_t(s)\}_{t=1}^T,\;\{P_t\}_{t=1}^T,\;\{r_t\}_{t=1}^T\right)$ 4 specifies the agent’s action selection rule at each turn, where each $\left(\{\mathcal{S}_t\}_{t=1}^T,\;\{\mathcal{A}_t(s)\}_{t=1}^T,\;\{P_t\}_{t=1}^T,\;\{r_t\}_{t=1}^T\right)$ 5 is generally non-stationary. The Bellman recursion for the value function (expected optimal reward-to-go) is

$\left(\{\mathcal{S}_t\}_{t=1}^T,\;\{\mathcal{A}_t(s)\}_{t=1}^T,\;\{P_t\}_{t=1}^T,\;\{r_t\}_{t=1}^T\right)$ 6

with backward induction yielding optimality for a deterministic, possibly time-varying, policy (Morton et al., 26 Sep 2025, Chamie et al., 2015, Wang et al., 2019).

2. Extensions: Constraints, Non-Stationarity, and Structure

State Constraints and CMDPs

When trajectory or state visitation constraints are imposed (e.g., $\left(\{\mathcal{S}_t\}_{t=1}^T,\;\{\mathcal{A}_t(s)\}_{t=1}^T,\;\{P_t\}_{t=1}^T,\;\{r_t\}_{t=1}^T\right)$ 7 for occupation measure $\left(\{\mathcal{S}_t\}_{t=1}^T,\;\{\mathcal{A}_t(s)\}_{t=1}^T,\;\{P_t\}_{t=1}^T,\;\{r_t\}_{t=1}^T\right)$ 8), the policy space is convexified: solutions generally require (time-dependent) randomized policies, and deterministic optimality is lost (Chamie et al., 2015). Linear programming and duality approaches are used to synthesize non-stationary randomized policies for such finite-horizon CMDPs, with explicit performance lower bounds guaranteed. Constraints couple the decision rules across time, complicating the backward-induction solution.

Decision-Dependent Transitions and Learning

Policy-graph representations and multi-stage stochastic programming generalize the MDP backbone by enabling node-based transitions, decision-dependent uncertainty, and embedded learning (Bayesian update of model beliefs), producing dynamic programming equations over graphs with mixed continuous-discrete states (Morton et al., 26 Sep 2025). Cut-based approximation methods (e.g., SDDP) are required for large-scale or nonconvex instances.

Environmental Non-Stationarity

To model exogenous temporal effects, the MDP can be augmented by an external process whose history alters transition dynamics. The resulting system may no longer be Markovian in original state variables. By augmenting the state to include event-history up to a finite lag, the process is rendered Markov again, and approximate optimality can be achieved by considering only recent exogenous events, with formal error–memory–sample complexity tradeoffs (Ayyagari et al., 2023).

3. Low-Dimensional and Sufficient Multi-Turn MDPs

In high-dimensional settings (e.g., mobile health applications), it becomes crucial to reduce the state-space. Sufficient MDP reduction seeks a mapping $\left(\{\mathcal{S}_t\}_{t=1}^T,\;\{\mathcal{A}_t(s)\}_{t=1}^T,\;\{P_t\}_{t=1}^T,\;\{r_t\}_{t=1}^T\right)$ 9, with $\mathcal{S}_t$ 0, such that the process $\mathcal{S}_t$ 1 with $\mathcal{S}_t$ 2 remains Markov, and an optimal policy for the reduced space induces optimality for the original system. This sufficiency is tested by conditional independence of next reward and next state from the full state, given $\mathcal{S}_t$ 3 and $\mathcal{S}_t$ 4, and can be efficiently achieved by alternating deep neural networks with feature selection penalties (Wang et al., 2017). Training objectives combine predictive loss with group-lasso for structured sparsity, then use any RL algorithm (Q-learning, etc.) on the learned low-dimensional MDP.

4. Policy and Solution Methodologies

Algorithmic Approaches

Approach	Policy Class	Representation
Backward Induction	Deterministic (finite-horizon)	Dynamic Programming
LP-based (CMDP)	Non-stationary randomized	Occupation measures
SDDP	Piecewise affine/linear	Cut-approximations (Morton et al., 26 Sep 2025)
Deep RL on $\mathcal{S}_t$ 5-reduced MDP	Deterministic or stochastic	Function approximation (Wang et al., 2017)

Backward induction/the Bellman equation provides exact optimal policies for unconstrained, classical cases. For state-trajectory-constrained or decision-dependent problems, convex (or convex-relaxed) optimization and approximation methods are required. Structure-aware methods such as the Maturing MDP (MMDP) framework introduce stage-wise information and action-set symmetries, enhance sample efficiency, and produce policies aligned with operational deadlines (Liu et al., 17 Jun 2026).

5. Knowledge Representation and Elaboration Tolerance

Action languages (such as the decision-theoretic extension of pBC+) encode MDPs declaratively, supporting succinct representation and elaboration tolerance. This enables the automatic construction of multi-turn MDPs, including reward structure, from high-level action descriptions (Wang et al., 2019). The system pbcplus2mdp generates transition and reward matrices from a logical program and interfaces with standard MDP solvers, providing formal guarantees (policy equivalence theorems) between the logical and probabilistic semantics.

6. Empirical Characterization and Applications

Empirical evaluation spans diverse domains:

Mobile health interventions: Sufficient-reduced MDPs enable tractable, interpretable high-quality policies on $\mathcal{S}_t$ 6-dimensional summaries for real-time decision support (Wang et al., 2017).
Constrained multi-agent planning: LP-based multi-turn CMDPs maintain joint state-distribution constraints for safety and resource allocation (Chamie et al., 2015).
Large-scale supply chain and inventory: Policy-graph MDPs with SDDP enable optimization under non-Markovian uncertainty and Bayesian learning (Morton et al., 26 Sep 2025).
Sequential cash management: Maturing MDPs with expiring-action abstraction reduce sample complexity and improve RL convergence as complexity scales (Liu et al., 17 Jun 2026).
Automated knowledge representation: The pBC+ action-language framework enables elaboration-tolerant encoding and rapid policy extraction even as the number of domain objects grows (scaling bottleneck arises in logical inference, not in MDP solution) (Wang et al., 2019).

7. Theoretical Guarantees, Limitations, and Open Challenges

Multi-turn MDPs admit optimal policies via dynamic programming when all standard assumptions hold. Augmentation to handle constraints, external non-stationarity, or decision-dependent uncertainty introduces complexity, often requiring randomized, non-stationary, or approximate policies. Cut-based, duality-based, or function-approximation algorithms supply practical methods with explicit bounds on suboptimality, sample complexity, and computational tradeoffs (Wang et al., 2017, Chamie et al., 2015, Ayyagari et al., 2023, Morton et al., 26 Sep 2025). A limitation is the scalability under high cardinality state/action spaces; dimensionality reduction, structured representations, and sufficient statistic learning help mitigate but do not eliminate the challenge.

A plausible implication is that continued research in structure-exploiting algorithms, automated knowledge compilation, and scalable function approximation will remain central to further progress in multi-turn MDP methodology.