Papers
Topics
Authors
Recent
Search
2000 character limit reached

Finite-Horizon Markov Decision Process

Updated 17 June 2026
  • Finite-horizon MDPs are discrete-time stochastic models defined over a fixed number of stages, used to optimize target reachability and cumulative rewards.
  • They employ dynamic programming with Bellman recursion to calculate nonstationary, counter-based optimal policies over a finite time horizon.
  • Intrinsic memory and periodicity bounds highlight the complexity of constructing scalable controllers and achieving high-precision outcomes in practical applications.

A finite-horizon Markov Decision Process (MDP) is a discrete-time stochastic control framework in which the decision-maker operates over a fixed sequence of time steps, rather than an indefinite or infinite horizon. The objective is typically to optimize the probability of attaining designated goals, expected cumulative rewards, or related objectives within the specified horizon. Finite-horizon MDPs are ubiquitous in stochastic planning, operations research, reinforcement learning, and probabilistic verification.

1. Formal Definition and Model Construction

A finite-horizon MDP is specified as the tuple (S, A, P, s0, F, H)(S, A, P, s_0, F, H), where:

  • SS is a finite set of nn states
  • AA is a finite set of actions, with As⊆AA_s\subseteq A denoting admissible actions in state ss
  • P: S×A×S→[0,1]P: S \times A \times S \to [0,1] is the transition probability function, so that P(s,a,s′)P(s,a,s') is the probability of transitioning from ss to s′s' under action SS0
  • SS1 is the initial state
  • SS2 is a set of target (goal) states
  • SS3 is the finite horizon

At each stage SS4, if the system is in state SS5, the controller chooses SS6, and the next state SS7 is drawn from SS8. If SS9, the process is absorbed in nn0. The canonical reachability objective is to maximize nn1 (Chatterjee et al., 2012). More general reward-based objectives can also be defined, using cumulative rewards nn2 and terminal rewards nn3.

2. Value Functions, Bellman Recursion, and Optimal Policies

The optimal value function nn4 is defined recursively by

nn5

nn6

for nn7. The corresponding optimal policy nn8 satisfies nn9, with AA0 (Chatterjee et al., 2012).

For reachability, rewards are typically indicators for entering AA1, and for general total or average reward, AA2 captures terminal payoff. The time dependence of the policy (nonstationarity) is intrinsic: optimal actions generally depend on the remaining steps AA3 (Chen et al., 17 Nov 2025, Chatterjee et al., 2012).

3. Strategy Complexity and Memory Requirements

Strategy complexity in finite-horizon MDPs is governed by two primary lower bounds (Chatterjee et al., 2012):

  • For any AA4, every AA5-optimal counter-based strategy requires at most AA6 bits of memory, and memory of size AA7 is required in the worst case.
  • This memory is necessary to distinguish both long paths (requiring AA8 bits) and precise timing to AA9-precision (As⊆AA_s\subseteq A0 bits).

Counter-based strategies, which update their memory state as a deterministic function of stage count, suffice for As⊆AA_s\subseteq A1-optimality. For exact optimality, the minimal period of any finite-memory optimal strategy may be as large as As⊆AA_s\subseteq A2 (Chatterjee et al., 2012).

Complexity Metric Upper Bound Lower Bound
Bits to As⊆AA_s\subseteq A3-optimality As⊆AA_s\subseteq A4 As⊆AA_s\subseteq A5
Period of exact-optimal strategy As⊆AA_s\subseteq A6 As⊆AA_s\subseteq A7

This establishes tight memory and period bounds for finite-horizon reachability MDPs.

4. Structural Properties and Periodicity of Strategies

A counter-based strategy As⊆AA_s\subseteq A8 is characterized by an eventually periodic memory sequence As⊆AA_s\subseteq A9: after a finite pre-period, controls repeat with some period ss0. There are explicit constructions (combining clocked "gadgets" controlled by primes) that necessitate sub-exponential periods ss1 for optimality (Chatterjee et al., 2012). This demonstrates that even for moderate ss2, optimal control can entail highly complex periodic switching.

5. Algorithms and Computational Methods

Backward dynamic programming (value iteration) is the standard method for solving finite-horizon MDPs (Chatterjee et al., 2012). For ss3 steps, ss4 states, and ss5 actions, the computational complexity is ss6 per iteration. In cases requiring explicit strategy extraction, one reconstructs the nonstationary optimal policy from stored ss7.

For more complex specifications (e.g., vector-valued objectives or constraints), additional linear programming or bilinear programming techniques are introduced. Notably, linear program formulations for occupancy measures cover both standard and multi-objective formulations (Mifrani et al., 19 Feb 2025). In the context of strategy complexity, the focus is strictly on counter-based strategies, whose memory and periodicity requirements sharply govern implementation complexity (Chatterjee et al., 2012).

6. Bounds, Limitations, and Reachability MDPs

The structural results detailed above reveal fundamental limits in memory and periodicity that cannot be bypassed by algorithmic improvements: ss8-optimal policies in finite-horizon MDPs can never be implemented with less than ss9 bits of memory, and the period of exact-optimal strategies has a provable subexponential lower bound (Chatterjee et al., 2012). These findings underscore intrinsic trade-offs in policy synthesis for high-precision planning or systems with large state spaces.

For reachability objectives, infinite-horizon memoryless strategies suffice in the unconstrained case (Condon '92). However, the presence of finite horizon drastically increases the required memory and periodicity, as demonstrated by the matching lower bounds.

7. Relevance and Implications in Modern Research

These results have direct consequences for probabilistic verification, controller synthesis in robotics, and stochastic games. The explicit dependence of memory requirements on both problem size and precision parameter P: S×A×S→[0,1]P: S \times A \times S \to [0,1]0 serves as a practical and theoretical guide for scalable controller and policy synthesis in finite-horizon environments. The sub-exponential periodicity lower bound further implies that, for certain MDPs, any exact-optimal controller must implement a schedule with extremely long, non-repetitive preamble (Chatterjee et al., 2012).

These bounds also set optimality benchmarks for current and future research into symbolic, memory-efficient, and computationally tractable planning in stochastic finite-horizon settings.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Finite-Horizon Markov Decision Process (MDP).