Finite-Horizon Markov Decision Process

Updated 17 June 2026

Finite-horizon MDPs are discrete-time stochastic models defined over a fixed number of stages, used to optimize target reachability and cumulative rewards.
They employ dynamic programming with Bellman recursion to calculate nonstationary, counter-based optimal policies over a finite time horizon.
Intrinsic memory and periodicity bounds highlight the complexity of constructing scalable controllers and achieving high-precision outcomes in practical applications.

A finite-horizon Markov Decision Process (MDP) is a discrete-time stochastic control framework in which the decision-maker operates over a fixed sequence of time steps, rather than an indefinite or infinite horizon. The objective is typically to optimize the probability of attaining designated goals, expected cumulative rewards, or related objectives within the specified horizon. Finite-horizon MDPs are ubiquitous in stochastic planning, operations research, reinforcement learning, and probabilistic verification.

1. Formal Definition and Model Construction

A finite-horizon MDP is specified as the tuple $(S, A, P, s_0, F, H)$ , where:

$S$ is a finite set of $n$ states
$A$ is a finite set of actions, with $A_s\subseteq A$ denoting admissible actions in state $s$
$P: S \times A \times S \to [0,1]$ is the transition probability function, so that $P(s,a,s')$ is the probability of transitioning from $s$ to $s'$ under action $S$ 0
$S$ 1 is the initial state
$S$ 2 is a set of target (goal) states
$S$ 3 is the finite horizon

At each stage $S$ 4, if the system is in state $S$ 5, the controller chooses $S$ 6, and the next state $S$ 7 is drawn from $S$ 8. If $S$ 9, the process is absorbed in $n$ 0. The canonical reachability objective is to maximize $n$ 1 (Chatterjee et al., 2012). More general reward-based objectives can also be defined, using cumulative rewards $n$ 2 and terminal rewards $n$ 3.

2. Value Functions, Bellman Recursion, and Optimal Policies

The optimal value function $n$ 4 is defined recursively by

$n$ 5

$n$ 6

for $n$ 7. The corresponding optimal policy $n$ 8 satisfies $n$ 9, with $A$ 0 (Chatterjee et al., 2012).

For reachability, rewards are typically indicators for entering $A$ 1, and for general total or average reward, $A$ 2 captures terminal payoff. The time dependence of the policy (nonstationarity) is intrinsic: optimal actions generally depend on the remaining steps $A$ 3 (Chen et al., 17 Nov 2025, Chatterjee et al., 2012).

3. Strategy Complexity and Memory Requirements

Strategy complexity in finite-horizon MDPs is governed by two primary lower bounds (Chatterjee et al., 2012):

For any $A$ 4, every $A$ 5-optimal counter-based strategy requires at most $A$ 6 bits of memory, and memory of size $A$ 7 is required in the worst case.
This memory is necessary to distinguish both long paths (requiring $A$ 8 bits) and precise timing to $A$ 9-precision ( $A_s\subseteq A$ 0 bits).

Counter-based strategies, which update their memory state as a deterministic function of stage count, suffice for $A_s\subseteq A$ 1-optimality. For exact optimality, the minimal period of any finite-memory optimal strategy may be as large as $A_s\subseteq A$ 2 (Chatterjee et al., 2012).

Complexity Metric	Upper Bound	Lower Bound
Bits to $A_s\subseteq A$ 3-optimality	$A_s\subseteq A$ 4	$A_s\subseteq A$ 5
Period of exact-optimal strategy	$A_s\subseteq A$ 6	$A_s\subseteq A$ 7

This establishes tight memory and period bounds for finite-horizon reachability MDPs.

4. Structural Properties and Periodicity of Strategies

A counter-based strategy $A_s\subseteq A$ 8 is characterized by an eventually periodic memory sequence $A_s\subseteq A$ 9: after a finite pre-period, controls repeat with some period $s$ 0. There are explicit constructions (combining clocked "gadgets" controlled by primes) that necessitate sub-exponential periods $s$ 1 for optimality (Chatterjee et al., 2012). This demonstrates that even for moderate $s$ 2, optimal control can entail highly complex periodic switching.

5. Algorithms and Computational Methods

Backward dynamic programming (value iteration) is the standard method for solving finite-horizon MDPs (Chatterjee et al., 2012). For $s$ 3 steps, $s$ 4 states, and $s$ 5 actions, the computational complexity is $s$ 6 per iteration. In cases requiring explicit strategy extraction, one reconstructs the nonstationary optimal policy from stored $s$ 7.

For more complex specifications (e.g., vector-valued objectives or constraints), additional linear programming or bilinear programming techniques are introduced. Notably, linear program formulations for occupancy measures cover both standard and multi-objective formulations (Mifrani et al., 19 Feb 2025). In the context of strategy complexity, the focus is strictly on counter-based strategies, whose memory and periodicity requirements sharply govern implementation complexity (Chatterjee et al., 2012).

6. Bounds, Limitations, and Reachability MDPs

The structural results detailed above reveal fundamental limits in memory and periodicity that cannot be bypassed by algorithmic improvements: $s$ 8-optimal policies in finite-horizon MDPs can never be implemented with less than $s$ 9 bits of memory, and the period of exact-optimal strategies has a provable subexponential lower bound (Chatterjee et al., 2012). These findings underscore intrinsic trade-offs in policy synthesis for high-precision planning or systems with large state spaces.

For reachability objectives, infinite-horizon memoryless strategies suffice in the unconstrained case (Condon '92). However, the presence of finite horizon drastically increases the required memory and periodicity, as demonstrated by the matching lower bounds.

7. Relevance and Implications in Modern Research

These results have direct consequences for probabilistic verification, controller synthesis in robotics, and stochastic games. The explicit dependence of memory requirements on both problem size and precision parameter $P: S \times A \times S \to [0,1]$ 0 serves as a practical and theoretical guide for scalable controller and policy synthesis in finite-horizon environments. The sub-exponential periodicity lower bound further implies that, for certain MDPs, any exact-optimal controller must implement a schedule with extremely long, non-repetitive preamble (Chatterjee et al., 2012).

These bounds also set optimality benchmarks for current and future research into symbolic, memory-efficient, and computationally tractable planning in stochastic finite-horizon settings.