Finite-Horizon Episodic MDPs

Updated 21 April 2026

Finite-horizon episodic MDPs are mathematical models for sequential decision-making over a fixed number of time steps with time-dependent transitions and rewards.
They rely on nonstationary policies and backward dynamic programming (Bellman recursion) to optimize the cumulative reward under finite episodes.
Applications include reinforcement learning, resource allocation, and risk-sensitive control, employing techniques like occupancy measure LPs and bilinear programming.

A finite-horizon episodic Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision problems where an agent interacts with an environment over a fixed, finite number of time steps, or "episodes." Each episode consists of a sequence of decisions, transitions, and rewards, and the agent's objective is to optimize an expected sum of rewards or costs, which may be subject to constraints. The finite horizon necessitates time-dependent (nonstationary) policies and special sample/algorithmic considerations distinct from infinite-horizon or continuing MDPs.

1. Formal Structure and Policy Classes

A finite-horizon episodic MDP is described by the tuple $(S, A, H, (P_t)_{t=0}^{H-1}, (r_t)_{t=0}^{H-1}, p_0)$ , where:

$S$ : finite state space.
$A$ : finite action space.
$H$ : integer-valued horizon, i.e., the episode ends after $H$ steps.
$P_t(s'|s,a)$ : time-(stage-)dependent transition kernels.
$r_t(s,a)$ (or $r_t(s,a,s')$ ): possibly time-dependent reward or cost.
$p_0$ : initial state distribution.

A nonstationary policy $\pi = (\pi_0, \ldots, \pi_{H-1})$ , with each $S$ 0 a distribution over $S$ 1, governs the agent's choices at each time step, potentially allowing randomized (stochastic) actions. Trajectories consist of $S$ 2, where each $S$ 3.

The value function at stage $S$ 4 under policy $S$ 5 from state $S$ 6 is defined as $S$ 7, with optimal value $S$ 8 (Chowdhury et al., 2021).

2. Dynamic Programming and Bellman Recursion

The backbone of computation in finite-horizon episodic MDPs is backward dynamic programming (DP) using the stage-indexed Bellman equations:

$S$ 9

$A$ 0

The finite horizon necessitates distinct (nonstationary) policies for each stage $A$ 1, in contrast to the stationary structure of infinite-horizon discounted or average-reward MDPs (Chowdhury et al., 2021, Dann et al., 2015). The optimal policy at each stage chooses actions maximizing $A$ 2.

3. Sample Complexity and Regret: PAC-Learning, Lower and Upper Bounds

Finite-horizon episodic MDPs have been central to the theory of sample complexity and regret minimization. Key results include:

PAC (Probably Approximately Correct) sample complexity for learning $A$ 3-optimal policies scales as $A$ 4 in the general (tabular) case, given known rewards and unknown transitions. Matching lower bounds are $A$ 5 (Dann et al., 2015). This quadratic dependence on the horizon $A$ 6 is optimal up to logarithmic factors.
A crucial technical advance was the Bellman-variance analysis, improving the horizon-exponent from $A$ 7 (typical for reductions to discounted MDPs or Hoeffding-style bounds) to $A$ 8 via refined Bernstein/variance-matching concentration (Dann et al., 2015).
For constrained MDPs ("CMDPs"), where policies must satisfy constraints on secondary costs, occupancy measure LPs yield sample complexity of $A$ 9 episodes, with $H$ 0 the maximal possible number of successor states (Kalagarla et al., 2020).
For standard reward-only problems, recent non-constructive analyses establish the horizon-free upper bound $H$ 1 (for fixed $H$ 2), i.e., the number of episodes to PAC-optimality need not grow with $H$ 3, under a bounded-sum reward normalization and with computational intractability (Li et al., 2021).

4. Algorithmic Methodologies: LPs, Bilinear Programs, and Reinforcement Learning

Multiple algorithmic frameworks are available for finite-horizon MDPs and their constrained/generalizations:

Dynamic programming/value iteration is efficient for moderate problem sizes, exploiting the backward-recursive Bellman structure (Chowdhury et al., 2021, Dann et al., 2015).
Occupancy measure LPs: The set of feasible state-(action) occupancy sequences over the $H$ 4 steps forms a polytope described by flow-conservation constraints. The optimal policy is encoded as the solution to an LP minimizing expected cost/reward, subject to constraints on expected secondary costs (Kalagarla et al., 2020, Dann et al., 2015).
Bilinear programming in CMDPs: For CMDPs with both additive and multiplicative utilities (e.g., risk-sensitive or multiplicative cost criteria), model transformation (via augmented state variables) reduction turns the problem into a purely additive CMDP on an expanded state space, solvable via a finite-dimensional bilinear program. The augmented state doubles for each multiplicative component, but practical scenarios often keep this manageable ( $H$ 5 for $H$ 6 multiplicative indices) (M et al., 2023).

Problem Type	Formulation	Notes/Comments
Unconstrained MDP	LP over occupancy measures	Flows, no constraints; solution gives policy
Additive + multiplicative CMDP	Bilinear program (augmented state)	Polynomial in horizon, exponential in # mult. objectives (M et al., 2023)
CMDP (constraints)	LP with cost constraints	Scales as $H$ 7 (worst-case) episodes (Kalagarla et al., 2020)

Reinforcement learning algorithms: Finite-horizon Q-learning algorithms have been developed for this setting, utilizing nonstationary Q-functions $H$ 8, with sample-based updates and convergence established by ODE-based stochastic approximation (VP et al., 2021).

5. Strategy and Memory Complexity

Finite-horizon MDPs generally require policies that keep explicit track of elapsed time (or steps-to-go). Characterizing the memory requirements:

Any $H$ 9-optimal policy for a fixed-horizon MDP with $H$ 0 states can be implemented by a counter-based strategy with $H$ 1 bits of memory. This is tight: there exist MDPs where $H$ 2 bits are required (Chatterjee et al., 2012).
For exactly optimal strategies, the period of counter-based implementations can be as large as $H$ 3 for some $H$ 4-state MDPs, due to periodic dependence on the horizon modulo various primes (Chatterjee et al., 2012).

6. Extensions: Resource Constraints, Risk Sensitivity, and Advanced Utility Criteria

Recent research has generalized finite-horizon episodic MDPs to accommodate a range of advanced objectives:

Mean-variance optimization: Formulated as a bilevel MDP with an augmented state (tracking accumulated returns), solved by alternating optimization over the pseudo-mean and a dynamic program over the remainder. The optimal policy may be history-dependent and the value function is piecewise quadratic-concave. This paradigm extends to multi-period mean-variance portfolio optimization, queueing control, and inventory management (Xia et al., 30 Jul 2025).
Additive and multiplicative constraints/objectives: The use of binary auxiliary state variables allows for the incorporation of multiplicative utility components (e.g., risk-sensitive criteria, survival probabilities), transforming into a pure additive augmented CMDP (M et al., 2023).
Resource allocation: Online learning frameworks based on dual mirror descent solve episodic finite-horizon CMDPs with unknown kernels and stochastic reward/resource consumption, attaining tight regret bounds in both observe-then-decide and decide-then-observe regimes (Lee et al., 2023).

7. Connections, Complexity, and Theoretical Implications

Finite-horizon episodic MDPs form a universal formalism encompassing a wide range of sequential stochastic optimization problems, from basic stochastic control, resource allocation, and inventory to learning under bandit feedback (Zhuo et al., 2 Feb 2026).

Computational complexity is governed by the underlying structure: standard MDPs are polynomially tractable (in $H$ 5) via DP; constrained and risk-sensitive variants may induce exponential complexity in the number of constraints or multiplicative utilities, but linear in horizon for modest $H$ 6 (M et al., 2023).
Sample complexity is optimally quadratic in the horizon for tabular RL under classical regret/PAC guarantees, but logarithmic or constant for certain horizon-normalized or computationally-inefficient models (Dann et al., 2015, Li et al., 2021).
Algorithmic frameworks offer tractable solutions for small/medium-scale problems; novel structures (low-rank tensors (Rozada et al., 17 Jan 2025), tensor networks (Gillman et al., 2020), quantum algorithms (Luo et al., 7 Aug 2025)) provide scalable or computational advantages in high dimensions or under special structural assumptions.

These developments position finite-horizon episodic MDPs as a highly expressive, deeply analyzed model at the core of sequential decision theory, with ongoing advances driven by the intersection of learning, control, and optimization.