Papers
Topics
Authors
Recent
Search
2000 character limit reached

Finite-Horizon Episodic MDPs

Updated 21 April 2026
  • Finite-horizon episodic MDPs are mathematical models for sequential decision-making over a fixed number of time steps with time-dependent transitions and rewards.
  • They rely on nonstationary policies and backward dynamic programming (Bellman recursion) to optimize the cumulative reward under finite episodes.
  • Applications include reinforcement learning, resource allocation, and risk-sensitive control, employing techniques like occupancy measure LPs and bilinear programming.

A finite-horizon episodic Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision problems where an agent interacts with an environment over a fixed, finite number of time steps, or "episodes." Each episode consists of a sequence of decisions, transitions, and rewards, and the agent's objective is to optimize an expected sum of rewards or costs, which may be subject to constraints. The finite horizon necessitates time-dependent (nonstationary) policies and special sample/algorithmic considerations distinct from infinite-horizon or continuing MDPs.

1. Formal Structure and Policy Classes

A finite-horizon episodic MDP is described by the tuple (S,A,H,(Pt)t=0H−1,(rt)t=0H−1,p0)(S, A, H, (P_t)_{t=0}^{H-1}, (r_t)_{t=0}^{H-1}, p_0), where:

  • SS: finite state space.
  • AA: finite action space.
  • HH: integer-valued horizon, i.e., the episode ends after HH steps.
  • Pt(s′∣s,a)P_t(s'|s,a): time-(stage-)dependent transition kernels.
  • rt(s,a)r_t(s,a) (or rt(s,a,s′)r_t(s,a,s')): possibly time-dependent reward or cost.
  • p0p_0: initial state distribution.

A nonstationary policy π=(π0,…,πH−1)\pi = (\pi_0, \ldots, \pi_{H-1}), with each SS0 a distribution over SS1, governs the agent's choices at each time step, potentially allowing randomized (stochastic) actions. Trajectories consist of SS2, where each SS3.

The value function at stage SS4 under policy SS5 from state SS6 is defined as SS7, with optimal value SS8 (Chowdhury et al., 2021).

2. Dynamic Programming and Bellman Recursion

The backbone of computation in finite-horizon episodic MDPs is backward dynamic programming (DP) using the stage-indexed Bellman equations:

SS9

AA0

The finite horizon necessitates distinct (nonstationary) policies for each stage AA1, in contrast to the stationary structure of infinite-horizon discounted or average-reward MDPs (Chowdhury et al., 2021, Dann et al., 2015). The optimal policy at each stage chooses actions maximizing AA2.

3. Sample Complexity and Regret: PAC-Learning, Lower and Upper Bounds

Finite-horizon episodic MDPs have been central to the theory of sample complexity and regret minimization. Key results include:

  • PAC (Probably Approximately Correct) sample complexity for learning AA3-optimal policies scales as AA4 in the general (tabular) case, given known rewards and unknown transitions. Matching lower bounds are AA5 (Dann et al., 2015). This quadratic dependence on the horizon AA6 is optimal up to logarithmic factors.
  • A crucial technical advance was the Bellman-variance analysis, improving the horizon-exponent from AA7 (typical for reductions to discounted MDPs or Hoeffding-style bounds) to AA8 via refined Bernstein/variance-matching concentration (Dann et al., 2015).
  • For constrained MDPs ("CMDPs"), where policies must satisfy constraints on secondary costs, occupancy measure LPs yield sample complexity of AA9 episodes, with HH0 the maximal possible number of successor states (Kalagarla et al., 2020).
  • For standard reward-only problems, recent non-constructive analyses establish the horizon-free upper bound HH1 (for fixed HH2), i.e., the number of episodes to PAC-optimality need not grow with HH3, under a bounded-sum reward normalization and with computational intractability (Li et al., 2021).

4. Algorithmic Methodologies: LPs, Bilinear Programs, and Reinforcement Learning

Multiple algorithmic frameworks are available for finite-horizon MDPs and their constrained/generalizations:

  • Dynamic programming/value iteration is efficient for moderate problem sizes, exploiting the backward-recursive Bellman structure (Chowdhury et al., 2021, Dann et al., 2015).
  • Occupancy measure LPs: The set of feasible state-(action) occupancy sequences over the HH4 steps forms a polytope described by flow-conservation constraints. The optimal policy is encoded as the solution to an LP minimizing expected cost/reward, subject to constraints on expected secondary costs (Kalagarla et al., 2020, Dann et al., 2015).
  • Bilinear programming in CMDPs: For CMDPs with both additive and multiplicative utilities (e.g., risk-sensitive or multiplicative cost criteria), model transformation (via augmented state variables) reduction turns the problem into a purely additive CMDP on an expanded state space, solvable via a finite-dimensional bilinear program. The augmented state doubles for each multiplicative component, but practical scenarios often keep this manageable (HH5 for HH6 multiplicative indices) (M et al., 2023).
Problem Type Formulation Notes/Comments
Unconstrained MDP LP over occupancy measures Flows, no constraints; solution gives policy
Additive + multiplicative CMDP Bilinear program (augmented state) Polynomial in horizon, exponential in # mult. objectives (M et al., 2023)
CMDP (constraints) LP with cost constraints Scales as HH7 (worst-case) episodes (Kalagarla et al., 2020)
  • Reinforcement learning algorithms: Finite-horizon Q-learning algorithms have been developed for this setting, utilizing nonstationary Q-functions HH8, with sample-based updates and convergence established by ODE-based stochastic approximation (VP et al., 2021).

5. Strategy and Memory Complexity

Finite-horizon MDPs generally require policies that keep explicit track of elapsed time (or steps-to-go). Characterizing the memory requirements:

  • Any HH9-optimal policy for a fixed-horizon MDP with HH0 states can be implemented by a counter-based strategy with HH1 bits of memory. This is tight: there exist MDPs where HH2 bits are required (Chatterjee et al., 2012).
  • For exactly optimal strategies, the period of counter-based implementations can be as large as HH3 for some HH4-state MDPs, due to periodic dependence on the horizon modulo various primes (Chatterjee et al., 2012).

6. Extensions: Resource Constraints, Risk Sensitivity, and Advanced Utility Criteria

Recent research has generalized finite-horizon episodic MDPs to accommodate a range of advanced objectives:

  • Mean-variance optimization: Formulated as a bilevel MDP with an augmented state (tracking accumulated returns), solved by alternating optimization over the pseudo-mean and a dynamic program over the remainder. The optimal policy may be history-dependent and the value function is piecewise quadratic-concave. This paradigm extends to multi-period mean-variance portfolio optimization, queueing control, and inventory management (Xia et al., 30 Jul 2025).
  • Additive and multiplicative constraints/objectives: The use of binary auxiliary state variables allows for the incorporation of multiplicative utility components (e.g., risk-sensitive criteria, survival probabilities), transforming into a pure additive augmented CMDP (M et al., 2023).
  • Resource allocation: Online learning frameworks based on dual mirror descent solve episodic finite-horizon CMDPs with unknown kernels and stochastic reward/resource consumption, attaining tight regret bounds in both observe-then-decide and decide-then-observe regimes (Lee et al., 2023).

7. Connections, Complexity, and Theoretical Implications

Finite-horizon episodic MDPs form a universal formalism encompassing a wide range of sequential stochastic optimization problems, from basic stochastic control, resource allocation, and inventory to learning under bandit feedback (Zhuo et al., 2 Feb 2026).

  • Computational complexity is governed by the underlying structure: standard MDPs are polynomially tractable (in HH5) via DP; constrained and risk-sensitive variants may induce exponential complexity in the number of constraints or multiplicative utilities, but linear in horizon for modest HH6 (M et al., 2023).
  • Sample complexity is optimally quadratic in the horizon for tabular RL under classical regret/PAC guarantees, but logarithmic or constant for certain horizon-normalized or computationally-inefficient models (Dann et al., 2015, Li et al., 2021).
  • Algorithmic frameworks offer tractable solutions for small/medium-scale problems; novel structures (low-rank tensors (Rozada et al., 17 Jan 2025), tensor networks (Gillman et al., 2020), quantum algorithms (Luo et al., 7 Aug 2025)) provide scalable or computational advantages in high dimensions or under special structural assumptions.

These developments position finite-horizon episodic MDPs as a highly expressive, deeply analyzed model at the core of sequential decision theory, with ongoing advances driven by the intersection of learning, control, and optimization.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Finite-Horizon Episodic Markov Decision Processes (MDPs).