Infinite-Horizon Average-Cost MDPs

Updated 8 August 2025

Infinite-horizon average-cost MDPs is a mathematical framework for sequential stochastic control that averages long-run costs per time step in infinite or continuous state spaces.
The approach leverages behavioral metrics to quantify state similarity and ensure Lipschitz continuity, which supports robust state aggregation and controlled approximation errors.
It provides theoretical guarantees and practical strategies for addressing high-dimensional decision-making challenges in operations research, stochastic control, and reinforcement learning.

An infinite-horizon average-cost Markov decision process (MDP) is a mathematical framework for sequential stochastic control in which the performance criterion is the long-run average of accrued costs per time step, optimized over an infinite time horizon. This framework is foundational for operations research, stochastic control, learning theory, and mathematical finance, and is especially critical for studying systems with continuous or large state spaces, constraints, and nontrivial limiting behavior.

1. Metrics and State Similarity in Infinite State Spaces

A central theoretical advancement for infinite-horizon average-cost MDPs with infinite or continuous state spaces is the development of behavioral metrics for state similarity (Ferns et al., 2012). These metrics generalize the classical notion of probabilistic bisimulation into a quantitative semimetric framework:

For any two states $s, s' \in S$ , the behavioral distance $d(s, s')$ is constructed via maximal differences in immediate rewards and Kantorovich (Wasserstein) distances between their transition probability measures, iterated to a fixed point via a contraction operator $F_c$ .
Formally,

$F_c(h)(s, s') = \max_{a\in A} \left[\, |r(s,a) - r(s',a)| + c \cdot T_K(h)(P(s,a), P(s',a)) \right],$

where $T_K(h)$ is the Kantorovich distance induced by the semimetric $h$ .

This metric supplies a continuous analogue to bisimulation: the metric's kernel recovers exactly the bisimulation relation (distance zero), while nonzero values quantify graded behavioral dissimilarity between states. The continuity of this metric with respect to the MDP parameters (rewards, transitions) confers stability to state aggregation schemes: aggregates with small metric diameters guarantee bounded errors in approximated value functions and policies.

2. Approximation and Aggregation in Infinite-State MDPs

Infinite or continuous state spaces preclude direct application of classical discrete MDP solution algorithms. The behavioral metric framework enables principled state aggregation and approximation:

By partitioning the state space into aggregates of small metric diameter, and constructing a finite MDP where each aggregate is represented by a prototype, one can ensure that the difference between the value function of the approximate MDP and the original is bounded by the maximal aggregate diameter due to the value function's Lipschitz regularity.
Theoretical guarantees: if the value function $V^*$ is Lipschitz with constant 1 with respect to $d_{fix}$ , then for any $s, s'$ :

$|V^*(s) - V^*(s')| \leq d_{fix}(s, s').$

This justifies both uniform discretization and more sophisticated, metric-adaptive aggregation schemes for high-dimensional and continuous-state MDPs, crucial for tractable infinite-horizon planning.

The main technical burden for continuous spaces is the computation of the Kantorovich metric and the fixed-point semimetric. This is addressed by the embedding of the metric construction in measure-theoretic and duality-based functional analytic frameworks, including leveraging contraction mapping properties in the semi-lattice (lower semicontinuous semimetrics).

3. Continuity and Robustness of Value Functions

For infinite-horizon planning, an essential property is the continuity of the optimal value function with respect to the behavioral metric:

For any two states that are close under $d_{fix}$ , their optimal values cannot differ significantly:

$|V^*(s) - V^*(s')| \leq d_{fix}(s, s').$

This property provides both a theoretical foundation for the reliability of metric-based aggregation and a guarantee of robustness of the optimal value function and derived policies to small model perturbations (changes in transition probabilities or rewards).

This is critical for structure-preserving approximation and for quantifying the effect of numerical or modeling errors in large or continuous MDPs.

4. Application to Infinite-Horizon Planning and Average-Cost Criteria

The metric approach is not restricted to the discounted cost setting; it is inherently suitable for infinite-horizon average-cost criteria and long-run planning (Ferns et al., 2012). The fundamental elements include:

By ensuring that the value function is Lipschitz with respect to the behavioral metric, one can relate the error introduced by approximation schemes (including partition-based truncations or aggregations) directly to the metric diameter of aggregates, yielding explicit, uniform performance bounds.
The framework facilitates iterative multi-resolution approximations: as aggregate diameters decrease, the approximate value functions and optimal policies converge (in the metric) to those of the original infinite-state system.

Robustness also extends to planning and learning algorithms, as the value function's continuity ensures that small perturbations in the MDP do not result in discontinuities or abrupt changes in decision-making behavior.

5. Theoretical and Computational Considerations

Advantages and limitations of the behavioral metric and aggregation approach include:

Advantages:
- Provides rigorous performance guarantees for state aggregation in arbitrary (possibly high-dimensional or continuous) MDPs.
- Supplies a systematic, quantitative approach to model approximation and reduction, applicable to both value iteration and policy iteration frameworks.
- Robust to small model errors due to the topological continuity of value functions w.r.t. the metric.
Limitations:
- Computing the fixed-point semimetric and the Kantorovich metric may be computationally intensive, particularly in high-dimensional or nonparametric settings.
- Construction of aggregates requires explicit control over aggregate diameters in the metric, which may be highly nontrivial in practice.
- Sensitivity to the choice of the discount constant $c$ or its relation to the actual planning discount factor $y$ needs careful tuning to ensure tight Lipschitz bounds.

6. Implications for Theory and Practice

The behavioral metric perspective fundamentally advances the theory of infinite-horizon average-cost MDPs by extending classical, combinatorial notions of state equivalence into tools that are robust, continuous, and applicable to general, possibly uncountable state spaces (Ferns et al., 2012). Practically, this enables:

Rigorous design and analysis of approximate dynamic programming and reinforcement learning algorithms for systems with infinite or continuous state space.
Quantitative assessment of the tradeoff between state-space reduction (aggregation) and performance loss.
Robust guarantees of the convergence and reliability of approximate solution methods for infinite-horizon average-cost control tasks.

In summary, the behavioral metric and associated fixed-point construction provide a scalable, theoretically justified framework for the approximation, aggregation, and robust solution of infinite-horizon average-cost MDPs, particularly in settings where state space cardinality or continuity preclude direct application of classical techniques.

PDF Markdown Chat (Pro)

References (1)

Metrics for Markov Decision Processes with Infinite State Spaces (2012)

Follow Topic

Get notified by email when new papers are published related to Infinite-Horizon Average-Cost Markov Decision Processes (MDPs).