Computing the Exact Pareto Front in Average-Cost Multi-Objective Markov Decision Processes

Published 2 Apr 2026 in eess.SY, cs.IT, cs.LG, and cs.NI | (2604.02196v1)

Abstract: Many communication and control problems are cast as multi-objective Markov decision processes (MOMDPs). The complete solution to an MOMDP is the Pareto front. Much of the literature approximates this front via scalarization into single-objective MDPs. Recent work has begun to characterize the full front in discounted or simple bi-objective settings by exploiting its geometry. In this work, we characterize the exact front in average-cost MOMDPs. We show that the front is a continuous, piecewise-linear surface lying on the boundary of a convex polytope. Each vertex corresponds to a deterministic policy, and adjacent vertices differ in exactly one state. Each edge is realized as a convex combination of the policies at its endpoints, with the mixing coefficient given in closed form. We apply these results to a remote state estimation problem, where each vertex on the front corresponds to a threshold policy. The exact Pareto front and solutions to certain non-convex MDPs can be obtained without explicitly solving any MDP.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a geometric occupancy measure framework that computes the exact Pareto front in average-cost MOMDPs.
It demonstrates that deterministic policies and their convex mixtures can yield all Pareto-optimal cost vectors, overcoming limitations of scalarization.
A numerical study on remote state estimation validates threshold-based policy efficiency and shows practical implications for nonconvex objectives.

Exact Pareto Front Computation in Average-Cost Multi-Objective Markov Decision Processes

Problem Formulation and Background

The paper addresses the challenge of determining the exact Pareto front in average-cost multi-objective Markov Decision Processes (MOMDPs). MOMDPs model scenarios with multiple conflicting objectives, such as control-communication tradeoffs in networked systems. Each policy yields a vectorized cost, and Pareto optimality is defined as the set of policies that cannot be strictly improved in all objectives simultaneously.

Previous literature has mostly relied on scalarization, reducing multi-objective problems to a series of single-objective MDPs via linear combinations of the objectives. While computationally tractable, this approach does not recover the full Pareto front, especially when objectives interact nonlinearly or are nonconvex. The paper thus targets a fundamental limitation: for average-cost MOMDPs, how can one compute the entire Pareto front exactly, rather than approximate or select points via scalarization?

Structural Results: Occupancy Measure Polytope and Pareto Front Geometry

The core theoretical advance is a geometric characterization of the Pareto front. The paper demonstrates that, under standard assumptions (finite state and action spaces, unichain Markov transitions), the set of attainable average-cost objective vectors forms a convex polytope. Each deterministic stationary policy corresponds to a vertex of this polytope; policies at adjacent vertices differ in exactly one state. Edges between vertices represent convex mixtures (randomization in a single state), with closed-form coefficients. Every point on the Pareto front (boundary of the polytope) can be realized by stationary policies or mixtures thereof.

Formally, the occupancy measure induced by any stationary policy lies in the polytope $\Phi$ defined by normalization, balance, and nonnegativity constraints—completely characterized by linear programming. The mapping from occupancy measures to objective vectors is linear. The Pareto front is a continuous, piecewise-linear surface lying on the boundary, consisting of connected edges between deterministic policies.

For scalarized MDPs with strictly increasing (not necessarily linear) scalarization functions, the paper establishes that the optimal solutions always reside on the Pareto front and can be realized by convex combinations of at most $K$ deterministic policies, where $K$ is the number of objectives. This is a strong claim: computing the full Pareto front requires no explicit solution of the underlying MDPs, and nonlinear objectives are admissible as long as they are strictly increasing.

Numerical Case Study: Remote State Estimation

The theoretical results are applied to remote state estimation in a linear Gaussian system, balancing estimation error and communication rate. Each policy is parameterized as a threshold; transmission decisions depend on the elapsed time since the last update. The occupancy measure polytope allows computation of all achievable (error, rate) pairs. The study reveals that each vertex of the Pareto front corresponds to a threshold policy and that any interior point can be realized by mixing two adjacent thresholds. Significantly, for certain ranges of the scalarization weight $\lambda$ , a single threshold policy uniquely attains the minimum cost over the entire interval.

Figure 1: Pareto front of the estimation system for varying communication policies, highlighting threshold-based structure and linear segments.

Further analysis demonstrates that the optimal cost for nonlinear scalarization (e.g., with a sigmoid penalty for communication rate) remains on the Pareto front despite nonconvexity, and the optimal policy can be constructed by combining deterministic threshold policies on an edge.

Theoretical and Practical Implications

The geometric occupancy measure framework unifies the multi-objective average-cost MDP literature by clarifying that stationary policies suffice for Pareto optimality, and that deterministic policies yield all vertices. This result means any strictly increasing objective function—potentially nonlinear and application-specific—can be optimized by selecting suitable mixtures of deterministic policies corresponding to Pareto front vertices.

Practically, this enables efficient enumeration and policy selection for MOMDPs without the necessity of repeated dynamic programming or reinforcement learning for each scalarization. For theoretical research, the connection to convex polytopes opens avenues for advanced MOLP techniques, Carathéodory representations, and policy synthesis so long as the underlying system meets the stated finiteness and unichain conditions.

The methodology also has implications in multi-agent and multi-objective reinforcement learning, semantic communications, and constrained control, where trade-offs must be analyzed exactly and not heuristically. Notably, the fact that nonlinear and even nonconvex operational objectives (as long as they are strictly increasing) can be handled is a contradiction to classical assertions of computational intractability.

Future Directions

Future work can focus on generalizing the approach to infinite state spaces, broader classes of policies (history-dependent, randomized), and relaxing the unichain assumption. Extensions to robust settings, adaptive polytopes under uncertainty, and algorithmic efficiency for high-dimensional objectives are also pertinent. Integration with existing RL approaches for policy mixing and Pareto front exploration is promising for nonconvex, nonstationary environments.

Conclusion

The paper rigorously characterizes the exact Pareto front in average-cost MOMDPs via the geometry of the occupancy measure polytope. Detailing a linear mapping from stationary policies to objective vectors, it asserts—contrary to prevailing beliefs—that the complete Pareto front and optimal solutions for strictly increasing scalarization functions can be obtained via deterministic policies and simple mixtures. These insights have broad theoretical and practical implications for multi-objective optimization in MDPs, enabling efficient, exact policy design for systems with conflicting operational goals.

Markdown Report Issue