Optimal Markov Policies in Unichain MDPs

Updated 2 September 2025

Optimal Markov policies are decision rules that assign actions to states to maximize the long-term average reward in unichain MDPs.
They are closed under combination and mixture, meaning any policy constructed by mixing optimal actions remains optimal.
This structural property simplifies computational strategies and supports robust policy design in reinforcement learning and adaptive control.

Optimal Markov policies are a central concept in Markov decision processes (MDPs), representing mappings from states to actions (possibly with randomization) that achieve maximal expected performance according to a given criterion, such as average reward per time step or long-term discounted reward. In the context of unichain MDPs—where every stationary policy induces a Markov chain with a single recurrent class and possibly some transient states—structural properties of the set of optimal policies are particularly pronounced, with far-reaching implications for both theory and practice.

1. Formal Definition and Unichain MDP Setting

A stationary Markov policy $\pi : S \to A$ assigns an action (either deterministically or randomly) to each state $i \in S$ . In the unichain setting, the induced Markov chain under any stationary policy, including optimal ones, has a single irreducible recurrent class. This structural property is significant because it ensures the average reward (per time step) does not depend on the initial state, leading to a well-defined infinite-horizon performance criterion.

Let $\mathcal{M}$ be a unichain MDP with state space $S$ , action space $A$ , and transition probabilities and reward functions $P(s' | s, a)$ and $r(s, a)$ . Consider stationary policies $\pi_j^*: S \to A$ (for $1 \leq j \leq n$ ), each achieving the optimal average infinite-horizon reward $\rho^*$ (i.e., $\rho(\pi_j^*) = \rho^*$ ). These are the foundational objects for the analysis of combinations and mixtures of optimal policies.

2. Statement of Main Results

The principal result established is that, within a unichain MDP, the set of stationary optimal policies is "closed under combination and mixture":

Combination: For any collection $\{\pi_j^*\}_{j=1}^n$ of optimal stationary policies, any policy $\pi$ that at each state $i \in S$ selects $\pi_j^*(i)$ for some $j$ (i.e., $\pi(i) = \pi_j^*(i)$ for some $j$ ) is itself an optimal stationary policy. Thus, one can "combine" optimal policies by performing arbitrary selections across states without loss of optimality.
Mixture: Any (possibly state-dependent) randomization over the set $\{\pi_j^*\}_{j=1}^n$ that, at each visit to $i$ (or in each round), selects an action from the set $\{\pi_j^*(i)\}$ , also yields an optimal average reward. That is, at each time step, for the current state $i$ , the controller may choose any action $\pi_j^*(i)$ with any probability distribution (possibly depending on the visit), and the resulting (possibly randomized) policy is optimal.

These results can be concisely expressed:

Form	Description	Optimality Guarantee
Combination	For all $i$ , set $\pi(i) = \pi_j^*(i)$	Optimal average reward $\rho^*$
Mixture	At each visit to $i$ , any $\pi_j^*(i)$ is allowed, possibly at random	Optimal average reward $\rho^*$

3. Key Methodological Elements and Proof Strategies

The proof utilizes linearity of the average reward function, stationarity, and the unichain property. The Bellman optimality equations for the average-reward case are leveraged, taking advantage of their structure under unichain dynamics:

The average reward $\rho^*$ and the relative value (bias) function $h^*$ satisfy:

$\rho^* + h^*(i) = \max_{a \in A} \left[ r(i,a) + \sum_{j} P(j|i,a) h^*(j) \right], \quad \forall i \in S.$

For any $\pi_j^*$ , one has

$\rho^* + h^*(i) = r(i,\pi_j^*(i)) + \sum_{j} P(j|i,\pi_j^*(i)) h^*(j).$

The key is that all $\pi_j^*$ share this property at each state; thus, picking any of the optimal actions at any state maintains the equality.

The unichain property ensures that steady-state behavior is independent of the trajectory's history, stabilizing the average reward.

In the mixture case, the average reward is a convex combination of rewards along recurrent classes. Since each mixture component is optimal and the recurrence structure is shared, the mixture remains optimal. This is formalized by considering occupancy measures or by writing the expected average reward as an expectation over the randomization.

4. Implications for the Structure of the Set of Optimal Policies

The set of optimal stationary Markov policies in a unichain MDP is convex under both pointwise combination and randomization (mixture). More precisely, it forms a lattice under the partial order defined by action selection, and the closure under randomization means the extremal points (the deterministic, stationary optimal policies) generate the entire optimal set via convex combinations.

This property supplies several practical and theoretical tools:

It facilitates robustness, as one may arbitrary combine optimal strategies even when facing model uncertainty or perturbations without risking suboptimality.
It enables the design and justification of randomized control schemes, especially in applications where randomization is required for exploration, fairness, or handling adversarial phenomena.

5. Computational and Algorithmic Consequences

From a computational perspective, these results guarantee that once all deterministic stationary optimal policies are found, no further search for optimal stationary policies is needed—all combinations and mixtures of these policies are optimal.

If the policy space is large, but the number of distinct optimal action choices per state is moderate, these results allow significant dimension reduction in policy search. For algorithms that output multiple optimal policies (e.g., via multiple optimal solutions of a linear programming formulation or by non-unique greedy choice in policy iteration), the assurance that all pointwise combinations and mixtures remain optimal prevents unnecessary recomputation and justifies strategies such as randomized tie-breaking in implementations.

6. Applications and Impact

This property of closure under combination and mixture is leveraged in various domains:

Reinforcement learning: Epsilon-greedy or softmax exploration strategies that randomly perturb among optimal actions are justified by these results. Optimal exploration-exploitation trade-offs are achieved without incurring suboptimality in the exploitation mode.
Operations research and adaptive control: Randomized or composite policies constructed for robustness or risk sensitivity retain optimal performance under worst-case or adversarial models.
Policy synthesis in multi-objective or constrained MDPs: If multiple optimal policies arise due to constraints, any scheduling or mixing among these optimal policies remains optimal under the original criteria.

7. Open Problems and Further Directions

Several avenues extend or build on these results:

For multichain MDPs (where more than one recurrent class exists under some policies), similar closure properties often fail; optimal policies may not be closed under combination, as the average reward may vary across recurrent classes, raising new questions in the multichain regime.
Extension to MDPs with partial observability (POMDPs), where stochastic policies are often necessary for optimality, but closure under mixture may require stronger or alternative conditions.
Analysis of convergence rates and policy improvement dynamics when combining or mixing optimal policies, especially in approximate or reinforcement learning settings.

The convexity and closure properties of the set of optimal Markov policies in unichain MDPs provide foundational guarantees for both the design and deployment of robust, flexible, and efficient stochastic control policies.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Optimal Markov Policies.