Chattering Stationary Policies in MDPs

Updated 7 October 2025

Chattering stationary policies are defined as randomized mixtures of a finite set of deterministic policies that rapidly alternate among actions to achieve optimal outcomes in constrained MDPs.
Algorithmic constructions leverage these policies in value and policy iteration methods, reducing error bounds and enhancing performance for infinite-horizon and approximate dynamic programming problems.
In non-atomless and hybrid control settings, chattering policies enable tractable approximations of complex switching behaviors, ensuring effective regularization and verification in practical applications.

Chattering stationary policies are policies in Markov decision processes and control systems characterized by rapid or periodic alternation among a finite set of stationary (state-dependent, time-invariant) decision rules. The term “chattering” refers to the phenomenon where the control signal or policy oscillates among competing actions or sub-policies, potentially at high frequency or in a periodic non-stationary fashion. Such behaviors arise in both discrete-time Markov Decision Processes (MDPs) with complex constraints or multiple criteria, as well as in continuous-time or hybrid control systems with discontinuous or highly sensitive switching policies. Theoretical developments over the past decade have formalized the sufficiency and utility of chattering stationary policies—either as optimal solutions in constrained MDPs, as efficient algorithmic constructs in approximate dynamic programming, or as practical control strategies to regularize or exploit oscillatory switching.

1. Mathematical Formalization and General Properties

Chattering stationary policies can be rigorously defined as randomized stationary policies whose randomization kernel at each state assigns all probability mass to a finite set of deterministic stationary policies. More formally, a chattering stationary policy of order $n$ is a stationary randomized policy such that, for each state, the action is chosen by mixing among at most $n$ deterministic stationary policies. This notion is directly connected to the concept of finitely supported Young measures in stochastic control and occupation measure analysis.

The foundational result from (Dufour et al., 6 Oct 2025) asserts that, in discrete-time, uniformly absorbing Markov Decision Processes (MDPs) with multiple reward criteria and general measurable state space, the set of chattering stationary policies of order $d+1$ (where $d$ is the vector reward dimension) is sufficient. That is, any achievable vector reward via general (possibly history-dependent or randomized) policies can also be achieved by a chattering stationary policy with at most $d+1$ components. This sufficiency extends the classical Feinberg–Piunovskiy Theorem, which guarantees that deterministic stationary policies suffice in the special case of atomless MDPs.

This structure can be summarized:

Model Class	Sufficient Policy Family	Reference
Atomless, absorbing MDP	Deterministic stationary policies	(Dufour et al., 6 Oct 2025)
General absorbing MDP	Chattering stationary policies	(Dufour et al., 6 Oct 2025)

The mathematical characterization leverages convex analytic tools: every extreme point of the convex set of occupation measures (subject to linear constraints) corresponds to an occupation measure induced by some chattering stationary policy. Young measure theory (see Balder’s results cited in (Dufour et al., 6 Oct 2025)) ensures that any occupation measure can be approximated by such finitely supported kernels.

2. Algorithmic Construction and Approximation

Chattering stationary policies naturally arise in value and policy iteration algorithms for infinite-horizon discounted MDPs with approximate computations. In (Scherrer, 2012) and (Lesner et al., 2013), the concept is instantiated via periodic non-stationary policies that cycle through the last $m$ or $\ell$ policies generated during value or policy iteration. Instead of deploying the last greedy stationary policy, one executes a periodic sequence:

$\pi_{k,m} = (\pi_k, \pi_{k-1}, ..., \pi_{k-m+1}, \pi_k, \pi_{k-1}, ...)$

This cyclic use of the last $m$ policies can be interpreted as chattering among them. The critical finding is that for approximate dynamic programming with bounded per-iteration errors $\epsilon$ , the performance bound for such a periodic non-stationary policy improves from $O\left(\frac{\gamma}{(1-\gamma)^2}\epsilon\right)$ (stationary case) to $O\left(\frac{\gamma}{1-\gamma}\epsilon\right)$ when chattering over $m \to \infty$ . The improvement is especially pronounced as the discount factor $\gamma$ approaches 1, i.e., for long-horizon problems.

In approximate modified policy iteration (MPI), chattering policies also yield significantly tighter error propagation bounds. Explicit constructions and tight deterministic counterexamples show that these bounds are sharp (Lesner et al., 2013).

3. Role in Constrained, Multi-Objective, and Non-Atomless MDPs

In classical MDP theory with Borel state and action spaces, atomless transition structure ensures that deterministic stationary policies suffice for achieving optimal vectors of expected rewards in constrained or multi-objective formulations. When the atomless condition fails, (Dufour et al., 6 Oct 2025) proves that the role of deterministic stationary policies can be replaced by chattering stationary policies. Every occupation measure (subject to integral constraints) can be represented (or closely approximated) as a convex combination (with at most $d+1$ measures) generated by deterministic stationary policies—each corresponding to a component of a chattering stationary policy.

The practical implication is that, even in non-atomless settings, policy search and verification can be restricted to a finite-dimensional space formed by these chattering mixtures, dramatically reducing the complexity of policy characterization and verification.

4. Connections to Control-Theoretic and Hybrid Systems

The term “chattering” also appears in optimal control theory to describe infinite or rapidly alternating switching, as in the Fuller, Zeno, or Robbins phenomenon (Caponigro et al., 2013). While in these continuous or hybrid control systems true optimal control may exhibit countably infinite switchings, practical implementation—especially for numerical methods—relies on regularization to produce chattering stationary (or periodic) control laws of bounded variation. Penalizing the total variation in the control (or number of boundary contacts in state-constrained problems) leads to quasi-optimal chattering controls whose performance can be made arbitrarily close to that of true optimal (but nonrealizable) chattering solutions as the penalty parameter vanishes. For instance, for hybrid systems experiencing Zeno behavior, the strategy is to approximate Zeno trajectories by chattering among a finite set of mode sequences to ensure numerical tractability while maintaining convergence guarantees (Caponigro et al., 2013).

In modern safety filtering and networked control, chattering stationary policies are either exploited or mitigated depending on their effect on system performance and physical implementability. Quantized stationary policies (Saldi et al., 2013) are used to approximate continuous stationary policies by policies that only take finitely many actions, thereby taming inadvertent high-frequency switching (chattering) while providing explicit error bounds.

5. Practical and Structural Implications

The sufficiency of chattering stationary policies in absorbing and multi-criteria MDPs enables a concrete and minimalistic policy class for both policy optimization and verification. This facilitates:

Reduction in policy search: Only finitely supported randomizations over deterministic stationary policies need be considered.
Convexity exploitation: The set of all occupation measures is convex, allowing efficient characterization of Pareto frontiers by analyzing extreme points, each of which corresponds to a chattering stationary policy (Dufour et al., 6 Oct 2025).
Implementation advantages: Policy execution can be randomized over a small set of stationary policies, which is easier to store and communicate in distributed/decentralized scenarios.
Analytic tractability: Occupation measure methods and linear programming approaches can be employed more effectively, with extreme points mapped to chattering stationary policies of bounded order.

However, practical considerations include:

Switching complexity: Chattering among multiple policies may complicate actuation logic, particularly if physical switching costs are non-negligible.
Non-stationarity in approximate methods: While theoretically advantageous for error minimization, periodic non-stationary (chattering) policies may be incongruous with actuator expectations of stationary behavior unless appropriate smoothing or quantization is imposed.

Several generalizations and caveats about chattering stationary policies are identified:

In systems with atomless probability transitions, deterministic stationary policies are enough, making chattering policies unnecessary for optimality (Dufour et al., 6 Oct 2025).
In non-atomless settings or with noncompact action spaces, chattering stationary policies fill the gap, as proved using occupation measures, Young measure theory, and convex analysis.
For continuous-time and average-cost problems with countable state space, continuity of the average performance function relative to a suitable metric on the policy space can rule out chattering phenomena and ensure existence of genuine stationary optimal policies (Xia et al., 2020).
In dynamic systems with priorities or piecewise linear dynamics, the existence of stationary regimes (steady-state solutions as invariant half-lines) can be asserted even in the presence of “chattering” among linear regimes, as long as spectral and structural conditions are met (Allamigeon et al., 19 Nov 2024).
Regularization by total variation penalization in continuous control transforms inherently chattering controls into quasi-optimal controls with bounded variation, with explicit convergence rates to optimal performance (Caponigro et al., 2013).
Extensions to multiple criteria and discrete or continuous state/action spaces remain valid, with the order of chattering (number of deterministic policies mixed at any state) determined by the problem dimensionality and the structure of the occupation measure polytope.
Synthesis and reduced representation of chattering policies often rely on extremal characterizations of convex sets of feasible occupation measures (Dubins, Rockafellar), measurable selection, and approximations via finitely supported kernels.

7. Summary Table: Chattering Stationary Policies Across Contexts

Domain/Setting	Chattering Stationary Policy Role	Key Impact / Theoretical Result
Absorbing, non-atomless discrete MDPs	Sufficient for (multi-criteria) opt.	Extreme points correspond to chattering policies
Infinite-horizon discounted MDPs (approx)	Periodic non-stationary (chattering)	Improved performance guarantees in value iteration
Approximate dynamic programming (MPI, VI)	Periodic chattering improves error	Bound reduces from $O(1/(1-\gamma)^2)$ to $O(1/(1-\gamma))$
Networked control/quantized policies	Taming chattering via quantization	Error bound in terms of quantizer rate
Continuous/hybrid control (Fuller/Zeno/etc)	Regularizing chattering (bounded var)	Quasi-optimal, numerically feasible approximations
Piecewise linear systems with priorities	Chattering among regimes defines steady state	Steady-state “averaged” by rapid switching

The concept of chattering stationary policies thus acts both as an analytic tool for representing optimal and near-optimal policies in the presence of constraints, non-compactness, or multi-criteria, and as an algorithmic strategy for improving bounds or tractability in dynamic programming and optimal control. Their sufficiency, structural properties, and practical implications depend sensitively on the class of stochastic process, the presence or absence of atomless transitions, and the performance criteria of interest.