Group-Invariant Latent-Noise MDP

Updated 15 December 2025

Group-Invariant Latent-Noise MDP is a stochastic control model capturing large agent populations with shared latent common noise and permutation invariance.
It employs a lifted MDP framework on the space of probability measures with dynamic programming to optimize open-loop controls over an infinite horizon.
Relaxed (randomized) controls and optimal coupling techniques are essential for ensuring near-optimal collective policies under mean-field influences.

A Group-Invariant Latent-Noise Markov Decision Process (MDP), formalized as a conditional McKean–Vlasov MDP (CMKV-MDP), is a stochastic control framework modeling a large population of interacting agents under mean-field influences, incorporating a shared latent noise source. Optimization is performed over open-loop controls on an infinite time horizon. The defining features include permutation invariance across agents and the presence of common (macro) noise affecting the system collectively. Central constructs include the lifting of the MDP onto the space of probability measures, a dynamic programming formulation on this lifted space, and the necessity of relaxed (randomized) controls due to inherent continuity requirements. CMKV-MDPs have foundational applications in areas where social planners or influencers seek optimal collective strategies without access to individual-level information, operating only via environmental noises and population-level statistics (Motte et al., 2019).

1. Model Structure: Dynamics with Common Noise

The CMKV-MDP is specified on a compact Polish state space $X$ , a compact Polish action space $A$ , and noise spaces $E^0$ (common) and $E$ (idiosyncratic). Each agent receives initial information $G$ , i.i.d. across the population. The agent's open-loop policy is a sequence $\pi = (\pi_0, \pi_1, ...)$ with $\pi_t : G \times E^t \times (E^0)^t \to A$ , so that at time $t$ ,

$\alpha_t = \pi_t(\Gamma, \epsilon_{0:t}, \epsilon^0_{0:t}),$

determines the agent's action. State evolution follows

$X_0 = \xi, \quad X_{t+1} = F(X_t, \alpha_t, \text{Law}^0(X_t, \alpha_t), \epsilon_{t+1}, \epsilon^0_{t+1}),$

where $\text{Law}^0(X_t, \alpha_t)$ is the conditional law given the common-noise filtration $\sigma(\epsilon^0_{0:t})$ . The reward for an agent is

$f(X_t, \alpha_t, \text{Law}^0(X_t, \alpha_t)),$

and the planner aims to maximize the total discounted gain

$J^{\pi} = \mathbb{E}^0\left[\sum_{t \geq 0} \beta^t f(X_t, \alpha_t, \text{Law}^0(X_t, \alpha_t))\right],$

by selecting over open-loop policies $\pi$ .

2. Permutation Invariance and the Role of Latent Common Noise

Permutation invariance, also referred to as mean-field or de Finetti invariance, arises because agents interact only through population empirical measures; relabeling the indices has no effect on the system's law. The common noise component $\epsilon^0$ influences all agents identically and acts as a latent public or macro noise—the agents observe and condition their strategies on this macro-level uncertainty, which can model phenomena such as macroeconomic shocks (Motte et al., 2019).

3. Lifting to Probability Measure Space and the Bellman Equation

The system admits a lifted MDP representation on the space of probability measures:

At time $t$ , the population law $\mu_t = \text{Law}^0(X_t) \in \mathcal{P}(X)$ ,
Relaxed controls are kernels $\hat{\nu}_t(x) = \text{Law}^0(\alpha_t|X_t=x) \in \mathcal{P}(A)$ ,
Joint law $\text{Law}^0(X_t, \alpha_t) = \mu_t \otimes \hat{\nu}_t$ .

The population law evolves via a measurable update

$\mu_{t+1} = \Phi(\mu_t, \hat{\nu}_t, \epsilon^0_{t+1}),$

with stage reward

$\hat{f}(\mu, \hat{\nu}) = \int_{X \times A} f(x, a, \mu \otimes \hat{\nu})\, (\mu \otimes \hat{\nu})(dx, da).$

The equivalent MDP is defined on $\mathcal{P}(X)$ (state), $\mathcal{P}(X \times A)$ (action), and transition $\mu \to \Phi(\mu, \nu, \epsilon^0)$ . The dynamic programming operator $T$ for bounded measurable $V : \mathcal{P}(X) \to \mathbb{R}$ is

$(TV)(\mu) = \sup_{\nu \in \mathcal{P}(X \times A)} \bigg \{ \tilde{f}(\mu, \nu) + \beta\, \mathbb{E}\left[V(\Phi(\mu, \nu, \epsilon^0_1))\right] \bigg \}.$

Under a Lipschitz continuity assumption $\text{HF}_\text{Lip}$ , $T$ admits a unique fixed point $V^*$ , which matches the planner's value function (Motte et al., 2019).

4. Necessity of Relaxed (Randomized) Controls and Optimal Coupling

Standard deterministic feedback controls are not generally sufficient for optimality under continuity requirements. It is necessary to employ relaxed (measure-valued) controls, i.e., for each $\mu$ a kernel $\nu \in \mathcal{P}(X \times A)$ to randomize actions for each $x \in X$ . A technical foundation for this approach is an optimal coupling construction for measures:

There exists a measurable $\zeta : \mathcal{P}(X) \times \mathcal{P}(X) \times X \times [0,1] \to X$ , such that for $\xi \sim \mu, U \sim \text{Unif}[0,1]$ independent,

$\zeta(\mu, \mu', \xi, U) \sim \mu', \quad \text{and} \quad \mathbb{E}[d(\xi, \zeta(\mu, \mu', \xi, U))] = W(\mu, \mu'),$

where $W$ denotes the Wasserstein distance. The $\zeta$ -map is central to verifying value function continuity and constructing $\epsilon$ -optimal feedback policies via quantization of $\mathcal{P}(X)$ (Motte et al., 2019).

5. Existence and Construction of $\epsilon$ -Optimal Randomized Feedback Policies

Assuming $\text{HF}_\text{Lip}$ and "richness" (atomlessness) of the initial $\sigma$ -algebra, measurable selection and quantization arguments guarantee, for all $\epsilon > 0$ , a measurable randomized feedback rule

$a : \mathcal{P}(X) \times X \times [0,1] \to A,$

such that with $U \sim \text{Unif}[0,1]$ independent of $X$ , taking $\alpha_t = a(\mu_t, X_t, U_t)$ yields a policy achieving value within $\epsilon$ of $V^*$ . Therefore, the optimal value $V^*$ can be attained (up to $\epsilon$ ) using stationary randomized feedback strategies. Theorem 4.1 guarantees that, for each $\epsilon > 0$ , one can construct a randomized stationary policy $\alpha^\epsilon$ (utilizing the $\zeta$ mapping and quantized $\mathcal{P}(X)$ ) satisfying

$V(\xi) - V^{\alpha^\epsilon}(\xi) \leq \frac{\epsilon}{1-\beta},$

with $V(\xi)$ the planner's value (Motte et al., 2019).

6. Significance and Procedural Implications

The CMKV-MDP framework equivalently reformulates mean-field control problems with latent group-level noise as Bellman-fixed-point equations on population-measure spaces, admitting rigorous solution procedures even when optimization is over open-loop controls and only population-level distributions and environment noises are observable. The requirement for relaxed controls and optimal coupling arguments reflects deep differences from classical finite-agent MDPs, and leads to constructive procedures for generating near-optimal stationary randomized policies for large populations of cooperative agents (Motte et al., 2019).

PDF Markdown Chat (Pro)

References (1)

Mean-field Markov decision processes with common noise and open-loop controls (2019)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Group-Invariant Latent-Noise MDP.

Group-Invariant Latent-Noise MDP

1. Model Structure: Dynamics with Common Noise

2. Permutation Invariance and the Role of Latent Common Noise

3. Lifting to Probability Measure Space and the Bellman Equation

4. Necessity of Relaxed (Randomized) Controls and Optimal Coupling

5. Existence and Construction of $\epsilon$ -Optimal Randomized Feedback Policies

6. Significance and Procedural Implications

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Group-Invariant Latent-Noise MDP

1. Model Structure: Dynamics with Common Noise

2. Permutation Invariance and the Role of Latent Common Noise

3. Lifting to Probability Measure Space and the Bellman Equation

4. Necessity of Relaxed (Randomized) Controls and Optimal Coupling

5. Existence and Construction of ϵ\epsilonϵ-Optimal Randomized Feedback Policies

6. Significance and Procedural Implications

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

5. Existence and Construction of $\epsilon$ -Optimal Randomized Feedback Policies