Group-Invariant Latent-Noise MDP
- Group-Invariant Latent-Noise MDP is a stochastic control model capturing large agent populations with shared latent common noise and permutation invariance.
- It employs a lifted MDP framework on the space of probability measures with dynamic programming to optimize open-loop controls over an infinite horizon.
- Relaxed (randomized) controls and optimal coupling techniques are essential for ensuring near-optimal collective policies under mean-field influences.
A Group-Invariant Latent-Noise Markov Decision Process (MDP), formalized as a conditional McKean–Vlasov MDP (CMKV-MDP), is a stochastic control framework modeling a large population of interacting agents under mean-field influences, incorporating a shared latent noise source. Optimization is performed over open-loop controls on an infinite time horizon. The defining features include permutation invariance across agents and the presence of common (macro) noise affecting the system collectively. Central constructs include the lifting of the MDP onto the space of probability measures, a dynamic programming formulation on this lifted space, and the necessity of relaxed (randomized) controls due to inherent continuity requirements. CMKV-MDPs have foundational applications in areas where social planners or influencers seek optimal collective strategies without access to individual-level information, operating only via environmental noises and population-level statistics (Motte et al., 2019).
1. Model Structure: Dynamics with Common Noise
The CMKV-MDP is specified on a compact Polish state space , a compact Polish action space , and noise spaces (common) and (idiosyncratic). Each agent receives initial information , i.i.d. across the population. The agent's open-loop policy is a sequence with , so that at time ,
determines the agent's action. State evolution follows
where is the conditional law given the common-noise filtration . The reward for an agent is
and the planner aims to maximize the total discounted gain
by selecting over open-loop policies .
2. Permutation Invariance and the Role of Latent Common Noise
Permutation invariance, also referred to as mean-field or de Finetti invariance, arises because agents interact only through population empirical measures; relabeling the indices has no effect on the system's law. The common noise component influences all agents identically and acts as a latent public or macro noise—the agents observe and condition their strategies on this macro-level uncertainty, which can model phenomena such as macroeconomic shocks (Motte et al., 2019).
3. Lifting to Probability Measure Space and the Bellman Equation
The system admits a lifted MDP representation on the space of probability measures:
- At time , the population law ,
- Relaxed controls are kernels ,
- Joint law .
The population law evolves via a measurable update
with stage reward
The equivalent MDP is defined on (state), (action), and transition . The dynamic programming operator for bounded measurable is
Under a Lipschitz continuity assumption , admits a unique fixed point , which matches the planner's value function (Motte et al., 2019).
4. Necessity of Relaxed (Randomized) Controls and Optimal Coupling
Standard deterministic feedback controls are not generally sufficient for optimality under continuity requirements. It is necessary to employ relaxed (measure-valued) controls, i.e., for each a kernel to randomize actions for each . A technical foundation for this approach is an optimal coupling construction for measures:
- There exists a measurable , such that for independent,
where denotes the Wasserstein distance. The -map is central to verifying value function continuity and constructing -optimal feedback policies via quantization of (Motte et al., 2019).
5. Existence and Construction of -Optimal Randomized Feedback Policies
Assuming and "richness" (atomlessness) of the initial -algebra, measurable selection and quantization arguments guarantee, for all , a measurable randomized feedback rule
such that with independent of , taking yields a policy achieving value within of . Therefore, the optimal value can be attained (up to ) using stationary randomized feedback strategies. Theorem 4.1 guarantees that, for each , one can construct a randomized stationary policy (utilizing the mapping and quantized ) satisfying
with the planner's value (Motte et al., 2019).
6. Significance and Procedural Implications
The CMKV-MDP framework equivalently reformulates mean-field control problems with latent group-level noise as Bellman-fixed-point equations on population-measure spaces, admitting rigorous solution procedures even when optimization is over open-loop controls and only population-level distributions and environment noises are observable. The requirement for relaxed controls and optimal coupling arguments reflects deep differences from classical finite-agent MDPs, and leads to constructive procedures for generating near-optimal stationary randomized policies for large populations of cooperative agents (Motte et al., 2019).