Papers
Topics
Authors
Recent
2000 character limit reached

Group-Invariant Latent-Noise MDP

Updated 15 December 2025
  • Group-Invariant Latent-Noise MDP is a stochastic control model capturing large agent populations with shared latent common noise and permutation invariance.
  • It employs a lifted MDP framework on the space of probability measures with dynamic programming to optimize open-loop controls over an infinite horizon.
  • Relaxed (randomized) controls and optimal coupling techniques are essential for ensuring near-optimal collective policies under mean-field influences.

A Group-Invariant Latent-Noise Markov Decision Process (MDP), formalized as a conditional McKean–Vlasov MDP (CMKV-MDP), is a stochastic control framework modeling a large population of interacting agents under mean-field influences, incorporating a shared latent noise source. Optimization is performed over open-loop controls on an infinite time horizon. The defining features include permutation invariance across agents and the presence of common (macro) noise affecting the system collectively. Central constructs include the lifting of the MDP onto the space of probability measures, a dynamic programming formulation on this lifted space, and the necessity of relaxed (randomized) controls due to inherent continuity requirements. CMKV-MDPs have foundational applications in areas where social planners or influencers seek optimal collective strategies without access to individual-level information, operating only via environmental noises and population-level statistics (Motte et al., 2019).

1. Model Structure: Dynamics with Common Noise

The CMKV-MDP is specified on a compact Polish state space XX, a compact Polish action space AA, and noise spaces E0E^0 (common) and EE (idiosyncratic). Each agent receives initial information GG, i.i.d. across the population. The agent's open-loop policy is a sequence π=(π0,π1,...)\pi = (\pi_0, \pi_1, ...) with πt:G×Et×(E0)tA\pi_t : G \times E^t \times (E^0)^t \to A, so that at time tt,

αt=πt(Γ,ϵ0:t,ϵ0:t0),\alpha_t = \pi_t(\Gamma, \epsilon_{0:t}, \epsilon^0_{0:t}),

determines the agent's action. State evolution follows

X0=ξ,Xt+1=F(Xt,αt,Law0(Xt,αt),ϵt+1,ϵt+10),X_0 = \xi, \quad X_{t+1} = F(X_t, \alpha_t, \text{Law}^0(X_t, \alpha_t), \epsilon_{t+1}, \epsilon^0_{t+1}),

where Law0(Xt,αt)\text{Law}^0(X_t, \alpha_t) is the conditional law given the common-noise filtration σ(ϵ0:t0)\sigma(\epsilon^0_{0:t}). The reward for an agent is

f(Xt,αt,Law0(Xt,αt)),f(X_t, \alpha_t, \text{Law}^0(X_t, \alpha_t)),

and the planner aims to maximize the total discounted gain

Jπ=E0[t0βtf(Xt,αt,Law0(Xt,αt))],J^{\pi} = \mathbb{E}^0\left[\sum_{t \geq 0} \beta^t f(X_t, \alpha_t, \text{Law}^0(X_t, \alpha_t))\right],

by selecting over open-loop policies π\pi.

2. Permutation Invariance and the Role of Latent Common Noise

Permutation invariance, also referred to as mean-field or de Finetti invariance, arises because agents interact only through population empirical measures; relabeling the indices has no effect on the system's law. The common noise component ϵ0\epsilon^0 influences all agents identically and acts as a latent public or macro noise—the agents observe and condition their strategies on this macro-level uncertainty, which can model phenomena such as macroeconomic shocks (Motte et al., 2019).

3. Lifting to Probability Measure Space and the Bellman Equation

The system admits a lifted MDP representation on the space of probability measures:

  • At time tt, the population law μt=Law0(Xt)P(X)\mu_t = \text{Law}^0(X_t) \in \mathcal{P}(X),
  • Relaxed controls are kernels ν^t(x)=Law0(αtXt=x)P(A)\hat{\nu}_t(x) = \text{Law}^0(\alpha_t|X_t=x) \in \mathcal{P}(A),
  • Joint law Law0(Xt,αt)=μtν^t\text{Law}^0(X_t, \alpha_t) = \mu_t \otimes \hat{\nu}_t.

The population law evolves via a measurable update

μt+1=Φ(μt,ν^t,ϵt+10),\mu_{t+1} = \Phi(\mu_t, \hat{\nu}_t, \epsilon^0_{t+1}),

with stage reward

f^(μ,ν^)=X×Af(x,a,μν^)(μν^)(dx,da).\hat{f}(\mu, \hat{\nu}) = \int_{X \times A} f(x, a, \mu \otimes \hat{\nu})\, (\mu \otimes \hat{\nu})(dx, da).

The equivalent MDP is defined on P(X)\mathcal{P}(X) (state), P(X×A)\mathcal{P}(X \times A) (action), and transition μΦ(μ,ν,ϵ0)\mu \to \Phi(\mu, \nu, \epsilon^0). The dynamic programming operator TT for bounded measurable V:P(X)RV : \mathcal{P}(X) \to \mathbb{R} is

(TV)(μ)=supνP(X×A){f~(μ,ν)+βE[V(Φ(μ,ν,ϵ10))]}.(TV)(\mu) = \sup_{\nu \in \mathcal{P}(X \times A)} \bigg \{ \tilde{f}(\mu, \nu) + \beta\, \mathbb{E}\left[V(\Phi(\mu, \nu, \epsilon^0_1))\right] \bigg \}.

Under a Lipschitz continuity assumption HFLip\text{HF}_\text{Lip}, TT admits a unique fixed point VV^*, which matches the planner's value function (Motte et al., 2019).

4. Necessity of Relaxed (Randomized) Controls and Optimal Coupling

Standard deterministic feedback controls are not generally sufficient for optimality under continuity requirements. It is necessary to employ relaxed (measure-valued) controls, i.e., for each μ\mu a kernel νP(X×A)\nu \in \mathcal{P}(X \times A) to randomize actions for each xXx \in X. A technical foundation for this approach is an optimal coupling construction for measures:

  • There exists a measurable ζ:P(X)×P(X)×X×[0,1]X\zeta : \mathcal{P}(X) \times \mathcal{P}(X) \times X \times [0,1] \to X, such that for ξμ,UUnif[0,1]\xi \sim \mu, U \sim \text{Unif}[0,1] independent,

ζ(μ,μ,ξ,U)μ,andE[d(ξ,ζ(μ,μ,ξ,U))]=W(μ,μ),\zeta(\mu, \mu', \xi, U) \sim \mu', \quad \text{and} \quad \mathbb{E}[d(\xi, \zeta(\mu, \mu', \xi, U))] = W(\mu, \mu'),

where WW denotes the Wasserstein distance. The ζ\zeta-map is central to verifying value function continuity and constructing ϵ\epsilon-optimal feedback policies via quantization of P(X)\mathcal{P}(X) (Motte et al., 2019).

5. Existence and Construction of ϵ\epsilon-Optimal Randomized Feedback Policies

Assuming HFLip\text{HF}_\text{Lip} and "richness" (atomlessness) of the initial σ\sigma-algebra, measurable selection and quantization arguments guarantee, for all ϵ>0\epsilon > 0, a measurable randomized feedback rule

a:P(X)×X×[0,1]A,a : \mathcal{P}(X) \times X \times [0,1] \to A,

such that with UUnif[0,1]U \sim \text{Unif}[0,1] independent of XX, taking αt=a(μt,Xt,Ut)\alpha_t = a(\mu_t, X_t, U_t) yields a policy achieving value within ϵ\epsilon of VV^*. Therefore, the optimal value VV^* can be attained (up to ϵ\epsilon) using stationary randomized feedback strategies. Theorem 4.1 guarantees that, for each ϵ>0\epsilon > 0, one can construct a randomized stationary policy αϵ\alpha^\epsilon (utilizing the ζ\zeta mapping and quantized P(X)\mathcal{P}(X)) satisfying

V(ξ)Vαϵ(ξ)ϵ1β,V(\xi) - V^{\alpha^\epsilon}(\xi) \leq \frac{\epsilon}{1-\beta},

with V(ξ)V(\xi) the planner's value (Motte et al., 2019).

6. Significance and Procedural Implications

The CMKV-MDP framework equivalently reformulates mean-field control problems with latent group-level noise as Bellman-fixed-point equations on population-measure spaces, admitting rigorous solution procedures even when optimization is over open-loop controls and only population-level distributions and environment noises are observable. The requirement for relaxed controls and optimal coupling arguments reflects deep differences from classical finite-agent MDPs, and leads to constructive procedures for generating near-optimal stationary randomized policies for large populations of cooperative agents (Motte et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Group-Invariant Latent-Noise MDP.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube