MrCoM: Meta-Regularized Contextual World-Model

Updated 16 November 2025

The paper presents MrCoM, a model-based reinforcement learning framework that decomposes latent states and applies meta-regularization to achieve robust cross-scenario generalization.
MrCoM employs a modularized architecture with a shallow Transformer for contextual encoding and a three-part latent-state decomposition handling stochastic, deterministic, and auxiliary elements.
Empirical evaluations show MrCoM outperforms baselines in handling dynamics, reward, and observation perturbations, backed by theoretical error bounds on generalization.

The Meta-Regularized Contextual World-Model (MrCoM) is a model-based reinforcement learning (MBRL) framework that addresses generalization in multi-scenario settings by building a unified, meta-regularized world model. MrCoM isolates latent representations aligned with dynamic characteristics and scenario relevance, regularizes both state and value representations via meta-objectives, and provides theoretical guarantees on generalization gap. Empirical evaluations demonstrate that MrCoM attains superior generalization and robustness compared to contemporary world-model baselines under diverse alterations in environmental dynamics, rewards, and observations (Xiong et al., 9 Nov 2025).

1. Architecture and Components

MrCoM introduces a modularized architecture structured around scenario-agnostic and scenario-specific elements to facilitate cross-scenario transfer. The core elements are:

Contextual Encoder: At each time step $t$ , a context window $C_t = \{o_{t-m}, a_{t-m}, \ldots, o_{t-1}, a_{t-1}\}$ of length $m$ is ingested by a shallow Transformer (1–2 layers, 3 heads) to extract contextual embeddings. This architectural choice enables scenario-conditional inference and prediction.
Latent-State Decomposition: The unified latent state $\tilde{s}_t$ $\tilde{s}_{t}$ is factorized into:
- $u_t$ (stochastic): Encodes aleatoric uncertainty; governed by a Gaussian prior $p_0(u_t|C_t, a_t)$ and posterior $q(u_t|C_t, a_t, o_t)$ .
- $d_t$ (deterministic): A recurrent hidden state evolving as $p(d_t|d_{t-1}, u_t, a_t)$ .
- $h_t$ (auxiliary): Captures residual structure, with its own prior and posterior distributions.

A probabilistic decoder $p(o_t|u_t, d_t, h_t)$ reconstructs the original observations, tying the latent components to observed data. All modules map diverse input scenarios to a shared latent space $\tilde{S}$ .

Policy and Value Heads: The learning framework integrates (a) scenario-specific value heads $v_{\psi_i}(\tilde{s}_t)$ , (b) a shared meta-value head $v_\psi(\tilde{s}_t)$ , and (c) a policy $\pi_\phi(a_t|\tilde{s}_t)$ , all operating on the unified latent embedding.

This design partitions scenario-relevant structure within $u_t$ and $h_t$ , captures temporal dependencies in $d_t$ , and ensures that policies and value estimations generalize across scenarios through a shared representation.

2. Meta-State Regularization

Meta-state regularization is designed to enforce that $u_t$ encodes only information in $o_t$ relevant given $(C_t, a_t)$ . To achieve this, MrCoM directly penalizes the conditional mutual information $I(u_t; o_t \mid C_t, a_t)$ , effectively discouraging encoding of scenario-irrelevant noise.

Formally, employing a variational upper bound [Poole et al. 2019]: $I(u; o \mid C, a) \leq \mathbb{E}_{p(C, a, o)} \big[\text{KL}(p(u \mid C, a, o) \| q(u \mid C, a))\big].$

The meta-state loss is: $L_s = \mathbb{E}_{(C_t, a_t, o_t) \sim \mathcal{D}} \left[ \text{KL}\left( p_\theta(u_t|C_t, a_t) \Vert q_\theta(u_t|C_t, a_t, o_t) \right) \right].$

This procedure strips $u_t$ of features from $o_t$ that cannot be predicted from context and action, yielding latent representations robust to irrelevant observation noise and scenario-specific peculiarities.

3. Meta-Value Regularization

Meta-value regularization aligns policy learning and world-model optimization across diverse objectives. It incorporates two core loss terms:

Scenario-Specific Bellman Update:

$L_{\text{value}_i} = \mathbb{E}_{\mathcal{T}_i \sim p(\mathcal{T})} \big\| v_{\psi_i}(\tilde{s}_t) - \big(r_t + \gamma v_{\psi_i}(\tilde{s}_{t+1}) \big) \big\|^2$

This enforces value consistency per scenario.

Meta-Value Alignment:

$L_{\text{value}} = \mathbb{E}_{\mathcal{T}_i \sim p(\mathcal{T})} \big\| v_{\psi_i}(\tilde{s}_t) - v_{\psi}(\tilde{s}_t) \big\|^2$

This loss encourages all scenario-specific values to align with a unified meta-value function.

Meta-Value Rollout Consistency:

$L_v = \mathbb{E}_{\mathcal{T}_i \sim p(\mathcal{T})} \left\| v_\psi(\tilde{s}_{t+1}) - \mathbb{E}_{\hat{s}_{t+1} \sim \hat{T}_\theta(\cdot|\tilde{s}_t, a_t)} \left[ v_\psi(\hat{s}_{t+1}) \right] \right\|$

This tripartite value regularization ensures effective Bellman propagation in all scenarios and constrains the learned world-model to support meta-policy learning.

4. Generalization Error Bound

The theoretical framework established in MrCoM provides generalization error upper bounds under multi-scenario settings, assuming dynamics homogeneity and encoder approximation error $\varepsilon_S$ .

Lemma 1: Dynamics representation error: $\max_i \mathbb{E}_{\mathcal{T}_i}\big[ D_{TV}( \tilde{T}(f(o') | f(o), a) \Vert T_i(s'|s, a) ) \big] \leq \varepsilon_T + C_T \varepsilon_S$
Lemma 2: Policy representation error: $D_{TV}( \pi(a | f(o)) \Vert \pi(a|s) ) \leq \varepsilon_\pi + \frac{1}{2} C_\pi \varepsilon_S$
Lemma 3: Performance gap: $|G^1(\pi_1) - G^2(\pi_2)| \leq \frac{2R\gamma(\varepsilon_\pi + \varepsilon_T)}{(1-\gamma)^2} + \frac{2R\varepsilon_\pi}{1-\gamma}$
Theorem 2: $\bigl|\,\tilde{G}_i(\pi) - \tilde{G}_\theta(\pi)\,\bigr| \leq \frac{R\gamma\, [\,4\varepsilon_\pi + 2\varepsilon_T + (C_\pi + 2C_T) \varepsilon_S\,]}{(1-\gamma)^2} + \frac{2R\,[2\varepsilon_\pi + C_\pi \varepsilon_S]}{1-\gamma}$

The bound decomposes the total generalization error into contributions from dynamics modeling error ( $\varepsilon_T$ ), encoder error ( $\varepsilon_S$ ), and policy mismatch ( $\varepsilon_\pi$ ). MrCoM’s regularization objectives map directly onto these error sources.

5. Training Algorithms and Procedural Details

The MrCoM training process consists of two stages:

World-Model Training:
- Sample scenario $\mathcal{T}_i \sim p(\mathcal{T})$ , collect transitions using $\pi_{\phi_i}$ .
- Update $v_{\psi_i}$ with respect to $L_{\text{value}_i}$ (Bellman loss).
- Update policy $\phi_i$ via standard actor-critic updates on both true and model-simulated rollouts.
- Store $(o_t, v_{\psi_i}(\tilde{s}_t))$ pairs for meta-value alignment.
- 3. Update meta-value head $\psi$ using $L_{\text{value}}$ .
- 4. Optimize overall world-model loss:
$L_{\text{MrCoM}} = \lambda_{\text{var}} L_{\text{var}} + \lambda_s L_s + \lambda_v L_v$

Key hyperparameters: $\lambda_{\text{var}}=1, \lambda_s=0.1, \lambda_v=1$ , batch size 32, learning rates $1\times10^{-4}$ (actor), $2\times10^{-4}$ (critic), rollout horizon $H=5$ , latent state sizes 128 per component.
Scenario Adaptation:
- For a new scenario $\mathcal{T}^\star$ , fix the world-model $\hat{T}_\theta$ , learn policy and value heads via mixed real+simulated rollouts, optionally fine-tune $\theta$ with $L_{\text{MrCoM}}$ .

6. Empirical Evaluation and Results

Experiments are conducted on the MuJoCo-based DeepMind Control Suite (Hopper, Walker, Cheetah) with controlled scenario variations:

Dynamics changes: Uniform random perturbation of limb size/length by $\alpha\%$ , for $\alpha \in \{5, 10, 20, \ldots\}$ .
Reward changes: Randomization of target speed $v_i \sim \text{Uniform}(0, \beta\% \cdot v_{\text{max}})$ , $\beta \in \{20, 50, 100\}$ .

Training is performed in a multi-scenario manner by merging trajectories from all environments and fitting a unified world-model. Baselines considered include DreamerV3, CaDM, and MAMBA.

In-distribution and out-of-distribution generalization is evaluated by training and testing under disjoint perturbation settings (e.g., train at $(\alpha=5, \beta=20)$ , test at $(\alpha=10, \beta=50)$ ).
Performance Comparison:
- MrCoM outperforms all baselines in 11/12 multi-scenario in-distribution runs and 11/12 out-of-distribution runs (see Table 1 in (Xiong et al., 9 Nov 2025)).
- Under pure dynamics shifts, MrCoM achieves the highest return in 5/6 cases.
- For observation corruptions (Gaussian noise, dimension addition, random masking), MrCoM attains top performance in 8/12 scenarios.

Ablation studies indicate that removal of any latent component ( $d_t$ ), context prompt ( $C_t$ ), meta-state loss ( $L_s$ ), or meta-value loss ( $L_v$ ) degrades performance, with the context prompt being most crucial in the multi-scenario regime.

7. Context and Significance

MrCoM's unified world-model approach, three-fold latent decomposition, and regularization mechanisms are designed to meet the challenges of scenario transfer in MBRL by structurally decoupling scenario-dependent and -independent information. The explicit theoretical error bounds allow precise control of the sources of generalization loss, tightly linking architecture and training procedure to expected empirical performance. Main empirical findings demonstrate that its design increases robustness and transferability under broad changes in underlying transition dynamics, reward functions, and observation corruptions. A plausible implication is that this paradigm could provide a scalable route to robust MBRL in real-world, non-stationary domains where scenario variation is the norm.

PDF Markdown Chat (Pro)

References (1)

MrCoM: A Meta-Regularized World-Model Generalizing Across Multi-Scenarios (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Meta-Regularized Contextual World-Model (MrCoM).