Papers
Topics
Authors
Recent
2000 character limit reached

MrCoM: Meta-Regularized Contextual World-Model

Updated 16 November 2025
  • The paper presents MrCoM, a model-based reinforcement learning framework that decomposes latent states and applies meta-regularization to achieve robust cross-scenario generalization.
  • MrCoM employs a modularized architecture with a shallow Transformer for contextual encoding and a three-part latent-state decomposition handling stochastic, deterministic, and auxiliary elements.
  • Empirical evaluations show MrCoM outperforms baselines in handling dynamics, reward, and observation perturbations, backed by theoretical error bounds on generalization.

The Meta-Regularized Contextual World-Model (MrCoM) is a model-based reinforcement learning (MBRL) framework that addresses generalization in multi-scenario settings by building a unified, meta-regularized world model. MrCoM isolates latent representations aligned with dynamic characteristics and scenario relevance, regularizes both state and value representations via meta-objectives, and provides theoretical guarantees on generalization gap. Empirical evaluations demonstrate that MrCoM attains superior generalization and robustness compared to contemporary world-model baselines under diverse alterations in environmental dynamics, rewards, and observations (Xiong et al., 9 Nov 2025).

1. Architecture and Components

MrCoM introduces a modularized architecture structured around scenario-agnostic and scenario-specific elements to facilitate cross-scenario transfer. The core elements are:

  • Contextual Encoder: At each time step tt, a context window Ct={otm,atm,,ot1,at1}C_t = \{o_{t-m}, a_{t-m}, \ldots, o_{t-1}, a_{t-1}\} of length mm is ingested by a shallow Transformer (1–2 layers, 3 heads) to extract contextual embeddings. This architectural choice enables scenario-conditional inference and prediction.
  • Latent-State Decomposition: The unified latent state s~t\tilde{s}_t is factorized into:
    • utu_t (stochastic): Encodes aleatoric uncertainty; governed by a Gaussian prior p0(utCt,at)p_0(u_t|C_t, a_t) and posterior q(utCt,at,ot)q(u_t|C_t, a_t, o_t).
    • dtd_t (deterministic): A recurrent hidden state evolving as p(dtdt1,ut,at)p(d_t|d_{t-1}, u_t, a_t).
    • hth_t (auxiliary): Captures residual structure, with its own prior and posterior distributions.

A probabilistic decoder p(otut,dt,ht)p(o_t|u_t, d_t, h_t) reconstructs the original observations, tying the latent components to observed data. All modules map diverse input scenarios to a shared latent space S~\tilde{S}.

  • Policy and Value Heads: The learning framework integrates (a) scenario-specific value heads vψi(s~t)v_{\psi_i}(\tilde{s}_t), (b) a shared meta-value head vψ(s~t)v_\psi(\tilde{s}_t), and (c) a policy πϕ(ats~t)\pi_\phi(a_t|\tilde{s}_t), all operating on the unified latent embedding.

This design partitions scenario-relevant structure within utu_t and hth_t, captures temporal dependencies in dtd_t, and ensures that policies and value estimations generalize across scenarios through a shared representation.

2. Meta-State Regularization

Meta-state regularization is designed to enforce that utu_t encodes only information in oto_t relevant given (Ct,at)(C_t, a_t). To achieve this, MrCoM directly penalizes the conditional mutual information I(ut;otCt,at)I(u_t; o_t \mid C_t, a_t), effectively discouraging encoding of scenario-irrelevant noise.

Formally, employing a variational upper bound [Poole et al. 2019]: I(u;oC,a)Ep(C,a,o)[KL(p(uC,a,o)q(uC,a))].I(u; o \mid C, a) \leq \mathbb{E}_{p(C, a, o)} \big[\text{KL}(p(u \mid C, a, o) \| q(u \mid C, a))\big].

The meta-state loss is: Ls=E(Ct,at,ot)D[KL(pθ(utCt,at)qθ(utCt,at,ot))].L_s = \mathbb{E}_{(C_t, a_t, o_t) \sim \mathcal{D}} \left[ \text{KL}\left( p_\theta(u_t|C_t, a_t) \Vert q_\theta(u_t|C_t, a_t, o_t) \right) \right].

This procedure strips utu_t of features from oto_t that cannot be predicted from context and action, yielding latent representations robust to irrelevant observation noise and scenario-specific peculiarities.

3. Meta-Value Regularization

Meta-value regularization aligns policy learning and world-model optimization across diverse objectives. It incorporates two core loss terms:

  • Scenario-Specific Bellman Update:

Lvaluei=ETip(T)vψi(s~t)(rt+γvψi(s~t+1))2L_{\text{value}_i} = \mathbb{E}_{\mathcal{T}_i \sim p(\mathcal{T})} \big\| v_{\psi_i}(\tilde{s}_t) - \big(r_t + \gamma v_{\psi_i}(\tilde{s}_{t+1}) \big) \big\|^2

This enforces value consistency per scenario.

  • Meta-Value Alignment:

Lvalue=ETip(T)vψi(s~t)vψ(s~t)2L_{\text{value}} = \mathbb{E}_{\mathcal{T}_i \sim p(\mathcal{T})} \big\| v_{\psi_i}(\tilde{s}_t) - v_{\psi}(\tilde{s}_t) \big\|^2

This loss encourages all scenario-specific values to align with a unified meta-value function.

  • Meta-Value Rollout Consistency:

Lv=ETip(T)vψ(s~t+1)Es^t+1T^θ(s~t,at)[vψ(s^t+1)]L_v = \mathbb{E}_{\mathcal{T}_i \sim p(\mathcal{T})} \left\| v_\psi(\tilde{s}_{t+1}) - \mathbb{E}_{\hat{s}_{t+1} \sim \hat{T}_\theta(\cdot|\tilde{s}_t, a_t)} \left[ v_\psi(\hat{s}_{t+1}) \right] \right\|

This tripartite value regularization ensures effective Bellman propagation in all scenarios and constrains the learned world-model to support meta-policy learning.

4. Generalization Error Bound

The theoretical framework established in MrCoM provides generalization error upper bounds under multi-scenario settings, assuming dynamics homogeneity and encoder approximation error εS\varepsilon_S.

  • Lemma 1: Dynamics representation error: maxiETi[DTV(T~(f(o)f(o),a)Ti(ss,a))]εT+CTεS\max_i \mathbb{E}_{\mathcal{T}_i}\big[ D_{TV}( \tilde{T}(f(o') | f(o), a) \Vert T_i(s'|s, a) ) \big] \leq \varepsilon_T + C_T \varepsilon_S
  • Lemma 2: Policy representation error: DTV(π(af(o))π(as))επ+12CπεSD_{TV}( \pi(a | f(o)) \Vert \pi(a|s) ) \leq \varepsilon_\pi + \frac{1}{2} C_\pi \varepsilon_S
  • Lemma 3: Performance gap: G1(π1)G2(π2)2Rγ(επ+εT)(1γ)2+2Rεπ1γ|G^1(\pi_1) - G^2(\pi_2)| \leq \frac{2R\gamma(\varepsilon_\pi + \varepsilon_T)}{(1-\gamma)^2} + \frac{2R\varepsilon_\pi}{1-\gamma}
  • Theorem 2: G~i(π)G~θ(π)Rγ[4επ+2εT+(Cπ+2CT)εS](1γ)2+2R[2επ+CπεS]1γ\bigl|\,\tilde{G}_i(\pi) - \tilde{G}_\theta(\pi)\,\bigr| \leq \frac{R\gamma\, [\,4\varepsilon_\pi + 2\varepsilon_T + (C_\pi + 2C_T) \varepsilon_S\,]}{(1-\gamma)^2} + \frac{2R\,[2\varepsilon_\pi + C_\pi \varepsilon_S]}{1-\gamma}

The bound decomposes the total generalization error into contributions from dynamics modeling error (εT\varepsilon_T), encoder error (εS\varepsilon_S), and policy mismatch (επ\varepsilon_\pi). MrCoM’s regularization objectives map directly onto these error sources.

5. Training Algorithms and Procedural Details

The MrCoM training process consists of two stages:

  • World-Model Training:
    • Sample scenario Tip(T)\mathcal{T}_i \sim p(\mathcal{T}), collect transitions using πϕi\pi_{\phi_i}.
    • Update vψiv_{\psi_i} with respect to LvalueiL_{\text{value}_i} (Bellman loss).
    • Update policy ϕi\phi_i via standard actor-critic updates on both true and model-simulated rollouts.
    • Store (ot,vψi(s~t))(o_t, v_{\psi_i}(\tilde{s}_t)) pairs for meta-value alignment.
    • 3. Update meta-value head ψ\psi using LvalueL_{\text{value}}.
    • 4. Optimize overall world-model loss:

    LMrCoM=λvarLvar+λsLs+λvLvL_{\text{MrCoM}} = \lambda_{\text{var}} L_{\text{var}} + \lambda_s L_s + \lambda_v L_v

    Key hyperparameters: λvar=1,λs=0.1,λv=1\lambda_{\text{var}}=1, \lambda_s=0.1, \lambda_v=1, batch size 32, learning rates 1×1041\times10^{-4} (actor), 2×1042\times10^{-4} (critic), rollout horizon H=5H=5, latent state sizes 128 per component.

  • Scenario Adaptation:

    • For a new scenario T\mathcal{T}^\star, fix the world-model T^θ\hat{T}_\theta, learn policy and value heads via mixed real+simulated rollouts, optionally fine-tune θ\theta with LMrCoML_{\text{MrCoM}}.

6. Empirical Evaluation and Results

Experiments are conducted on the MuJoCo-based DeepMind Control Suite (Hopper, Walker, Cheetah) with controlled scenario variations:

  • Dynamics changes: Uniform random perturbation of limb size/length by α%\alpha\%, for α{5,10,20,}\alpha \in \{5, 10, 20, \ldots\}.
  • Reward changes: Randomization of target speed viUniform(0,β%vmax)v_i \sim \text{Uniform}(0, \beta\% \cdot v_{\text{max}}), β{20,50,100}\beta \in \{20, 50, 100\}.

Training is performed in a multi-scenario manner by merging trajectories from all environments and fitting a unified world-model. Baselines considered include DreamerV3, CaDM, and MAMBA.

  • In-distribution and out-of-distribution generalization is evaluated by training and testing under disjoint perturbation settings (e.g., train at (α=5,β=20)(\alpha=5, \beta=20), test at (α=10,β=50)(\alpha=10, \beta=50)).
  • Performance Comparison:
    • MrCoM outperforms all baselines in 11/12 multi-scenario in-distribution runs and 11/12 out-of-distribution runs (see Table 1 in (Xiong et al., 9 Nov 2025)).
    • Under pure dynamics shifts, MrCoM achieves the highest return in 5/6 cases.
    • For observation corruptions (Gaussian noise, dimension addition, random masking), MrCoM attains top performance in 8/12 scenarios.

Ablation studies indicate that removal of any latent component (dtd_t), context prompt (CtC_t), meta-state loss (LsL_s), or meta-value loss (LvL_v) degrades performance, with the context prompt being most crucial in the multi-scenario regime.

7. Context and Significance

MrCoM's unified world-model approach, three-fold latent decomposition, and regularization mechanisms are designed to meet the challenges of scenario transfer in MBRL by structurally decoupling scenario-dependent and -independent information. The explicit theoretical error bounds allow precise control of the sources of generalization loss, tightly linking architecture and training procedure to expected empirical performance. Main empirical findings demonstrate that its design increases robustness and transferability under broad changes in underlying transition dynamics, reward functions, and observation corruptions. A plausible implication is that this paradigm could provide a scalable route to robust MBRL in real-world, non-stationary domains where scenario variation is the norm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Meta-Regularized Contextual World-Model (MrCoM).