Papers
Topics
Authors
Recent
2000 character limit reached

LED-WM: Language-Aware Dreamer Model

Updated 6 December 2025
  • The paper introduces LED-WM, which improves zero-shot policy generalization by grounding natural language into latent dynamics through cross-modal attention.
  • It integrates structured grid observations with natural-language manuals using a dedicated encoder to produce semantic features for reinforcement learning.
  • Experiments on compositional grid environments demonstrate that LED-WM outperforms baselines without requiring expert demonstrations or inference-time planning.

The Language-Aware Encoder for Dreamer World Model (LED-WM) is a model-based reinforcement learning (RL) approach designed to improve policy generalization through explicit grounding of natural language environment descriptions into a world model. Built atop DreamerV3, LED-WM introduces a cross-modal attention mechanism that aligns language describing environment dynamics with entity representations in observations, producing latent features that inform RL policy learning. This method emphasizes zero-shot policy generalization to unseen environmental dynamics and language, removing the need for inference-time planning or expert demonstrations, and demonstrates state-of-the-art results in compositional grid-based environments (Nguyen et al., 28 Nov 2025).

1. Architectural Components

LED-WM integrates natural-language manuals and structured observations into a world model that supports policy learning. Its architecture consists of the following components:

  • Inputs:
    • Natural-language manual L={1,,N}\mathcal{L} = \{\ell_1, \ldots, \ell_N\}, one sentence per entity, embedded via a frozen T5 encoder, producing s1,,sNRdss_1, \ldots, s_N \in \mathbb{R}^{d_s}.
    • Grid-world observation ot{0,1}10×10×Co_t \in \{0,1\}^{10 \times 10 \times C}, where each grid cell contains one-hot entity or agent symbols mapped to learned embeddings sbiRdsbsb_i \in \mathbb{R}^{d_{sb}} or aRdsba \in \mathbb{R}^{d_{sb}}.
    • Time embedding timetRdtime\mathrm{time}_t \in \mathbb{R}^{d_{\mathrm{time}}} for step tt.
    • Directional history features DitD_i^t, computed for each entity ii as:

    Dit=(pitpatpitpat)(pitpit1pitpit1)D_i^t = \left( \frac{p_i^t - p_a^t}{\|p_i^t - p_a^t\|} \right) \cdot \left( \frac{p_i^t - p_i^{t-1}}{\|p_i^t - p_i^{t-1}\|} \right)

    where pp denotes grid cell locations.

  • Language-Grounding via Cross-Attention:

    • For each entity ii, a query qi=MLPq([sbi,Dit])Rdq_i = \mathrm{MLP}_q([sb_i, D_i^t]) \in \mathbb{R}^d is constructed.
    • All sentence embeddings produce key-value pairs: kj=MLPk(sj)Rdk_j = \mathrm{MLP}_k(s_j) \in \mathbb{R}^d, vj=WvsjRdvalv_j = W_v s_j \in \mathbb{R}^{d_{\mathrm{val}}}.
    • Attention weights: Ai=softmax(qiKT/d)RNA_i = \mathrm{softmax}(q_i K^T / \sqrt{d}) \in \mathbb{R}^N.
    • Grounded entity vector: ei=jAi,jvje_i = \sum_j A_{i,j} v_j encodes the language for each grid entity.
  • Grid Feature Construction:
    • Each eie_i is placed back at its corresponding grid location, forming GlR10×10×dvalG_l \in \mathbb{R}^{10 \times 10 \times d_{\mathrm{val}}} alongside the agent embedding.
    • GlG_l is processed by a CNN, flattened, concatenated with the time embedding, and passed through an MLP to produce the visual-language feature xtx_t.
  • Latent Dynamics Model:
    • LED-WM uses DreamerV3’s latent RSSM: at each tt,
    • Deterministic state update ht=fϕ(ht1,zt1,at1)h_t = f_\phi(h_{t-1}, z_{t-1}, a_{t-1}),
    • Latent variable ztqϕ(ztht,xt)z_t \sim q_\phi(z_t | h_t, x_t),
    • Prior h ⁣z~t+1pϕ(zht)h\!\tilde{z}_{t+1} \sim p_\phi(z | h_t),
    • Reward and continuation predictors.

This architecture explicitly aligns language with perceptual observations, facilitating compositional and robust environment understanding.

2. Optimization Objectives

LED-WM employs a training objective that omits pixel reconstruction and prioritizes multi-step semantic prediction:

  • World-Model Loss:

L(ϕ)=Eqϕ[t=1T(Lpred(t)+βdynLdyn(t)+βrepLrep(t))]\mathcal{L}(\phi) = \mathbb{E}_{q_\phi} \left[ \sum_{t=1}^T \left( \mathcal{L}_{\mathrm{pred}}(t) + \beta_{\mathrm{dyn}} \mathcal{L}_{\mathrm{dyn}}(t) + \beta_{\mathrm{rep}} \mathcal{L}_{\mathrm{rep}}(t) \right) \right]

  • Prediction loss Lpred(t)\mathcal{L}_{\mathrm{pred}}(t) penalizes errors in reward and continuation multi-step prediction, discounted by rollout parameter λ\lambda.
  • Dynamics loss Ldyn(t)\mathcal{L}_{\mathrm{dyn}}(t) is a KL divergence regularizer between posterior and prior latents.
  • Representation loss Lrep\mathcal{L}_{\mathrm{rep}} is a free-nats regularizer.

A key architectural choice is dropping DreamerV3’s pixel decoder, focusing all model capacity on semantic dynamics.

3. Policy Learning and Rollout

The policy in LED-WM is learned using the Dreamer paradigm, relying exclusively on the latent state produced by the language-aware world model:

  • Actor (μθ\mu_\theta) and Critic (vψv_\psi):
    • Both receive only the world-model latent (ht,zt)(h_t, z_t), not pixels or language directly.
    • Training uses imagined multi-step latent rollouts, generated solely via pϕp_\phi and μθ\mu_\theta from recent posteriors.
    • The critic is regressed on λ\lambda-returns from these rollouts; the actor maximizes value and entropy in imaginative space.
  • Training Regime:
    • No inference-time planning (e.g., MCTS) or expert demonstrations are used for policy learning.
    • The policy head remains identical to DreamerV3, with the only modification being the encoder.

A supplementary policy fine-tuning procedure leverages synthetic rollouts: for new test games, policies can be further adapted via world-model-generated trajectories if predicted returns fall below a threshold for a small number of gradient steps, yielding marginal but statistically significant gains.

4. Experimental Evaluation

LED-WM is evaluated primarily in the MESSENGER and MESSENGER-WM environments, each constructed to probe language-conditioned dynamics generalization:

  • MESSENGER: 10×10 grid with 3–5 entities, each described in a natural-language manual with roles (messenger, goal, enemy) and movements (chasing, fleeing, stationary). Evaluation stages are:
    • S1: Name grounding only.
    • S2/S2-dev: Different movement combinations; S2-dev matches train, S2 uses novel combos.
    • S3: Increased N, distractors, synonyms, complex language.
  • MESSENGER-WM: 3-entity tasks with various test splits:
    • NewCombo: Unseen entity sets.
    • NewAttr: Known entities, novel assignments.
    • NewAll: Both aspects novel.
  • Metrics:
    • MESSENGER: Win Rate over 1000 episodes.
    • MESSENGER-WM: Average Sum of Scores over 1000 games × 60 rollouts/game.
  • Baselines and Ablations:
    • Model-free: EMMA (with and without curriculum), CRL.
    • Model-based: Dynalang (no explicit grounding), EMMA-LWM (expert demos + IL or BC), Reader (MCTS planning).
    • Ablations: Remove cross-attention (equivalent to Dynalang), remove pixel decoder, vary β\beta weights.

Performance of LED-WM and baselines across key tasks is summarized as follows:

Method S1 S2 S2-dev S3
Dynalang 0.03±0.02 0.04±0.05 0.03±0.05
CRL 0.88±0.03 0.76±0.05 0.32±0.02
EMMA (no curr.) 0.85±0.01 0.45±0.12 0.10±0.01
EMMA (w/ curr.) 0.88±0.02 0.95±0.00 0.22±0.04
LED-WM 1.00±0.00 0.52±0.03 0.97±0.01 0.35±0.02

For MESSENGER-WM, LED-WM outperforms EMMA-LWM (both IL and BC settings) in all generalization splits, without requiring expert trajectories.

5. Grounding Evaluation and Generalization

Qualitative analysis demonstrates that LED-WM’s cross-attention encoder achieves explicit and compositional language grounding:

  • Synonyms (e.g., “scholar” vs “researcher”), paraphrases (e.g., “won’t move” vs “stationary”), and distractor sentences are correctly mapped to grid entities.
  • Uniform fusion baselines (Dynalang) fail to distinguish entities with similar names but divergent behaviors, leading to policy collapse.
  • Attention weights localize linguistic information to the correct visual entities, even under unseen language or novel behavioral assignments.

A notable aspect is generalization to both novel entity combinations (compositionality) and novel natural-language instructions (linguistic robustness).

6. Comparative Performance and Limitations

LED-WM achieves or surpasses all ablated and baseline methods in terms of zero-shot generalization, without reliance on expert demonstrations or costly online planning. EMMA with curriculum slightly outperforms in one S2 condition, attributed to dataset bias, suggesting that integrating CRL’s debiasing may further improve LED-WM. Synthetic rollout fine-tuning offers additional, albeit small, statistical gains.

Key contrasts:

  • Model-based, language-conditioned without explicit grounding (Dynalang) underperform on compositional generalization.
  • Model-free with curriculum or IL requirements (EMMA, EMMA-LWM) require task-specific tuning or demonstration data.
  • Explicit cross-modal attention in LED-WM is critical; ablations removing this mechanism revert performance to Dynalang levels.

LED-WM’s use of a cross-modal encoder and multi-step prediction is central to its improved generalization in language-conditioned policy learning (Nguyen et al., 28 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Language-Aware Encoder for Dreamer World Model (LED-WM).