Papers
Topics
Authors
Recent
Search
2000 character limit reached

Latent World Model Formulation

Updated 27 February 2026
  • Latent World Model Formulation is a method that learns compact representations of the world state and agent actions for efficient planning.
  • It utilizes structured probabilistic frameworks and variational objectives to decouple observations from control, enabling robust policy learning.
  • The formulation supports diverse architectures—from CNNs to transformers—and demonstrates strong generalization in autonomous planning and control tasks.

A latent world model is a structured probabilistic system that encodes the dynamics of an environment in a learned latent space, decoupling observations and control from raw sensory inputs and actions. By inferring compact latent representations of both the world state and agent actions, such models enable efficient planning, prediction, and policy learning from high-dimensional, potentially unstructured data, including passive observations and limited action labels. Latent world model formulations encompass diverse graphical structures, learning objectives, and representational assumptions, supporting controllable generative modeling, autonomous decision-making, and robust generalization.

1. Core Mathematical Foundations and Generative Factorizations

Latent world models employ latent-variable graphical factorizations that encode the dynamics and observation processes underlying a system. The essential random variables and relationships are:

  • Latent state ztz_t or sts_t: compact, possibly disentangled embedding of the world at time tt.
  • Latent action utu_t or ata_t: abstract control code, potentially inferred from passive data, encoding the transformation from ztz_t to zt+1z_{t+1} (or generating xt+1x_{t+1}).
  • Action ata_t (when observed): physical action label, used to align latent actions if available.
  • Observation xtx_t or sts_t0: high-dimensional raw input (image, video frame, sensory vector).

Common Generative Model Structures

The following are representative generative model factorizations found in latent world model formulations:

  • Action-conditioned VAE/RSSM factorization (Dreamer, LAWM, AdaWorld, CoLA-World, Motus):

sts_t1

Here, sts_t2 encodes the latent state, sts_t3 is the latent action (abstract or grounded), and sts_t4 is the true action (when available) (Alles et al., 10 Dec 2025, Gao et al., 24 Mar 2025, Tharwat et al., 22 Sep 2025, Wang et al., 30 Oct 2025, Bi et al., 15 Dec 2025).

  • Action-free or passive video factorization (OLAF-World, Factored LAWM, SWIRL):

sts_t5

The latent action sts_t6 is inferred by an inverse dynamics model sts_t7 or at the factor level for slot-based models (Wang et al., 18 Feb 2026, Jiang et al., 10 Feb 2026, Qiu et al., 5 Feb 2026).

  • Sequence-level factorization and alignment (Olaf-World):

The expected effect of all latent actions over a sequence is aligned with the average semantic perceptual change in a frozen feature space (Jiang et al., 10 Feb 2026).

Slot and Entity-wise Factorization

  • Factored models decompose both state and actions into independent per-entity or per-factor channels:

sts_t8

with sts_t9 and tt0 (Wang et al., 18 Feb 2026).

2. Inference Models and Action Learning from Passive Data

Latent actions may be directly observed or inferred via an inverse dynamics encoder, often parameterized as a VAE, spatio-temporal transformer, or hybrid approaches:

  • Passive inverse-dynamics inference:

tt1

or in video, tt2, learned to reconstruct tt3 or tt4 conditioned on tt5 and tt6 (Qiu et al., 5 Feb 2026, Gao et al., 24 Mar 2025, Tharwat et al., 22 Sep 2025, Wang et al., 30 Oct 2025, Wang et al., 18 Feb 2026, Jiang et al., 10 Feb 2026).

  • Action-conditioned inference (when tt7 is available):

tt8

aligning observations with control labels, but in the minimal-labeled or self-supervised cases, the model bootstraps this correspondence from passive data (Alles et al., 10 Dec 2025, Gao et al., 24 Mar 2025, Wang et al., 18 Feb 2026).

  • Alignment objectives (e.g., sequence-level effect alignment):

Cosine similarity between the integrated latent action over a sequence and the observed feature-difference "effect direction":

tt9

where utu_t0 and utu_t1 is the average change in frozen feature space (Jiang et al., 10 Feb 2026).

3. Variational Training Objectives and Regularization

Optimization typically follows a variational framework—maximizing evidence lower bounds (ELBO) or related objectives. Most formulations include:

  • ELBO-style objectives:

utu_t2

where the nature of utu_t3 and which KLs are present depend on the data modality (action-conditioned, action-free, factorization); e.g.:

β-VAE or "free-nats" constraints control information flow through the action channel and inhibit overfitting or autoencoding (Gao et al., 24 Mar 2025, Jiang et al., 10 Feb 2026, Wang et al., 18 Feb 2026).

  • Reconstruction terms:

Pixel-space or latent-space prediction losses supervise the model's simulation of future outcomes (Jiang et al., 10 Feb 2026, Gao et al., 24 Mar 2025, Tharwat et al., 22 Sep 2025, Bi et al., 15 Dec 2025).

  • Auxiliary alignment and regularization:

4. Architectural Parameterizations and Modeling Choices

Latent world models instantiate a diverse array of neural parameterizations.

Backbone and Encoder Designs

Transition and Dynamics Models

Decoder and Output Heads

5. Planning, Control, and Downstream Applications

Latent world model frameworks support planning and policy learning via several mechanisms:

Robust generalization and transfer result directly from the model’s ability to align or adapt latent actions, either via alignment to effect directions (as in Olaf-World), or via clustering/discretization (as in AdaWorld and CoLA-World) (Jiang et al., 10 Feb 2026, Gao et al., 24 Mar 2025, Wang et al., 30 Oct 2025).

6. Theoretical Properties, Identifiability, and Alignment

Latent world models feature several theoretical and empirical properties:

  • Identifiability:
    • Under suitable low-degree biases and multi-task setups, latent world models can provably recover the true latent variables of a data-generating process up to invertible degree-1 transforms, i.e., permutation and sign flips (Zhang et al., 13 Feb 2025).
    • Architecture must foster “basis compatibility” for correct recovery.
  • Causal factorization and non-triviality:
    • An internal representation ata_t1 constitutes a genuine latent world model only when there exists a “ground-truth” function ata_t2 factored through ata_t3 and not accessible via trivial baselines on ata_t4 or outputs ata_t5 (Li et al., 29 Jul 2025).
    • Empirical validation typically involves probing experiments, baseline comparisons on input/output, and direct interventions on latent space.
  • Alignment and identifiability in latent actions:
    • Seqata_t6-REPA and similar mechanism enforce shared coordinate systems and semantics for the action channel even in large-scale, heterogeneous passive data (Jiang et al., 10 Feb 2026).

7. Model Variants, Empirical Results, and Representative Designs

Several notable latent world model designs and their empirical regimes include:

Model/Framework Latent state Latent action Transition Model Training/data regime Distinct properties Reference
Dreamer V3, LAWM RSSM (categorical/vector) direct or IL model GRU + MLP, stoch. RSSM robot video, human video Offline RL, action chunking, no action labels (Tharwat et al., 22 Sep 2025)
EMERALD spatial categorical tensor vector (flat) MaskGIT, TSSM pixel-based RL, Crafter Masked latent transformer, spatial ELBO (Burchi et al., 5 Jul 2025)
AdaWorld, CoLA-World video/diffusion latent learned cont/disk VAE+diffusion (unconditional or joint) action-free+few-shot action Visual planning, action clustering/codebook (Gao et al., 24 Mar 2025, Wang et al., 30 Oct 2025)
Olaf-World V-JEPA2 transformer features VAE learned, cont conditional VAE + alignment (Seqata_t7-REPA) large-scale passive video Zero-shot action transfer, effect alignment (Jiang et al., 10 Feb 2026)
FLAM VQ-VAE slot-tokens per-slot, cond. Cross-attention entity-wise, shared codebook multi-entity video Factored slot-wise policies (Wang et al., 18 Feb 2026)
SWIRL arbitrary encoder reverse inferred forward/inverse dynamics (coordinate ascent, RL) state-only, LLM/VLM data Info-max alternation, coordinate ascent theory (Qiu et al., 5 Feb 2026)
Motus optical-flow VAE VAE-proj. of flow joint diffusion over (future frames, latent actions ) 3-phase large-scale robot+human Mixture-of-transformer, UniDiffuser-style scheduler (Bi et al., 15 Dec 2025)
World4Drive, LAW, WorldRFT BEV latent, spatial intention/trajectory Transformer-based, cross-attn latent transition autonomous driving, self-sup. Multi-modal, modular, planning-oriented (Zheng et al., 1 Jul 2025, Li et al., 2024, Yang et al., 22 Dec 2025)

Empirical studies confirm that self-supervised latent world models can achieve state-of-the-art planning, navigation, and policy performance with fractions of action-labeled data, efficiently bridge action-free and action-conditioned training (Gao et al., 24 Mar 2025, Alles et al., 10 Dec 2025, Wang et al., 30 Oct 2025, Jiang et al., 10 Feb 2026).

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent World Model Formulation.