Latent World Model Formulation
- Latent World Model Formulation is a method that learns compact representations of the world state and agent actions for efficient planning.
- It utilizes structured probabilistic frameworks and variational objectives to decouple observations from control, enabling robust policy learning.
- The formulation supports diverse architectures—from CNNs to transformers—and demonstrates strong generalization in autonomous planning and control tasks.
A latent world model is a structured probabilistic system that encodes the dynamics of an environment in a learned latent space, decoupling observations and control from raw sensory inputs and actions. By inferring compact latent representations of both the world state and agent actions, such models enable efficient planning, prediction, and policy learning from high-dimensional, potentially unstructured data, including passive observations and limited action labels. Latent world model formulations encompass diverse graphical structures, learning objectives, and representational assumptions, supporting controllable generative modeling, autonomous decision-making, and robust generalization.
1. Core Mathematical Foundations and Generative Factorizations
Latent world models employ latent-variable graphical factorizations that encode the dynamics and observation processes underlying a system. The essential random variables and relationships are:
- Latent state or : compact, possibly disentangled embedding of the world at time .
- Latent action or : abstract control code, potentially inferred from passive data, encoding the transformation from to (or generating ).
- Action (when observed): physical action label, used to align latent actions if available.
- Observation or 0: high-dimensional raw input (image, video frame, sensory vector).
Common Generative Model Structures
The following are representative generative model factorizations found in latent world model formulations:
- Action-conditioned VAE/RSSM factorization (Dreamer, LAWM, AdaWorld, CoLA-World, Motus):
1
Here, 2 encodes the latent state, 3 is the latent action (abstract or grounded), and 4 is the true action (when available) (Alles et al., 10 Dec 2025, Gao et al., 24 Mar 2025, Tharwat et al., 22 Sep 2025, Wang et al., 30 Oct 2025, Bi et al., 15 Dec 2025).
- Action-free or passive video factorization (OLAF-World, Factored LAWM, SWIRL):
5
The latent action 6 is inferred by an inverse dynamics model 7 or at the factor level for slot-based models (Wang et al., 18 Feb 2026, Jiang et al., 10 Feb 2026, Qiu et al., 5 Feb 2026).
- Sequence-level factorization and alignment (Olaf-World):
The expected effect of all latent actions over a sequence is aligned with the average semantic perceptual change in a frozen feature space (Jiang et al., 10 Feb 2026).
Slot and Entity-wise Factorization
- Factored models decompose both state and actions into independent per-entity or per-factor channels:
8
with 9 and 0 (Wang et al., 18 Feb 2026).
2. Inference Models and Action Learning from Passive Data
Latent actions may be directly observed or inferred via an inverse dynamics encoder, often parameterized as a VAE, spatio-temporal transformer, or hybrid approaches:
- Passive inverse-dynamics inference:
1
or in video, 2, learned to reconstruct 3 or 4 conditioned on 5 and 6 (Qiu et al., 5 Feb 2026, Gao et al., 24 Mar 2025, Tharwat et al., 22 Sep 2025, Wang et al., 30 Oct 2025, Wang et al., 18 Feb 2026, Jiang et al., 10 Feb 2026).
- Action-conditioned inference (when 7 is available):
8
aligning observations with control labels, but in the minimal-labeled or self-supervised cases, the model bootstraps this correspondence from passive data (Alles et al., 10 Dec 2025, Gao et al., 24 Mar 2025, Wang et al., 18 Feb 2026).
- Alignment objectives (e.g., sequence-level effect alignment):
Cosine similarity between the integrated latent action over a sequence and the observed feature-difference "effect direction":
9
where 0 and 1 is the average change in frozen feature space (Jiang et al., 10 Feb 2026).
3. Variational Training Objectives and Regularization
Optimization typically follows a variational framework—maximizing evidence lower bounds (ELBO) or related objectives. Most formulations include:
- ELBO-style objectives:
2
where the nature of 3 and which KLs are present depend on the data modality (action-conditioned, action-free, factorization); e.g.:
- For action-conditioned sequences: KLs are imposed on both 4 and on the inferred latent actions 5 (Alles et al., 10 Dec 2025, Gao et al., 24 Mar 2025).
- For action-free: 6 is regularized against 7 (Alles et al., 10 Dec 2025, Qiu et al., 5 Feb 2026, Jiang et al., 10 Feb 2026, Wang et al., 18 Feb 2026).
- KL regularization:
β-VAE or "free-nats" constraints control information flow through the action channel and inhibit overfitting or autoencoding (Gao et al., 24 Mar 2025, Jiang et al., 10 Feb 2026, Wang et al., 18 Feb 2026).
- Reconstruction terms:
Pixel-space or latent-space prediction losses supervise the model's simulation of future outcomes (Jiang et al., 10 Feb 2026, Gao et al., 24 Mar 2025, Tharwat et al., 22 Sep 2025, Bi et al., 15 Dec 2025).
- Auxiliary alignment and regularization:
- Sequence-level effect alignment (see above) (Jiang et al., 10 Feb 2026).
- Bisimulation-based distance regularization to ensure task-relevant features are preserved (Sun et al., 2024).
- Commitment and codebook losses for discrete or vector-quantized action spaces (Wang et al., 30 Oct 2025).
4. Architectural Parameterizations and Modeling Choices
Latent world models instantiate a diverse array of neural parameterizations.
Backbone and Encoder Designs
- CNN and VQ-VAE-based encoders for frame-level or patch-wise latent variables (Wang et al., 18 Feb 2026, Burchi et al., 5 Jul 2025, Tharwat et al., 22 Sep 2025).
- Spatio-temporal Transformers for temporal inference over video and action prediction (Jiang et al., 10 Feb 2026, Gao et al., 24 Mar 2025, Tharwat et al., 22 Sep 2025).
- Optical flow VAE for action embedding in Motus: 8 is explicitly a function of flow fields encoded and projected into low-dimensional latent action space (Bi et al., 15 Dec 2025).
- Slot or entity factorization: slot-attention over tokens for multi-entity modeling (Wang et al., 18 Feb 2026).
Transition and Dynamics Models
- Simple Markovian or RSSM dynamics: 9 propagated via 0 or Dreamer-style architectures (Alles et al., 10 Dec 2025, Burchi et al., 5 Jul 2025).
- Transformer-based (masked, latent, spatial) transition models: e.g., MaskGIT block for spatial parallelism (Burchi et al., 5 Jul 2025).
- Diffusion or flow-matching models: denoising video latents/frames conditioned on sequences of latent actions (Wang et al., 30 Oct 2025, Bi et al., 15 Dec 2025, Jiang et al., 10 Feb 2026).
- Factored per-entity models: per-slot inverse/forward cross-attention (Wang et al., 18 Feb 2026).
Decoder and Output Heads
- Pixel/image decoders: reconstructing predicted frames.
- Latent-space decoders: matching future latent state encodings as supervision, supporting efficient control and planning (Zheng et al., 1 Jul 2025, Zhang et al., 14 Nov 2025, Li et al., 2024).
5. Planning, Control, and Downstream Applications
Latent world model frameworks support planning and policy learning via several mechanisms:
- Model-predictive control (MPC) in latent space: roll out sequences of latent states/actions, optimize over trajectories via CEM or sampling-based methods (Zhang et al., 14 Nov 2025, Gao et al., 24 Mar 2025, Wang et al., 30 Oct 2025).
- Off-policy and reinforcement learning: Dreamer-style actor-critic or group-relative policy optimization (GRPO) directly in latent space (Burchi et al., 5 Jul 2025, Yang et al., 22 Dec 2025, Sun et al., 2024).
- Self-supervised or annotation-free planning: utilization of multi-modal world models to evaluate multiple predicted future trajectories in the latent space, with selector modules for decision making (Zheng et al., 1 Jul 2025, Li et al., 2024, Yang et al., 22 Dec 2025).
- Planning-oriented refinements: local-aware iterative update mechanisms and hierarchical decomposition with separate queries for target region, path, and trajectory prediction (e.g., WorldRFT) (Yang et al., 22 Dec 2025).
Robust generalization and transfer result directly from the model’s ability to align or adapt latent actions, either via alignment to effect directions (as in Olaf-World), or via clustering/discretization (as in AdaWorld and CoLA-World) (Jiang et al., 10 Feb 2026, Gao et al., 24 Mar 2025, Wang et al., 30 Oct 2025).
6. Theoretical Properties, Identifiability, and Alignment
Latent world models feature several theoretical and empirical properties:
- Identifiability:
- Under suitable low-degree biases and multi-task setups, latent world models can provably recover the true latent variables of a data-generating process up to invertible degree-1 transforms, i.e., permutation and sign flips (Zhang et al., 13 Feb 2025).
- Architecture must foster “basis compatibility” for correct recovery.
- Causal factorization and non-triviality:
- An internal representation 1 constitutes a genuine latent world model only when there exists a “ground-truth” function 2 factored through 3 and not accessible via trivial baselines on 4 or outputs 5 (Li et al., 29 Jul 2025).
- Empirical validation typically involves probing experiments, baseline comparisons on input/output, and direct interventions on latent space.
- Alignment and identifiability in latent actions:
- Seq6-REPA and similar mechanism enforce shared coordinate systems and semantics for the action channel even in large-scale, heterogeneous passive data (Jiang et al., 10 Feb 2026).
7. Model Variants, Empirical Results, and Representative Designs
Several notable latent world model designs and their empirical regimes include:
| Model/Framework | Latent state | Latent action | Transition Model | Training/data regime | Distinct properties | Reference |
|---|---|---|---|---|---|---|
| Dreamer V3, LAWM | RSSM (categorical/vector) | direct or IL model | GRU + MLP, stoch. RSSM | robot video, human video | Offline RL, action chunking, no action labels | (Tharwat et al., 22 Sep 2025) |
| EMERALD | spatial categorical tensor | vector (flat) | MaskGIT, TSSM | pixel-based RL, Crafter | Masked latent transformer, spatial ELBO | (Burchi et al., 5 Jul 2025) |
| AdaWorld, CoLA-World | video/diffusion latent | learned cont/disk | VAE+diffusion (unconditional or joint) | action-free+few-shot action | Visual planning, action clustering/codebook | (Gao et al., 24 Mar 2025, Wang et al., 30 Oct 2025) |
| Olaf-World | V-JEPA2 transformer features | VAE learned, cont | conditional VAE + alignment (Seq7-REPA) | large-scale passive video | Zero-shot action transfer, effect alignment | (Jiang et al., 10 Feb 2026) |
| FLAM | VQ-VAE slot-tokens | per-slot, cond. | Cross-attention entity-wise, shared codebook | multi-entity video | Factored slot-wise policies | (Wang et al., 18 Feb 2026) |
| SWIRL | arbitrary encoder | reverse inferred | forward/inverse dynamics (coordinate ascent, RL) | state-only, LLM/VLM data | Info-max alternation, coordinate ascent theory | (Qiu et al., 5 Feb 2026) |
| Motus | optical-flow VAE | VAE-proj. of flow | joint diffusion over (future frames, latent actions ) | 3-phase large-scale robot+human | Mixture-of-transformer, UniDiffuser-style scheduler | (Bi et al., 15 Dec 2025) |
| World4Drive, LAW, WorldRFT | BEV latent, spatial | intention/trajectory | Transformer-based, cross-attn latent transition | autonomous driving, self-sup. | Multi-modal, modular, planning-oriented | (Zheng et al., 1 Jul 2025, Li et al., 2024, Yang et al., 22 Dec 2025) |
Empirical studies confirm that self-supervised latent world models can achieve state-of-the-art planning, navigation, and policy performance with fractions of action-labeled data, efficiently bridge action-free and action-conditioned training (Gao et al., 24 Mar 2025, Alles et al., 10 Dec 2025, Wang et al., 30 Oct 2025, Jiang et al., 10 Feb 2026).
References
- (Zheng et al., 1 Jul 2025) World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model
- (Gao et al., 24 Mar 2025) AdaWorld: Learning Adaptable World Models with Latent Actions
- (Tharwat et al., 22 Sep 2025) Latent Action Pretraining Through World Modeling
- (Burchi et al., 5 Jul 2025) Accurate and Efficient World Modeling with Masked Latent Transformers
- (Alles et al., 10 Dec 2025) Latent Action World Models for Control with Unlabeled Trajectories
- (Qiu et al., 5 Feb 2026) Self-Improving World Modelling with Latent Actions
- (Wang et al., 18 Feb 2026) Factored Latent Action World Models
- (Jiang et al., 10 Feb 2026) Olaf-World: Orienting Latent Actions for Video World Modeling
- (Wang et al., 30 Oct 2025) Co-Evolving Latent Action World Models
- (Yang et al., 22 Dec 2025) WorldRFT: Latent World Model Planning with Reinforcement Fine-Tuning for Autonomous Driving
- (Zhang et al., 13 Feb 2025) When do neural networks learn world models?
- (Li et al., 2024) Enhancing End-to-End Autonomous Driving with Latent World Model
- (Zhang et al., 14 Nov 2025) Latent-Space Autoregressive World Model for Efficient and Robust Image-Goal Navigation
- (Bi et al., 15 Dec 2025) Motus: A Unified Latent Action World Model
- (Li et al., 29 Jul 2025) What Does it Mean for a Neural Network to Learn a "World Model"?
- (Sun et al., 2024) Learning Latent Dynamic Robust Representations for World Models
- (Xiao et al., 2024) Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous Driving