Latent World Model Formulation

Updated 27 February 2026

Latent World Model Formulation is a method that learns compact representations of the world state and agent actions for efficient planning.
It utilizes structured probabilistic frameworks and variational objectives to decouple observations from control, enabling robust policy learning.
The formulation supports diverse architectures—from CNNs to transformers—and demonstrates strong generalization in autonomous planning and control tasks.

A latent world model is a structured probabilistic system that encodes the dynamics of an environment in a learned latent space, decoupling observations and control from raw sensory inputs and actions. By inferring compact latent representations of both the world state and agent actions, such models enable efficient planning, prediction, and policy learning from high-dimensional, potentially unstructured data, including passive observations and limited action labels. Latent world model formulations encompass diverse graphical structures, learning objectives, and representational assumptions, supporting controllable generative modeling, autonomous decision-making, and robust generalization.

1. Core Mathematical Foundations and Generative Factorizations

Latent world models employ latent-variable graphical factorizations that encode the dynamics and observation processes underlying a system. The essential random variables and relationships are:

Latent state $z_t$ or $s_t$ : compact, possibly disentangled embedding of the world at time $t$ .
Latent action $u_t$ or $a_t$ : abstract control code, potentially inferred from passive data, encoding the transformation from $z_t$ to $z_{t+1}$ (or generating $x_{t+1}$ ).
Action $a_t$ (when observed): physical action label, used to align latent actions if available.
Observation $x_t$ or $s_t$ 0: high-dimensional raw input (image, video frame, sensory vector).

Common Generative Model Structures

The following are representative generative model factorizations found in latent world model formulations:

Action-conditioned VAE/RSSM factorization (Dreamer, LAWM, AdaWorld, CoLA-World, Motus):

$s_t$ 1

Here, $s_t$ 2 encodes the latent state, $s_t$ 3 is the latent action (abstract or grounded), and $s_t$ 4 is the true action (when available) (Alles et al., 10 Dec 2025, Gao et al., 24 Mar 2025, Tharwat et al., 22 Sep 2025, Wang et al., 30 Oct 2025, Bi et al., 15 Dec 2025).

Action-free or passive video factorization (OLAF-World, Factored LAWM, SWIRL):

$s_t$ 5

The latent action $s_t$ 6 is inferred by an inverse dynamics model $s_t$ 7 or at the factor level for slot-based models (Wang et al., 18 Feb 2026, Jiang et al., 10 Feb 2026, Qiu et al., 5 Feb 2026).

Sequence-level factorization and alignment (Olaf-World):

The expected effect of all latent actions over a sequence is aligned with the average semantic perceptual change in a frozen feature space (Jiang et al., 10 Feb 2026).

Slot and Entity-wise Factorization

Factored models decompose both state and actions into independent per-entity or per-factor channels:

$s_t$ 8

with $s_t$ 9 and $t$ 0 (Wang et al., 18 Feb 2026).

2. Inference Models and Action Learning from Passive Data

Latent actions may be directly observed or inferred via an inverse dynamics encoder, often parameterized as a VAE, spatio-temporal transformer, or hybrid approaches:

Passive inverse-dynamics inference:

$t$ 1

or in video, $t$ 2, learned to reconstruct $t$ 3 or $t$ 4 conditioned on $t$ 5 and $t$ 6 (Qiu et al., 5 Feb 2026, Gao et al., 24 Mar 2025, Tharwat et al., 22 Sep 2025, Wang et al., 30 Oct 2025, Wang et al., 18 Feb 2026, Jiang et al., 10 Feb 2026).

Action-conditioned inference (when $t$ 7 is available):

$t$ 8

aligning observations with control labels, but in the minimal-labeled or self-supervised cases, the model bootstraps this correspondence from passive data (Alles et al., 10 Dec 2025, Gao et al., 24 Mar 2025, Wang et al., 18 Feb 2026).

Alignment objectives (e.g., sequence-level effect alignment):

Cosine similarity between the integrated latent action over a sequence and the observed feature-difference "effect direction":

$t$ 9

where $u_t$ 0 and $u_t$ 1 is the average change in frozen feature space (Jiang et al., 10 Feb 2026).

3. Variational Training Objectives and Regularization

Optimization typically follows a variational framework—maximizing evidence lower bounds (ELBO) or related objectives. Most formulations include:

ELBO-style objectives:

$u_t$ 2

where the nature of $u_t$ 3 and which KLs are present depend on the data modality (action-conditioned, action-free, factorization); e.g.:

For action-conditioned sequences: KLs are imposed on both $u_t$ 4 and on the inferred latent actions $u_t$ 5 (Alles et al., 10 Dec 2025, Gao et al., 24 Mar 2025).
For action-free: $u_t$ $u_{t}$ 6 is regularized against $u_t$ $u_{t}$ 7 (Alles et al., 10 Dec 2025, Qiu et al., 5 Feb 2026, Jiang et al., 10 Feb 2026, Wang et al., 18 Feb 2026).
- KL regularization:

β-VAE or "free-nats" constraints control information flow through the action channel and inhibit overfitting or autoencoding (Gao et al., 24 Mar 2025, Jiang et al., 10 Feb 2026, Wang et al., 18 Feb 2026).

Reconstruction terms:

Pixel-space or latent-space prediction losses supervise the model's simulation of future outcomes (Jiang et al., 10 Feb 2026, Gao et al., 24 Mar 2025, Tharwat et al., 22 Sep 2025, Bi et al., 15 Dec 2025).

Auxiliary alignment and regularization:
- Sequence-level effect alignment (see above) (Jiang et al., 10 Feb 2026).
- Bisimulation-based distance regularization to ensure task-relevant features are preserved (Sun et al., 2024).
- Commitment and codebook losses for discrete or vector-quantized action spaces (Wang et al., 30 Oct 2025).

4. Architectural Parameterizations and Modeling Choices

Latent world models instantiate a diverse array of neural parameterizations.

Backbone and Encoder Designs

CNN and VQ-VAE-based encoders for frame-level or patch-wise latent variables (Wang et al., 18 Feb 2026, Burchi et al., 5 Jul 2025, Tharwat et al., 22 Sep 2025).
Spatio-temporal Transformers for temporal inference over video and action prediction (Jiang et al., 10 Feb 2026, Gao et al., 24 Mar 2025, Tharwat et al., 22 Sep 2025).
Optical flow VAE for action embedding in Motus: $u_t$ 8 is explicitly a function of flow fields encoded and projected into low-dimensional latent action space (Bi et al., 15 Dec 2025).
Slot or entity factorization: slot-attention over tokens for multi-entity modeling (Wang et al., 18 Feb 2026).

Transition and Dynamics Models

Simple Markovian or RSSM dynamics: $u_t$ 9 propagated via $a_t$ 0 or Dreamer-style architectures (Alles et al., 10 Dec 2025, Burchi et al., 5 Jul 2025).
Transformer-based (masked, latent, spatial) transition models: e.g., MaskGIT block for spatial parallelism (Burchi et al., 5 Jul 2025).
Diffusion or flow-matching models: denoising video latents/frames conditioned on sequences of latent actions (Wang et al., 30 Oct 2025, Bi et al., 15 Dec 2025, Jiang et al., 10 Feb 2026).
Factored per-entity models: per-slot inverse/forward cross-attention (Wang et al., 18 Feb 2026).

Decoder and Output Heads

Pixel/image decoders: reconstructing predicted frames.
Latent-space decoders: matching future latent state encodings as supervision, supporting efficient control and planning (Zheng et al., 1 Jul 2025, Zhang et al., 14 Nov 2025, Li et al., 2024).

5. Planning, Control, and Downstream Applications

Latent world model frameworks support planning and policy learning via several mechanisms:

Model-predictive control (MPC) in latent space: roll out sequences of latent states/actions, optimize over trajectories via CEM or sampling-based methods (Zhang et al., 14 Nov 2025, Gao et al., 24 Mar 2025, Wang et al., 30 Oct 2025).
Off-policy and reinforcement learning: Dreamer-style actor-critic or group-relative policy optimization (GRPO) directly in latent space (Burchi et al., 5 Jul 2025, Yang et al., 22 Dec 2025, Sun et al., 2024).
Self-supervised or annotation-free planning: utilization of multi-modal world models to evaluate multiple predicted future trajectories in the latent space, with selector modules for decision making (Zheng et al., 1 Jul 2025, Li et al., 2024, Yang et al., 22 Dec 2025).
Planning-oriented refinements: local-aware iterative update mechanisms and hierarchical decomposition with separate queries for target region, path, and trajectory prediction (e.g., WorldRFT) (Yang et al., 22 Dec 2025).

Robust generalization and transfer result directly from the model’s ability to align or adapt latent actions, either via alignment to effect directions (as in Olaf-World), or via clustering/discretization (as in AdaWorld and CoLA-World) (Jiang et al., 10 Feb 2026, Gao et al., 24 Mar 2025, Wang et al., 30 Oct 2025).

6. Theoretical Properties, Identifiability, and Alignment

Latent world models feature several theoretical and empirical properties:

Identifiability:
- Under suitable low-degree biases and multi-task setups, latent world models can provably recover the true latent variables of a data-generating process up to invertible degree-1 transforms, i.e., permutation and sign flips (Zhang et al., 13 Feb 2025).
- Architecture must foster “basis compatibility” for correct recovery.
Causal factorization and non-triviality:
- An internal representation $a_t$ 1 constitutes a genuine latent world model only when there exists a “ground-truth” function $a_t$ 2 factored through $a_t$ 3 and not accessible via trivial baselines on $a_t$ 4 or outputs $a_t$ 5 (Li et al., 29 Jul 2025).
- Empirical validation typically involves probing experiments, baseline comparisons on input/output, and direct interventions on latent space.
Alignment and identifiability in latent actions:
- Seq $a_t$ 6-REPA and similar mechanism enforce shared coordinate systems and semantics for the action channel even in large-scale, heterogeneous passive data (Jiang et al., 10 Feb 2026).

7. Model Variants, Empirical Results, and Representative Designs

Several notable latent world model designs and their empirical regimes include:

Model/Framework	Latent state	Latent action	Transition Model	Training/data regime	Distinct properties	Reference
Dreamer V3, LAWM	RSSM (categorical/vector)	direct or IL model	GRU + MLP, stoch. RSSM	robot video, human video	Offline RL, action chunking, no action labels	(Tharwat et al., 22 Sep 2025)
EMERALD	spatial categorical tensor	vector (flat)	MaskGIT, TSSM	pixel-based RL, Crafter	Masked latent transformer, spatial ELBO	(Burchi et al., 5 Jul 2025)
AdaWorld, CoLA-World	video/diffusion latent	learned cont/disk	VAE+diffusion (unconditional or joint)	action-free+few-shot action	Visual planning, action clustering/codebook	(Gao et al., 24 Mar 2025, Wang et al., 30 Oct 2025)
Olaf-World	V-JEPA2 transformer features	VAE learned, cont	conditional VAE + alignment (Seq $a_t$ 7-REPA)	large-scale passive video	Zero-shot action transfer, effect alignment	(Jiang et al., 10 Feb 2026)
FLAM	VQ-VAE slot-tokens	per-slot, cond.	Cross-attention entity-wise, shared codebook	multi-entity video	Factored slot-wise policies	(Wang et al., 18 Feb 2026)
SWIRL	arbitrary encoder	reverse inferred	forward/inverse dynamics (coordinate ascent, RL)	state-only, LLM/VLM data	Info-max alternation, coordinate ascent theory	(Qiu et al., 5 Feb 2026)
Motus	optical-flow VAE	VAE-proj. of flow	joint diffusion over (future frames, latent actions )	3-phase large-scale robot+human	Mixture-of-transformer, UniDiffuser-style scheduler	(Bi et al., 15 Dec 2025)
World4Drive, LAW, WorldRFT	BEV latent, spatial	intention/trajectory	Transformer-based, cross-attn latent transition	autonomous driving, self-sup.	Multi-modal, modular, planning-oriented	(Zheng et al., 1 Jul 2025, Li et al., 2024, Yang et al., 22 Dec 2025)

Empirical studies confirm that self-supervised latent world models can achieve state-of-the-art planning, navigation, and policy performance with fractions of action-labeled data, efficiently bridge action-free and action-conditioned training (Gao et al., 24 Mar 2025, Alles et al., 10 Dec 2025, Wang et al., 30 Oct 2025, Jiang et al., 10 Feb 2026).

References

(Zheng et al., 1 Jul 2025) World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model
(Gao et al., 24 Mar 2025) AdaWorld: Learning Adaptable World Models with Latent Actions
(Tharwat et al., 22 Sep 2025) Latent Action Pretraining Through World Modeling
(Burchi et al., 5 Jul 2025) Accurate and Efficient World Modeling with Masked Latent Transformers
(Alles et al., 10 Dec 2025) Latent Action World Models for Control with Unlabeled Trajectories
(Qiu et al., 5 Feb 2026) Self-Improving World Modelling with Latent Actions
(Wang et al., 18 Feb 2026) Factored Latent Action World Models
(Jiang et al., 10 Feb 2026) Olaf-World: Orienting Latent Actions for Video World Modeling
(Wang et al., 30 Oct 2025) Co-Evolving Latent Action World Models
(Yang et al., 22 Dec 2025) WorldRFT: Latent World Model Planning with Reinforcement Fine-Tuning for Autonomous Driving
(Zhang et al., 13 Feb 2025) When do neural networks learn world models?
(Li et al., 2024) Enhancing End-to-End Autonomous Driving with Latent World Model
(Zhang et al., 14 Nov 2025) Latent-Space Autoregressive World Model for Efficient and Robust Image-Goal Navigation
(Bi et al., 15 Dec 2025) Motus: A Unified Latent Action World Model
(Li et al., 29 Jul 2025) What Does it Mean for a Neural Network to Learn a "World Model"?
(Sun et al., 2024) Learning Latent Dynamic Robust Representations for World Models
(Xiao et al., 2024) Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous Driving