World Models in Reinforcement Learning

Updated 26 September 2025

World Models are internal predictive systems that simulate future states enabling robust decision-making in partially observable environments.
Comparative studies highlight trade-offs among RNN, transformer, and structured state space architectures in terms of memory, speed, and long-term accuracy.
State space models like S4WM demonstrate superior long-range prediction and efficient training, enhancing performance in complex, temporally extended tasks.

World models (WMs) are internal predictive models that enable artificial agents to learn the causal dynamics of their environment, supporting temporally extended reasoning, planning, and robust decision-making, especially in partially observable and complex domains. In model-based reinforcement learning (MBRL), world models serve as the generative core that simulates future environmental trajectories conditioned on observed history and action sequences, thereby allowing agents to evaluate potential plans before physical execution. The field of world modeling now encompasses a broad continuum of architectures—including recurrent neural networks (RNNs), transformers, structured state space models (SSMs), and diffusion models—each offering distinct capabilities with regard to memory, scalability, and the faithful simulation of environment dynamics.

1. Mathematical Foundations and Latent Variable Formulation

World models formalize the conditional distribution of future observable states given initial conditions and an action sequence. The canonical latent variable framework models high-dimensional observation sequences $x_{1:T}$ given an initial observation $x_0$ and actions $a_{1:T}$ using latent states $z_{0:T}$ :

$p(x_{1:T} \mid x_0, a_{1:T}) = \int p(z_0 \mid x_0) \prod_{t=1}^T p(x_t \mid z_{\leq t}, a_{\leq t})\, p(z_t \mid z_{<t}, a_{\leq t})\, dz_{0:T}$

During training, standard practice is to maximize a variational evidence lower bound (ELBO), leveraging a variational posterior $q(z_t \mid x_t)$ :

$\log p(x_{1:T} \mid x_0, a_{1:T}) \geq \mathbb{E}_q\left[\sum_t\log p(x_t \mid z_{\leq t}, a_{\leq t}) - KL(q(z_t \mid x_t) \| p(z_t \mid z_{<t}, a_{\leq t}))\right]$

Architectural instantiations differ in how the latent transition and prior/policy conditioning mechanisms are parameterized and in the form of the memory backbone (sequential, attention-based, or state space).

2. Backbone Architectures: RNNs, Transformers, and Structured State Space Models

Three major classes of world model architectures are compared on their capacity to model long-range dependencies and support efficient, high-fidelity future prediction:

Recurrent Neural Network-based (RNN, e.g., RSSM-TBTT): The RSSM variant employs recurrent, sequential updates, encoding the history via a hidden state. These offer high throughput in imagination (i.e., rollout speed) due to single-step updates but are inherently limited by memory capacity and are computationally inefficient for long sequences due to the lack of parallelizability and the need for gradient truncation during backpropagation through time.
Transformer-based (e.g., TSSM-XL): Memory and contextual reasoning are strengthened via self-attention over tokenized history, supporting long-range dependency modeling given sufficient context window (“cache” length). However, training and inference costs scale quadratically with sequence length, and empirical results indicate that transformers have limited robustness on extremely long input sequences unless architectural hyperparameters (e.g., cache size) are set aggressively.
Structured State Space Models (SSM, e.g., S4WM): The S4WM world model employs stacks of parallelizable state space model (PSSM) blocks, enabling “sub-quadratic” complexity while capturing arbitrarily long input contexts. History is first embedded via an MLP, then passed through SSMs which maintain a hidden state vector through recurrent, but parallelizable updates. For long-horizon tasks and context-dependent recall, S4WM achieves superior long-term predictive accuracy and scaling efficiency. For example, on the Four Rooms task, S4WM achieves generation MSE ≈ 44.0 compared to ≈ 219.4 for RSSM-TBTT and ≈ 224.4 for TSSM-XL. S4WM demonstrates faster training due to its parallel interface while maintaining high accuracy during “imagination” rollouts.

The following table compares key attributes:

Backbone	Memory Capacity	Parallelizability	Imagination Speed	Long-Term Accuracy
RNN (RSSM)	Limited	Sequential	High	Poor
Transformer	Configurable	Parallelizable	Moderate	Moderate
SSM (S4/S4WM)	Extensive	Parallelizable	High-Moderate	Superior

3. Benchmarking Memory, Prediction, and Reasoning

A comprehensive evaluation is carried out across carefully designed tasks, each targeting a specific requirement for memory and temporal abstraction:

Long-Term Imagination: 3D memory mazes (e.g., Two/Four/Ten Rooms) require hundreds of future-step predictions.
Context-Dependent Recall: “Teleport” variants reset the agent to an earlier state, forcing reliance on internal memory/history tracking.
Reward Prediction: 2D environments with distractors demand the suppression of irrelevant input and retention of reward-relevant information.
Memory-Based Reasoning: Multi Doors Keys tasks evaluate an agent's ability to reason over acquired objects and use them in the correct temporal sequence.

Performance is quantified by reconstruction and generation MSE, with lower values indicating better performance. S4WM is consistently dominant in both standard and memory-intensive settings, demonstrating resilience as memory lengths increase. Transformers (when cache-augmented) can temporarily rival S4WM for short sequences but degrade more quickly under extended horizons.

4. Implications for Model-Based Reinforcement Learning and Practical Deployment

Improvements in long-term memory and prediction translate directly into more sample-efficient and robust model-based RL agents. High-fidelity, consistent rollouts improve planning accuracy over extended time horizons and reduce compounding model errors. The scalability of PSSMs—especially S4/S5 variants—permits the modeling of extensive histories without prohibitive computational overhead, overcoming limitations of both sequential RNNs and transformer quadratic scaling.

S4WM achieves a favorable trade-off: faster training than RNNs (due to parallel updates), superior long-term prediction versus both transformer and RNN backbones, and only a modest decrease in imagination throughput relative to simple RNNs. These advantages make S4WM an attractive backbone for real-world deployment where diverse, persistent memory and scalable parallelization are paramount.

5. Implementation Considerations and Trade-Offs

Data Regimes: All compared architectures assume sufficient coverage of relevant temporal contexts in training data. For SSM- and transformer-based models, full benefit is realized only when the model context spans the longest dependencies present in the tasks.
Hardware and Efficiency: Truncation-based RNN training (TBTT) is a bottleneck for long-horizon data; parallel architectures (transformers, SSMs) can exploit modern hardware for speedups.
Scalability: SSMs, and especially the S4/S4WM, exhibit sub-quadratic complexity making them preferable for high-fidelity simulation in large, partially observable settings.
Limitations: The dominance of S4WM is established on domains with structured spatial and memory demands. Tasks that depend heavily on non-sequential, non-local attention may still favor large-transformer solutions if efficiency constraints are relaxed. Transformer-based models may offer better adaptability in heterogeneously structured environments if input contexts are carefully engineered.

6. Domain-Specific Applications and Extensions

The empirical findings for S4WM generalize to applications demanding reliable long-horizon modeling, such as robotic control, navigation in spatially extended environments, strategic planning with context-dependent state transitions, and high-frequency environment simulation for sample-efficient reinforcement learning. Incorporating S4WM into existing MBRL pipelines can enhance both learning speed (by enabling parallelized simulation) and planning competence (by permitting robust, accurate multi-step rollouts).

Downstream implications include improved policy exploration through the foresight of long-range consequences, enhanced transferability via more expressive latent dynamics, and the ability to scale world modeling to domains demanding hundreds or thousands of context steps, as required in realistic, partially observed environments.

In summary, the systematic comparison demonstrates that structured state space models—through the S4WM framework—set a new standard for world model backbones in MBRL, exhibiting unparalleled long-term memory, parallelizability, and computational efficiency for temporally extended simulation and reasoning. These features are critical for advancing both the theoretical and practical capabilities of model-based agents.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to World Models (WMs).