Learning Local Causal World Models with State Space Models and Attention

Published 4 May 2025 in cs.LG and stat.ML | (2505.02074v1)

Abstract: World modelling, i.e. building a representation of the rules that govern the world so as to predict its evolution, is an essential ability for any agent interacting with the physical world. Despite their impressive performance, many solutions fail to learn a causal representation of the environment they are trying to model, which would be necessary to gain a deep enough understanding of the world to perform complex tasks. With this work, we aim to broaden the research in the intersection of causality theory and neural world modelling by assessing the potential for causal discovery of the State Space Model (SSM) architecture, which has been shown to have several advantages over the widespread Transformer. We show empirically that, compared to an equivalent Transformer, a SSM can model the dynamics of a simple environment and learn a causal model at the same time with equivalent or better performance, thus paving the way for further experiments that lean into the strength of SSMs and further enhance them with causal awareness.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

There was an error generating the whiteboard.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single consolidated list of what remains missing, uncertain, or unexplored in the paper, framed to guide future research:

Reliance on provided object masks sidesteps the harder problem of learning object-centric slots from raw pixels; robustness of the causal graph when segmentation is learned (unsupervised or weakly supervised) is untested.
Evaluation is confined to a single small synthetic dataset (Interventional Pong) with few objects and simple physics; scalability to more complex, higher-resolution, or real-world environments is unknown.
The causal interpretation of attention weights is assumed but not theoretically justified; the conditions under which attention-derived edges reflect true causal relations (vs. correlations) are not established.
Acyclicity, causal directionality, and identifiability are not enforced or analyzed; the learned graphs may contain cycles or ambiguous directions, conflicting with DAG-based SCM assumptions.
The procedure for binarizing weighted attentions into edges for SHD (e.g., thresholds, per-layer aggregation, calibration) is not specified, leaving the causal evaluation pipeline underdetermined.
Multi-head attention handling in causal graph extraction is unspecified (e.g., per-head graphs, aggregation rules), limiting interpretability and reproducibility.
The “paths across layers” construction $\bar A = \prod_\ell (A^\ell + \mathbb{I})$ lacks a clear causal semantics, normalization, or sensitivity analysis; whether it faithfully captures indirect influences is unclear.
Only contemporaneous (within-step) edges are modeled; explicit time-lagged causal relationships (dynamic causal graphs over multiple time steps) are not considered.
In testing, only the environment slot is updated with a small few-shot set while the world model remains frozen; how to adapt dynamics when interventions require updating transition functions is left open.
The capacity, selection, and training dynamics of the environment codebook are under-specified (e.g., number of codes, assignment mechanism, failure modes with many or overlapping environments).
Claims that new environments can be added during training are not experimentally demonstrated; zero-shot or continual adaptation protocols are not assessed.
The loss omits mention of standard VAE regularization (e.g., KL divergence); it is unclear whether a KL term is used, how it is weighted, and how posterior collapse is avoided.
Sensitivity of performance and discovered graphs to the sparsity weight schedule (dynamic $\lambda$ ) and the baseline target $\tau$ is not explored; no ablation on schedule design is provided.
No analysis of variance across random seeds, confidence intervals, or statistical significance; robustness of reported improvements is uncertain.
Parameter parity and training budget parity between S2-SSM and Transformer baselines are not documented; fair comparison is uncertain.
The method allows edges from objects to the environment slot; whether the environment is treated as an exogenous cause (and whether enforcing that improves results) is unexplored.
Higher-order (non-pairwise) interactions are not modeled explicitly; whether cross-attention suffices for triadic or collective effects remains an open question.
Long-horizon rollouts and compounding error are not evaluated; benefits of SSM memory for multi-step prediction or planning are not demonstrated.
No assessment of inference speed, memory footprint, or hardware efficiency, despite SSMs’ touted advantages; practical trade-offs remain unquantified.
Robustness to occlusions, partial observability, missing objects, variable object counts, and dynamic appearance/disappearance is not tested.
Noise robustness (sensor noise, distractors, background clutter) and domain shift beyond the provided composite environments are not examined.
No counterfactual or interventional validation (e.g., do-predictions) is performed to test whether the learned graph supports causal reasoning beyond SHD.
Generalization to tasks beyond next-frame prediction (e.g., control, planning, counterfactual reasoning) is not investigated.
Comparison set is narrow (Transformer baseline and ablations); benchmarking against graph/relational baselines (e.g., NRI, GNN-based causal learners) is missing.
The approach assumes a fixed number of slots/objects; how to handle varying or unknown numbers of entities is not addressed.
Potential confounding by unobserved factors is only heuristically addressed via an environment slot; formal treatment of latent confounders and its limits is absent.
The mapping from attention to causal strength (sigmoid of $q_i^\top k_j$ ) and its reuse as an attention mask may introduce circularity; alternative metrics and their impact are not explored.
Ground-truth graph definitions for each environment and their directionality (especially with interventions) are not detailed, making SHD interpretability opaque.
Reproducibility is limited by missing implementation details (e.g., training hyperparameters, thresholds, code availability).

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Learning Local Causal World Models with State Space Models and Attention

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Open Problems

Continue Learning

Collections

Don't miss out on important new AI/ML research

Learning Local Causal World Models with State Space Models and Attention

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Open Problems

Continue Learning

Collections

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research