Learning Local Causal World Models with State Space Models and Attention
Abstract: World modelling, i.e. building a representation of the rules that govern the world so as to predict its evolution, is an essential ability for any agent interacting with the physical world. Despite their impressive performance, many solutions fail to learn a causal representation of the environment they are trying to model, which would be necessary to gain a deep enough understanding of the world to perform complex tasks. With this work, we aim to broaden the research in the intersection of causality theory and neural world modelling by assessing the potential for causal discovery of the State Space Model (SSM) architecture, which has been shown to have several advantages over the widespread Transformer. We show empirically that, compared to an equivalent Transformer, a SSM can model the dynamics of a simple environment and learn a causal model at the same time with equivalent or better performance, thus paving the way for further experiments that lean into the strength of SSMs and further enhance them with causal awareness.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single consolidated list of what remains missing, uncertain, or unexplored in the paper, framed to guide future research:
- Reliance on provided object masks sidesteps the harder problem of learning object-centric slots from raw pixels; robustness of the causal graph when segmentation is learned (unsupervised or weakly supervised) is untested.
- Evaluation is confined to a single small synthetic dataset (Interventional Pong) with few objects and simple physics; scalability to more complex, higher-resolution, or real-world environments is unknown.
- The causal interpretation of attention weights is assumed but not theoretically justified; the conditions under which attention-derived edges reflect true causal relations (vs. correlations) are not established.
- Acyclicity, causal directionality, and identifiability are not enforced or analyzed; the learned graphs may contain cycles or ambiguous directions, conflicting with DAG-based SCM assumptions.
- The procedure for binarizing weighted attentions into edges for SHD (e.g., thresholds, per-layer aggregation, calibration) is not specified, leaving the causal evaluation pipeline underdetermined.
- Multi-head attention handling in causal graph extraction is unspecified (e.g., per-head graphs, aggregation rules), limiting interpretability and reproducibility.
- The “paths across layers” construction lacks a clear causal semantics, normalization, or sensitivity analysis; whether it faithfully captures indirect influences is unclear.
- Only contemporaneous (within-step) edges are modeled; explicit time-lagged causal relationships (dynamic causal graphs over multiple time steps) are not considered.
- In testing, only the environment slot is updated with a small few-shot set while the world model remains frozen; how to adapt dynamics when interventions require updating transition functions is left open.
- The capacity, selection, and training dynamics of the environment codebook are under-specified (e.g., number of codes, assignment mechanism, failure modes with many or overlapping environments).
- Claims that new environments can be added during training are not experimentally demonstrated; zero-shot or continual adaptation protocols are not assessed.
- The loss omits mention of standard VAE regularization (e.g., KL divergence); it is unclear whether a KL term is used, how it is weighted, and how posterior collapse is avoided.
- Sensitivity of performance and discovered graphs to the sparsity weight schedule (dynamic ) and the baseline target is not explored; no ablation on schedule design is provided.
- No analysis of variance across random seeds, confidence intervals, or statistical significance; robustness of reported improvements is uncertain.
- Parameter parity and training budget parity between S2-SSM and Transformer baselines are not documented; fair comparison is uncertain.
- The method allows edges from objects to the environment slot; whether the environment is treated as an exogenous cause (and whether enforcing that improves results) is unexplored.
- Higher-order (non-pairwise) interactions are not modeled explicitly; whether cross-attention suffices for triadic or collective effects remains an open question.
- Long-horizon rollouts and compounding error are not evaluated; benefits of SSM memory for multi-step prediction or planning are not demonstrated.
- No assessment of inference speed, memory footprint, or hardware efficiency, despite SSMs’ touted advantages; practical trade-offs remain unquantified.
- Robustness to occlusions, partial observability, missing objects, variable object counts, and dynamic appearance/disappearance is not tested.
- Noise robustness (sensor noise, distractors, background clutter) and domain shift beyond the provided composite environments are not examined.
- No counterfactual or interventional validation (e.g., do-predictions) is performed to test whether the learned graph supports causal reasoning beyond SHD.
- Generalization to tasks beyond next-frame prediction (e.g., control, planning, counterfactual reasoning) is not investigated.
- Comparison set is narrow (Transformer baseline and ablations); benchmarking against graph/relational baselines (e.g., NRI, GNN-based causal learners) is missing.
- The approach assumes a fixed number of slots/objects; how to handle varying or unknown numbers of entities is not addressed.
- Potential confounding by unobserved factors is only heuristically addressed via an environment slot; formal treatment of latent confounders and its limits is absent.
- The mapping from attention to causal strength (sigmoid of ) and its reuse as an attention mask may introduce circularity; alternative metrics and their impact are not explored.
- Ground-truth graph definitions for each environment and their directionality (especially with interventions) are not detailed, making SHD interpretability opaque.
- Reproducibility is limited by missing implementation details (e.g., training hyperparameters, thresholds, code availability).
Collections
Sign up for free to add this paper to one or more collections.