Object-Centric World Models

Updated 13 March 2026

Object-centric world models are approaches that decompose environments into individual objects with attributes and dynamics, enabling interpretable and efficient control.
They employ methods such as slot attention, keypoint grouping, and graph-based dynamics to explicitly model interactions among objects.
Their application enhances performance in robotic manipulation, video prediction, and planning by boosting sample efficiency and generalization.

Object-centric world models (OCWMs) are a class of world models in model-based reinforcement learning, generative modeling, and perception that represent environments as sets of entities (objects) and explicitly model their attributes, dynamics, interactions, and relationships. In contrast to holistic or pixel-based world models, OCWMs decompose the latent state into object-level factors, enabling data efficiency, compositional generalization, interpretable representations, and improved performance—especially in visually complex, multi-object, or manipulation-centric environments. This paradigm encompasses a spectrum of techniques, including discrete slot-based factorization, keypoint/pixel grouping, latent object particles, dynamic relational graphs, and modular dynamics models. OCWMs can be constructed in both supervised and self-supervised regimes and have demonstrated robust advantages across prediction, control, planning, and reasoning tasks.

1. Foundations and Motivation

Early approaches to world modeling in robotics and artificial agent domains relied on static scene descriptions or monolithic latent states, which often struggled with generalization, credit assignment, or sample efficiency in environments with multiple, interacting entities. Object-centric modeling posits that the environment can be factorized into entities (objects), each with its own state (e.g., pose, appearance, class label) and independent or interaction-driven transitions (Wong et al., 2015, Lin et al., 2020). This view aligns with cognitive and neuroscientific accounts of perception and facilitates the modularization of perception, prediction, and policy learning.

The motivation for OCWMs arises from the limitations of pixel-level world models, which often fail to represent small or dynamic decision-relevant elements and overfit large static backgrounds, as noted in visually complex tasks like video games or robotic manipulation (Zhang et al., 27 Jan 2025, Ferraro et al., 2023). By structuring the latent state into object-wise representations, agents can focus model capacity on task-relevant entities and interactions, yielding benefits for sample efficiency and robustness.

2. Architectural Principles of OCWMs

OCWMs can be taxonomized along several architectural axes:

Object Decomposition: Mapping observations to object-level latents using entity extractors such as Slot Attention (Collu et al., 2024, Jeong et al., 8 Mar 2025), keypoint/particle detectors (Daniel et al., 4 Mar 2026, Ferraro et al., 8 Nov 2025), or clustering/tracking approaches (Wong et al., 2015).
Latent Representation: Slots may encode position, scale, depth, transparency, appearance, and potentially object identity (Ferraro et al., 8 Nov 2025, Daniel et al., 4 Mar 2026). Some models enforce disentanglement of spatial and appearance factors or use permutation-invariant sets.
Dynamics Model: Object transitions are modeled independently or via relational modules (e.g., graph neural networks, Transformers) that capture pairwise or higher-order interactions (Collu et al., 2024, Vakhitov et al., 10 Jan 2026, Feng et al., 4 Nov 2025).
Generative Objective: Typical loss terms include reconstruction of object-masked pixels, prediction of future latents, cross-entropy or KL regularization on object attributes, and (in stochastic cases) variational lower bounds.
Interaction Modeling: OCWMs may encode explicit interaction graphs (factored, learned sparseness) (Feng et al., 4 Nov 2025, Nishimoto et al., 18 Nov 2025), employ relational message passing (Collu et al., 2024, Vakhitov et al., 10 Jan 2026), or leverage masked attention mechanisms for causal discovery (Nam et al., 11 Feb 2026).
Intrinsic Rewards and Exploration: Some OCWMs deploy slot-level entropy bonuses or curiosity-driven exploration to encourage interaction with novel objects (Ferraro et al., 2023).

The table below summarizes representative OCWM model components:

Model/Fam.	Slot Extraction	Dynamics	Object Interaction
FOCUS (Ferraro et al., 2023)	One-hot identity	RSSM	Masked slot decoder
ObjectZero (Vakhitov et al., 10 Jan 2026)	Slot Attention	GNN	Fully-connected
LPWM (Daniel et al., 4 Mar 2026)	Keypoint/Particle	Transformer	Latent action module
SSWM (Collu et al., 2024)	Slot Attention	Latent GNN	Message-passing
FIOC-WM (Feng et al., 4 Nov 2025)	Slot Att. + VAE	Graph+GRU	Sparse adjacencies
STICA (Nishimoto et al., 18 Nov 2025)	Slot Att., AE	Transformer-XL	Causal attention

See the cited papers for concrete architectural diagrams, code, and parameterizations.

3. Training Objectives and Optimization

OCWMs are typically trained using reconstruction and predictive objectives on object-level or slot-level latents. Common components include:

Object-level Reconstruction: Masked pixel reconstruction losses or decoder likelihoods, often spatial-broadcast or per-slot (Collu et al., 2024, Ferraro et al., 2023).
Future Prediction: Conditional prediction of subsequent slot states, often with L2, cross-entropy, or KL divergence losses, with some using temporal consistency (Vakhitov et al., 10 Jan 2026, Ferraro et al., 8 Nov 2025).
Interaction Losses: Additional penalties or objectives to enforce correct relational reasoning, such as cross-entropy over interaction graphs (Feng et al., 4 Nov 2025), or object-level masking as in C-JEPA (Nam et al., 11 Feb 2026), which enforces that an object's state must be inferred from other object slots.
Stochastic Objectives: Full variational lower bounds (ELBO) for stochastic video generation and action-conditioned rolls, as in LPWM (Daniel et al., 4 Mar 2026) and G-SWM (Lin et al., 2020).
Exploration Rewards: Slot entropy or particle-based intrinsic rewards for exploration in sparse-reward regimes (Ferraro et al., 2023).
Regularization: Slot entropy penalties, cross-slot diversity, or temporal consistency constraints to encourage disentanglement and object permanence (Nishimoto et al., 18 Nov 2025, Ferraro et al., 8 Nov 2025).

Optimization follows standard routines, with end-to-end backpropagation over slot encoders, dynamics models, and decoders; some frameworks pre-train slot extractors or keep vision backbones fixed for stability (Zhang et al., 27 Jan 2025, Vakhitov et al., 10 Jan 2026).

4. Explicit Modeling of Object Interactions

A core strength of OCWMs is their explicit encoding and prediction of object-object interactions. Techniques include:

Relational Message Passing: Action-conditional GNNs or Transformers compute next-slot embeddings by aggregating messages across all slot pairs (Collu et al., 2024, Vakhitov et al., 10 Jan 2026). In SSWM (Collu et al., 2024), K iterative GNN rounds enable modeling of chains of collisions.
Sparse Interaction Discovery: Explicit learning or induction of sparse interaction edges using variational, codebook, or conditional dependence tests (Feng et al., 4 Nov 2025). The world model uses these graphs to modulate the state transition dynamics.
Causal Attention: Causality-aware attention layers compute token-level cause-effect masks and integrate causality scores in policy/value networks (Nishimoto et al., 18 Nov 2025).
Object-level Masking (Latent Interventions): C-JEPA (Nam et al., 11 Feb 2026) applies latent masking across object slots, forcing the model to reason counterfactually about dynamics and learn stable influence neighborhoods.
Slot Entropy for Exploration: FOCUS (Ferraro et al., 2023) directly incentivizes entropy of object-slot states, promoting coverage and diverse object interactions during exploration.

This explicit modeling is empirically linked to gains in sample efficiency, transfer to out-of-distribution scenes, and reasoning about counterfactuals and causal interventions (Nam et al., 11 Feb 2026, Feng et al., 4 Nov 2025).

5. Empirical Performance and Benchmarks

OCWMs have been evaluated across a diverse array of simulated and real-world domains, including:

Atari and Video Games: OC-STORM (Zhang et al., 27 Jan 2025) outperforms pixel-based world models in 18/26 Atari games, especially in those where small objects are critical.
Robotic Manipulation: FOCUS (Ferraro et al., 2023), ObjectZero (Vakhitov et al., 10 Jan 2026), and LPWM (Daniel et al., 4 Mar 2026) demonstrate faster convergence, higher success rates, and sharper reconstructions on Robosuite, ManiSkill2, BAIR, and LanguageTable tasks than monolithic models (e.g., DreamerV2/V3).
Planning and Visual Reasoning: C-JEPA (Nam et al., 11 Feb 2026) yields a ∼20 point improvement in counterfactual reasoning on CLEVRER compared to OC-JEPA, and an 8× speedup in planning over patch-based models.
Generalization: Structured object representations enable compositional transfer (e.g., unseen color-shape combinations, as in DLPWM (Ferraro et al., 8 Nov 2025) and LPWM (Daniel et al., 4 Mar 2026)), and improved task completion under zero-shot block and task splits (Jeong et al., 8 Mar 2025).
Partial Observability and Tracking: Structured World Belief (Singh et al., 2021) outperforms both unstructured particle models and deterministic object trackers under occlusion and uncertainty.

A sample of comparative empirical results is provided below:

Domain/Task	Baseline Method	OCWM Method	Main Result
Robosuite Lift	DreamerV2	FOCUS	2× faster convergence
Atari 100k (avg.)	STORM	OC-STORM	114% → 134% HN score
CLEVRER counterfactual	OC-JEPA	C-JEPA	+20% absolute accuracy
Block-Lifting	DreamerV3, ROCA	ObjectZero	Matches/surpasses baselines
BAIR video pred. (FVD)	Diffusion models (large)	LPWM (compact)	Comparable FVD, smaller model
OOD gen. (Robosuite)	DreamerV3	DLPWM	Similar SSIM/LPIPS, better recon

6. Practical Deployment and Limitations

Despite their advantages, OCWMs present several implementation and practical limitations:

Supervision and Slot Discovery: Many models require either known object masks (e.g., FOCUS (Ferraro et al., 2023)) or rely on pre-trained, frozen slot extractors (ObjectZero (Vakhitov et al., 10 Jan 2026), OC-STORM (Zhang et al., 27 Jan 2025)), making unsupervised discovery in realistic, cluttered scenes an ongoing challenge (Ferraro et al., 8 Nov 2025).
Slot Binding and Drift: Consistency of slot-object assignments across time and interactions can drift, especially under object contact or occlusion, leading to unstable policy learning (Ferraro et al., 8 Nov 2025).
Fixed Object Cardinality: The number of slots/particles is often fixed a priori or determined via heuristics, complicating deployment in environments with dynamically varying object count (Ferraro et al., 2023, Collu et al., 2024).
Scaling to Rich Visuals: Processing and dynamics modeling over large numbers of slots (i.e., K² for GNNs) can become computationally burdensome (Vakhitov et al., 10 Jan 2026).
Background and Context Modeling: Non-object background features may be poorly captured by mask or slot-based approaches, requiring hybrid architectures or auxiliary pixel-level VAEs (Zhang et al., 27 Jan 2025).
Partial Observability: Belief-state tracking via particles has been shown to yield substantial gains but is more complex algorithmically (Singh et al., 2021).

Future directions aimed at ameliorating these limitations include end-to-end joint training of slot extractors, unsupervised or few-shot object discovery, dynamic slot allocation, and improved robustness to slot drift and object identity permutations (Daniel et al., 4 Mar 2026, Nam et al., 11 Feb 2026).

Recent OCWM research has extended the paradigm to:

Causal Reasoning: Inducing a causal inductive bias via object-level masking/interventions (Nam et al., 11 Feb 2026), explicit causality-aware attention layers (Nishimoto et al., 18 Nov 2025), or graph-based attribute factorization (Feng et al., 4 Nov 2025).
Counterfactual and Relational Question Answering: C-JEPA delivers marked improvements on visual question answering tasks demanding counterfactual and predictive reasoning (Nam et al., 11 Feb 2026).
Language and Multi-modal Inputs: Incorporating language-guided planning and manipulation via cross-attention between slots and language embeddings, with models like LSlotFormer demonstrating superior sample efficiency and zero-shot skill transfer in visuo-linguo-motor domains (Jeong et al., 8 Mar 2025).
Stochastic Multimodal Dynamics: LPWM and G-SWM implement fully stochastic dynamics via per-object latent actions or hierarchical latent transitions, enabling multi-modal, action- or goal-conditioned video prediction (Daniel et al., 4 Mar 2026, Lin et al., 2020).
Hierarchical and Modular Policies: FIOC-WM leverages explicit interaction primitives and hierarchical policy layers over object-centric world models to boost generalization and efficiency in multi-object control (Feng et al., 4 Nov 2025).

Such augmentations underline the OCWM paradigm's flexibility for supporting reasoning, planning, compositional policy learning, and adaptation to complex embodied AI, simulated physics, and interactive visual environments.

References