Papers
Topics
Authors
Recent
Search
2000 character limit reached

Object-Centric World Models

Updated 13 March 2026
  • Object-centric world models are approaches that decompose environments into individual objects with attributes and dynamics, enabling interpretable and efficient control.
  • They employ methods such as slot attention, keypoint grouping, and graph-based dynamics to explicitly model interactions among objects.
  • Their application enhances performance in robotic manipulation, video prediction, and planning by boosting sample efficiency and generalization.

Object-centric world models (OCWMs) are a class of world models in model-based reinforcement learning, generative modeling, and perception that represent environments as sets of entities (objects) and explicitly model their attributes, dynamics, interactions, and relationships. In contrast to holistic or pixel-based world models, OCWMs decompose the latent state into object-level factors, enabling data efficiency, compositional generalization, interpretable representations, and improved performance—especially in visually complex, multi-object, or manipulation-centric environments. This paradigm encompasses a spectrum of techniques, including discrete slot-based factorization, keypoint/pixel grouping, latent object particles, dynamic relational graphs, and modular dynamics models. OCWMs can be constructed in both supervised and self-supervised regimes and have demonstrated robust advantages across prediction, control, planning, and reasoning tasks.

1. Foundations and Motivation

Early approaches to world modeling in robotics and artificial agent domains relied on static scene descriptions or monolithic latent states, which often struggled with generalization, credit assignment, or sample efficiency in environments with multiple, interacting entities. Object-centric modeling posits that the environment can be factorized into entities (objects), each with its own state (e.g., pose, appearance, class label) and independent or interaction-driven transitions (Wong et al., 2015, Lin et al., 2020). This view aligns with cognitive and neuroscientific accounts of perception and facilitates the modularization of perception, prediction, and policy learning.

The motivation for OCWMs arises from the limitations of pixel-level world models, which often fail to represent small or dynamic decision-relevant elements and overfit large static backgrounds, as noted in visually complex tasks like video games or robotic manipulation (Zhang et al., 27 Jan 2025, Ferraro et al., 2023). By structuring the latent state into object-wise representations, agents can focus model capacity on task-relevant entities and interactions, yielding benefits for sample efficiency and robustness.

2. Architectural Principles of OCWMs

OCWMs can be taxonomized along several architectural axes:

The table below summarizes representative OCWM model components:

Model/Fam. Slot Extraction Dynamics Object Interaction
FOCUS (Ferraro et al., 2023) One-hot identity RSSM Masked slot decoder
ObjectZero (Vakhitov et al., 10 Jan 2026) Slot Attention GNN Fully-connected
LPWM (Daniel et al., 4 Mar 2026) Keypoint/Particle Transformer Latent action module
SSWM (Collu et al., 2024) Slot Attention Latent GNN Message-passing
FIOC-WM (Feng et al., 4 Nov 2025) Slot Att. + VAE Graph+GRU Sparse adjacencies
STICA (Nishimoto et al., 18 Nov 2025) Slot Att., AE Transformer-XL Causal attention

See the cited papers for concrete architectural diagrams, code, and parameterizations.

3. Training Objectives and Optimization

OCWMs are typically trained using reconstruction and predictive objectives on object-level or slot-level latents. Common components include:

Optimization follows standard routines, with end-to-end backpropagation over slot encoders, dynamics models, and decoders; some frameworks pre-train slot extractors or keep vision backbones fixed for stability (Zhang et al., 27 Jan 2025, Vakhitov et al., 10 Jan 2026).

4. Explicit Modeling of Object Interactions

A core strength of OCWMs is their explicit encoding and prediction of object-object interactions. Techniques include:

  • Relational Message Passing: Action-conditional GNNs or Transformers compute next-slot embeddings by aggregating messages across all slot pairs (Collu et al., 2024, Vakhitov et al., 10 Jan 2026). In SSWM (Collu et al., 2024), K iterative GNN rounds enable modeling of chains of collisions.
  • Sparse Interaction Discovery: Explicit learning or induction of sparse interaction edges using variational, codebook, or conditional dependence tests (Feng et al., 4 Nov 2025). The world model uses these graphs to modulate the state transition dynamics.
  • Causal Attention: Causality-aware attention layers compute token-level cause-effect masks and integrate causality scores in policy/value networks (Nishimoto et al., 18 Nov 2025).
  • Object-level Masking (Latent Interventions): C-JEPA (Nam et al., 11 Feb 2026) applies latent masking across object slots, forcing the model to reason counterfactually about dynamics and learn stable influence neighborhoods.
  • Slot Entropy for Exploration: FOCUS (Ferraro et al., 2023) directly incentivizes entropy of object-slot states, promoting coverage and diverse object interactions during exploration.

This explicit modeling is empirically linked to gains in sample efficiency, transfer to out-of-distribution scenes, and reasoning about counterfactuals and causal interventions (Nam et al., 11 Feb 2026, Feng et al., 4 Nov 2025).

5. Empirical Performance and Benchmarks

OCWMs have been evaluated across a diverse array of simulated and real-world domains, including:

  • Atari and Video Games: OC-STORM (Zhang et al., 27 Jan 2025) outperforms pixel-based world models in 18/26 Atari games, especially in those where small objects are critical.
  • Robotic Manipulation: FOCUS (Ferraro et al., 2023), ObjectZero (Vakhitov et al., 10 Jan 2026), and LPWM (Daniel et al., 4 Mar 2026) demonstrate faster convergence, higher success rates, and sharper reconstructions on Robosuite, ManiSkill2, BAIR, and LanguageTable tasks than monolithic models (e.g., DreamerV2/V3).
  • Planning and Visual Reasoning: C-JEPA (Nam et al., 11 Feb 2026) yields a ∼20 point improvement in counterfactual reasoning on CLEVRER compared to OC-JEPA, and an 8× speedup in planning over patch-based models.
  • Generalization: Structured object representations enable compositional transfer (e.g., unseen color-shape combinations, as in DLPWM (Ferraro et al., 8 Nov 2025) and LPWM (Daniel et al., 4 Mar 2026)), and improved task completion under zero-shot block and task splits (Jeong et al., 8 Mar 2025).
  • Partial Observability and Tracking: Structured World Belief (Singh et al., 2021) outperforms both unstructured particle models and deterministic object trackers under occlusion and uncertainty.

A sample of comparative empirical results is provided below:

Domain/Task Baseline Method OCWM Method Main Result
Robosuite Lift DreamerV2 FOCUS 2× faster convergence
Atari 100k (avg.) STORM OC-STORM 114% → 134% HN score
CLEVRER counterfactual OC-JEPA C-JEPA +20% absolute accuracy
Block-Lifting DreamerV3, ROCA ObjectZero Matches/surpasses baselines
BAIR video pred. (FVD) Diffusion models (large) LPWM (compact) Comparable FVD, smaller model
OOD gen. (Robosuite) DreamerV3 DLPWM Similar SSIM/LPIPS, better recon

6. Practical Deployment and Limitations

Despite their advantages, OCWMs present several implementation and practical limitations:

  • Supervision and Slot Discovery: Many models require either known object masks (e.g., FOCUS (Ferraro et al., 2023)) or rely on pre-trained, frozen slot extractors (ObjectZero (Vakhitov et al., 10 Jan 2026), OC-STORM (Zhang et al., 27 Jan 2025)), making unsupervised discovery in realistic, cluttered scenes an ongoing challenge (Ferraro et al., 8 Nov 2025).
  • Slot Binding and Drift: Consistency of slot-object assignments across time and interactions can drift, especially under object contact or occlusion, leading to unstable policy learning (Ferraro et al., 8 Nov 2025).
  • Fixed Object Cardinality: The number of slots/particles is often fixed a priori or determined via heuristics, complicating deployment in environments with dynamically varying object count (Ferraro et al., 2023, Collu et al., 2024).
  • Scaling to Rich Visuals: Processing and dynamics modeling over large numbers of slots (i.e., K² for GNNs) can become computationally burdensome (Vakhitov et al., 10 Jan 2026).
  • Background and Context Modeling: Non-object background features may be poorly captured by mask or slot-based approaches, requiring hybrid architectures or auxiliary pixel-level VAEs (Zhang et al., 27 Jan 2025).
  • Partial Observability: Belief-state tracking via particles has been shown to yield substantial gains but is more complex algorithmically (Singh et al., 2021).

Future directions aimed at ameliorating these limitations include end-to-end joint training of slot extractors, unsupervised or few-shot object discovery, dynamic slot allocation, and improved robustness to slot drift and object identity permutations (Daniel et al., 4 Mar 2026, Nam et al., 11 Feb 2026).

7. Extensions: Causality, Reasoning, and Multi-modal Conditioning

Recent OCWM research has extended the paradigm to:

  • Causal Reasoning: Inducing a causal inductive bias via object-level masking/interventions (Nam et al., 11 Feb 2026), explicit causality-aware attention layers (Nishimoto et al., 18 Nov 2025), or graph-based attribute factorization (Feng et al., 4 Nov 2025).
  • Counterfactual and Relational Question Answering: C-JEPA delivers marked improvements on visual question answering tasks demanding counterfactual and predictive reasoning (Nam et al., 11 Feb 2026).
  • Language and Multi-modal Inputs: Incorporating language-guided planning and manipulation via cross-attention between slots and language embeddings, with models like LSlotFormer demonstrating superior sample efficiency and zero-shot skill transfer in visuo-linguo-motor domains (Jeong et al., 8 Mar 2025).
  • Stochastic Multimodal Dynamics: LPWM and G-SWM implement fully stochastic dynamics via per-object latent actions or hierarchical latent transitions, enabling multi-modal, action- or goal-conditioned video prediction (Daniel et al., 4 Mar 2026, Lin et al., 2020).
  • Hierarchical and Modular Policies: FIOC-WM leverages explicit interaction primitives and hierarchical policy layers over object-centric world models to boost generalization and efficiency in multi-object control (Feng et al., 4 Nov 2025).

Such augmentations underline the OCWM paradigm's flexibility for supporting reasoning, planning, compositional policy learning, and adaptation to complex embodied AI, simulated physics, and interactive visual environments.


References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Object-Centric World Models (OCWMs).