Object-Centric World Models
- Object-centric world models are structured models that decompose scenes into discrete object slots for enhanced compositional reasoning.
- They integrate object-centric encoders, dynamics modules, and decoders to effectively capture interactions and support predictive control.
- OCWMs boost reinforcement learning by providing low-dimensional, object-factorized state representations for efficient simulation and planning.
Object-centric world models (OCWMs) are structured generative or predictive models that factor environmental state into discrete entity representations—typically “object slots"—and explicitly model scene dynamics as transitions or interactions between those entities. Building on progress in unsupervised object segmentation, variational inference, and relational dynamics, OCWMs provide a compositional abstraction for reasoning, simulation, planning, and control from unstructured visual observations in both simulation and real-world domains.
1. Core Principles and Mathematical Foundations
OCWMs represent each scene as a set of object-centric latent variables. Formally, at each timestep , the state (e.g., an RGB image) is encoded to a set of slots , where each slot parameterizes an individual entity (or object) with learned features such as position, appearance, and depth. The generative model factors the joint density over observation sequences and latent variables as:
Inference is performed by amortized variational posteriors , optimized via the evidence lower bound (ELBO) or maximum-likelihood objectives, often augmented by auxiliary predictive, reconstructive, or contrastive losses (Zadaianchuk et al., 2020, Janner et al., 2018, Collu et al., 8 Jan 2024, Ferraro et al., 8 Nov 2025).
Slot attention or related mechanisms are commonly employed to achieve permutation-invariant, instance-aware slot assignments. Dynamic models—graph neural networks, slot-wise recurrent modules, or transformers—propagate each object's latent state forward, with inter-object interactions (e.g., contact response, physics) modeled via message passing or explicit attention (Collu et al., 8 Jan 2024, Feng et al., 4 Nov 2025, Lin et al., 2020).
2. Model Architectures and Learning Paradigms
OCWM architectures share three fundamental modules:
- Object-centric encoder: Maps to slots using Slot Attention (Collu et al., 8 Jan 2024, Elsayed et al., 2022, Jeong et al., 8 Mar 2025), channel-wise feature maps (Ramakrishnan et al., 2023), or unsupervised segmentation (Janner et al., 2018, Wong et al., 2015). Each slot encodes object-level state, e.g., (appearance), (spatial pose), (presence), (occlusion).
- Object dynamics module: Predicts given and action , often via a latent GNN:
or via slot-wise or relational transformers (Collu et al., 8 Jan 2024, Ferraro et al., 8 Nov 2025, Feng et al., 4 Nov 2025).
- Decoder (renderer): Maps object slots back to pixel space, using spatial transformers, slot-wise alpha compositing, or direct pixel regression (Zadaianchuk et al., 2020, Janner et al., 2018, Lin et al., 2020).
Training can be unsupervised/object-agnostic, leveraging reconstruction, temporal prediction, and optionally reward/completion objectives (Zadaianchuk et al., 2020, Ferraro et al., 8 Nov 2025, Collu et al., 8 Jan 2024). Some methods employ supervised segmentation masks or object annotations for slot initialization in complex domains (Zhang et al., 27 Jan 2025, Ferraro et al., 2023).
3. Object Interactions and Compositional Dynamics
A defining feature of OCWMs is their explicit modeling of interactions:
- Physical and relational reasoning: Models like O2P2 and G-SWM employ object–object pairwise networks to propagate forces and contact across collections of objects (Janner et al., 2018, Lin et al., 2020). Graph-based latent dynamics allow force transmission and support long-horizon prediction in manipulation and physics reasoning.
- Modularity and dynamic graphs: Instantiations such as the Factored Interactive Object-Centric World Model (FIOC-WM) dynamically infer an interaction graph at each step, controlling message passing among slots and enabling learnable, environment-specific relational structure (Feng et al., 4 Nov 2025).
- Handling occlusion and object permanence: Sequential Monte Carlo OCWMs and depth-augmented slot models achieve robust tracking of objects through long occlusions and re-emergence events (Singh et al., 2021, Elsayed et al., 2022).
These modular dynamics enable compositional generalization—ratios, cardinality changes, and task transfer—outperforming monolithic, pixel-based models in multi-object domains (Zadaianchuk et al., 2020, Nishimoto et al., 18 Nov 2025, Ferraro et al., 8 Nov 2025, Collu et al., 8 Jan 2024).
4. Integration with Reinforcement Learning and Policy Learning
OCWMs facilitate model-based RL by providing policies with structured, low-dimensional, object-factorized state representations:
- Goal-conditioned control: Models such as SMORL encode both current and goal states in object slot space, enabling attention-based or transformer-based policies that act directly on objects, bypassing the binding problem of vectorized VAEs (Zadaianchuk et al., 2020, Nishimoto et al., 18 Nov 2025, Ferraro et al., 2023).
- Imagination and planning: Agents can perform multi-step imagination rollouts in slot space, optimizing policies via actor–critic or model-predictive control on trajectories imagined by the object-centric world model (Ferraro et al., 8 Nov 2025, Zhang et al., 27 Jan 2025, Ferraro et al., 2023).
- Hierarchical and causality-aware RL: Explicit object-interaction primitives or learned causality graphs (as in STICA and FIOC-WM) support hierarchical decomposition of multi-object tasks, improving sample efficiency and policy transfer (Nishimoto et al., 18 Nov 2025, Feng et al., 4 Nov 2025).
- Exploration: Object-space intrinsic exploration rewards, such as per-slot entropy bonuses or novelty metrics, drive agents toward diverse interaction and manipulation skills (Ferraro et al., 2023).
Empirical benchmarks demonstrate that OCWM-based RL agents outperform pixel-based and flat-latent approaches, especially in multi-object, compositional, and interaction-heavy tasks, including real-robotic control and complex games (Zhang et al., 27 Jan 2025, Feng et al., 4 Nov 2025, Ferraro et al., 2023).
5. Robustness, Generalization, and Limitations
OCWMs promise compositional and representational generalization, but key challenges remain:
- Generalization to OOD conjunctions: Slot-based models without explicit disambiguation rapidly lose factorization when test objects have novel attribute combinations (shape, color, etc.), as shown by sharp declines in predictive accuracy in out-of-distribution settings (Ramakrishnan et al., 2023).
- Slot stability and drift: During rapid multi-object interaction or physical contact, slot identity can drift ("slot bleeding"), corrupting imagined trajectories and undermining policy control (Ferraro et al., 8 Nov 2025). Online filtering or temporal regularization (such as exponential moving averages of slot latents) can attenuate high-frequency drift but have not fully solved this instability.
- Handling variable object cardinality: Most slot models use a fixed maximum slot count, with explicit slot presence variables or matching networks required for dynamic object sets (Zadaianchuk et al., 2020, Singh et al., 2021).
- Role of supervision and foundation models: Approaches such as OC-STORM demonstrate the use of pretrained segmentation (e.g., Cutie, SAM) and vector-level object features to enhance slot formation and representation in visually complex domains, but at the cost of requiring some supervised mask annotation (Zhang et al., 27 Jan 2025).
- Occlusion handling and long-term tracking: Depth-aware slots (e.g., SAVi++) and SMC-based approaches improve object permanence and re-identification across long occlusions, but robust slot identity through severe occlusion or out-of-frame events is still an open problem (Elsayed et al., 2022, Singh et al., 2021).
- Scaling to real-world data: SAVi++ and related depth-supervised slot models enable segmentation and tracking in real-world multi-object videos (e.g., Waymo), but rely on auxiliary sensors; unsupervised generalization in cluttered natural videos remains unsolved (Elsayed et al., 2022).
6. Practical Applications and Empirical Performance
OCWMs have demonstrated effectiveness across synthetic and real-world domains:
- Physical prediction and planning: O2P2 and G-SWM predict equilibrium and dynamic outcomes for block towers and interactive agent-based physics, supporting planning that generalizes beyond training configurations (Janner et al., 2018, Lin et al., 2020).
- Robot manipulation with language or vision guidance: Language-conditioned OCWMs efficiently translate text instructions into object-centric "slot rollouts" for manipulation tasks, outperforming diffusion-based pixel models in both efficiency and success rate on unseen tasks (Jeong et al., 8 Mar 2025).
- Atari and complex games: Object-centric RL agents (OC-STORM) achieve higher sample efficiency and final performance than pixel-based MBRL pipelines on Atari-100k and visually rich action games by directing modeling capacity to decision-relevant objects (Zhang et al., 27 Jan 2025).
- Real-world vision and autonomous driving: SAVi++ achieves strong unsupervised object segmentation/tracking on moving-camera driving datasets; object-centric models have been deployed on robots with only a handful of human-drawn object masks (Elsayed et al., 2022, Ferraro et al., 2023).
- Partially observed and multi-hypothesis domains: SMC-based object belief models provide robust tracking, filtering, and RL under partial observability and long-term object absence, improving RL and planning performance versus standard structured VAEs (Singh et al., 2021).
Empirical tables in these studies consistently show superiority of object-centric variants over monolithic baselines in sample efficiency, generalization, and compositional task performance.
7. Prospects and Open Research Directions
OCWMs enable structured, compositional abstraction in prediction and control, but several research frontiers remain:
- Slot discovery and open-world generalization: Dynamic, open-ended slot models that can adapt to arbitrary object cardinality and attribute novelty, possibly guided by instance-level attention or domain randomization (Zadaianchuk et al., 2020, Ramakrishnan et al., 2023).
- Slot stability and interaction-aware update: Regularization and architectural innovations for maintaining slot identity and disentangled transition under heavy interaction and occlusion (Ferraro et al., 8 Nov 2025, Feng et al., 4 Nov 2025).
- Unified handling of action, language, and observation spaces: Integration of language, vision, and action at the slot/object level to support generalist agents across visuo-linguo-motor domains (Jeong et al., 8 Mar 2025).
- Causal reasoning and compositional skill acquisition: Causality-aware slot attention, hierarchical policy layers, and factorized planning over learned primitives to accelerate skill composition and task transfer (Nishimoto et al., 18 Nov 2025, Feng et al., 4 Nov 2025).
- Foundation model integration: Leveraging large, frozen vision-language segmenters for robust object extraction, with potential for improved zero-shot transfer but increased need for cross-domain adaptation (Zhang et al., 27 Jan 2025).
- Scalability and real-world robustness: Scaling unsupervised occlusion-handling, slot synchronization, and variable-cardinality to natural datasets in robotics, driving, and embodied-AI settings (Elsayed et al., 2022).
A plausible implication is that OCWM development will increasingly blend advances from object-centric unsupervised learning, foundation vision models, and relational causal modeling to achieve human-like generalization and sample efficiency in multi-object interactive domains.