FIOC-WM: Object-Centric World Model

Updated 13 March 2026

The framework factorizes scenes into object-centric latents and employs GNNs for explicit modeling of object-object interactions.
Empirical results demonstrate improved zero-shot generalization, reduced compounding error, and enhanced sample efficiency across various benchmarks.
FIOC-WM integrates reconstruction, predictive, and contrastive losses to support efficient model-based planning and compositional transfer in complex environments.

The Factored Interactive Object-Centric World Model (FIOC-WM) is a general framework for learning, inference, and planning in environments that can be decomposed into objects and their interactions. FIOC-WM factorizes scene representations into object-centric latents (“object slots”) and employs explicit models of object-object interactions, typically implemented as Graph Neural Networks (GNNs), to capture underlying environment dynamics in a structured, permutation-equivariant, and sample-efficient manner. FIOC-WM has demonstrated strong empirical performance across a spectrum of benchmarks including simulated physics, robotic manipulation, and abstract planning environments, consistently outperforming non-object-centric and baseline structured approaches in multi-step, generalization, and transfer tasks (Collu et al., 2024, Biza et al., 2022, GX-Chen et al., 2024, Feng et al., 4 Nov 2025, Feng et al., 2023).

1. Object-Centric Factorization and Scene Encoding

FIOC-WM represents the environment at each time step as a set of K object slots or object-centric latents, $\mathbf{z}_t = \{z_t^1, \ldots, z_t^{K}\}$ , with each slot intended to capture the appearance, position, and other attributes of a distinct object or scene component. These slots are inferred by an object-centric encoder. In visual domains, the encoder often employs Slot Attention (Collu et al., 2024, Feng et al., 2023):

Slot Attention is an iterative attention-based module that transforms per-pixel or per-patch features $X \in \mathbb{R}^{N \times D}$ into $K$ slot vectors $S \in \mathbb{R}^{K \times d}$ through multiple rounds of competitive attention, feature aggregation, and recurrent update.
General factored representations can also be constructed by passing object-wise glimpses (RGB+coord crops) through shared encoders—this enforces permutation equivariance at the input stage and supports non-visual domains (Biza et al., 2022).

In advanced variants, each object latent $z_t^i$ is decomposed into static (e.g., class, mass, color) and dynamic (e.g., position, velocity) components, with architectural and loss function choices (temporal constancy, slot-level contrastive losses) used to further disentangle factors of variation (Feng et al., 4 Nov 2025). Some formulations also introduce explicit attribute vectors and class inference mappings as part of the encoded slot state (Feng et al., 2023, GX-Chen et al., 2024).

2. Modeling Object Interactions via Graph Neural Networks

At the core of FIOC-WM is the use of GNN dynamics models to capture object-object interactions. The set of slots is interpreted as a fully connected (except self-edges) or sparsely inferred graph, and the GNN performs several rounds of message passing:

Latent message passing: For each GNN iteration, edge messages $e_{ij}$ between all slot pairs (i→j) are computed via edge networks (typically MLPs) taking as input the slot states (and optionally current slot updates), while node updates aggregate incoming messages and may integrate action information (Collu et al., 2024, Biza et al., 2022).
Residual connections: Each GNN layer is formulated in a residual form so that the model learns iterative refinement of slot dynamics across interaction steps (Biza et al., 2022).
Sparsity and structure discovery: Some FIOC-WM variants infer a dynamic, sparse interaction graph $G_t$ at each time, learning which pairs of objects interact at each timestep. Techniques used include variational masks with Gumbel-Softmax, learned codebooks, or conditional mutual information estimators (Feng et al., 4 Nov 2025, Feng et al., 2023).

Action input is injected either at specific layers or throughout the GNN, enabling efficient modeling of action-conditional object interactions. Permutation equivariance is maintained by sharing all GNN parameters across slots and by aggregating incoming messages symmetrically (Biza et al., 2022).

3. Learning Objectives and Training Procedures

FIOC-WMs are trained with objectives tailored to their modular decomposition:

Reconstruction loss: Object-centric encoders/decoders (e.g., Slot Attention) are often pretrained with reconstruction losses to ensure good slot representations (Collu et al., 2024, Feng et al., 2023).
Predictive loss: GNN transition models are typically trained to minimize object-wise L₂ loss between predicted and ground-truth slots over one or multiple steps (Collu et al., 2024).
Contrastive loss: Many approaches use a contrastive hinge loss, where the prediction is forced to stay close to the true future state in slot space and far from random (negative) samples, facilitating compact but discriminative representations (Biza et al., 2022).
Probabilistic modeling: Some implementations can be cast as latent state-space models with a natural ELBO loss over sequences. For instance:

$\mathcal{L}_{\mathrm{ELBO}} = \sum_{t=1}^T \left[ \mathbb{E}_{q(z_t|x_t)} [\log p(x_t|z_t)] - \mathrm{KL}(q(z_t|x_t) \| p(z_t|z_{t-1},a_{t-1})) \right]$

(Collu et al., 2024).

Graph-based regularization: Sparsity penalties, KL divergences over edge distributions, and specialized auxiliary losses (e.g., slot-contrastive, static consistency) are employed in advanced FIOC-WMs (Feng et al., 4 Nov 2025, Feng et al., 2023).

Optimization is performed using Adam, typically with large batch sizes and for several hundred epochs (Collu et al., 2024).

4. Planning, Exploration, and Control

FIOC-WM supports model-based planning, exploration, and policy learning:

Rollout and imagination: The trained model can be unrolled over multiple steps for action-conditional predictions, enabling model-based planning in the latent slot space (with optional decoding to pixels for visualization or auxiliary loss) (Collu et al., 2024).
Closed-loop planning: FIOC-WM ensembles are used for multistep open-loop or closed-loop planning over large candidate action spaces, scoring actions based on expected distance to the goal state in latent space, and selecting actions that maximize the likelihood of desired object arrangements. Fast heuristic search yields efficient plans, even for sequential manipulation tasks with up to 12 actions (Biza et al., 2022).
Hierarchical control: FIOC-WM supports decomposition into high-level (task, interaction primitive selection) and low-level (execution of interaction sequences) policy layers. High-level policies select target interaction graphs or subgoals, and low-level controllers plan/control actions to realize them. Both MPC and PPO are used at different layers, and diversity bonuses prevent cycling (Feng et al., 4 Nov 2025).
Efficient exploration: Count-based intrinsic rewards on abstract (object,attribute)-state transitions, coupled with efficient discriminative model learning and MCTS, drive rapid exploration and coverage of possible state transitions. Goal-directed planning uses shortest path solvers over the learned transition graph (GX-Chen et al., 2024).

5. Evaluation Benchmarks and Empirical Performance

FIOC-WM has been evaluated in a range of environments, with a focus on compositional generalization, zero-shot transfer, and long-range prediction:

Benchmark	Domain	Metric(s)	Key Results
Spriteworld	Abstract 2D physics	H@1, MRR (slot correspondence)	FIOC-WM: H@1@10 = 68–88%, C-SWM: 19–24%
Block Stacking	Robotic simulation/real	RMSE (cm), Hits@1, plan success	FIOC-WM: RMSE 0.68–1.05cm, up to 75% zero-shot
2D Crafting/MiniHack	Abstract planning	Success rate, sample efficiency	50–70% zero-shot, >80% composition, 1–2 orders of magnitude sample efficiency vs. baselines
i-Gibson/Libero	Embodied AI	LPIPS, policy success	FIOC converges 2× faster, 0.8+ success

(Collu et al., 2024, Biza et al., 2022, GX-Chen et al., 2024, Feng et al., 4 Nov 2025, Feng et al., 2023)

Empirically, FIOC-WM demonstrates:

Accurate single- and multi-step predictions with low compounding error.
Substantial improvement in combinatorial generalization: novel numbers, arrangements, and attributes of objects, as well as unseen interaction “skills.”
Strong zero-shot and few-shot transfer to new tasks in both symbolic and pixel domains.
Robust sample efficiency and generalization improvement over C-SWM, Dreamer-v3, PPO, and other strong baselines.

6. Extensions, Limitations, and Future Directions

Modularity and scalability: FIOC-WM’s modular design produces robust, scalable world models, but some variants currently require known object bounding boxes or good slot extractors as input (Biza et al., 2022, Feng et al., 2023).
Pairwise limitation: Current models largely operate with pairwise interactions; handling higher-order interactions remains an area for future development (Biza et al., 2022, Feng et al., 2023).
Slot discovery and parameterization: Integration of more powerful object discovery mechanisms (beyond Slot Attention, e.g., via unsupervised detection or attention modulation) is an open direction (Biza et al., 2022, Feng et al., 2023).
Attribute and class-factorization limitations: Some methods require pre-specified object classes or curated attribute sets, which limit open-world applicability (Feng et al., 2023, GX-Chen et al., 2024).
Generalization boundaries: While FIOC-WM achieves strong zero-shot and compositional transfer, performance may degrade for tasks involving non-local or continuous functional dependencies, highly entangled interactions, or where novel objects have fundamentally different affordances or dynamics (Biza et al., 2022, Feng et al., 4 Nov 2025).

Ongoing research explores richer relational dynamics, full end-to-end object extraction, transfer to physical robots, and integration with broader forms of symbolic reasoning.

7. Comparative Summary and Impact

FIOC-WM unifies object-centric perception and interaction-centric dynamics, leading to a structured, generalizable world-modeling paradigm. Compared to prior approaches, it achieves:

Consistent factorization of multi-object scenes,
Explicit modeling of object-object interactions with GNNs,
Permutation equivariance and transfer to new object configurations,
Strong empirical performance in zero-shot, few-shot, and compositional generalization benchmarks.

The approach has reshaped research into data-efficient reinforcement learning, structured planning, and model-based policy optimization by demonstrating the power of explicit object and interaction factorization (Collu et al., 2024, Biza et al., 2022, Feng et al., 4 Nov 2025, GX-Chen et al., 2024, Feng et al., 2023).