Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
103 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
50 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

SlotPi: A Physics-Informed Object-Centric Framework

Updated 11 July 2025
  • SlotPi is a physics-informed, object-centric reasoning framework that combines Hamiltonian mechanics with deep learning to model and forecast physical dynamics.
  • It uses an object-centric encoder along with dual modules—a physics module enforcing conservative laws and a spatio-temporal module capturing dissipative effects—to predict system states.
  • Empirical results show that SlotPi improves image fidelity and object-dynamics accuracy across synthetic datasets and real-world videos, outperforming prior object-centric models.

SlotPi is a physics-informed, object-centric reasoning framework designed for understanding and forecasting the dynamics of physical systems from visual observations. It explicitly integrates principles from Hamiltonian mechanics with learned deep spatio-temporal reasoning components, aiming to bridge the gap between traditional object-centric video models and the faithful simulation of real-world physics, particularly where both object interactions and complex fluid behaviors are present. The approach provides a composite architecture to jointly model conservative physical interactions and non-conservative, dissipative phenomena, and it demonstrates enhanced adaptability across both synthetic and real-world environments.

1. Architecture and Conceptual Foundations

SlotPi unifies classic physics principles and deep learning by combining an object-centric slot-based representation with physics-informed dynamic prediction. The framework is divided into two main modules:

  • Physical Module: Encodes and predicts object dynamics using Hamiltonian mechanics as an explicit inductive bias.
  • Spatio-temporal Reasoning Module: Captures non-conservative effects and complex interactions outside the strict Hamiltonian formalism via attention-based learning.

Given a sequence of video frames, an object-centric encoder such as SAVi or STATM-SAVi is used to segment the scene into object-level “slots”—vector representations of spatially and semantically distinct entities. These slot representations serve as a mid-level abstraction, which is then independently processed by the physical and spatio-temporal modules. Their respective predictions are combined in a weighted sum to yield the final forecasting of the next system state.

2. Integration of Physics via Hamiltonian Principles

At the core of the physical module is an explicit Hamiltonian formalism implemented through neural attention blocks. The module computes, per time step tt, for each slot StS_t:

  • Generalized momentum PtP_t via cross-attention over the current and historical slots: Pt=CrossAttP(St,{S0,...,St1})P_t = \mathrm{CrossAtt}_P(S_t, \{S_0, ..., S_{t-1}\})
  • Generalized coordinates QtQ_t via self-attention: Qt=SelfAttQ(St)Q_t = \mathrm{SelfAtt}_Q(S_t)

Subsequently, a self-attention block calculates per-slot energy contributions SHtSH_t, which are passed through a linear layer and Softplus activation to ensure positivity:

SHt=Linear(SelfAttH(Qt,Pt))SH_t = \mathrm{Linear}(\mathrm{SelfAtt}_H(Q_t, P_t))

The total system energy is the sum over all slots:

Ht=SHtH_t = \sum SH_t

The temporal evolution is governed by the Hamiltonian equations:

dQtdt=HtPtanddPtdt=HtQt\frac{dQ_t}{dt} = \frac{\partial H_t}{\partial P_t} \quad \text{and} \quad \frac{dP_t}{dt} = -\frac{\partial H_t}{\partial Q_t}

State updates are performed using Euler’s method with a fixed small time step Δt\Delta t:

(Qt+1,Pt+1)=(Qt,Pt)+Δt(dQtdt,dPtdt)(Q_{t+1}, P_{t+1}) = (Q_t, P_t) + \Delta t \cdot \left( \frac{dQ_t}{dt}, \frac{dP_t}{dt} \right)

This method enables SlotPi to impose universal physical laws directly via learnable energy-based interactions among slots.

3. Spatio-temporal Reasoning Module

Real-world dynamics frequently involve energy dissipation and non-conservative processes not captured by pure Hamiltonian models. SlotPi addresses this by incorporating a secondary spatio-temporal reasoning module. This component features:

  • Temporal cross-attention: Each slot attends to its own past sequence, facilitating short- and long-term memory of temporal dynamics.
  • Spatial self-attention: Captures interactions among contemporaneous slots to model inter-object and object-fluid relations not easily captured by first-principles physics.

The output of the reasoning module, denoted S^T+1\hat{S}_{T+1}, is linearly blended with the Hamiltonian module’s prediction Q^T+1\hat{Q}_{T+1} as:

S^T+1=λQ^T+1+S^T+1\hat{S}_{T+1} = \lambda \cdot \hat{Q}_{T+1} + \hat{S}_{T+1}

where λ\lambda is a tunable parameter regulating the relative contribution of each pathway. Ablation studies confirm the importance of both modules for robust performance.

4. Generalization and Adaptability Across Physical Regimes

SlotPi’s modular design, with decoupled physical and learned reasoning components, affords substantial generalization. While the Hamiltonian module is tailored for conservative systems (e.g., rigid-body mechanics), the spatio-temporal module’s learning-based, attention-driven structure allows adaptation to phenomena such as fluid flow, viscosity, and other dissipative behaviors. Empirical results demonstrate SlotPi’s success across:

  • Synthetic object-centric datasets (e.g., OBJ3D, CLEVRER)
  • Fluid simulation datasets derived from downsampled Navier–Stokes equations
  • Real-world scene videos encompassing rigid bodies, fluids, and their interactions

A key outcome is robust prediction performance and accurate extrapolation to scenarios with combined fluids and objects—settings traditionally difficult for standard object-centric world models.

5. Empirical Evaluation and Benchmarking

SlotPi is extensively benchmarked, exhibiting improvements over prior object-centric prediction architectures such as STATM and SlotFormer. Quantitative evaluation is reported on both image quality and consistency measures:

Metric SlotPi Improvement Context Evaluated Domains
PSNR, SSIM, LPIPS Higher image fidelity Video prediction (OBJ3D, CLEVRER)
FG-ARI, FG-mIoU, ARI Better object-dynamics accuracy CLEVRER, multi-object scenes
RMSE, MAE, High Correction Time (HCT) Improved for fluid dynamics Fluid datasets (Navier–Stokes)

In downstream tasks such as Visual Question Answering (VQA) on CLEVRER and Physion, SlotPi achieves higher predictive question accuracy scores than preceding object-centric models. Qualitative demonstrations include the correct simulation of subtle effects such as fluid-induced rotation and compound object-fluid interactions.

6. Real-world Dataset Construction

To empirically validate model generalization, SlotPi is tested on a custom real-world dataset incorporating both object and fluid dynamics. The dataset contains:

  • Diverse scenes: indoor containers, outdoor lakes, and varied liquid types
  • Objects: 3–5 per scene, varied in shape, size, color, and material
  • Controlled interventions: External forces applied to some objects
  • Annotations and structure: Videos recorded at 30 frames per second, standardized in duration and spatial resolution

The dataset is curated to enhance visual distinctness (e.g., distinct fluid dyeing) and facilitate slot-based object decomposition. Predictive results confirm that SlotPi can generalize learned principles from simulation to the complexities of natural environments.

7. Implications and Prospects for Hybrid World Modeling

SlotPi demonstrates that explicit integration of Hamiltonian constraints with deep neural spatio-temporal reasoning delivers versatile and physically consistent predictions across a spectrum of dynamic phenomena. This capability supports the development of more advanced world models that jointly learn from data and enforce global physical laws.

The empirical evidence that classical physics and attention-based learning can be synergistically combined suggests a promising trajectory for future research. A plausible implication is that further refinement of hybrid architectures could yield agents with enhanced capabilities for both “simulation” and “understanding” of physical scenarios across increasingly varied real-world contexts. As such, SlotPi provides a template for the unified treatment of physics-based reasoning and data-driven generalization in artificial intelligence.