Embodied-SlotSSM: Scalable Object-Centric SSM

Updated 21 November 2025

Embodied-SlotSSM is a scalable object-centric state-space modeling framework that uses slot attention to persistently track objects in complex, partially observable environments.
It fuses per-slot state-space models with a relational cross-attention encoder to enable temporally aware action prediction and robust memory integration.
Empirical evaluations on LIBERO benchmarks show significant improvements in subgoal completion and object tracking accuracy in non-Markovian tasks.

Embodied-SlotSSM is a scalable object-centric state-space modeling framework designed for temporally grounded reasoning in complex, partially observable, and non-Markovian embodied settings. It integrates slot attention for persistently tracking object identities, per-slot state-space models (SSMs) for object-level memory, and a relational cross-attention encoder to support temporally aware action prediction conditioned on both visual and language goals (Chung et al., 14 Nov 2025).

1. Motivation and Problem Setting

Modern embodied agents, particularly in robotic manipulation, frequently encounter scenarios where the environment is only partially observable and task-relevant information is not immediately accessible in the current observation. In such non-Markovian settings, the agent's policy must access detailed per-object histories—such as which object has been interacted with and in what sequence—to disambiguate between visually similar scenes requiring different actions. The LIBERO-Mem benchmark, introduced in (Chung et al., 14 Nov 2025), formalizes this challenge with tasks that require long-horizon, object-specific memory and subgoal management, exposing the limitations of conventional vision-language-action (VLA) architectures. Existing VLA models exhibit marked performance degradation on these memory-intensive tasks (subgoal completion typically <5%) due to the inability of dense token-based representations to scale beyond a few hundred frames without intractable memory or attention costs.

2. Core Components of Embodied-SlotSSM

Embodied-SlotSSM fuses three architectural principles for spatio-temporal, object-centric memory:

Slot Attention-based Perception: At each timestep, a frozen or learned visual encoder extracts dense patch or feature map tokens from the observation. Slot attention transforms the scene features into $K$ object-centric slot embeddings $s_t = [s_t^1, …, s_t^K]$ , each intended to represent a persistent entity over long timescales.
Slot-State-Space Modeling (Slot-SSM): For each slot $k$ , a dedicated state-space model maintains its own hidden state $h_t^k \in \mathbb{R}^H$ , updating via input-conditioned, block-diagonal affine maps:

$h_t^k = A_k(s_t^k) h_{t-1}^k + B_k(s_t^k) s_t^k; \quad y_t^k = C_k(s_t^k) h_t^k$

This mechanism imposes temporal persistence on each slot's memory trace, enabling tracking of both visible and occluded objects, and supports structured recall/regeneration of short-horizon slot trajectories.

Relational Encoder for Action Decoding: To exploit structured slot-level memory, a relational encoder applies $L$ cross-attention layers where slot-fused codes (which integrate current, predicted-next, and goal embeddings for each slot) act as queries and the current visual feature tokens as keys/values. The resulting relation tokens encode object-scene interactions vital for grounding language- or subgoal-conditioned action selection.

3. Formal Algorithm and Dataflow

Given a video frame $o_t$ and a global language goal $\ell$ :

Visual Encoding: $v_t \leftarrow$ pretrained CNN $(o_t)$
Slot Attention: $s_t^{(0)}$ initialized (randomly or from $s_{t-1}^{(T)}$ ), then refined for $T_\text{slot}$ steps
Slot-SSM Update: $\forall k: h_t^k = A_k(s_t^k) h_{t-1}^k + B_k(s_t^k) s_t^k$
Short-term Reconstruction: Predict $(\hat z_{t+\delta}^k)_{\delta=-p}^q$ from $[s_{t+1}^k \parallel s_t^k]$ via shared MLP
SlotFusion & Relational Encoding: $d_t^k \leftarrow$ SlotFusion $(s_t^k, s_{t+1}^k, g_t^k)$ ; relational encoder computes relation tokens $\{r_t^k\}_{k=1}^K$
Action Decoding: $\hat a_t \leftarrow$ VLA head $(\{r_t^k\}, \{d_t^k\}, \ell)$
Loss Computation: Cross-entropy for action prediction, temporal contrastive for slot consistency, and mean-squared error for SSM reconstruction

Block diagram:

Stage	Input(s)	Output(s)
Observation	Frame $_t$	CNN features $v_t$
Slot Attention	$v_t$	Slots $s_t^k$
Slot-SSM	$s_t^k, h_{t-1}^k$	Updated states $h_t^k$
Relational Enc.	$d_t^k, v_t$	Relations $r_t^k$
Action Decoder	$r_t^k, d_t^k, \ell$	Action logits $\hat a_t$

4. Training, Loss Functions, and Hyperparameters

The total training objective is a weighted sum of:

Slot Temporal Contrastive Loss: Encourages consistent slot identity over small time horizons, crucial for avoiding slot collapse or identity drift.
Slot-SSM Windowed Reconstruction Loss: Enforces accurate reconstruction of per-slot latent trajectories over a local window.
Action Cross-Entropy Loss: Standard for supervised policy learning, with input tokens composed of relation and slot-fused codes.

Hyperparameters validated in (Chung et al., 14 Nov 2025) include: AdamW optimizer (lr $3 \times 10^{-4}$ , weight decay $10^{-2}$ ), batch size of 16 trajectories, $K=16$ slots of dim 64, hidden state dim 128, window size $p=2$ , $q=2$ , 3 relational layers with 4 heads, and MLP dropout $p=0.1$ .

5. Empirical Evaluation and Ablation

Experiments on LIBERO-Goal (general) and LIBERO-Mem (non-Markovian memory tasks) demonstrate:

Improvements over Dense/Slot VLA Models: On LIBERO-Goal, Embodied-SlotSSM achieves 80.1% average success (vs. 66.5% for SlotVLA $h=8$ ) with consistent gains on Markovian and non-Markovian benchmarks.
Non-Markovian Task Superiority: Subgoal completion on LIBERO-Mem is 14.8% for Embodied-SlotSSM versus 5.0% for dense baselines and SlotVLA, with per-task completion as high as 50% on the easiest memory-demanding tasks.
Ablation Findings: Removing the Slot-SSM loss drops LIBERO-Mem performance by nearly half; removing the relational encoder also substantially reduces subgoal completion. Increasing the reconstruction window beyond $p=q=2$ yields diminishing returns due to increased optimization demand.
Qualitative Results: Visualization of slot heatmaps confirms sustained object identity through occlusion and long horizons; reconstructed slot trajectories closely match ground truth motion profiles.

6. Architectural Relations and Comparisons

Embodied-SlotSSM extends the Slot Structured World Model paradigm (Collu et al., 2024) by:

Substituting a classical slot-GNN with per-slot state-space models (block-diagonal, slot-wise dynamics)
Integrating a relational cross-attention encoder supporting vision-language-action fusion
Utilizing structured, multi-frame slot reconstruction to stabilize slot identity

Unlike prior Transformer-based video world models that use a small set of transformer slots but rely exclusively on cross- and self-attention for memory (Petri et al., 2024), Embodied-SlotSSM enforces explicit slot memory via SSMs and is directly evaluated in embodied, agent-in-the-loop settings (robotic manipulation). The absence of action-conditioning, compositional goal representations, and agent interactivity in earlier works delineates the distinct focus and novelty of Embodied-SlotSSM.

7. Limitations and Prospects

Current instantiations employ oracle subgoal embeddings ( $g_t^k$ ), limiting autonomous subgoal inference. The approach is demonstrated in simulation and requires domain adaptation for real-world transfer. Scaling to larger numbers of objects ( $K>16$ ) or longer horizons ( $T>1000$ ) is constrained by token and memory bottlenecks, motivating future work in hierarchical SSMs and slot compression. Extensions to continuous control, integrated subgoal inference via LLMs, and broader foundation model integration are identified as next steps (Chung et al., 14 Nov 2025).

The Slot Structured World Models (SSWM) of (Collu et al., 2024) combine a pixel-space slot attention encoder with a latent graph neural network (GNN) for object interaction and action-conditioned prediction. SSWM achieves markedly superior long-horizon object prediction compared to C-SWM, as measured by Hits@1 and MRR, by maintaining object-level factorization and using frozen slot attention for dynamics learning.

In contrast, (Petri et al., 2024) introduces a Transformer world model (FPTT) with token-based VQ encodings and slot-structured cross/self-attention, but without classic slot-attention or agent-action conditioning. FPTT improves sample efficiency and predictive stability but is not evaluated in embodied settings.

Embodied-SlotSSM brings together these architectural branches—persistently object-centric, memory-rich, and relationally compositional modeling—to address long-term, ambiguous, object-conditional reasoning and action prediction in real-world-inspired robotic domains.

PDF Markdown Chat (Pro)

References (3)

Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective (2025)

Slot Structured World Models (2024)

Transformers and Slot Encoding for Sample Efficient Physical World Modelling (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Embodied-SlotSSM.