Papers
Topics
Authors
Recent
Search
2000 character limit reached

SlotSSMs: Modular Sequence Modeling

Updated 5 February 2026
  • SlotSSMs are state-space models that decompose hidden states into parallel slots, each evolving independently with its own SSM kernel.
  • They employ sparse cross-slot self-attention to enable targeted interaction among slots while capturing long-range temporal dependencies.
  • Empirical evaluations show SlotSSMs outperform monolithic models in tasks like multi-object video prediction and long-context reasoning.

Slot State Space Models (SlotSSMs) constitute a modular sequence modeling framework that extends classical State Space Models (SSMs) by structuring hidden state as a set of parallel “slots,” each evolving largely independently but interacting through a controlled, sparse self-attention mechanism. This architecture is motivated by the prevalence of modular, object-centric structure in real-world dynamical systems, aiming to combine the advantages of SSMs’ long-range, parallel training capabilities with the compositionality characteristic of slot- or object-based models, and to avoid the entanglement and training inefficiencies of monolithic and recurrent architectures (Jiang et al., 2024).

1. Motivation and Conceptual Basis

Traditional SSMs such as S4, S5, and Mamba encode all past information into a single hidden vector hth_t, yielding strong performance on long-range dependency tasks but entangling all factors in a monolithic fashion. Many natural processes—e.g., physical simulations with independently-evolving objects—are more naturally modeled as modular systems with loosely coupled interacting components.

SlotSSMs introduce KK parallel state vectors ("slots"), each intended to capture an independent mechanism or object. Each slot evolves via its own SSM kernel, encouraging temporal modularity. Interaction among slots is limited and mediated through a sparse, bottlenecked self-attention mechanism. This yields improved separation of information, better generalization, and greater interpretability in settings where modular structure is inherent (Jiang et al., 2024).

2. Mathematical Formulation

The SlotSSM organizes state at time tt as S(t)=[s1(t),,sK(t)]RK×DsS(t) = [s_1(t),\ldots, s_K(t)] \in \mathbb{R}^{K \times D_s}, with each sk(t)RDss_k(t) \in \mathbb{R}^{D_s}.

Independent Slot Transitions

Each slot sks_k is updated inductively based on a slot-specific SSM kernel:

sk(t+1)=SSMKernel(sk(t),u(t+1)),k=1,...,K.s_k(t+1) = \mathrm{SSMKernel}(s_k(t), u(t+1)), \quad k = 1, ..., K.

where u(t)u(t) denotes the (optionally slot-encoded) input.

In line with recent SSM parameterizations (e.g., Mamba), each slot's transition is block-diagonalized, with {A(i)(stk),B(i)(stk)}\{\overline{A}^{(i)}(s_t^k), \overline{B}^{(i)}(s_t^k)\} parametric in the per-slot state.

Sparse Cross-Slot Self-Attention

After independent evolution, each slot's output ytky_t^k is projected to key, query, and value vectors: Qk=WQytk,Kk=WKytk,Vk=WVytkQ_k = W^Q y_t^k, \quad K_k = W^K y_t^k, \quad V_k = W^V y_t^k in Rdhead\mathbb{R}^{d_{\mathrm{head}}}. Self-attention applies: ak=softmax(QkKTdhead)a_{k\ell} = \mathrm{softmax}_\ell\left(\frac{Q_k K_\ell^T}{\sqrt{d_{\mathrm{head}}}}\right) to form the sparse interaction: y~tk=ytk+WO(=1KakV).\widetilde{y}_t^k = y_t^k + W^O\left( \sum_{\ell=1}^K a_{k\ell} V_\ell \right).

This attention mixing operates at O(K2dhead)\mathcal{O}(K^2 d_{\mathrm{head}}) cost per step, ensuring cross-slot communication is efficient and targeted.

3. Model Architecture and Implementation

A typical SlotSSM layer consists of three stages:

  1. Slot Encoder: For unstructured inputs (e.g., image patches), K slot queries are extracted using cross-attention or an inverted attention mechanism (object-centric variant).
  2. SlotSSM Core: KK parallel Mamba- or S4-type SSM kernels update slot states independently.
  3. Slot Mixer: Two-block residual—self-attention over slots for sparse interaction, followed by a per-slot MLP.

Slot initialization involves learnable tokens (e.g., CLS) or tokens derived from CNN feature maps, augmented with spatial or temporal position embeddings. Slot re-encoding via cross-attention occurs only if the number of slots changes between layers.

Per time step, the SSM kernel update incurs O(KDsN)\mathcal{O}(K D_s N) compute (with NN as local SSM state size), while slot-mixer self-attention costs O(K2dhead)\mathcal{O}(K^2 d_{\mathrm{head}}). Memory is O(KDs)\mathcal{O}(K D_s) for slot states, with negligible overhead for attention projections.

4. Training Paradigms and Scalability

SlotSSMs are optimized with AdamW (typical learning rates 10410^{-4} to 10310^{-3}, weight decay \sim0.1, dropout unused in SSM blocks). Layer normalization is placed before attention modules and MLPs.

Loss functions are task-specific:

  • Mean-squared error (MSE) or binary cross-entropy for sequence prediction/reconstruction,
  • Cross-entropy for classification in long-context tasks,
  • Supervised losses for tasks such as localization.

Temporal scalability is enabled by constant-cost per-step recurrence and parallel training across time, facilitating modeling of thousands of steps without the quadratic memory/computation scaling typical of transformers or sequential RNNs.

5. Empirical Performance Across Domains

SlotSSMs have been evaluated on several benchmarks:

Task Baseline Comparison SlotSSM Finding
Multi-object video prediction Single-state SSM, Split, RIM, Transformer, SlotTransformer Achieves lowest MSE; temporal modularity crucial for performance.
Long-context reasoning (Blinking Color Balls) SlotTransformer, RNN-based slots Outperforms competitors for up to 2560 steps; stable where others degrade.
Unsupervised object-centric learning (MOVi-A/B) SAVi (slot-RNN), SlotAttn+RNN Higher FG-ARI, tighter masks, better attribute inference, more stable.
3D visual reasoning (CATER snitch localization) Single-state SSM, SlotTransformer OC-SlotSSM: Top-1/5 accuracy 61.6%/84.0% (no pretrain), 69.3%/90.5% (pretrain).

These results show that SlotSSMs outperform both monolithic SSMs and slot-based RNNs or transformers, especially where processes are inherently modular, and demonstrate robustness for long-range sequence modeling (Jiang et al., 2024).

6. Ablation Studies and Analysis

  • Slot Count (KK): Task performance improves as KK approaches the true number of independent objects/mechanisms, saturating or degrading slightly when KK is set much higher.
  • Attention Bottleneck (dheadd_{\mathrm{head}}): Varying the head dimension modulates cross-slot expressivity and computational load. Moderate values (e.g., 4 heads of 16–32 dimensions) are sufficient.
  • Sparsity Patterns: While self-attention is efficient for small KK (e.g., K12K \lesssim 12), enforced top-k attention sparsity offers marginal, task-dependent gains.
  • Efficiency: Computation overhead is 10–20% relative to vanilla SSMs, with significant accuracy improvements in multi-object and long-sequence contexts.

7. Implications, Limitations, and Prospects

The SlotSSM architecture demonstrates that enforcing state modularity with controlled, sparse interactions leads to improved interpretability and generalization in tasks exhibiting compositional or object-centric structure. The combination of SSMs’ long-range, parallel training and slot-based compositionality achieves scalability not observed in either slot-Transformers or slot-RNNs. Emergent specialization of slots to objects is observed even without explicit spatial constraints, as revealed by decoder cross-attention analysis.

Current limitations include the absence of evaluation on large-scale language or audio datasets and untested applicability to highly textured video domains or industry-scale models. Potential future directions for SlotSSMs include application to text at sentence or paragraph level, hybridization with graph-structured cross-slot mechanisms, and adaptive slot count per layer or timestep (Jiang et al., 2024).

SlotSSMs represent a principled advance in sequence modeling by embedding modular inductive bias within state-space recursion, yielding models that are both efficient and effective when the underlying dynamics consist of loosely coupled, interacting subsystems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SlotSSMs.