SlotSSMs: Modular Sequence Modeling
- SlotSSMs are state-space models that decompose hidden states into parallel slots, each evolving independently with its own SSM kernel.
- They employ sparse cross-slot self-attention to enable targeted interaction among slots while capturing long-range temporal dependencies.
- Empirical evaluations show SlotSSMs outperform monolithic models in tasks like multi-object video prediction and long-context reasoning.
Slot State Space Models (SlotSSMs) constitute a modular sequence modeling framework that extends classical State Space Models (SSMs) by structuring hidden state as a set of parallel “slots,” each evolving largely independently but interacting through a controlled, sparse self-attention mechanism. This architecture is motivated by the prevalence of modular, object-centric structure in real-world dynamical systems, aiming to combine the advantages of SSMs’ long-range, parallel training capabilities with the compositionality characteristic of slot- or object-based models, and to avoid the entanglement and training inefficiencies of monolithic and recurrent architectures (Jiang et al., 2024).
1. Motivation and Conceptual Basis
Traditional SSMs such as S4, S5, and Mamba encode all past information into a single hidden vector , yielding strong performance on long-range dependency tasks but entangling all factors in a monolithic fashion. Many natural processes—e.g., physical simulations with independently-evolving objects—are more naturally modeled as modular systems with loosely coupled interacting components.
SlotSSMs introduce parallel state vectors ("slots"), each intended to capture an independent mechanism or object. Each slot evolves via its own SSM kernel, encouraging temporal modularity. Interaction among slots is limited and mediated through a sparse, bottlenecked self-attention mechanism. This yields improved separation of information, better generalization, and greater interpretability in settings where modular structure is inherent (Jiang et al., 2024).
2. Mathematical Formulation
The SlotSSM organizes state at time as , with each .
Independent Slot Transitions
Each slot is updated inductively based on a slot-specific SSM kernel:
where denotes the (optionally slot-encoded) input.
In line with recent SSM parameterizations (e.g., Mamba), each slot's transition is block-diagonalized, with parametric in the per-slot state.
Sparse Cross-Slot Self-Attention
After independent evolution, each slot's output is projected to key, query, and value vectors: in . Self-attention applies: to form the sparse interaction:
This attention mixing operates at cost per step, ensuring cross-slot communication is efficient and targeted.
3. Model Architecture and Implementation
A typical SlotSSM layer consists of three stages:
- Slot Encoder: For unstructured inputs (e.g., image patches), K slot queries are extracted using cross-attention or an inverted attention mechanism (object-centric variant).
- SlotSSM Core: parallel Mamba- or S4-type SSM kernels update slot states independently.
- Slot Mixer: Two-block residual—self-attention over slots for sparse interaction, followed by a per-slot MLP.
Slot initialization involves learnable tokens (e.g., CLS) or tokens derived from CNN feature maps, augmented with spatial or temporal position embeddings. Slot re-encoding via cross-attention occurs only if the number of slots changes between layers.
Per time step, the SSM kernel update incurs compute (with as local SSM state size), while slot-mixer self-attention costs . Memory is for slot states, with negligible overhead for attention projections.
4. Training Paradigms and Scalability
SlotSSMs are optimized with AdamW (typical learning rates to , weight decay 0.1, dropout unused in SSM blocks). Layer normalization is placed before attention modules and MLPs.
Loss functions are task-specific:
- Mean-squared error (MSE) or binary cross-entropy for sequence prediction/reconstruction,
- Cross-entropy for classification in long-context tasks,
- Supervised losses for tasks such as localization.
Temporal scalability is enabled by constant-cost per-step recurrence and parallel training across time, facilitating modeling of thousands of steps without the quadratic memory/computation scaling typical of transformers or sequential RNNs.
5. Empirical Performance Across Domains
SlotSSMs have been evaluated on several benchmarks:
| Task | Baseline Comparison | SlotSSM Finding |
|---|---|---|
| Multi-object video prediction | Single-state SSM, Split, RIM, Transformer, SlotTransformer | Achieves lowest MSE; temporal modularity crucial for performance. |
| Long-context reasoning (Blinking Color Balls) | SlotTransformer, RNN-based slots | Outperforms competitors for up to 2560 steps; stable where others degrade. |
| Unsupervised object-centric learning (MOVi-A/B) | SAVi (slot-RNN), SlotAttn+RNN | Higher FG-ARI, tighter masks, better attribute inference, more stable. |
| 3D visual reasoning (CATER snitch localization) | Single-state SSM, SlotTransformer | OC-SlotSSM: Top-1/5 accuracy 61.6%/84.0% (no pretrain), 69.3%/90.5% (pretrain). |
These results show that SlotSSMs outperform both monolithic SSMs and slot-based RNNs or transformers, especially where processes are inherently modular, and demonstrate robustness for long-range sequence modeling (Jiang et al., 2024).
6. Ablation Studies and Analysis
- Slot Count (): Task performance improves as approaches the true number of independent objects/mechanisms, saturating or degrading slightly when is set much higher.
- Attention Bottleneck (): Varying the head dimension modulates cross-slot expressivity and computational load. Moderate values (e.g., 4 heads of 16–32 dimensions) are sufficient.
- Sparsity Patterns: While self-attention is efficient for small (e.g., ), enforced top-k attention sparsity offers marginal, task-dependent gains.
- Efficiency: Computation overhead is 10–20% relative to vanilla SSMs, with significant accuracy improvements in multi-object and long-sequence contexts.
7. Implications, Limitations, and Prospects
The SlotSSM architecture demonstrates that enforcing state modularity with controlled, sparse interactions leads to improved interpretability and generalization in tasks exhibiting compositional or object-centric structure. The combination of SSMs’ long-range, parallel training and slot-based compositionality achieves scalability not observed in either slot-Transformers or slot-RNNs. Emergent specialization of slots to objects is observed even without explicit spatial constraints, as revealed by decoder cross-attention analysis.
Current limitations include the absence of evaluation on large-scale language or audio datasets and untested applicability to highly textured video domains or industry-scale models. Potential future directions for SlotSSMs include application to text at sentence or paragraph level, hybridization with graph-structured cross-slot mechanisms, and adaptive slot count per layer or timestep (Jiang et al., 2024).
SlotSSMs represent a principled advance in sequence modeling by embedding modular inductive bias within state-space recursion, yielding models that are both efficient and effective when the underlying dynamics consist of loosely coupled, interacting subsystems.