Per-Object Selective SSMs

Updated 3 May 2026

Per-object selective SSMs are advanced state-space models that use dynamic, object-specific gating to modulate memory propagation and ensure efficient processing.
They incorporate dual gating mechanisms at token and channel levels, enabling modular state decompositions and scalable, hardware-friendly implementations.
Empirical studies show that models like MambaMixer and SlotSSM achieve superior performance in vision, time-series, and video tasks through selective update strategies.

A per-object selective State Space Model (SSM) is a sequential modeling framework in which the state-space architecture is dynamically and selectively conditioned on individual objects (or tokens) as well as, in advanced instantiations, on individual feature channels. This paradigm enables targeted and contextually adaptive memory propagation, efficient handling of multi-object inputs, and selective routing of information through recurrent architectures. Recent research on selective SSMs emphasizes per-object gating mechanisms, modular state decompositions, content- and context-aware selectivity, and scalable hardware-oriented designs, culminating in models that combine the statistical power of classical dynamical systems with the data-dependent flexibility of modern deep learning. Below is an in-depth overview of the conceptual, mathematical, and practical structure of per-object selective SSMs, with reference to key contributions in the literature.

1. Core Mathematical Constructs of Per-Object Selective SSMs

The foundation of per-object selective SSMs is the standard continuous-time linear state-space model:

$h'(t) = A h(t) + B x(t), \quad y(t) = C h(t),$

where $x(t) \in \mathbb{R}^{d}$ is the input, $h(t) \in \mathbb{R}^N$ the state, and $A,B,C$ are learnable or fixed matrices. Discretization yields the recurrence:

$h_t = \overline{A} h_{t-1} + \overline{B} x_t, \quad y_t = C h_t,$

where $\overline{A}, \overline{B}$ encode the discretized dynamics (Behrouz et al., 2024).

Per-object selectivity augments this framework by associating each object $i$ in a collection (e.g., set of tokens, regions, or slots) with its own identity-dependent parameters and gate:

$h_n^i = A^i h_{n-1}^i + g^i(x_n) \odot (B^i h_{n-1}^i) + C^i x_n^i,$

where $g^i(x_n)$ is a gating vector, possibly parametrized as a softplus or sigmoid nonlinearity on affine projections of $x_n$ (Cirone et al., 2024). For sequence inputs structured as objects $x(t) \in \mathbb{R}^{d}$ 0 channels (e.g., $x(t) \in \mathbb{R}^{d}$ 1 for batch, object/tokens, channels), this selectivity can be applied per-object, per-channel, or both.

Recent models, such as MambaMixer (Behrouz et al., 2024) and SlotSSM (Jiang et al., 2024), operationalize per-object selection as either:

Multiplicative masking of the SSM outputs (gating)
Slotwise state updates with minimal cross-slot mixing
Content- and context-driven Kalman-optimal selection (KOSS (Wang et al., 18 Dec 2025))

This per-object extension retains theoretical guarantees established for input-gated SSMs, including universality for path functionals via signature expansion in the rough path theory sense (Cirone et al., 2024).

2. Dual Selection Mechanisms: Per-Token and Per-Channel Gates

Per-object selective SSMs generalize selectivity by introducing gating mechanisms along multiple axes:

Token (per-object) gates: For each object (token), a gating score $x(t) \in \mathbb{R}^{d}$ 2 is computed to determine its relevance at each step or layer. This is often realized as:

$x(t) \in \mathbb{R}^{d}$ 3

where $x(t) \in \mathbb{R}^{d}$ 4 denotes the sigmoid function and $x(t) \in \mathbb{R}^{d}$ 5 is a spatial (depthwise or 1D) convolution over objects/tokens.

Channel gates: For each feature channel, a score $x(t) \in \mathbb{R}^{d}$ 6 governs the activation or suppression of information flow:

$x(t) \in \mathbb{R}^{d}$ 7

where data is transposed such that channels are treated analogously to tokens (Behrouz et al., 2024).

These gates are then integrated multiplicatively with the SSM output, such that:

$x(t) \in \mathbb{R}^{d}$ 8

where $x(t) \in \mathbb{R}^{d}$ 9 denotes a data-dependent convolution kernel and $h(t) \in \mathbb{R}^N$ 0 is the convolution operator. This design allows individual objects/channels to be dynamically activated or suppressed in response to the evolving input context.

3. Architectural Realizations and Model Variants

Per-object selective SSM architectures vary in the complexity and explicitness of object modeling:

MambaMixer: Employs dual gating mechanisms (across tokens and channels) in conjunction with data-dependent dynamic kernels, and forms dense skip connections (weighted averages) across layers to stabilize training and promote feature reuse. Its core primitive interleaves “Selective Token” and “Selective Channel” mixers, supporting high hardware efficiency and linear memory scaling (Behrouz et al., 2024).
SlotSSM: Decomposes the state space into $h(t) \in \mathbb{R}^N$ 1 independent slot vectors, each evolving independently via its own SSM update, with periodic sparse cross-slot mixing implemented via a multi-head self-attention “slot mixer.” This modularization induces per-object state updates, encourages information separation, and scales efficiently even as sequence length grows (Jiang et al., 2024).
KOSS: Introduces Kalman-optimal gating, whereby each latent dimension dynamically modulates information propagation using a context- and content-aware Kalman gain. The gating is learned as an MLP of the “innovation” (current input minus model prediction), yielding a closed-loop system that adaptively routes information based on signal uncertainty minimization. KOSS operates with a global FFT-based spectral differentiator and segment-wise prefix scans to maintain near-linear time complexity (Wang et al., 18 Dec 2025).
Foundational Theory: Universality and path-signature results demonstrate that, under suitable conditions (sufficient width, Lipschitz/gate regularity, stable discretization), per-object selective SSMs can approximate any functional built from the signature of object-wise input streams (Cirone et al., 2024).

4. Implementation and Computational Scaling

Selective SSMs exploit architectural strategies for hardware efficiency:

Associative scan/prefix-sum: With data-dependent SSM parameters, standard convolutional acceleration cannot be used. Instead, efficient associative (prefix-sum) scan algorithms parallelize the SSM recurrence over sequence segments (Behrouz et al., 2024, Wang et al., 18 Dec 2025).
Contiguous axis packing: Tokens and channels are packed into contiguous memory blocks to optimize 1D/2D convolutional kernels (Behrouz et al., 2024).
Fused gating and computation: Sigmoid gating and output modulation are fused to minimize additional memory passes (Behrouz et al., 2024).
Spectral differentiation: KOSS introduces FFT-based spectral derivative estimation to stabilize input-difference computations under high-frequency noise (Wang et al., 18 Dec 2025).

For SlotSSM, per-step complexity scales as $h(t) \in \mathbb{R}^N$ 2, with $h(t) \in \mathbb{R}^N$ 3 the number of slots (objects). Memory usage depends linearly on $h(t) \in \mathbb{R}^N$ 4 and the per-slot state size. These designs ensure suitability for long-sequence modeling and object-centric domains.

5. Empirical Evaluation and Applications

Per-object selective SSMs have been empirically validated across several modalities:

Model	Domain	Key Results
MambaMixer	Vision, Time series	ViM2 outperforms SSM-based models in detection/segmentation; TSM2 achieves superior time series forecasting with lower computational cost (Behrouz et al., 2024)
KOSS	Long-term seq., Tracking	Achieves 79.2% accuracy on selective copying with distractors (vs <20% for S4/Mamba); 2.9–36.2% MSE reduction on benchmarks; robust to noisy, irregular real-world SSR tracking (Wang et al., 18 Dec 2025)
SlotSSM	Video, Object reasoning	Substantial MSE and boundary adherence improvements on multi-object video; robust long-context reasoning (sequence length up to 2560); superior unsupervised object-centric scores (FG-ARI, mIoU) (Jiang et al., 2024)

A common thread is that per-object selectivity provides strong modularity, improved interpretability (object- or token-level relevance tracing), reduced overfitting to spurious correlations (by suppressing uninformative tokens/channels), and efficient utilization of modern hardware.

6. Theoretical Guarantees and Expressivity

Theoretical analysis via rough path theory shows that selective SSMs, when equipped with object-wise gates and sufficient width, admit universal approximation properties for object-path functionals:

The hidden state of each object can approximate, up to arbitrary precision, any continuous signature functional of its own input stream, provided the gating dynamics are sufficiently expressive and stable (Cirone et al., 2024).
The selectivity mechanism (object/token and channel-wise gating) modulates the “control path” in the induced controlled differential equation, linking selective SSMs to broader theories of controlled dynamical systems and path signatures.
For context- and content-aware designs (e.g., KOSS), the closed-loop Kalman gain provides an optimal selection policy in the sense of posterior state uncertainty minimization, converging to an optimal steady-state solution under classical observability and noise conditions (Wang et al., 18 Dec 2025).

7. Extensions, Limitations, and Future Directions

Per-object selective SSMs extend naturally to a variety of object-centric and modular sequential modeling tasks:

Multi-object tracking: treating each tracked entity or detection as a “token” or “slot,” enabling dynamic memory allocation and selective update (Jiang et al., 2024, Wang et al., 18 Dec 2025).
Relational reasoning and scene graph propagation: using selective gates to sparsify and modularize inter-object and inter-channel interactions (Behrouz et al., 2024).
Improved robustness in noisy, distractor-laden environments: e.g., KOSS’s resilience in selective-copying with correlated distractors (Wang et al., 18 Dec 2025).

Potential challenges include balancing slot/object cardinality (especially in unsupervised settings), stabilizing training of deeply modular SSMs, and efficiently scaling cross-object interactions beyond narrow attention bottlenecks.

A plausible implication is that selective SSMs may enable a new generation of scalable, interpretable, and efficient models for domains where modular, object-wise information processing is intrinsically required, providing an alternative to the quadratic costs of Transformers and the rigidity of conventional RNNs.

Key References:

"MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection" (Behrouz et al., 2024)
"KOSS: Kalman-Optimal Selective State Spaces for Long-Term Sequence Modeling" (Wang et al., 18 Dec 2025)
"Slot State Space Models" (Jiang et al., 2024)
"Theoretical Foundations of Deep Selective State-Space Models" (Cirone et al., 2024)

Markdown Report Issue Upgrade to Chat

References (4)

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection (2024)

Theoretical Foundations of Deep Selective State-Space Models (2024)

Slot State Space Models (2024)

KOSS: Kalman-Optimal Selective State Spaces for Long-Term Sequence Modeling (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Per-Object Selective State Space Models (SSM).