Dense Selective State-Space Models

Updated 23 February 2026

Dense Selective State-Space Models are neural architectures that fuse dense, input-driven transitions with gating mechanisms to capture high-order token interactions.
They integrate shallow hidden states via lateral connections to preserve low-level features and enable superior performance in language modeling and formal sequence tasks.
Empirical benchmarks reveal these models achieve lower perplexity and error rates while using memory efficiently, outperforming traditional RNNs and Transformers.

Dense Selective State-Space Models (SD-SSM) are advanced neural sequence modeling architectures that generalize classical state-space models by introducing dense, input-dependent transition mechanisms and explicit pathways for fine-grained information flow. SD-SSMs combine the structured recurrence of SSMs with selective, content-driven gating and dense inter-layer connections to achieve superior expressivity and computational efficiency compared to both RNNs and Transformers, excelling in both language modeling and formal sequence emulation tasks.

1. Architectural Foundations and Variants

Dense Selective State-Space Models are characterized by two core innovations relative to standard SSMs: (i) dense, selective (input- and state-dependent) transitions within each recurrent block, and (ii) dense hidden state integration across different layers or over a dictionary of transition matrices.

At the block level, an SD-SSM replaces the time-invariant recurrence

$h_{t} = A\,h_{t-1} + B\,x_{t}$

with a recurrence where $A$ becomes an input-driven, content-selective operator: $h_{t} = A\,h_{t-1} + B\,x_{t} + \left(C\,h_{t-1}\right) \odot \left(D\,x_{t}\right)$ or, equivalently,

$h_t = \bar A(x_t) h_{t-1} + \bar B(x_t) x_t$

where $\bar A(x_t)$ and $\bar B(x_t)$ are dense functions of the current input. This architecture, encompassing both dense internal mixing and multiplicative gating, achieves nontrivial high-order token interactions, in contrast to classical SSMs, which are limited to fixed convolutional memory and linear updates (Cirone et al., 2024, Terzić et al., 2024).

When extended across depth, DenseSSMs (e.g., DenseMamba, DenseRetNet) insert explicit, gated lateral connections from shallow hidden states to deeper layers, enabling direct propagation of low-level features and mitigating state degradation. At each layer $\ell$ , shallow hidden states $h_t^{(\ell-k)}$ ( $k=1,\dots,m$ ) are linearly projected and adaptively gated, then fused additively into the deeper $h_t^{(\ell)}$ prior to output computation (He et al., 2024).

2. Mathematical Formulation and Theoretical Guarantees

The core dynamical equation in a dense selective SSM is

$h_{t} = A\,h_{t-1} + B\,x_{t} + \left(C\,h_{t-1}\right) \odot \left(D\,x_{t}\right)$

or, in the gating formulation,

$\bar A(x_t) = A + \operatorname{diag}\left(C h_{t-1}\right) D x_t$

where $C$ and $D$ are dense matrices and the gating may depend on both $x_t$ and $h_{t-1}$ .

In modern implementations, the gating is performed by a deep, cross-dimensional nonlinear function, e.g.,

$g(x_t, u_t) = \sigma\left(W_x x_t + W_u u_t + W_{xu} \left(x_t \odot u_t\right) + b\right)$

which produces dense, feature-level gates for coordinated selection (Bhat, 2024).

Theoretically, the expressive power of dense selective SSMs is characterized via Rough Path Theory: the hidden state at each time is a low-dimensional projection of the path signature of the (potentially continuous) input sequence, i.e., a universal feature set capable of representing all continuous functionals of the input trajectory. Dense gating captures arbitrary high-order interactions with linear parameter scaling in length, while diagonal gating captures only commutative (monomial) signature components (Cirone et al., 2024).

Strong universality theorems establish that (i) any continuous path-to-point functional can be approximated by a dense SD-SSM of sufficient width and (ii) finite-parameter random initializations suffice for this universality, with only the readout requiring training as width increases (Cirone et al., 2024). In the finite state domain, SD-SSMs can emulate any deterministic finite-state automaton, including non-commutative automata, with perfect length generalization in a single layer—a capability unattainable by diagonal or block-diagonal SSMs (Terzić et al., 2024).

3. Implementation Strategies and Computational Complexity

Several concrete SD-SSM constructions exist:

DenseSSM/DenseMamba: Each higher layer aggregates (with linear projections and MLP-based gates) the hidden states from the last $m$ shallower layers. Gating is typically realized by two-layer MLPs, with parameter and memory overhead scaling as $O(m d^2)$ for $d$ -dimensional states. Fusion is additive, and the approach is backbone-agnostic, applicable to Mamba, RetNet, and related models (He et al., 2024).
Dictionary-based SD-SSM: Each time step selects a convex combination of $k$ learned $n\times n$ transition matrices via a softmax mechanism on embedded input, forming the new recurrence operator. Stability is ensured via column-wise normalization; outputs are produced by LayerNorm followed by a linear projection (Terzić et al., 2024).
Selective Gating Compression: Dense gating enables memory and computation pruning by dynamically masking irrelevant hidden state dimensions per time step. The selective mask $g(x_t, u_t)$ drives variable effective dimension and reduces computational load to $O(\alpha^2 d^2 + \alpha d m)$ , where $\alpha$ is the average gate density (Bhat, 2024).

Parallelization is preserved in both layerwise integration and sequence processing. DenseSSM-style architectures retain $O(N d L)$ training complexity and $O(1)$ per-token inference. Dictionary-based SD-SSMs leverage parallel scan algorithms for efficient GPU computation on long sequences, with significant wallclock advantages over sequential RNN-style updates (He et al., 2024, Terzić et al., 2024).

4. Information-Theoretic and Stability Analysis

SD-SSMs can be interpreted as dynamic, information-theoretic encoders that balance history retention against hidden-state compression. Let $U = u_{1:t}$ and $H = h_t$ . Mutual information $I(U; H)$ quantifies the fidelity of historical information in the current hidden state. The gating function shapes the rate–distortion curve $R(D)$ , which expresses the trade-off between compression rate and tolerable distortion (Bhat, 2024): $R(D) = \frac{1}{2} \sum_{i=1}^d \max(0, \log(\lambda_i/D))$ with $\lambda_i$ the eigenvalues of the hidden-state covariance.

Stability and mean-square convergence are guaranteed under spectral norm and Lipschitz conditions on the base transition and gating functions:

If $\|A\|_2 \leq \rho < 1/L_G$ and $g$ is $L_G$ -Lipschitz, the process converges to a unique stationary distribution, certified by Lyapunov arguments (Bhat, 2024).
For controlled memory usage, only the last $m$ shallow states need to be buffered for dense lateral updates.

5. Empirical Performance and Expressiveness Benchmarks

Empirical studies demonstrate the advantage of SD-SSMs over both RNNs and diagonal SSMs:

Language Modeling: DenseSSM models (DenseRetNet, DenseMamba) on 350M to 1.3B parameters yield perplexity reductions on WikiText-103 and LAMBADA benchmarks, with up to +5% accuracy over baseline RetNet/Mamba, and similar efficiency as smaller baselines (He et al., 2024). Ablations show m=2 lateral links to be optimal under parameter and memory budgets.
Formal Languages (FSA Emulation): SD-SSMs attain perfect or near-perfect accuracy on emulation of both commutative and non-commutative regular languages, generalizing out-of-distribution to arbitrarily long sequences—a feat not matched by diagonal or block-diagonal selective SSMs (Terzić et al., 2024).
Time-Series Forecasting and NLP: On tasks such as electricity load prediction and Penn Treebank language modeling, SD-SSMs achieve lower error and perplexity while using 30–40% less memory than LSTM and GRU baselines (Bhat, 2024).

Table: SD-SSM Performance Highlights

Task/Metric	SD-SSM	Baseline (LSTM/GRU/RNN)	Improvement
Language Modeling	PPL 2.55	PPL 2.57 (RetNet)	Lower perplexity, +3–5% acc.
Time-Series RMSE	0.152	0.168 (LSTM)	Lower RMSE, less memory
Regular Languages (avg. acc.)	>99.9%	(Diagonal SSM) ≤54%	Perfect length-gen, generalization

6. Design Considerations, Limitations, and Future Directions

SD-SSMs enable expressivity beyond monomial sequence statistics, are universal for sequence-to-point functionals, and scale subquadratically with sequence length. Distinctive advantages include:

Universal emulation of regular languages (including non-commutative automata) in a single layer (Terzić et al., 2024).
Full high-order path signature representation for arbitrary functionals (Cirone et al., 2024).
Efficient memory compression and adaptive state selection via dense gating (Bhat, 2024).

Trade-offs and open directions:

Dense gating incurs $O(N^2 L)$ cost; diagonal or chained blocks offer lower cost at slight expressive reduction (Cirone et al., 2024).
Parameter and compute overhead grow with number of lateral dense links $m$ or transition dictionary size $k$ , but remain manageable (additional 2–3% in DenseSSM) (He et al., 2024).
Fixed connectivity or link count $m$ may be suboptimal; learning $m$ or employing dynamic sparsity masks could further optimize efficiency-accuracy trade-off.
Application beyond language modeling, such as to S4/Hyena-style architectures or via cross-attentive fusion mechanisms, presents promising research directions (He et al., 2024).

A plausible implication is that by modulating gating and connection strategies, future SD-SSMs can be tuned precisely for the computational and expressivity needs of diverse domains, including long-context language, algorithmic reasoning, and memory-constrained sequence computation.

Markdown Report Issue Upgrade to Chat

References (4)

Theoretical Foundations of Deep Selective State-Space Models (2024)

On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages (2024)

DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models (2024)

Mathematical Formalism for Memory Compression in Selective State Space Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dense Selective State-Space Models (SD-SSM).

Dense Selective State-Space Models

1. Architectural Foundations and Variants

2. Mathematical Formulation and Theoretical Guarantees

3. Implementation Strategies and Computational Complexity

4. Information-Theoretic and Stability Analysis

5. Empirical Performance and Expressiveness Benchmarks

6. Design Considerations, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Dense Selective State-Space Models

1. Architectural Foundations and Variants

2. Mathematical Formulation and Theoretical Guarantees

3. Implementation Strategies and Computational Complexity

4. Information-Theoretic and Stability Analysis

5. Empirical Performance and Expressiveness Benchmarks

6. Design Considerations, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research