Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dense Selective State-Space Models

Updated 23 February 2026
  • Dense Selective State-Space Models are neural architectures that fuse dense, input-driven transitions with gating mechanisms to capture high-order token interactions.
  • They integrate shallow hidden states via lateral connections to preserve low-level features and enable superior performance in language modeling and formal sequence tasks.
  • Empirical benchmarks reveal these models achieve lower perplexity and error rates while using memory efficiently, outperforming traditional RNNs and Transformers.

Dense Selective State-Space Models (SD-SSM) are advanced neural sequence modeling architectures that generalize classical state-space models by introducing dense, input-dependent transition mechanisms and explicit pathways for fine-grained information flow. SD-SSMs combine the structured recurrence of SSMs with selective, content-driven gating and dense inter-layer connections to achieve superior expressivity and computational efficiency compared to both RNNs and Transformers, excelling in both language modeling and formal sequence emulation tasks.

1. Architectural Foundations and Variants

Dense Selective State-Space Models are characterized by two core innovations relative to standard SSMs: (i) dense, selective (input- and state-dependent) transitions within each recurrent block, and (ii) dense hidden state integration across different layers or over a dictionary of transition matrices.

At the block level, an SD-SSM replaces the time-invariant recurrence

ht=Aht1+Bxth_{t} = A\,h_{t-1} + B\,x_{t}

with a recurrence where AA becomes an input-driven, content-selective operator: ht=Aht1+Bxt+(Cht1)(Dxt)h_{t} = A\,h_{t-1} + B\,x_{t} + \left(C\,h_{t-1}\right) \odot \left(D\,x_{t}\right) or, equivalently,

ht=Aˉ(xt)ht1+Bˉ(xt)xth_t = \bar A(x_t) h_{t-1} + \bar B(x_t) x_t

where Aˉ(xt)\bar A(x_t) and Bˉ(xt)\bar B(x_t) are dense functions of the current input. This architecture, encompassing both dense internal mixing and multiplicative gating, achieves nontrivial high-order token interactions, in contrast to classical SSMs, which are limited to fixed convolutional memory and linear updates (Cirone et al., 2024, Terzić et al., 2024).

When extended across depth, DenseSSMs (e.g., DenseMamba, DenseRetNet) insert explicit, gated lateral connections from shallow hidden states to deeper layers, enabling direct propagation of low-level features and mitigating state degradation. At each layer \ell, shallow hidden states ht(k)h_t^{(\ell-k)} (k=1,,mk=1,\dots,m) are linearly projected and adaptively gated, then fused additively into the deeper ht()h_t^{(\ell)} prior to output computation (He et al., 2024).

2. Mathematical Formulation and Theoretical Guarantees

The core dynamical equation in a dense selective SSM is

ht=Aht1+Bxt+(Cht1)(Dxt)h_{t} = A\,h_{t-1} + B\,x_{t} + \left(C\,h_{t-1}\right) \odot \left(D\,x_{t}\right)

or, in the gating formulation,

Aˉ(xt)=A+diag(Cht1)Dxt\bar A(x_t) = A + \operatorname{diag}\left(C h_{t-1}\right) D x_t

where CC and DD are dense matrices and the gating may depend on both xtx_t and ht1h_{t-1}.

In modern implementations, the gating is performed by a deep, cross-dimensional nonlinear function, e.g.,

g(xt,ut)=σ(Wxxt+Wuut+Wxu(xtut)+b)g(x_t, u_t) = \sigma\left(W_x x_t + W_u u_t + W_{xu} \left(x_t \odot u_t\right) + b\right)

which produces dense, feature-level gates for coordinated selection (Bhat, 2024).

Theoretically, the expressive power of dense selective SSMs is characterized via Rough Path Theory: the hidden state at each time is a low-dimensional projection of the path signature of the (potentially continuous) input sequence, i.e., a universal feature set capable of representing all continuous functionals of the input trajectory. Dense gating captures arbitrary high-order interactions with linear parameter scaling in length, while diagonal gating captures only commutative (monomial) signature components (Cirone et al., 2024).

Strong universality theorems establish that (i) any continuous path-to-point functional can be approximated by a dense SD-SSM of sufficient width and (ii) finite-parameter random initializations suffice for this universality, with only the readout requiring training as width increases (Cirone et al., 2024). In the finite state domain, SD-SSMs can emulate any deterministic finite-state automaton, including non-commutative automata, with perfect length generalization in a single layer—a capability unattainable by diagonal or block-diagonal SSMs (Terzić et al., 2024).

3. Implementation Strategies and Computational Complexity

Several concrete SD-SSM constructions exist:

  • DenseSSM/DenseMamba: Each higher layer aggregates (with linear projections and MLP-based gates) the hidden states from the last mm shallower layers. Gating is typically realized by two-layer MLPs, with parameter and memory overhead scaling as O(md2)O(m d^2) for dd-dimensional states. Fusion is additive, and the approach is backbone-agnostic, applicable to Mamba, RetNet, and related models (He et al., 2024).
  • Dictionary-based SD-SSM: Each time step selects a convex combination of kk learned n×nn\times n transition matrices via a softmax mechanism on embedded input, forming the new recurrence operator. Stability is ensured via column-wise normalization; outputs are produced by LayerNorm followed by a linear projection (Terzić et al., 2024).
  • Selective Gating Compression: Dense gating enables memory and computation pruning by dynamically masking irrelevant hidden state dimensions per time step. The selective mask g(xt,ut)g(x_t, u_t) drives variable effective dimension and reduces computational load to O(α2d2+αdm)O(\alpha^2 d^2 + \alpha d m), where α\alpha is the average gate density (Bhat, 2024).

Parallelization is preserved in both layerwise integration and sequence processing. DenseSSM-style architectures retain O(NdL)O(N d L) training complexity and O(1)O(1) per-token inference. Dictionary-based SD-SSMs leverage parallel scan algorithms for efficient GPU computation on long sequences, with significant wallclock advantages over sequential RNN-style updates (He et al., 2024, Terzić et al., 2024).

4. Information-Theoretic and Stability Analysis

SD-SSMs can be interpreted as dynamic, information-theoretic encoders that balance history retention against hidden-state compression. Let U=u1:tU = u_{1:t} and H=htH = h_t. Mutual information I(U;H)I(U; H) quantifies the fidelity of historical information in the current hidden state. The gating function shapes the rate–distortion curve R(D)R(D), which expresses the trade-off between compression rate and tolerable distortion (Bhat, 2024): R(D)=12i=1dmax(0,log(λi/D))R(D) = \frac{1}{2} \sum_{i=1}^d \max(0, \log(\lambda_i/D)) with λi\lambda_i the eigenvalues of the hidden-state covariance.

Stability and mean-square convergence are guaranteed under spectral norm and Lipschitz conditions on the base transition and gating functions:

  • If A2ρ<1/LG\|A\|_2 \leq \rho < 1/L_G and gg is LGL_G-Lipschitz, the process converges to a unique stationary distribution, certified by Lyapunov arguments (Bhat, 2024).
  • For controlled memory usage, only the last mm shallow states need to be buffered for dense lateral updates.

5. Empirical Performance and Expressiveness Benchmarks

Empirical studies demonstrate the advantage of SD-SSMs over both RNNs and diagonal SSMs:

  • Language Modeling: DenseSSM models (DenseRetNet, DenseMamba) on 350M to 1.3B parameters yield perplexity reductions on WikiText-103 and LAMBADA benchmarks, with up to +5% accuracy over baseline RetNet/Mamba, and similar efficiency as smaller baselines (He et al., 2024). Ablations show m=2 lateral links to be optimal under parameter and memory budgets.
  • Formal Languages (FSA Emulation): SD-SSMs attain perfect or near-perfect accuracy on emulation of both commutative and non-commutative regular languages, generalizing out-of-distribution to arbitrarily long sequences—a feat not matched by diagonal or block-diagonal selective SSMs (Terzić et al., 2024).
  • Time-Series Forecasting and NLP: On tasks such as electricity load prediction and Penn Treebank language modeling, SD-SSMs achieve lower error and perplexity while using 30–40% less memory than LSTM and GRU baselines (Bhat, 2024).

Table: SD-SSM Performance Highlights

Task/Metric SD-SSM Baseline (LSTM/GRU/RNN) Improvement
Language Modeling PPL 2.55 PPL 2.57 (RetNet) Lower perplexity, +3–5% acc.
Time-Series RMSE 0.152 0.168 (LSTM) Lower RMSE, less memory
Regular Languages (avg. acc.) >99.9% (Diagonal SSM) ≤54% Perfect length-gen, generalization

6. Design Considerations, Limitations, and Future Directions

SD-SSMs enable expressivity beyond monomial sequence statistics, are universal for sequence-to-point functionals, and scale subquadratically with sequence length. Distinctive advantages include:

  • Universal emulation of regular languages (including non-commutative automata) in a single layer (Terzić et al., 2024).
  • Full high-order path signature representation for arbitrary functionals (Cirone et al., 2024).
  • Efficient memory compression and adaptive state selection via dense gating (Bhat, 2024).

Trade-offs and open directions:

  • Dense gating incurs O(N2L)O(N^2 L) cost; diagonal or chained blocks offer lower cost at slight expressive reduction (Cirone et al., 2024).
  • Parameter and compute overhead grow with number of lateral dense links mm or transition dictionary size kk, but remain manageable (additional 2–3% in DenseSSM) (He et al., 2024).
  • Fixed connectivity or link count mm may be suboptimal; learning mm or employing dynamic sparsity masks could further optimize efficiency-accuracy trade-off.
  • Application beyond language modeling, such as to S4/Hyena-style architectures or via cross-attentive fusion mechanisms, presents promising research directions (He et al., 2024).

A plausible implication is that by modulating gating and connection strategies, future SD-SSMs can be tuned precisely for the computational and expressivity needs of diverse domains, including long-context language, algorithmic reasoning, and memory-constrained sequence computation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dense Selective State-Space Models (SD-SSM).