Non-Causal Selective State Space Models

Updated 23 November 2025

Non-causal selective state space models are frameworks that remove standard causal restrictions, enabling bidirectional token interactions and global context fusion.
They employ per-token parameter selection and selective aggregation mechanisms to achieve efficient linear time complexity across varied deep learning tasks.
These models are applied in long-term forecasting and dense perception in vision, offering improved global receptive fields and superior performance.

A non-causal selective state space is a state space model framework in which token-wise (or time-step–wise) transitions, parameter selection, and aggregation mechanisms do not inherit the standard “past-to-future” causality restrictions, but rather allow bidirectional or global temporal and/or spatial context fusion. In selective state space models (SSMs), per-token parameters and explicit selection structure encode which past or present inputs can influence predictions. Non-causal selective SSMs remove the lower-triangular (causal) mask, yielding all-to-all (symmetric) dependency patterns while preserving the parameter selectivity and efficient linear time complexity that distinguish recent SSM-based deep learning models from both transformers and classical strictly-causal dynamical systems (Shi et al., 26 Jul 2024, Cai et al., 26 May 2024, Anand et al., 16 Nov 2025).

1. Selective State Space Models: Causal and Non-Causal Forms

Selective state space models generalize classical state-space systems by allowing each token $i$ a distinct set of transition parameters ( $A_i, B_i, C_i$ ), yielding token-varying filters and projections. In the standard causal organization, the hidden state at time $t$ follows: $h(t) = A_t h(t-1) + B_t x(t), \quad y(t) = C_t h(t)$ where the output $y(t)$ only depends on $x(1)$ through $x(t)$ , enforcing lower-triangular (causal) structure (Shi et al., 26 Jul 2024, Anand et al., 16 Nov 2025). In the quadratic form, the output aggregates causally: $y(t) = \sum_{i=1}^{t} C_t^\top A_{t:i+1} B_i x(i)$ with the state transition chain $A_{t:i} = A_t A_{t-1} \cdots A_{i+1}$ .

In non-causal selective SSMs, the restriction to past inputs is lifted by eliminating or symmetrizing the causal mask. The selection operator $S_{t,i}$ —which is 1 for $i\le t$ and 0 otherwise—is replaced by an all-ones matrix, and recurrences are replaced by global accumulation over all tokens: $M^\mathrm{noncausal}_{t,i} = \mu_i \qquad \forall\, t, i, \text{ where } \mu_j = \frac{1}{a_j}$ and each output is computed via shared aggregation: $H = \sum_{j=1}^{L} \mu_j B_j x(j), \quad y(i) = C_i H$ Thus, every output token mixes global context, not just the historical prefix (Anand et al., 16 Nov 2025, Shi et al., 26 Jul 2024).

2. Mathematical Mechanisms for Achieving Non-Causality

Non-causality is instantiated algorithmically through bidirectional scan, multi-scan aggregation, or full matrix contraction over all sequence positions. The process involves:

Computing per-token projections ( $Z_j = B_j x(j)$ )
Forming a weighting vector ( $\mu_j$ ) per token, typically as reciprocal of learned parameter $a_j$
Aggregating globally, e.g. $H = \sum_j \mu_j Z_j$
Applying per-token postprojections: $y(i) = C_i H$

This approach discards the temporal (or spatial/pixel) ordering bias introduced by sequential or block-lower-triangular state update rules. In visual domains, information can flow between all pixel/patch tokens; in time series with interleaved variable scans, global mixing allows arbitrary context fusion (Shi et al., 26 Jul 2024, Anand et al., 16 Nov 2025, Cai et al., 26 May 2024).

Table: Causal vs Non-Causal Selective SSM Aggregation

Property	Causal SSM	Non-Causal Selective SSM
Context Mask	Lower-triangular (prefix-only)	All-ones (global)
State Update	Recurrence $h(t) = A_t h(t-1) + B_t x(t)$	Global sum $H = \sum_j \mu_j B_j x(j)$
Output Calculation	$y(t) = C_t h(t)$	$y(i) = C_i H$

3. The Temporal Mamba Block and Variable-Aware Strategies

In long-term sequence forecasting, the MambaTS model demonstrates the removal of causality in selective SSMs through the Temporal Mamba Block (TMB). The causal convolution is omitted, replaced by a parameterized dropout and a non-causal SSM scan. Specifically: $u_t = W_{\mathrm{in}} x_t, \quad \tilde{u}_t = \mathrm{Dropout}(u_t)$

$h_t = \mathrm{SSM}(\tilde{u}_t) + \sigma(W_{\mathrm{res}} u_t + b_{\mathrm{res}})$

This structure allows bidirectional information flow, and no explicit time ordering is imposed on input aggregation (Cai et al., 26 May 2024).

The Variable-Aware Scan along Time (VAST) algorithm dynamically discovers the permutation of variables (channels) that minimizes predictive loss, optimizing the channel scan order via an asymmetric TSP solved through simulated annealing. This ensures that the non-causal context harnessed by the SSM is maximally beneficial for multi-variate time series without spurious ordering bias (Cai et al., 26 May 2024).

4. Non-Causal State Space Models in Vision and Dense Perception

In computer vision, the lack of natural time ordering in spatial arrangements justifies fully non-causal SSMs. The VSSD and DensePercept-NCSSD architectures deploy non-causal selective state-space duality (SSD) blocks, in which all tokens (patches/pixels) attend symmetrically to global content. Implementation proceeds by:

Per-token generation of scalars $a_i$ , matrices $B_i$ and $C_i$ , and value projections $Z_i$
Tensor contractions aggregating all tokens into a single global hidden state
Per-token output projections by local $C_i$

The procedure is parallelizable, linear in sequence length $L$ , and empirically yields superior accuracy, latency, and memory usage compared to causal SSM and attention-based transformers in image classification, detection, and segmentation tasks (Shi et al., 26 Jul 2024, Anand et al., 16 Nov 2025).

5. Theoretical and Empirical Advantages

Replacing causal constraints with non-causal aggregation yields:

Global receptive field: Every output accesses information from the entire sequence or spatial field in a single layer, removing context decay effects present in lower-triangular scan schemes (Shi et al., 26 Jul 2024).
Efficient parallelism: All token aggregations are parallel tensor contractions, removing sequential dependencies and accelerating both training and inference (Shi et al., 26 Jul 2024, Anand et al., 16 Nov 2025).
Memory and computational efficiency: Linear scaling in sequence length ( $O(L d^2)$ ) is retained, in contrast to the quadratic scaling of attention mechanisms (Anand et al., 16 Nov 2025).
State-of-the-art accuracy: Empirical results indicate consistent improvements (e.g., 3–10% lower MSE in forecasting, 1–4% higher top-1 accuracy in vision) over both causal SSM and transformer baselines (Cai et al., 26 May 2024, Shi et al., 26 Jul 2024).

6. Relation to Graph-Based Non-Causality and Selective Connectivity

Classical non-causality conditions, such as Granger non-causality, are reflected in the structure of the state-space transition matrices. In the block-triangular (coordinated/“selective”) innovation representation, the absence of Granger causality imposes zeros in the lower-left blocks of $A$ , $B$ , and output matrices, encoding precisely which interactions are permitted among subsystems (Jozsa et al., 2016). While contemporary non-causal selective SSMs mostly implement full (all-to-all) context integration, the underlying selective state-space formalism can represent sparse or structured non-causal graphs, where token-wise selection operators and system matrices define arbitrary context patterns beyond mere global aggregation. This generalizes the span of SSMs to encode both coordination and independence properties in large-scale multi-agent or multi-channel systems (Jozsa et al., 2016).