Non-Causal State Space Duality (NC-SSD)

Updated 12 November 2025

NC-SSD is a framework that extends state-space duality by symmetrically aggregating token contributions, removing causal constraints and enabling global receptive fields.
It leverages efficient linear-time algorithms and bidirectional scan methods to fuse information from all tokens, streamlining computations compared to classical causal models.
NC-SSD achieves state-of-the-art results in vision benchmarks while ensuring stability through controlled eigenvalue constraints and low-rank semiseparable approximations.

Non-Causal State Space Duality (NC-SSD) generalizes the duality between state-space models (SSMs) and attention mechanisms, extending the applicability of SSM-inspired architectures to domains where causality is neither natural nor required, such as vision. Unlike classical or causal SSMs—where outputs depend only on current and past inputs—NC-SSD computes outputs that symmetrically aggregate contributions from all tokens, regardless of position, via efficient, linear-time algorithms. This position-agnostic property enables global receptive fields, enhances performance across various vision benchmarks, and streamlines computations relative to prior bidirectional or multi-path approaches.

1. Classical and Causal State Space Models

The canonical continuous-time SSM is given by

$\frac{\mathrm{d}}{\mathrm{d}t} h(t) = A^\circ h(t) + B^\circ x(t), \qquad y(t) = C h(t) + D x(t),$

where $h(t)\in \mathbb{R}^N$ , $x(t)\in \mathbb{R}$ , and $y(t)\in \mathbb{R}$ . Discretizing with step size $\Delta$ : $h_n = A h_{n-1} + B x_n, \qquad y_n = C h_n + D x_n,$ with $A = e^{\Delta A^\circ}$ , $B\approx \Delta B^\circ$ , and $C,D$ learnable.

The unrolled recurrence defines a causal 1D convolution kernel $K = [CB, CAB, \ldots, CA^{L-1}B]$ , and the sequence output is computed as $y = x\ast K$ . In the State Space Duality (SSD) formulation (notably, Mamba2), $A_n$ is restricted to scalars, allowing the recursion to be interpreted as a particular form of masked (causal) attention, i.e., each $y_n$ only depends on $x_{1:n}$ . This connotes a triangular kernel where $K_{i,j}=0$ for $i<j$ .

2. Rationale for Non-Causality and Vision-Specific Challenges

In vision, non-causality is intrinsic; there is no temporal or sequential restriction preventing any image patch from influencing any other. Flattening a $2$D patch grid into $1$D destroys true spatial locality: adjacent patches in $2$D may be distant in $1$D, leading to decay patterns unaligned with spatial adjacency. Prior approaches (e.g., ViM, VMamba, LocalMamba) employ multiple scan paths (forward, backward, diagonal, etc.) to mitigate causality and aggregate outputs, but such fusion is both implementation-heavy and still fails to recover true non-causal semantics.

NC-SSD addresses this by discarding the causal dependence entirely, enabling each token’s output to be computed from a global, symmetric combination of all tokens, reflecting the true non-causal structure required for vision tasks (Shi et al., 26 Jul 2024).

3. Derivation and Mathematical Structure of NC-SSD

3.1. Reinterpretation of Interaction Coefficients

In causal SSD, the scalar $A_n$ modulates retention of $h_{n-1}$ against update from $x_n$ : $h_n = A_n h_{n-1} + B_n x_n,$ where causality inherently biases the output toward earlier tokens. Removing this yields: $h_n = h_{n-1} + \frac{1}{A_n} B_n x_n = \sum_{i=1}^{n} \frac{1}{A_i} B_i x_i,$ causality is absent; each token contributes directly via its coefficient $1/A_i$ irrespective of position.

3.2. Multi-Scan Fusion

Performing both forward and backward scans: $\begin{aligned} H^{\rightarrow}_i &= \sum_{j=1}^{i} m_j Z_j, \ H^{\leftarrow}_i &= \sum_{j=i}^L m_j Z_j, \qquad m_j = 1/A_j,\quad Z_j = B_j x_j, \end{aligned}$ Summing (and omitting the double-counted self-term, a negligible bias), yields: $H = \sum_{j=1}^{L} m_j Z_j,$ a global state decoupled from position—no ordering information persists.

3.3. Tensor and Einsum Formulation

Let $X\in \mathbb{R}^{L\times D}$ (tokens), $B\in \mathbb{R}^{L\times D\times D'}$ , $C\in \mathbb{R}^{D' \times D}$ , $m\in \mathbb{R}^L$ : $\begin{align*} Z_{j,d'} &= \sum_d X_{j,d}\, B_{j,d,d'},\ H_{d'} &= \sum_{j=1}^L m_j\, Z_{j,d'},\ Y_{j,d} &= \sum_{d'} C_{d,d'}\, H_{d'}, \end{align*}$ or equivalently, $Y = (X \odot m) @ (BC)$ , where $\odot$ denotes row-wise scaling, and $@$ is matrix multiplication.

3.4. Generalized (Structured) Convolutional Duality

From a classical perspective (Dao et al., 31 May 2024), the non-causal convolution matrix can be written as: $K^{\mathrm{nc}} = K^{\mathrm{forw}} + (K^{\mathrm{forw}})^{\top} - D I,$ where $K^{\mathrm{forw}}$ is strictly lower-triangular with entries $K_{i,j}^{\mathrm{forw}} = g[i-j]\ \mathbb{1}_{i>j}$ , $g[n] = C A^{n-1} B$ for $n\geq 1$ , $D$ is the direct term, and $I$ is the identity. Both $K^{\mathrm{forw}}$ and its transpose have semiseparable structure, enabling efficient scan operations.

4. Efficient Algorithms and Complexity

For a sequence of length $L$ , model dimension $D$ , and hidden dimension $D'$ :

Computation: $O(L D D')$ for expanding tokens, $O(L D')$ for global aggregation, $O(L D D')$ for broadcasting to all tokens. There is no $L\times L$ quadratic attention matrix or recurrence.
Implementation: Matrix contractions (einsum or batched matmuls) enable high efficiency on modern accelerators. Explicit for-loops are avoidable except in reference pseudocode.

Pseudocode Example (as in (Shi et al., 26 Jul 2024)):

m = 1.0 / A           # Step 1: per-token weights
Z = X @ B             # Step 2: token expansion
H = sum(m[j] * Z[j] for j in range(L))     # Step 3: global aggregation
Y = H @ C             # Step 4: project to output
Y = repeat(Y, L, axis=0)

In practice, all steps are batchable and memory-efficient.

Bidirectional Scan Algorithm (Dao et al., 31 May 2024):

def NC_SSD(A,B,C,D,u):
    # u: (T,P), A,B,C,D: SSM params
    h_f = zero((P,N))       # Forward scan
    for i in range(T):
        h_f = A@h_f + B@u[i]
        y_f[i] = C@h_f
    h_b = zero((P,N))       # Backward scan
    for i in reversed(range(T)):
        h_b = A.T@h_b + B@u[i]
        y_b[i] = C@h_b
    return y_f + y_b - D@u  # Merge and residual correction

Both forms are

O(L D D')

O(TNP)

5. Empirical Performance and Domain Applications

Experimental results (Shi et al., 26 Jul 2024) substantiate the efficacy of NC-SSD in computer vision tasks.

Summary of Key Results:

Task	Model / Setup	Accuracy / AP / mIoU	Baseline	Diff
ImageNet-1K Top-1	VSSD-Micro (14M, 2.3G)	82.5%	NAT-M 81.8%	+0.7%
	VSSD-Tiny (24M, 4.5G)	83.7%	VMambaV9-T 82.5%	+1.2%
	VSSD-Small (40M, 7.4G)	84.1%	LocalVMamba-S 83.7%	+0.4%
	VSSD-Base (89M, 16.1G)	84.7%	VMambaV9-B 83.9%	+0.8%
COCO Det/Seg	VSSD-Tiny	Box AP 46.9, Mask 42.6	Swin-T 42.7/39.3
			VMamba-T 46.5/42.1
	VSSD-Small	Box AP 48.4, Mask 43.5	VMamba-S 48.2/43.0
ADE20K Segmentation	VSSD-Tiny	47.9 mIoU (single-scale)	VMamba-T 47.3	+0.6
			Swin-T 44.4	+3.5
Efficiency	VSSD vs. vanilla SSD	+0.6% Top-1, +14% Train throughput
	VSSD vs. Bi-SSD	+0.2% Top-1, +50% Train throughput

These results demonstrate consistent state-of-the-art performance improvement or parity with prior SSM-based models and transformer/cnn baselines, with a notable increase in efficiency.

6. Generalization, Expressivity, and Stability

Expressive Power: Any convolution whose kernel admits a low-rank state-space representation (semiseparable) can be expressed via NC-SSD. Symmetric kernels (Gaussian, Matérn) are especially amenable to concise (low-rank) representations.

Stability: Stability is ensured by constraining the eigenvalues of $A$ within the unit disk (discrete) or requiring negative real parts (continuous). This holds identically for both forward and backward passes in NC-SSD.

Boundary Handling: Unlike causal SSMs, both initial and terminal states are zero-initialized. Omitting additional padding or explicit path alternation, global state computation is invariant to scan direction, and all possible scan directions yield the same result.

A plausible implication is that these properties render NC-SSD broadly suitable for any task where the underlying structure is best modeled by symmetric (non-directional) dependencies.

7. Relationship to Structured Attention and Semiseparable Matrices

NC-SSD can be viewed through the lens of semiseparable matrix theory (Dao et al., 31 May 2024). The full non-causal convolution matrix $K^{\mathrm{nc}}$ is decomposable into two N-sequentially-semiseparable (SSS) matrices and a diagonal, paralleling low-rank approximations used in efficient attention mechanisms. Specifically, for value sequences $V$ : $Y = K^{\mathrm{nc}} V = (K^{\mathrm{forw}} V) + ((K^{\mathrm{forw}})^T V) - D V,$ where each component is efficiently computable by a scan (forward, backward, diagonal). This duality collapses the distinction between SSM convolution and attention, providing a unifying, computationally efficient framework for both.

In summary, Non-Causal State Space Duality constitutes an efficient, theoretically grounded, and practically validated approach for non-causal sequence modeling, with particular advantages in vision. It maintains the linear scaling of SSMs while achieving position-agnostic global aggregation, outperforming bidirectional SSMs and multi-path scanning approaches in both accuracy and efficiency (Shi et al., 26 Jul 2024, Dao et al., 31 May 2024).

PDF Markdown Chat (Pro)

References (2)

VSSD: Vision Mamba with Non-Causal State Space Duality (2024)

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (2024)

Follow Topic

Get notified by email when new papers are published related to Non-Causal State Space Duality (NC-SSD).