Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Non-Causal State Space Duality (NC-SSD)

Updated 12 November 2025
  • NC-SSD is a framework that extends state-space duality by symmetrically aggregating token contributions, removing causal constraints and enabling global receptive fields.
  • It leverages efficient linear-time algorithms and bidirectional scan methods to fuse information from all tokens, streamlining computations compared to classical causal models.
  • NC-SSD achieves state-of-the-art results in vision benchmarks while ensuring stability through controlled eigenvalue constraints and low-rank semiseparable approximations.

Non-Causal State Space Duality (NC-SSD) generalizes the duality between state-space models (SSMs) and attention mechanisms, extending the applicability of SSM-inspired architectures to domains where causality is neither natural nor required, such as vision. Unlike classical or causal SSMs—where outputs depend only on current and past inputs—NC-SSD computes outputs that symmetrically aggregate contributions from all tokens, regardless of position, via efficient, linear-time algorithms. This position-agnostic property enables global receptive fields, enhances performance across various vision benchmarks, and streamlines computations relative to prior bidirectional or multi-path approaches.

1. Classical and Causal State Space Models

The canonical continuous-time SSM is given by

ddth(t)=Ah(t)+Bx(t),y(t)=Ch(t)+Dx(t),\frac{\mathrm{d}}{\mathrm{d}t} h(t) = A^\circ h(t) + B^\circ x(t), \qquad y(t) = C h(t) + D x(t),

where h(t)RNh(t)\in \mathbb{R}^N, x(t)Rx(t)\in \mathbb{R}, and y(t)Ry(t)\in \mathbb{R}. Discretizing with step size Δ\Delta: hn=Ahn1+Bxn,yn=Chn+Dxn,h_n = A h_{n-1} + B x_n, \qquad y_n = C h_n + D x_n, with A=eΔAA = e^{\Delta A^\circ}, BΔBB\approx \Delta B^\circ, and C,DC,D learnable.

The unrolled recurrence defines a causal 1D convolution kernel K=[CB,CAB,,CAL1B]K = [CB, CAB, \ldots, CA^{L-1}B], and the sequence output is computed as y=xKy = x\ast K. In the State Space Duality (SSD) formulation (notably, Mamba2), AnA_n is restricted to scalars, allowing the recursion to be interpreted as a particular form of masked (causal) attention, i.e., each yny_n only depends on x1:nx_{1:n}. This connotes a triangular kernel where Ki,j=0K_{i,j}=0 for i<ji<j.

2. Rationale for Non-Causality and Vision-Specific Challenges

In vision, non-causality is intrinsic; there is no temporal or sequential restriction preventing any image patch from influencing any other. Flattening a $2$D patch grid into $1$D destroys true spatial locality: adjacent patches in $2$D may be distant in $1$D, leading to decay patterns unaligned with spatial adjacency. Prior approaches (e.g., ViM, VMamba, LocalMamba) employ multiple scan paths (forward, backward, diagonal, etc.) to mitigate causality and aggregate outputs, but such fusion is both implementation-heavy and still fails to recover true non-causal semantics.

NC-SSD addresses this by discarding the causal dependence entirely, enabling each token’s output to be computed from a global, symmetric combination of all tokens, reflecting the true non-causal structure required for vision tasks (Shi et al., 26 Jul 2024).

3. Derivation and Mathematical Structure of NC-SSD

3.1. Reinterpretation of Interaction Coefficients

In causal SSD, the scalar AnA_n modulates retention of hn1h_{n-1} against update from xnx_n: hn=Anhn1+Bnxn,h_n = A_n h_{n-1} + B_n x_n, where causality inherently biases the output toward earlier tokens. Removing this yields: hn=hn1+1AnBnxn=i=1n1AiBixi,h_n = h_{n-1} + \frac{1}{A_n} B_n x_n = \sum_{i=1}^{n} \frac{1}{A_i} B_i x_i, causality is absent; each token contributes directly via its coefficient 1/Ai1/A_i irrespective of position.

3.2. Multi-Scan Fusion

Performing both forward and backward scans: Hi=j=1imjZj, Hi=j=iLmjZj,mj=1/Aj,Zj=Bjxj,\begin{aligned} H^{\rightarrow}_i &= \sum_{j=1}^{i} m_j Z_j, \ H^{\leftarrow}_i &= \sum_{j=i}^L m_j Z_j, \qquad m_j = 1/A_j,\quad Z_j = B_j x_j, \end{aligned} Summing (and omitting the double-counted self-term, a negligible bias), yields: H=j=1LmjZj,H = \sum_{j=1}^{L} m_j Z_j, a global state decoupled from position—no ordering information persists.

3.3. Tensor and Einsum Formulation

Let XRL×DX\in \mathbb{R}^{L\times D} (tokens), BRL×D×DB\in \mathbb{R}^{L\times D\times D'}, CRD×DC\in \mathbb{R}^{D' \times D}, mRLm\in \mathbb{R}^L: Zj,d=dXj,dBj,d,d, Hd=j=1LmjZj,d, Yj,d=dCd,dHd,\begin{align*} Z_{j,d'} &= \sum_d X_{j,d}\, B_{j,d,d'},\ H_{d'} &= \sum_{j=1}^L m_j\, Z_{j,d'},\ Y_{j,d} &= \sum_{d'} C_{d,d'}\, H_{d'}, \end{align*} or equivalently, Y=(Xm)@(BC)Y = (X \odot m) @ (BC), where \odot denotes row-wise scaling, and @@ is matrix multiplication.

3.4. Generalized (Structured) Convolutional Duality

From a classical perspective (Dao et al., 31 May 2024), the non-causal convolution matrix can be written as: Knc=Kforw+(Kforw)DI,K^{\mathrm{nc}} = K^{\mathrm{forw}} + (K^{\mathrm{forw}})^{\top} - D I, where KforwK^{\mathrm{forw}} is strictly lower-triangular with entries Ki,jforw=g[ij] 1i>jK_{i,j}^{\mathrm{forw}} = g[i-j]\ \mathbb{1}_{i>j}, g[n]=CAn1Bg[n] = C A^{n-1} B for n1n\geq 1, DD is the direct term, and II is the identity. Both KforwK^{\mathrm{forw}} and its transpose have semiseparable structure, enabling efficient scan operations.

4. Efficient Algorithms and Complexity

For a sequence of length LL, model dimension DD, and hidden dimension DD':

  • Computation: O(LDD)O(L D D') for expanding tokens, O(LD)O(L D') for global aggregation, O(LDD)O(L D D') for broadcasting to all tokens. There is no L×LL\times L quadratic attention matrix or recurrence.
  • Implementation: Matrix contractions (einsum or batched matmuls) enable high efficiency on modern accelerators. Explicit for-loops are avoidable except in reference pseudocode.

Pseudocode Example (as in (Shi et al., 26 Jul 2024)):

1
2
3
4
5
m = 1.0 / A           # Step 1: per-token weights
Z = X @ B             # Step 2: token expansion
H = sum(m[j] * Z[j] for j in range(L))     # Step 3: global aggregation
Y = H @ C             # Step 4: project to output
Y = repeat(Y, L, axis=0)
In practice, all steps are batchable and memory-efficient.

Bidirectional Scan Algorithm (Dao et al., 31 May 2024):

1
2
3
4
5
6
7
8
9
10
11
def NC_SSD(A,B,C,D,u):
    # u: (T,P), A,B,C,D: SSM params
    h_f = zero((P,N))       # Forward scan
    for i in range(T):
        h_f = A@h_f + B@u[i]
        y_f[i] = C@h_f
    h_b = zero((P,N))       # Backward scan
    for i in reversed(range(T)):
        h_b = A.T@h_b + B@u[i]
        y_b[i] = C@h_b
    return y_f + y_b - D@u  # Merge and residual correction
Both forms are O(LDD)O(L D D') or O(TNP)O(TNP).

5. Empirical Performance and Domain Applications

Experimental results (Shi et al., 26 Jul 2024) substantiate the efficacy of NC-SSD in computer vision tasks.

Summary of Key Results:

Task Model / Setup Accuracy / AP / mIoU Baseline Diff
ImageNet-1K Top-1 VSSD-Micro (14M, 2.3G) 82.5% NAT-M 81.8% +0.7%
VSSD-Tiny (24M, 4.5G) 83.7% VMambaV9-T 82.5% +1.2%
VSSD-Small (40M, 7.4G) 84.1% LocalVMamba-S 83.7% +0.4%
VSSD-Base (89M, 16.1G) 84.7% VMambaV9-B 83.9% +0.8%
COCO Det/Seg VSSD-Tiny Box AP 46.9, Mask 42.6 Swin-T 42.7/39.3
VMamba-T 46.5/42.1
VSSD-Small Box AP 48.4, Mask 43.5 VMamba-S 48.2/43.0
ADE20K Segmentation VSSD-Tiny 47.9 mIoU (single-scale) VMamba-T 47.3 +0.6
Swin-T 44.4 +3.5
Efficiency VSSD vs. vanilla SSD +0.6% Top-1, +14% Train throughput
VSSD vs. Bi-SSD +0.2% Top-1, +50% Train throughput

These results demonstrate consistent state-of-the-art performance improvement or parity with prior SSM-based models and transformer/cnn baselines, with a notable increase in efficiency.

6. Generalization, Expressivity, and Stability

Expressive Power: Any convolution whose kernel admits a low-rank state-space representation (semiseparable) can be expressed via NC-SSD. Symmetric kernels (Gaussian, Matérn) are especially amenable to concise (low-rank) representations.

Stability: Stability is ensured by constraining the eigenvalues of AA within the unit disk (discrete) or requiring negative real parts (continuous). This holds identically for both forward and backward passes in NC-SSD.

Boundary Handling: Unlike causal SSMs, both initial and terminal states are zero-initialized. Omitting additional padding or explicit path alternation, global state computation is invariant to scan direction, and all possible scan directions yield the same result.

A plausible implication is that these properties render NC-SSD broadly suitable for any task where the underlying structure is best modeled by symmetric (non-directional) dependencies.

7. Relationship to Structured Attention and Semiseparable Matrices

NC-SSD can be viewed through the lens of semiseparable matrix theory (Dao et al., 31 May 2024). The full non-causal convolution matrix KncK^{\mathrm{nc}} is decomposable into two N-sequentially-semiseparable (SSS) matrices and a diagonal, paralleling low-rank approximations used in efficient attention mechanisms. Specifically, for value sequences VV: Y=KncV=(KforwV)+((Kforw)TV)DV,Y = K^{\mathrm{nc}} V = (K^{\mathrm{forw}} V) + ((K^{\mathrm{forw}})^T V) - D V, where each component is efficiently computable by a scan (forward, backward, diagonal). This duality collapses the distinction between SSM convolution and attention, providing a unifying, computationally efficient framework for both.

In summary, Non-Causal State Space Duality constitutes an efficient, theoretically grounded, and practically validated approach for non-causal sequence modeling, with particular advantages in vision. It maintains the linear scaling of SSMs while achieving position-agnostic global aggregation, outperforming bidirectional SSMs and multi-path scanning approaches in both accuracy and efficiency (Shi et al., 26 Jul 2024, Dao et al., 31 May 2024).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Non-Causal State Space Duality (NC-SSD).