Non-Causal State Space Duality (NC-SSD)
- NC-SSD is a framework that extends state-space duality by symmetrically aggregating token contributions, removing causal constraints and enabling global receptive fields.
- It leverages efficient linear-time algorithms and bidirectional scan methods to fuse information from all tokens, streamlining computations compared to classical causal models.
- NC-SSD achieves state-of-the-art results in vision benchmarks while ensuring stability through controlled eigenvalue constraints and low-rank semiseparable approximations.
Non-Causal State Space Duality (NC-SSD) generalizes the duality between state-space models (SSMs) and attention mechanisms, extending the applicability of SSM-inspired architectures to domains where causality is neither natural nor required, such as vision. Unlike classical or causal SSMs—where outputs depend only on current and past inputs—NC-SSD computes outputs that symmetrically aggregate contributions from all tokens, regardless of position, via efficient, linear-time algorithms. This position-agnostic property enables global receptive fields, enhances performance across various vision benchmarks, and streamlines computations relative to prior bidirectional or multi-path approaches.
1. Classical and Causal State Space Models
The canonical continuous-time SSM is given by
where , , and . Discretizing with step size : with , , and learnable.
The unrolled recurrence defines a causal 1D convolution kernel , and the sequence output is computed as . In the State Space Duality (SSD) formulation (notably, Mamba2), is restricted to scalars, allowing the recursion to be interpreted as a particular form of masked (causal) attention, i.e., each only depends on . This connotes a triangular kernel where for .
2. Rationale for Non-Causality and Vision-Specific Challenges
In vision, non-causality is intrinsic; there is no temporal or sequential restriction preventing any image patch from influencing any other. Flattening a $2$D patch grid into $1$D destroys true spatial locality: adjacent patches in $2$D may be distant in $1$D, leading to decay patterns unaligned with spatial adjacency. Prior approaches (e.g., ViM, VMamba, LocalMamba) employ multiple scan paths (forward, backward, diagonal, etc.) to mitigate causality and aggregate outputs, but such fusion is both implementation-heavy and still fails to recover true non-causal semantics.
NC-SSD addresses this by discarding the causal dependence entirely, enabling each token’s output to be computed from a global, symmetric combination of all tokens, reflecting the true non-causal structure required for vision tasks (Shi et al., 26 Jul 2024).
3. Derivation and Mathematical Structure of NC-SSD
3.1. Reinterpretation of Interaction Coefficients
In causal SSD, the scalar modulates retention of against update from : where causality inherently biases the output toward earlier tokens. Removing this yields: causality is absent; each token contributes directly via its coefficient irrespective of position.
3.2. Multi-Scan Fusion
Performing both forward and backward scans: Summing (and omitting the double-counted self-term, a negligible bias), yields: a global state decoupled from position—no ordering information persists.
3.3. Tensor and Einsum Formulation
Let (tokens), , , : or equivalently, , where denotes row-wise scaling, and is matrix multiplication.
3.4. Generalized (Structured) Convolutional Duality
From a classical perspective (Dao et al., 31 May 2024), the non-causal convolution matrix can be written as: where is strictly lower-triangular with entries , for , is the direct term, and is the identity. Both and its transpose have semiseparable structure, enabling efficient scan operations.
4. Efficient Algorithms and Complexity
For a sequence of length , model dimension , and hidden dimension :
- Computation: for expanding tokens, for global aggregation, for broadcasting to all tokens. There is no quadratic attention matrix or recurrence.
- Implementation: Matrix contractions (einsum or batched matmuls) enable high efficiency on modern accelerators. Explicit for-loops are avoidable except in reference pseudocode.
Pseudocode Example (as in (Shi et al., 26 Jul 2024)):
1 2 3 4 5 |
m = 1.0 / A # Step 1: per-token weights Z = X @ B # Step 2: token expansion H = sum(m[j] * Z[j] for j in range(L)) # Step 3: global aggregation Y = H @ C # Step 4: project to output Y = repeat(Y, L, axis=0) |
Bidirectional Scan Algorithm (Dao et al., 31 May 2024):
1 2 3 4 5 6 7 8 9 10 11 |
def NC_SSD(A,B,C,D,u): # u: (T,P), A,B,C,D: SSM params h_f = zero((P,N)) # Forward scan for i in range(T): h_f = A@h_f + B@u[i] y_f[i] = C@h_f h_b = zero((P,N)) # Backward scan for i in reversed(range(T)): h_b = A.T@h_b + B@u[i] y_b[i] = C@h_b return y_f + y_b - D@u # Merge and residual correction |
5. Empirical Performance and Domain Applications
Experimental results (Shi et al., 26 Jul 2024) substantiate the efficacy of NC-SSD in computer vision tasks.
Summary of Key Results:
| Task | Model / Setup | Accuracy / AP / mIoU | Baseline | Diff |
|---|---|---|---|---|
| ImageNet-1K Top-1 | VSSD-Micro (14M, 2.3G) | 82.5% | NAT-M 81.8% | +0.7% |
| VSSD-Tiny (24M, 4.5G) | 83.7% | VMambaV9-T 82.5% | +1.2% | |
| VSSD-Small (40M, 7.4G) | 84.1% | LocalVMamba-S 83.7% | +0.4% | |
| VSSD-Base (89M, 16.1G) | 84.7% | VMambaV9-B 83.9% | +0.8% | |
| COCO Det/Seg | VSSD-Tiny | Box AP 46.9, Mask 42.6 | Swin-T 42.7/39.3 | |
| VMamba-T 46.5/42.1 | ||||
| VSSD-Small | Box AP 48.4, Mask 43.5 | VMamba-S 48.2/43.0 | ||
| ADE20K Segmentation | VSSD-Tiny | 47.9 mIoU (single-scale) | VMamba-T 47.3 | +0.6 |
| Swin-T 44.4 | +3.5 | |||
| Efficiency | VSSD vs. vanilla SSD | +0.6% Top-1, +14% Train throughput | ||
| VSSD vs. Bi-SSD | +0.2% Top-1, +50% Train throughput |
These results demonstrate consistent state-of-the-art performance improvement or parity with prior SSM-based models and transformer/cnn baselines, with a notable increase in efficiency.
6. Generalization, Expressivity, and Stability
Expressive Power: Any convolution whose kernel admits a low-rank state-space representation (semiseparable) can be expressed via NC-SSD. Symmetric kernels (Gaussian, Matérn) are especially amenable to concise (low-rank) representations.
Stability: Stability is ensured by constraining the eigenvalues of within the unit disk (discrete) or requiring negative real parts (continuous). This holds identically for both forward and backward passes in NC-SSD.
Boundary Handling: Unlike causal SSMs, both initial and terminal states are zero-initialized. Omitting additional padding or explicit path alternation, global state computation is invariant to scan direction, and all possible scan directions yield the same result.
A plausible implication is that these properties render NC-SSD broadly suitable for any task where the underlying structure is best modeled by symmetric (non-directional) dependencies.
7. Relationship to Structured Attention and Semiseparable Matrices
NC-SSD can be viewed through the lens of semiseparable matrix theory (Dao et al., 31 May 2024). The full non-causal convolution matrix is decomposable into two N-sequentially-semiseparable (SSS) matrices and a diagonal, paralleling low-rank approximations used in efficient attention mechanisms. Specifically, for value sequences : where each component is efficiently computable by a scan (forward, backward, diagonal). This duality collapses the distinction between SSM convolution and attention, providing a unifying, computationally efficient framework for both.
In summary, Non-Causal State Space Duality constitutes an efficient, theoretically grounded, and practically validated approach for non-causal sequence modeling, with particular advantages in vision. It maintains the linear scaling of SSMs while achieving position-agnostic global aggregation, outperforming bidirectional SSMs and multi-path scanning approaches in both accuracy and efficiency (Shi et al., 26 Jul 2024, Dao et al., 31 May 2024).