Bidirectional State-Space Models

Updated 4 March 2026

Bidirectional state-space models are neural architectures that combine forward and reverse state dynamics to capture comprehensive temporal and spatial context.
They employ dual-column and quasiseparable fusion mechanisms to maintain linear computational scaling, providing a computationally efficient alternative to transformers.
Empirical studies show these models excel in applications such as speech anti-spoofing, vision classification, and biomedical signal processing with significant accuracy improvements.

A bidirectional state-space model (SSM) is a neural sequence architecture that combines forward and reverse discrete-time state-space dynamics to aggregate information from both past and future inputs. This extension addresses the inherent uni-directionality of standard SSMs and is realized through a variety of architectural designs, including dual-column and quasiseparable matrix formulations, with demonstrated effectiveness across sequential, spatial, and spatio-temporal modeling tasks. Recent advances have established bidirectional SSMs as a computationally efficient, context-rich alternative to multi-head self-attention, achieving state-of-the-art results in applications such as speech anti-spoofing, large-scale vision, language modeling, pose estimation, and tabular inference (Xiao et al., 2024, Hwang et al., 2024, Huang et al., 2024).

1. Mathematical Foundations and Architectural Schemes

The core of a bidirectional SSM consists of two parallel recurrences derived from a discretized continuous-time state-space equation: $\dot{x}(t) = A\,x(t) + B\,u(t), \qquad y(t) = C\,x(t) + D\,u(t)$ which after discretization gives

$x_{n+1} = \bar{A}\,x_n + \bar{B}\,u_n, \qquad y_n = C\,x_n + D\,u_n$

For bidirectional modeling, two streams process the input sequence in opposite directions:

Forward: runs from $n=1$ to $T$ , $x_{n+1}^{(f)} = \bar{A}_n x_n^{(f)} + \bar{B}_n u_n$ .
Reverse: runs from $n=T$ to $1$, $x_{n-1}^{(b)} = \bar{A}_n x_n^{(b)} + \bar{B}_n u_n$ .

The outputs, $y_n^{(f)}$ and $y_n^{(b)}$ , may be concatenated, summed, or fused by a gating MLP or Hadamard product, depending on the architectural variant (Xiao et al., 2024, Hwang et al., 2024, Schaller et al., 17 Nov 2025). Dual-column “stacked” architectures maintain distinct parameterizations for forward and backward passes, while more parameter-efficient schemes, such as Hydra, implement a unified quasiseparable matrix structure, where a single low-rank representation supports subquadratic bidirectional mixing (Hwang et al., 2024). Variants exist for both sequence and 2D spatial structures, with modifications for token position-aware encoding.

2. Computational Complexity and Efficiency

Bidirectional SSMs are formulated to preserve the linear (or near-linear) time and memory scaling inherent to efficient SSMs—even as they extend context to both past and future sub-sequences. For a sequence of length $T$ and hidden/channel dimension $d$ , the time complexity is $O(T d^2)$ (or $O(T d)$ with optimized diagonal/low-rank implementations), and the memory footprint is $O(T d)$ per layer. This contrasts with the $O(T^2 d)$ scaling for transformer self-attention. When using quasiseparable matrices, as in Hydra, the parameter and computation overhead for bidirectionality is minimal—often just duplicating one core SSM kernel per block and adding a small set of fusion parameters (Hwang et al., 2024). Empirically, bidirectional SSMs have been shown to process two orders of magnitude longer contexts than transformers for the same compute budget in high-dimensional domains such as tabular prior-data-fitted networks (Koch et al., 16 Oct 2025).

3. Empirical Performance Across Domains

Bidirectional SSM architectures have been applied across diverse modalities:

Speech and Audio: The XLSR-Mamba dual-column model achieves 0.93% EER (min t-DCF=0.208) for anti-spoofing on ASVspoof 2021 LA, outperforming transformers while being significantly faster (Xiao et al., 2024). In audio deepfake detection, bidirectional Mamba yields a 34.1% improvement in EER over transformer baselines (Chen et al., 2024). For speech separation, dual-path BiMamba surpasses both DPRNN and Sepformer in SI-SNRi under smaller model sizes (Jiang et al., 2024).
Vision: Vision Mamba matches or outperforms DeiT ViTs for classification, detection, and segmentation with up to 2.8× faster throughput and 86.8% less GPU memory at high resolutions (Zhu et al., 2024). For point cloud analysis, bi-SSM blocks yield a +3.7% accuracy gain over uni-directional SSMs (Chen et al., 2024). In human pose estimation, PoseMamba’s bidirectional global-local SSM blocks achieve state-of-the-art accuracy at one-fifth the MACs of self-attention (Huang et al., 2024).
Language and Tabular: Hydra delivers +0.8 GLUE points over BERT and +2% ImageNet Top-1 accuracy over ViT (Hwang et al., 2024). In tabular PFNs, bidirectional SSMs reduce sequence order sensitivity and approach transformer performance with linear scaling (Koch et al., 16 Oct 2025).
Biomedical and Time-Series: Bidirectional Mamba (EEGMamba) improves generalization and accuracy in multi-task EEG classification over attention models, especially on long signals (sequence lengths up to 10,000) while maintaining linear GPU memory scaling (Gui et al., 2024). In anomalous diffusion analysis, Bi-Mamba outperforms bidirectional RNNs in low-data, short-trajectory settings (Lavaud et al., 2024).

Model/Domain	Main Fusion Type	Complexity	Empirical Highlight	Reference
XLSR-Mamba (speech)	Dual-column concat	$O(Td)$	SOTA anti-spoof EER/min t-DCF	(Xiao et al., 2024)
Hydra (multimodal)	Quasiseparable fusion	$O(Td)$	+0.8 GLUE, +2% ImageNet	(Hwang et al., 2024)
Vim (vision)	Summation/MLP fuse	$O(Ld)$	2.8× faster, 86.8% mem saving	(Zhu et al., 2024)
EEGMamba (EEG)	Sum+gating	$O(Td)$	Consistent accuracy uplift	(Gui et al., 2024)
PoseMamba (3D pose)	Multi-directional cat	$O(TJ)$	SOTA MPJPE, linear compute	(Huang et al., 2024)

4. Training Procedures, Inductive Biases, and Extensions

Effective bidirectional SSM training may rely on pretraining (e.g., wav2vec 2.0 for speech (Xiao et al., 2024)), curriculum learning, or mixtures of objectives. Birdie introduces a hybrid prefix-LM regime, splitting the state/channel to support bidirectional encoding in the prefix and transitioning to causal decoding in the suffix, with dynamic objective weighting by reinforcement learning to enhance retrieval and copy tasks (Blouir et al., 2024). Inductive biases inherent to SSMs (controlled recurrences) are combined with bidirectional context, yielding improved generalization, especially in low-data or noisy settings. Gating mechanisms (e.g., GLU, SiLU) frequently regulate state fusion and information flow for training and stability.

Variants exist for complex input structures—such as Vedic-inspired Hadamard fusion (Schaller et al., 17 Nov 2025), global-local scans for spatial graphs (Huang et al., 2024), or spectral-domain filtering for sequence recommendation (Wang et al., 2024)—but the universal property is symmetric context aggregation without quadratic cost.

5. Architectural Variants and Fusion Mechanisms

Key fusion strategies for combining forward and backward SSM outputs include:

Concatenation — Outputs from both passes are stacked (as in XLSR-Mamba and PoseMamba), providing distinct representational channels for downstream heads.
Element-wise Sum/Superposition — Used for memory and compute efficiency or when SSM parameters are tied/shared (Hwang et al., 2024).
MLP/Gated Fusion — Learned gating weights or pointwise nonlinearities (e.g., SiLU, GLU) select or merge outputs, as in EEGMamba or HSIDMamba (Gui et al., 2024, Liu et al., 2024).
Hadamard Product/Bilinear — Vedic or bilinear encodings, as in Naga, permit expressively low-rank cross-time interactions and analytical interpretability (Schaller et al., 17 Nov 2025).

These fusions can be static (fixed per time index) or dynamic (conditioned on input state), and may vary by task, modality, or model size constraints.

6. Empirical Ablations and Benchmark Results

Multiple works present controlled ablations distinguishing uni- vs. bidirectional SSM blocks. Across speech, EEG, hyperspectral, and vision settings, bidirectional SSMs consistently uplift accuracy, robustness, or regression metrics, with relative gains of 1–6% or higher, and sometimes dramatic reductions in order sensitivity, variance, and overfitting (Xiao et al., 2024, Huang et al., 2024, Gui et al., 2024). On resource benchmarks, SSMs (including their bidirectional extensions) regularly outperform or match transformer architectures in both latency and maximum sequence length, with memory and compute scaling confirmed empirical linearity (Zhu et al., 2024, Koch et al., 16 Oct 2025).

7. Limitations, Open Problems, and Future Directions

Despite their efficiency and generalization gains, bidirectional SSMs incur additional parameter/memory cost per block (roughly doubling for naïve dual-column variants). Parameter sharing, shared projection heads, or more sophisticated cross-directional interaction/fusion (e.g., mutual cross-attention) are proposed avenues for improvement (Xiao et al., 2024). While bidirectionality mitigates sequence order bias, absolute order invariance for table inputs still requires explicit token permutation averaging (Koch et al., 16 Oct 2025). Adaptive depth, mixture-of-experts block allocation, and hybrid architectures (e.g., combining bidirectional SSMs with a single self-attention layer for global context) are under investigation. Extending bidirectional SSMs to higher-order spatial/temporal modeling, structured scientific data, and prefix-masked or partially observed sequences constitutes an active research area (Blouir et al., 2024, Huang et al., 2024).