Bidirectional State-Space Models (SSMs)
- Bidirectional SSMs are sequence modeling architectures that process inputs in both directions to deliver efficient global context modeling.
- They leverage input-conditioned state transitions, novel fusion strategies like gated or concatenation methods, and structured matrix parameterizations.
- These models achieve competitive or state-of-the-art results in audio, vision, language, recommendation, and graph tasks while reducing computational complexity.
Bidirectional State-Space Models (SSMs) are a family of sequence modeling architectures that generalize traditional recurrence by propagating information both forward and backward along an input sequence via structured state-space recurrences, eschewing the quadratic computational complexity of self-attention while retaining or improving global context modeling. These models have emerged as efficient and expressive alternatives to attention-based transformers across a wide range of domains—including audio, vision, language, graphs, documents, and recommendation—by leveraging bidirectional scans of selective, input-conditioned state transitions, novel fusion mechanisms, and carefully engineered matrix parameterizations.
1. Mathematical Foundations of State-Space Models
A classical (continuous-time) linear time-invariant SSM is defined by the system:
where is the input, is the hidden state, is the output, and are system matrices of appropriate dimensions. Time-discrete implementations use standardized discretizations, typically zero-order hold or bilinear transforms, yielding:
with , for discretization step .
Recent advances, such as the "selective" parameterization in Mamba, predict and 0 dynamically from each input token using small convolutional or linear networks (Erol et al., 2024, Hwang et al., 2024). This time-varying SSM enables input-dependent gating and adaptive memory, essential for learning complex sequence dependencies.
2. Bidirectional Extension Formulations
The key innovation in bidirectional SSMs is the simultaneous propagation of state-space recurrences in both forward (1) and backward (2) directions along the input. For a sequence of length 3, one computes:
- Forward recurrence:
4
- Backward recurrence:
5
These recurrences use separate parameter heads for forward and backward directions (parameters are often independent, though architecture-dependent sharing is possible).
The final output at each position fuses forward and backward outputs using mechanisms such as element-wise addition, concatenation followed by a linear projection, or gating (Erol et al., 2024, Zhu et al., 2024, Hwang et al., 2024). Boundary conditions are handled by zero-initializing the hidden state at the start or end of each scan.
Matrix-mixer interpretations formalize this dual recurrence as a quasiseparable linear map:
6
with 7 and 8 corresponding to lower and upper triangular semiseparable matrices, and 9 a diagonal (Hwang et al., 2024).
3. Architectural Realizations and Domain-Specific Variants
Bidirectional SSM modules are used as drop-in replacements for attention in deep encoders, forming the core of new architectures:
- Audio Mamba (AuM): Replaces all self-attention blocks in the spectrogram patch encoder with bidirectional SSM blocks. The architecture mimics ViT/AST in patchification, positional embeddings, and layer stacking, but achieves linear time/memory scaling in sequence length, enabling efficient long-context audio classification (Erol et al., 2024).
- Vision Mamba (Vim): Processes image patches via bidirectional SSM blocks stacked in transformer-like encoders. Forward and backward recurrences are run over flatten sequence of embeddings, with fusion by concatenation or gating. The full backbone handles ImageNet-scale inputs with substantial speed and memory advantages over self-attention (Zhu et al., 2024).
- Hydra: Proposes a principled bidirectional SSM via quasiseparable matrix mixers, yielding a coherent generalization of Mamba to non-causal tasks in language and vision (GLUE, ImageNet). The matrix mixer formalism highlights that both SSMs and attention are special cases of structured sequence mixers (Hwang et al., 2024).
- Recommendation, Document, Graph, and Speech Models: EchoMamba4Rec (Wang et al., 2024) interleaves bidirectional SSMs with FFT-based spectral filtering and GLUs; DocMamba (Hu et al., 2024) uses a Segment-First Bidirectional Scan structured around document layout for token orderings; Graph Mamba (Behrouz et al., 2024) and XLSR-Mamba (Xiao et al., 2024) apply domain-adapted bidirectional SSMs to graph token sequences and speech features, respectively.
- Language Modeling: Birdie (Blouir et al., 2024) introduces bidirectional SSMs for prefix-LM style architectures, enabling dense context use for both retrieval and next-token prediction—critically advancing SSMs' ability to close the retrieval gap with transformers.
4. Computational Scaling and Complexity
Bidirectional SSMs fundamentally exploit the recurrent nature of state-space updates to achieve linear time and memory scaling in sequence length (0, with N the hidden state), in contrast to the quadratic (1) scaling of self-attention used in vanilla Transformers.
The following table summarizes scaling laws reported in the literature:
| Model | Time Complexity | Memory Complexity |
|---|---|---|
| Self-attention | 2 | 3 |
| Unidirectional SSM | 4 or 5 | 6 or 7 |
| Bidirectional SSM | 8 or 9 | 0 |
In multiple domains, parameter counts of bidirectional SSMs remain on par with equivalently sized transformers (e.g., AuM-Base/16: 86M, AST-Base/16: 88M), but SSM-based models yield dramatic reductions in runtime and activation memory for long sequences (Erol et al., 2024, Zhu et al., 2024, Hwang et al., 2024, Hu et al., 2024).
5. Fusion Strategies and Gating Mechanisms
Fusion of the forward and backward SSM outputs is crucial to maximize the utility of bidirectional context. Common fusion strategies include:
- Element-wise Addition: 1 (straightforward, parameter-efficient) (Erol et al., 2024).
- Concatenation + Linear Projection: 2, with optional gating (Zhu et al., 2024, Erol et al., 2024, Hwang et al., 2024, Xiao et al., 2024).
- Gated Fusion: Compute a gate 3, then fuse via 4 (Hu et al., 2024, Behrouz et al., 2024).
- Matrix Mixer/Quasiseparable Parameterization: Hydra assigns independent learnable weights to forward, backward, and diagonal contributions, synthesizing a flexible bidirectional mixer (Hwang et al., 2024).
These designs enable each position to receive adaptive, context-aware information from both past and future through lightweight mechanisms compatible with hardware acceleration.
6. Empirical Results Across Domains
Bidirectional SSMs have achieved state-of-the-art or competitive results against transformer baselines across a range of tasks, with major findings including:
- Audio: AuM-B/16 surpasses AST-B/16 on AudioSet Full (32.43 vs 29.10 mAP), AudioSet Balanced (+2.87 mAP), VGGSound (+5.33% accuracy), VoxCeleb (+5.90% accuracy), and SpeechCommands V2 (+6.32%) while incurring fractional time/memory overhead (Erol et al., 2024).
- Vision: Vim-Base outperforms DeiT-Base by 83.2% vs 79.8% Top-1 on ImageNet, while slashing GPU memory usage by up to 86.8% at high resolutions (Zhu et al., 2024). Hydra improves ViT-Base Top-1 accuracy by +2.2 pts and BERT-Base GLUE avg by +0.8 pts (Hwang et al., 2024).
- Recommendation: EchoMamba4Rec boosts HR@10 from 0.0781 (Bi‐Mamba4Rec) to 0.0833 on Amazon-Beauty; delivers 2.5× faster per-epoch training and uses ~1.6GB GPU memory (Wang et al., 2024).
- Document Understanding: DocMamba exceeds LayoutLMv3 on all downstream tasks (FUNSD F1 91.7% vs 90.3%, CORD 97.0%) and provides 2.4× inference speedup with linear memory scaling, making it suitable for length-extrapolation (Hu et al., 2024).
- Speech: XLSR-Mamba achieves state-of-the-art EER/min t-DCF on ASVspoof 2021 by embedding a dual-column bidirectional SSM block atop XLS-R features (Xiao et al., 2024).
- Graph Learning: Graph Mamba matches or surpasses transformer-based GNNs in long-range and heterophilic benchmarks, with linear scaling benefits and no reliance on costly positional/structural encodings (Behrouz et al., 2024).
- Language Modeling: Birdie brings SSMs much closer to transformers on long-range retrieval (phonebook, QA) and story infilling while preserving linear computational costs. Birdie's curriculum learning strongly outperforms fixed objective mixtures (Blouir et al., 2024).
7. Generalization, Limitations, and Research Directions
Bidirectional SSMs retain several strengths and present emerging challenges:
- Generalization and Flexibility: Empirical evidence indicates that such models generalize as well as, or better than, attention mechanisms to new domains and long inputs. Notably, DocMamba’s SSM can extrapolate document length well beyond pretraining caps (Hu et al., 2024).
- Model Stability: Training deep stacks of SSM blocks may exhibit decreased stability, mitigated by careful application of normalization and residuals (Erol et al., 2024).
- State Dimension Constraints: To maintain efficient convolutions, the SSM state dimension is often constrained (5), potentially limiting modeling capacity for extremely long dependencies (Erol et al., 2024, Hwang et al., 2024).
- Limitations on Fine-Grained Cross-Tokene Interaction: The structured, convolutional-style recurrences of SSMs—while hardware efficient—do not inherently provide the pairwise interaction flexibility of attention, possibly affecting tasks with ultrafine relational dynamics unless augmented with gating or dynamic fusion (Hwang et al., 2024, Behrouz et al., 2024).
- Architecture-Specific Considerations: For example, in Audio Mamba, correct placement of classification tokens is vital to achieve strong results (Erol et al., 2024). XLSR-Mamba shows that explicit dual-column fusion often outperforms uni- or single-column designs (Xiao et al., 2024).
A plausible implication is that bidirectional SSMs will continue to subsume an increasing range of formerly attention-dominated settings as implementations, training regimes, and fusion mechanisms advance. Open problems include scaling such mechanisms to very large models while retaining hardware efficiency, exploring new fusion or memory gating techniques, and designing specialized objectives or curricula to further narrow domain-specific performance gaps (Blouir et al., 2024).