Bidirectional State-Space Model (bi-SSM)
- Bidirectional SSM is an architectural paradigm that extends classical state-space models by processing sequences in both forward and backward directions to achieve full-context representation.
- It employs linear computational complexity through dual scanning and efficient fusion techniques, such as summation or gated mechanisms, to model long-range dependencies.
- Integration with Transformers and domain-specific architectures has enabled robust applications in vision, audio, and point cloud analysis with measurable performance improvements.
A Bidirectional State-Space Model (bi-SSM) is an architectural paradigm for sequence modeling that augments the classical state-space model by scanning input sequences both in forward (causal) and backward (anti-causal) directions. This design provides a full receptive field at every position, enabling context aggregation from both the past and future, and achieves this with linear computational complexity in sequence length. Modern bi-SSMs derive from recent advances exemplified by the Mamba architecture and are widely adopted in deep learning settings for vision, audio, and sequence analysis, offering a scalable alternative to attention-based models for tasks involving long dependencies and dense nonlocal structure (Chen et al., 2024).
1. Mathematical Structure and Bidirectional Recurrence
The bi-SSM architecture extends the classical discrete-time linear state-space representation: where is the state transition matrix, is the input mapping, and are output mappings, and denote the hidden state, input, and output, respectively.
In the bidirectional formulation, two passes are computed:
- Forward SSM: Processes the sequence with initial state , aggregating the past up to .
- Backward SSM: Processes the sequence in reverse (), with 0, and thus aggregates future context from 1 forward.
The outputs are merged by simple addition or, optionally, gated fusion: 2 where 3 is the sigmoid and 4 denotes concatenation. This approach ensures that every token's feature is informed by both its left and right neighbors (past and future in the input chain). The computational process remains linear in sequence length for each scan (Chen et al., 2024).
2. Bidirectional SSM Blocks: Implementation and Pseudocode
A prototypical bi-SSM block, such as in PointABM, consists of:
- Input LayerNorm
- Forward SSM pass (executed left-to-right).
- Backward SSM pass (executed right-to-left).
- Fusion step (either sum or gating).
- Per-token nonlinearity and output linear projection
- Residual addition
A minimal pseudocode abstraction: 4 This pattern, used in PointABM, provides a template for efficiently implementing and stacking bi-SSM layers (Chen et al., 2024).
3. Integration with Hybrid and Domain-specific Architectures
Bidirectional SSMs have been successfully integrated with Transformer-based architectures and specialized sequence models for various modalities:
- Point Cloud Analysis: In PointABM, a Transformer block first encodes point cloud patches, followed by a stack of bi-SSM blocks that enhance local and global feature extraction. The hybrid architecture leverages bidirectional context via SSM while Transformer heads model arbitrary spatial relations (Chen et al., 2024).
- Human Pose Estimation: Models such as PoseMamba and MV-SSM utilize bi-SSMs over spatial (joint) and temporal (frame) axes, often in combination with global-local permutations or projective-attention frontends. These designs systematically scan skeleton or keypoint structures in both directions, fusing the outputs to encode both kinematic chains and global pose information (Huang et al., 2024, Chharia et al., 31 Aug 2025).
- Audio and Video: Audio Mamba replaces quadratic-cost self-attention with SSM dual-pass blocks; video diffusion architectures similarly swap temporal attention for bi-SSM layers. In both cases, bidirectionality enables aggregation of information from both past and future events, crucial for representation learning in sequential data (Erol et al., 2024, Oshima et al., 2024).
- Hyperspectral Imaging: HSIDMamba employs an eight-directional bidirectional continuous scan mechanism, applying bi-SSMs along multiple flattenings (spatial-spectral paths) to enforce multi-view consistency without quadratic attention cost (Liu et al., 2024).
4. Efficiency, Complexity, and Training
Bidirectional SSMs exploit the inherent linearity of the SSM scan to ensure scalability:
- Each (forward or backward) SSM scan costs 5, where 6 is sequence length and 7 the state dimension, as opposed to attention's 8 cost.
- Bidirectional operation only doubles this cost, maintaining 9 scaling.
- Practical architectures (e.g., PointABM) deploy 12 bi-SSM layers, with typical configurations: SSM state dimension 0, token/feature dimension 1, AdamW optimizer, cosine decay schedules, and substantial pretraining (e.g., masked autoencoders on large datasets), as empirically verified on point cloud benchmarks (Chen et al., 2024).
Stacked bi-SSM blocks or complex variants (e.g., PoseMamba: four direction scans, global/local fuse) preserve linearity by performing all directional scans independently and summing, rather than entangling quadratic dependencies (Huang et al., 2024).
5. Empirical Impact and Comparative Performance
Task-specific empirical results confirm the utility of bidirectional context:
- 3D Point Cloud Analysis: PointABM (bi-SSM + Transformer) surpasses both pure-Mamba and pure-Transformer baselines (e.g., +1.6% accuracy on PB-T50-RS) (Chen et al., 2024).
- Multi-view Human Pose Estimation: MV-SSM demonstrates +10.8 AP2 improvement over the state-of-the-art on rare 3-camera setups; removal of bidirectional GTBS scanning in ablations leads to marked drops (AP3 falls by ∼6 points, MPJPE up by ∼3 mm) (Chharia et al., 31 Aug 2025).
- Video Diffusion: Linear-time bi-SSM blocks enable generation/training with hundreds of video frames (where full attention is memory-infeasible), matching or outperforming linear-attention and attention-based baselines in FVD (Oshima et al., 2024).
- Audio: AuM matches or exceeds Audio Spectrogram Transformer in multiple audio classification tasks while consistently requiring less GPU memory for long spectrograms (Erol et al., 2024).
- HSI Denoising: HSDM (with bi-SSM block) reports 30% speedup over Transformer and maximal PSNR/SSIM/SAM on ICVL and CAVE benchmarks (Liu et al., 2024).
- Anomalous Diffusion: Bi-Mamba achieves best F1 (0.91) and improved MAE/MSLE on AnDi-2 challenge, with stable linear scaling in trajectory length (Lavaud et al., 2024).
Bidirectionality is consistently critical: ablations across domains show that unidirectional SSM is insufficient to match attention or achieve robust long-range generalization.
6. Theoretical and Practical Significance
Bidirectional SSMs offer several distinct theoretical and applied benefits:
- Full-context modeling: Aggregation of entire sequence context at each token, analogously to bidirectional RNNs but without the vanishing gradient or inefficiency issues.
- Linear complexity: Full-sequence receptive fields are achieved at linear cost, enabling deployment on long sequences where quadratic-cost attention is infeasible.
- Stability and generalization: Bidirectionally fused latent representations (forward/backward scans) improve supervision signal propagation and stabilize training dynamics. In pose and multi-view domains, bi-SSMs promote structural regularization and robustness to sampling arrangements (Chen et al., 2024, Chharia et al., 31 Aug 2025).
- Flexibility of scan definition: Variants such as Mamba3D scan along token and channel axes, or HSIDMamba along multiple spatial-spectral orders, enabling model adaptation to arbitrary data topology or sequence structure (Han et al., 2024, Liu et al., 2024).
Limitations are acknowledged: while bi-SSMs excel at sequence modeling with long-range dependencies, hybrid strategies or additional attention heads may further boost performance at higher spatial resolutions or for tasks requiring rich pairwise interactions.
7. Representative Implementations and Future Directions
Summary of bi-SSM block designs across domains:
| Application | Bidirectional Scan Axis | Fusion Mechanism | Reported Benefit |
|---|---|---|---|
| PointABM, Mamba3D | Token, Channel | Sum (or gate) | Linear complexity, full context |
| PoseMamba, MV-SSM | Space (joints), Time | Sum | Richer global/local dependencies |
| HSIDMamba | 8 spatial-spectral orders | Sum + normalization | Multi-view spectral coherence |
| AuM, Video Diffusion (SSM+GLU/MLP) | Time (forward/backward) | Sum + MLP | Enables very long sequence handling |
| Bi-Mamba (anomalous diffusion) | Time (forward/backward) | Concatenation + FFN | Stable regression/classification |
Across these, the emerging consensus is that bidirectional SSMs fill a crucial gap between expensive full-attention and RNNs, supporting linear-time global modeling with strong empirical performance. Future work is likely to focus on adaptive/learned scan orders, hybridization with attention or convolution, and expanding the SSM formalism to richer connectivity topologies (Chen et al., 2024, Huang et al., 2024, Liu et al., 2024, Chharia et al., 31 Aug 2025).