Bidirectional Mamba State-Space Modeling
- Bidirectional Mamba state-space modeling is a deep sequence modeling approach that incorporates both past and future context by fusing forward and backward hidden states.
- It utilizes input-dependent selective feed-forward networks and diverse fusion schemes such as concatenation and gating to dynamically modulate model parameters.
- The architecture maintains linear compute and memory scaling, making it competitive with Transformers across modalities like speech, vision, time series, and graphs.
Bidirectional Mamba State-Space Modeling
Bidirectional Mamba State-Space Modeling refers to a class of deep sequence models that extend the “Mamba” family of selective state-space models (SSMs) with parallel or alternating forward and backward state-propagation mechanisms. This bidirectional processing architecture allows hidden state updates at each sequence position to incorporate both past and future context, thereby overcoming the limitations of causality inherent in traditional unidirectional SSMs. Bidirectional Mamba architectures are instantiated across diverse modalities, including speech, vision, graph, time series, and multimodal molecular modeling, and are characterized by their hardware efficiency—scaling in compute and memory linearly with sequence length—while rivaling or surpassing Transformer performance on key tasks.
1. Core Mathematical Principles
The foundational principle of Bidirectional Mamba State-Space Modeling is the discrete-time SSM, which for an input sequence produces a hidden state and output by
with matrices (possibly vectors or even scalars per channel in compact implementations) that may depend on the input at time . Bidirectional extensions introduce a simultaneous backward scan: so that at each position both (forward) and (reverse) are available. These are fused—typically by summing, concatenation and projection, or data-dependent gating—before further layer processing or downstream prediction (Masuyama et al., 2024, Zhu et al., 2024, Lavaud et al., 2024).
Mamba’s “selective” property leverages input-dependent modulations through small feed-forward networks that produce SSM parameters conditioned on , enabling flexible, context-sensitive dynamic modeling at each step (Masuyama et al., 2024).
2. Architectural Variants and Fusion Schemes
Bidirectional Mamba can be implemented using several fusion paradigms:
- Serial and Parallel Blocks: In speech modeling (e.g., MADEON) both “serial” (forward SSM, then speech-token reversal, then backward SSM) and “parallel” (forward and backward SSMs applied to the same (normalized, projected) input, then outputs fused) designs are used (Masuyama et al., 2024).
- Concatenation and Gating: Vision Mamba and Graph Mamba fuse forward and backward outputs by concatenation followed by linear or gated fusion. For example, in Vision Mamba, the fused hidden state is computed as , or, alternatively, using a data-dependent gate , (Zhu et al., 2024).
- Local and Global Bidirectionality: LBMamba (Locally Bi-directional Mamba) adds a backward recurrence only within local windows in each CUDA thread, combining the result with the forward hidden state, thus achieving local bidirectional context at minimal extra computational cost. Alternating scan direction every two layers recovers global receptive field while avoiding a full global backward pass (Zhang et al., 19 Jun 2025).
- Self-fusion with Residuals: Audio Mamba fuses forward and backward hidden states by summation , followed by residual, normalization, and feed-forward mixing (Erol et al., 2024); similar patterns appear in graph and time series domains.
3. Applications Across Modalities
Bidirectional Mamba state-space modeling has demonstrated efficacy in a wide range of domains:
- Speech Recognition and Separation: MADEON deploys serial and parallel bidirectional speech-prefixing over tokenized inputs, significantly reducing word error rates and outperforming non-selective or unidirectional SSMs, while scaling sub-quadratically. Dual-path and SepMamba variants leverage short-/long-term and U-Net architectures, maintaining linear complexity with competitive SI-SNRi and SDRi on WSJ0-2mix and LibriSpeech benchmarks (Masuyama et al., 2024, Jiang et al., 2024, Avenstrup et al., 2024).
- Vision and Medical Imaging: Vision Mamba (ViM), Surface Vision Mamba, and ABS-Mamba exploit spatial patch ordering and bidirectional SSM fusions for context aggregation, achieving superior accuracy, throughput, and memory efficiency compared to Transformer backbones in tasks spanning ImageNet, COCO, ADE20K, and spherical/cortical manifold segmentation (Zhu et al., 2024, He et al., 24 Jan 2025, Yuan et al., 12 May 2025).
- Time Series Forecasting: Bi-Mamba+ incorporates forward and reversed SSM passes and combines them with gating and adaptive tokenization strategies, yielding state-of-the-art performance over 8 real-world multivariate datasets (Liang et al., 2024).
- Graph Learning: Graph Mamba Networks (GMNs) employ bidirectional SSM “token” sequences constructed from local subgraph and random-walk encodings to systematically propagate context along both graph-theoretic “directions”, empirically mitigating the over-squashing phenomenon (Behrouz et al., 2024).
- Sequential Recommendation and Multimodal Fusion: EchoMamba4Rec applies spectral filtering and bidirectional Mamba for sequential recommendation, showing systematic gains in HR@10 and NDCG@10. CrossLLM-Mamba fuses embeddings from biological LLMs for RNA interaction prediction via bidirectional SSM “alignment”, outperforming static fusion schemes and setting new benchmarks on RPI and binding affinity (Wang et al., 2024, Sadia et al., 23 Feb 2026).
- Biomedical Signals and Video: SR-Mamba and UltraLBM-UNet deploy bidirectional selective SSMs within surgical phase recognition and lightweight U-Net architectures respectively for robust context modeling and efficiency (Cao et al., 2024, Fan et al., 25 Dec 2025).
4. Computational Complexity and Efficiency
A defining advantage of bidirectional Mamba models is their preservation of the base SSM’s compute and memory scaling, where is sequence length and is SSM state size, even under bidirectional extensions. Whereas bidirectional Transformers double both compute and memory due to quadratic attention, and naively running a global backward SSM sweep in Mamba would incur the same, Mamba’s selective SSMs use either in-thread local backward passes (LBMamba) or parallel forward/backward scans that do not significantly expand parameter count or runtime (Zhu et al., 2024, Zhang et al., 19 Jun 2025, Avenstrup et al., 2024).
Benchmark results confirm these claims: MADEON-2SP achieves training speed (6 h) and memory (20 GB) nearly half that of Transformer baselines (8 h, 40 GB), with comparable parameter counts and test-accuracy on LibriSpeech and GigaSpeech (Masuyama et al., 2024). Vision Mamba achieves higher throughput than DeiT while saving 86.8% GPU memory at high resolutions (Zhu et al., 2024). UltraLBM-UNet preserves zero parameter overhead for bidirectionality by weight sharing between directions (Fan et al., 25 Dec 2025).
5. Empirical Performance and Ablation
Bidirectional Mamba models consistently deliver superior or comparable accuracy to Transformers and nonlinear attention-free baselines:
| Architecture | Main Task | Gain from Bidirectionality | Source |
|---|---|---|---|
| MADEON-2SP | ASR (LibriSpeech) | –0.5% WER dev-clean vs uniSSMs | (Masuyama et al., 2024) |
| Vision Mamba (ViM) | ImageNet, COCO, ADE20k | – pts vs DeiT/backbones | (Zhu et al., 2024) |
| Dual-path Mamba | Speech Separation (WSJ0-2mix) | dB SI-SNRi over forward-only | (Jiang et al., 2024) |
| Graph Mamba Networks (GMN) | Node Classification (Long-Range) | pts accuracy on heterophilic | (Behrouz et al., 2024) |
| Bi-Mamba+ | Time-Series Forecasting | –3.25% MSE vs forward-only | (Liang et al., 2024) |
| EchoMamba4Rec | Sequential Recommendation | % HR@10 vs uniMamba | (Wang et al., 2024) |
| CrossLLM-Mamba | RNA–protein interaction | +4.4 pts MCC vs static fusion | (Sadia et al., 23 Feb 2026) |
| SR-Mamba | Surgical Phase Recognition | +4.4 pts accuracy vs uniMamba | (Cao et al., 2024) |
| UltraLBM-UNet | Skin Lesion Segmentation | +1.17 pt IoU, no extra params | (Fan et al., 25 Dec 2025) |
| LBMamba/LBVim | ImageNet, Pathology MIL | +0.8–1.6% accuracy under constraint | (Zhang et al., 19 Jun 2025) |
Ablation studies consistently demonstrate large degradations (up to 4.4 pp accuracy, 3.25% MSE, or multiple dB SI-SNRi) when replacing bidirectional blocks with single-directional SSMs (Masuyama et al., 2024, Liang et al., 2024, Jiang et al., 2024, Cao et al., 2024).
6. Algorithmic and Implementation Details
Bidirectional Mamba implementations share several algorithmic and engineering optimizations:
- Hardware-aware parallel scan: Both forward and backward recurrences are executed in a parallel “scan” pattern, exploiting GPU-friendly segmented prefix algorithms.
- Parameter sharing: Some implementations (e.g., UltraLBM-UNet) share weights between forward and backward passes, minimizing parameter count while maximizing performance (Fan et al., 25 Dec 2025).
- Context fusion: Fusion may be performed via summation, concatenation with projection, or data-adaptive gates (e.g., SiLU or sigmoid), with variations affecting empirical performance and stability.
- Efficient memory and compute: Localized bidirectionality (LBMamba) leverages in-register recurring backward sweeps within each thread to avoid double off-chip memory traffic (Zhang et al., 19 Jun 2025).
- Spectral and gating augmentations: EchoMamba4Rec augments bidirectional SSMs with frequency-domain (FFT) learned filters and GLU units, further enhancing signal quality and convergence (Wang et al., 2024).
- Domain-specific tokenization: MADEON applies STR (speech-token reversal) to the prefix, while Graph Mamba Networks construct ordered subgraph “token” sequences for bidirectional SSM passes (Masuyama et al., 2024, Behrouz et al., 2024).
7. Limitations, Extensions, and Outlook
While Bidirectional Mamba architectures are empirically robust and efficient, some limitations and open directions remain:
- Double-loop inefficiency: Naive global forward–backward sweeps can negate linearity; local or alternating schemes correct this, but may introduce partial context gaps unless carefully alternated (Zhang et al., 19 Jun 2025).
- Information flow topology: Graph-structured domains may lack a canonical sequence, making token ordering and permutation effects in bidirectional SSMs nontrivial (Behrouz et al., 2024).
- Task-specific fusion: Optimal fusion method (sum, concat, gate) may be domain or task-dependent and interacts with normalization, positional encoding, and depth.
- Uncertainty quantification: As current bidirectional Mamba models are discriminative and often determinate, additional machinery is needed for calibrated Bayesian or ensemble uncertainty estimates (Lavaud et al., 2024).
The bidirectional selective SSM paradigm is rapidly being adopted across domains requiring non-causal, context-rich long-sequence modeling, frequently at sub-quadratic complexity and without the architectural rigidity or inefficiency of self-attention layers. Ongoing developments focus on further optimizing bidirectional propagation at the hardware level, augmenting with spectral and graph-based regularization, and extending applications to emerging scientific and clinical data types.