Bidirectional Mamba Blocks

Updated 9 December 2025

Bidirectional Mamba Blocks are neural architectural primitives that extend state-space models to process sequential data in both forward and backward directions.
They combine dual SSM recurrences with dynamic gating and fusion to effectively capture long-range dependencies while maintaining strict linear computational complexity.
Empirical results across speech, vision, and time series applications demonstrate enhanced accuracy and hardware efficiency compared to traditional unidirectional or quadratic complexity models.

A Bidirectional Mamba Block is a neural architectural primitive that extends the hardware-optimized, selective state-space modeling mechanisms of the Mamba model to process sequential or tokenized data in both forward (causal) and backward (anti-causal) directions. By running two independent state-space model (SSM) recurrences—one left-to-right and one right-to-left—and then fusing their outputs, Bidirectional Mamba enables efficient global context integration analogous to bidirectional LSTMs or bidirectional Transformers, but with strict linear complexity in sequence length. This design improves long-range dependency modeling, context-aware inference, and computational efficiency across domains including speech, vision, time series, multimodal, and biomedical signal processing.

1. Core Principles and Mathematical Foundations

A Bidirectional Mamba Block consists of two parallel, parameter-tied or independent selective SSM streams:

Forward SSM: At time $t$ , updates hidden state $h_t^\rightarrow = A_t^\rightarrow h_{t-1}^\rightarrow + B_t^\rightarrow u_t$ , outputting $y_t^\rightarrow = C_t^\rightarrow h_t^\rightarrow$ .
Backward SSM: Processes the time-reversed sequence, $h_t^\leftarrow = A_t^\leftarrow h_{t+1}^\leftarrow + B_t^\leftarrow u_t$ , $y_t^\leftarrow = C_t^\leftarrow h_t^\leftarrow$ .
Input-dependent gating is typically realized by a small neural network: $u_t = \sigma(W_g x_t + b_g) \odot (W_u x_t + b_u)$ ; matrix parameters $A_t$ , $B_t$ , $C_t$ are usually light-weight MLP-parameterized or diagonal.

Fusion is performed either by concatenation and linear projection, simple summation, or learned gating, yielding a unified representation:

$z_t = W_f \begin{bmatrix} y_t^\rightarrow \ y_t^\leftarrow \end{bmatrix} + b_f, \quad \text{or} \quad z_t = \alpha_t \odot y_t^\rightarrow + (1-\alpha_t) \odot y_t^\leftarrow$

Here, $W_f$ is a learnable $D\times 2D$ matrix, and $\alpha_t$ may be a dynamic gate.

A residual connection and normalization (e.g., LayerNorm or RMSNorm) are always applied:

$\mathrm{out}_t = \mathrm{LayerNorm}(x_t + z_t)$

Each Bidirectional Mamba Block is commonly stacked $L$ times to form a deep model.

2. Algorithmic Variants and Architectural Design

Multiple Bidirectional Mamba block instantiations have been reported:

Dual-Column (DuaBiMamba): Contains two independent Mamba columns (forward/backward), fuses by concatenation $\rightarrow$ fully connected layer $\rightarrow$ residual $+$ layer norm, deployed atop pre-trained acoustic models for spoof detection (Xiao et al., 15 Nov 2024).
Parallel Scan (BiMamba): Runs two SSMs on original and reversed input in parallel, concatenates or sums features elementwise, typical in time series, EEG, and diffusion models (Lavaud et al., 10 Dec 2024, Liang et al., 24 Apr 2024, Gao et al., 17 Oct 2024).
Task-Axis, Cross-Feature, and Spiral Scans: Adaptive scan orderings in vision/multitask/2D biomedical applications, e.g., spiral scan in medical imaging or cross-task scan in dense prediction for improved spatial/contextual coverage (Cao et al., 28 Aug 2025, Yuan et al., 12 May 2025).
Locally Bi-directional (LBMamba): Fuses a forward global scan with lightweight in-register local backward scans within each GPU thread, followed by alternating global direction reversal across blocks for full receptive field with minimal memory overhead (Zhang et al., 19 Jun 2025).

Bidirectional Mamba blocks often integrate additional innovations:

Attention and gating mechanisms: Temporal attention weighting, cross-channel/channel-mixing, and forget gates augment the basic bidirectional state-space construction (Gao et al., 17 Oct 2024, Liang et al., 24 Apr 2024).
Integration in composite modules: Used with front-end CNNs, adaptive feature recalibration, spectral-temporal attention, and channel-wise transformations for high-dimensional or multivariate data (Zhou et al., 3 Nov 2024, Kheir et al., 20 May 2025).

3. Computational Complexity and Efficiency

Bidirectional Mamba blocks maintain strict linear computational and activation memory complexity in sequence length $T$ (for time-indexed models) or $N$ (for spatially flattened images or patches), given fixed state or channel size $D$ :

Forward/Backward scan cost: $O(2\,T\,(D^2 + D))$ per block.
Fusion/Projection cost: $O(T\,2D \times D)$ for fully connected merge.
Residual and normalization add negligible overhead.
The parallel scan implementation yields $O(TD + D^2 \log T)$ on modern hardware; dual-column/bidirectional designs double compute cost relative to single-path, but remain $\ll O(T^2)$ of self-attention (Xiao et al., 15 Nov 2024, Zhu et al., 17 Jan 2024, Zhang et al., 19 Jun 2025).

Empirical benchmarking demonstrates substantial improvement in throughput, memory footprint, and scalability compared to self-attention backbones, especially for long or high-resolution sequences (Zhu et al., 17 Jan 2024, Zhang et al., 19 Jun 2025, Mo et al., 24 May 2024).

4. Applications Across Modalities

Bidirectional Mamba blocks have been integrated and validated in a range of domains:

Speech and audio: Spoofing attack detection, speech enhancement, ASR, deepfake detection—DuaBiMamba, BiCrossMamba-ST, and generic BiMamba blocks enable context-rich classification and artifact localization with linear-time kernels (Xiao et al., 15 Nov 2024, Zhou et al., 3 Nov 2024, Kheir et al., 20 May 2025, Zhang et al., 21 May 2024).
Vision: Vision Mamba adopts bidirectional state-space processing as the backbone for classification, detection, segmentation with global receptive field efficiency (Zhu et al., 17 Jan 2024). Spiral/2D-bidirectional scans are key in medical image translation (ABS-Mamba) (Yuan et al., 12 May 2025). LBMamba/LBVim introduces hardware-efficient local bidirectional context for ultra-high-res imagery (Zhang et al., 19 Jun 2025).
Time Series and Forecasting: Efficient modeling of long-range dependencies, bidirectional context, and multi-channel interactions for forecasting, imputation, and anomalous diffusion inference (Lavaud et al., 10 Dec 2024, Gao et al., 17 Oct 2024, Liang et al., 24 Apr 2024, Xiong et al., 2 Apr 2025, Liu et al., 21 Aug 2024).
Multitask and Multimodal: BIM employs task/position bidirectional scans for dense prediction; bidirectional Mamba enables robust cross-task interaction with linear scaling in number of tasks (Cao et al., 28 Aug 2025).
Point Cloud and 3D Geometric Data: Hybrid Transformer + BiMamba models enhance global feature extraction under tight compute/memory constraints (Chen et al., 10 Jun 2024).

5. Empirical Evidence and Design Trade-offs

Ablation and benchmarking data demonstrate:

Predictive Advantage: Bidirectionality consistently improves accuracy, as measured by mIoU, FID, minDCF, error rates, etc., across speech, vision, diffusion, and time series tasks (Xiao et al., 15 Nov 2024, Zhu et al., 17 Jan 2024, Yuan et al., 12 May 2025, Liang et al., 24 Apr 2024, Kheir et al., 20 May 2025). For example, bidirectional configuration in Vision Mamba yields a $+0.7\%$ top-1 ImageNet gain over unidirectional, and in XLSR-Mamba, dual-column bidirectional design dominates challenging “in-the-wild” benchmarks (Xiao et al., 15 Nov 2024, Zhu et al., 17 Jan 2024).
Trade-offs: The computational cost is roughly $2\times$ relative to a single unidirectional scan, but remains $O(N)$ or $O(T)$ and is significantly lower than Transformer-style $O(N^2)$ . Local bidirectional variants (LBMamba) recover most benefits with only 2–3% throughput penalty and near-zero increase in off-chip memory traffic (Zhang et al., 19 Jun 2025).
Parameterization Choices: Gate sharing (InnBiMamba) is more efficient but less flexible than path-separated (ExtBiMamba); gating and fusion types modulate expressivity for various end tasks (Zhang et al., 21 May 2024).
Fusion Methods: Concatenation plus linear, summation, and learned dynamic gates all appear viable, with detailed design tuned empirically (Lavaud et al., 10 Dec 2024, Zhu et al., 17 Jan 2024, Kheir et al., 20 May 2025).

6. Architectural Patterns and Theoretical Implications

Bidirectional Mamba blocks inherit the advantages of SSM kernels for long-range dependency, but supplement RNN-like history with explicit anti-causal modeling:

Global context without quadratic attention: Each token or spatial location “sees” both past and future, enabling non-causal inference essential for generative modeling, detection, and segmentation (Zhu et al., 17 Jan 2024, Mo et al., 24 May 2024).
Selective gating: Small MLPs choose which parts of the sequence or spatial structure to integrate at each step, enabling noise suppression, context mixing, and adaptation to heterogeneous data (Lavaud et al., 10 Dec 2024, Kheir et al., 20 May 2025).
Fusion with locality and multi-scale context: Designs such as spiral scan and position/task-wise fusions allow hierarchical and global context mixing at any desired spatial or temporal granularity (Cao et al., 28 Aug 2025, Yuan et al., 12 May 2025).

7. Limitations, Open Problems, and Prospective Directions

Despite strong empirical evidence for performance and efficiency, several design constraints persist:

Double compute cost: Full bidirectional blocks incur $2\times$ SSM compute per layer; local bidirectional variants (e.g., LBMamba) offer mitigation but at possible cost to global receptive field unless augmented with alternating data order or hybrid architectures (Zhang et al., 19 Jun 2025).
Expressiveness relative to full attention: Despite superior speed and memory, bidirectional SSMs lack pairwise content-based selection inherent to self-attention, which may still be desirable for certain modeling tasks (Zhu et al., 17 Jan 2024, Zhang et al., 21 May 2024).
Parameter sharing and data efficiency: When applied to low-data or high-variance regimes (e.g., speech spoofing), careful integration of pre-training, regularization, and gate-sharing is required (Xiao et al., 15 Nov 2024).
Application to complex or non-sequential modalities: Extending bidirectional Mamba to multi-modal, graph-structured, or hierarchical data remains an ongoing area of research.

Bidirectional Mamba Blocks represent a significant advancement in scalable, context-rich sequence and tensor modeling, combining the operational simplicity and linear scaling of SSMs with the full-context capabilities of bidirectional inference, yielding state-of-the-art results and enabling new application domains with practical hardware-awareness and algorithmic flexibility (Xiao et al., 15 Nov 2024, Lavaud et al., 10 Dec 2024, Zhu et al., 17 Jan 2024, Zhang et al., 19 Jun 2025, Cao et al., 28 Aug 2025, Mo et al., 24 May 2024, Zhou et al., 3 Nov 2024).