Bidirectional Mamba Block
- Bidirectional Mamba Block is a neural architecture that processes sequences in both forward and reverse directions to create enriched global context.
- It employs dual columns and selective state space models to maintain linear time complexity while approximating the benefits of quadratic attention.
- Empirical results demonstrate its effectiveness across modalities, achieving superior performance in speech, vision, time series, and biomedical applications.
A Bidirectional Mamba Block is a neural architectural construct that generalizes the selective State Space Model (SSM) known as Mamba by running parallel recurrence scans in both chronological (forward) and anti-chronological (backward) directions, then fusing their outputs to create richer, globally contextualized representations. This design preserves the linear time and memory complexity intrinsic to SSMs, while more fully leveraging bidirectional dependencies—thus closing the gap with quadratic-attention mechanisms but at a lower computational cost. It has been instantiated in various forms across speech, vision, time series, point cloud, and biomedical domains, with multiple empirical and implementational variants.
1. Architectural Foundations and Variants
Dual-Column and Parallel Designs
The canonical Bidirectional Mamba (e.g., DuaBiMamba (Xiao et al., 2024)) comprises two independent Mamba pipelines:
- Forward column: Processes sequence in temporal order using a selective, input-driven SSM recurrence.
- Backward column: Processes the same sequence in reverse (). Each column consists of input projection, selective SSM, local Conv1D/gated mixing, and output projection to a shared embedding dimension. The outputs and are then merged—commonly by concatenation along channels, with a linear projection to return to dimension .
Bidirectionality generalizes to local-window (LBMamba (Zhang et al., 19 Jun 2025)) and cross-task (BIM (Cao et al., 28 Aug 2025)) settings:
- LBMamba embeds small-windowed local backward scans within the main forward scan, eliminating the need for a full global reverse sweep and reducing overhead to ~2% kernel time.
- Bidirectional Interaction Mamba (BIM) uses parallel forward/backward cross-task Mamba scans (BI-Scan) to couple task-specific features for multi-task dense prediction, integrating both task-first and position-first serialization patterns.
2. Mathematical Formulation
At each position :
- Forward SSM: ,
- Backward SSM: , where are parameterized via gating networks, potentially selective per input, and recurrence steps may be pointwise (1D convolution) or channelwise (spatial, temporal, or task axes).
Outputs are merged:
- Concatenation: (DuaBiMamba, VisionMamba, UltraLBM-UNet (Fan et al., 25 Dec 2025))
- Summation: (PointABM (Chen et al., 2024), SAMBA (Mehrabian et al., 2024))
- Gated fusion: (MotionMamba BSM (Zhang et al., 2024), BiT-MamSleep (Zhou et al., 2024))
In bidirectional interaction or multi-task designs, scan axes and merge strategies generalize to both feature and task dimensions.
3. Computational Complexity and Scalability
Bidirectional Mamba blocks double the forward pass cost relative to a unidirectional Mamba block but remain:
- Time/space complexity: , where is sequence length, embedding dimension, versus for Transformer attention.
- Local windowed variants: LBMamba fuses a local backward scan in per-thread registers, incurring only ~27% more FLOPs but no additional inter-thread or HBM traffic—substantially more efficient than full global backward passes (Zhang et al., 19 Jun 2025).
- BIM/BI-Scan: Cross-task bidirectional scan maintains linear cost , compared to in naïve full task-pair interaction (Cao et al., 28 Aug 2025).
Bidirectional SSMs naturally parallelize in-place and are hardware friendly, which underpins their success in long-sequence and high-resolution vision, speech, and bio applications.
4. Empirical Performance and Comparative Benchmarks
Empirical studies uniformly demonstrate gains over both vanilla (unidirectional) Mamba and Transformer baselines:
- Spoofing attack detection (XLSR-Mamba): DuaBiMamba achieves the lowest EER (0.93% LA, 1.88% DF), exceeding XLSR+Conformer and other Mamba variants, with faster inference (lower real-time factor) (Xiao et al., 2024).
- Computer vision (LBVim with LBMamba): At constant throughput, gains of 0.8-1.6% in ImageNet-1K Top-1, ~2.7% ADE20K mIoU, and 0.9-1.1% in COCO detection APb/APm are observed over Vim-Ti and Vim-S while maintaining high efficiency (Zhang et al., 19 Jun 2025).
- Multi-task dense prediction (BIM): Gains +1.29 mIoU (NYUD-v2) over baseline Mamba, total +1.58 mIoU vs. MTMamba, with only a 26% parameter and 1% FLOP overhead (Cao et al., 28 Aug 2025).
- Time-series forecasting and financial data (Bi-Mamba+, SAMBA): Outperforms state-of-the-art Transformers on real-world long-term forecasting and stock price prediction tasks, with near-linear scaling and improved accuracy (Liang et al., 2024, Mehrabian et al., 2024).
- Biomedical and motion domains: UltraLBM-UNet yields state-of-the-art segmentation at ultra-low resource budget (<0.034M params, <0.06 GFLOPs) (Fan et al., 25 Dec 2025); Motion Mamba BSM block yields a 64% FID improvement and higher R-Precision for long-horizon motion modeling (Zhang et al., 2024).
Ablations confirm that bidirectionality, even when locally windowed, accounts for the majority of these gains; fusion method and gate design account for additional, but smaller, improvements.
5. Implementation Patterns and Pseudocode
A generic forward pass for a Dual-Column Bidirectional Mamba block adheres to the following pattern:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
def DuaBiMamba(X): # Forward direction h_f = zeros(state_dim) Y_f = [] for t in range(T): A_f, B_f, C_f = GatingNetworks_forward(X[t]) h_f = A_f @ h_f + B_f @ X[t] y_f = C_f @ h_f Y_f.append(y_f) # Backward direction h_b = zeros(state_dim) Y_b = [] for t in reversed(range(T)): A_b, B_b, C_b = GatingNetworks_backward(X[t]) h_b = A_b @ h_b + B_b @ X[t] y_b = C_b @ h_b Y_b.insert(0, y_b) # reverse order # Merge Z = concat_along_channel(Y_f, Y_b) # shape: T x (2d) output = Linear_out(Z) # shape: T x d return output |
Windowed/local and cross-task variants (e.g., LBMamba, BIM) traverse either in per-thread or per-task windows and reverse sequence between blocks or task axes as context demands.
6. Domain-specific Extensions and Fusion Strategies
Bidirectional Mamba blocks serve as architectural primitives across modalities:
- Speech and audio: Serve as drop-in replacements for multi-head self-attention, delivering improved word/estimation error rates, speech intelligibility, and robustness (Zhang et al., 2024).
- EEG and biomedical: Core bidirectional temporal filter, yielding robust sleep stage or lesion classification even under sequence imbalance and hardware constraint (Zhou et al., 2024, Fan et al., 25 Dec 2025).
- Multi-task and multimodal contexts: BI-Scan extends bidirectional recurrence to both spatial and task axes, allowing simultaneous cross-task and within-task attention at linear complexity (Cao et al., 28 Aug 2025).
- Vision/image generation: Bidirectional context across both token and channel axes underpins advances in segmentation (UltraLBM-UNet), masked image generation (MaskMamba), and efficient gigapixel whole-slide processing (Chen et al., 2024, Zhang et al., 19 Jun 2025, Fan et al., 25 Dec 2025).
- Partial bidirectionality: Partially Flipped (PF-) Mamba in SIGMA selectively flips sequence prefixes to adapt bidirectionality flexibly, with gating for per-branch weighting (Liu et al., 2024).
Fusion strategies vary: channelwise concatenation (with linear projection), summation, or learned gate-based interpolation.
7. Limitations, Trade-offs, and Research Directions
While doubling the recurrence cost relative to unidirectional Mamba, Bidirectional Mamba blocks remain vastly more efficient than quadratic attention, with no principal loss in global receptive field—given sufficient layer depth, sequence reversal, or windowing schedule (LBVim strategy (Zhang et al., 19 Jun 2025)). Empirical studies consistently report that alternating scan direction (full sequence reversals) is essential to fully restore global context.
A plausible implication is that trade-offs between window size, block count, and throughput can be tuned to suit hardware or memory constraints, especially in massive-scale vision or long-range bio applications. Continual improvements in local windowing, fusion gate design, and task-aware scan axes may further optimize expressive power for cross-modal dense prediction and domain-specific high-resolution modeling.
References:
(Xiao et al., 2024, Zhang et al., 19 Jun 2025, Cao et al., 28 Aug 2025, Chen et al., 2024, Ibrahim et al., 11 Feb 2025, Lavaud et al., 2024, Liang et al., 2024, Mehrabian et al., 2024, Zhang et al., 2024, Zhou et al., 2024, Fan et al., 25 Dec 2025, Zhang et al., 2024, Liu et al., 2024).