Bidirectional Caduceus (BiMamba)

Updated 3 March 2026

Bidirectional Caduceus (BiMamba) is a neural sequence modeling architecture that fuses two SSM recurrences to process past and future context.
It achieves near-linear scaling in sequence length and improved efficiency over multi-head self-attention for long data sequences.
BiMamba’s flexible design supports diverse modalities like speech, genomics, and multi-dimensional data with specialized engineering extensions.

Bidirectional Caduceus (BiMamba) is a neural sequence modeling architecture that generalizes the Mamba Selective State-Space Model (SSM) to bidirectional contexts. By combining forward and backward SSM recurrences, BiMamba achieves efficient modeling of long-term dependencies across a broad spectrum of modalities—audio, speech, genomics, time series, and multi-dimensional data—while maintaining near-linear scaling in sequence length. Originally conceived as a competitive alternative to multi-head self-attention (MHSA), BiMamba integrates content-aware, globally receptive convolutional dynamics in both temporal directions, frequently augmented with spatiotemporal, frequency, or multi-dimensional extensions depending on the application.

1. Core Architecture and Mathematical Foundations

BiMamba generalizes the Mamba SSM block to process inputs bi-directionally by coupling a forward SSM (processing the sequence from left-to-right) with a backward SSM (processing right-to-left), fusing their outputs at each layer. The underlying mechanism is grounded in continuous-time linear state-space models parameterized as

$h'(t) = A\,h(t) + B\,x(t),\quad y(t) = C\,h(t) + D\,x(t),$

with zero-order hold discretization: $\bar{A} = \exp(\Delta A), \quad \bar{B} = (\Delta A)^{-1}(\exp(\Delta A)-I)\Delta B,\quad h_k = \bar{A} h_{k-1} + \bar{B} x_k, \quad y_k = C^\top h_k.$ BiMamba instantiates two independent recurrences (forward and backward), operating as

$h^{f}_k = \bar{A} h^{f}_{k-1} + \bar{B} x_k,\qquad h^{b}_k = \bar{A} h^{b}_{k+1} + \bar{B} x_k,$

where sequence outputs are fused, typically via concatenation followed by a learned projection, averaging, or scalar mixing: $z_k = W_o \bigl[y^f_k \oplus y^b_k\bigr],\qquad \text{or}\qquad z_k = \beta\, y^f_k + (1-\beta)\, y^b_k.$ Variants exist where the forward/backward SSMs share parameters for parameter efficiency, or operate with independent (external) projections for flexibility (Schiff et al., 2024, Zhang et al., 2024, Gao et al., 13 Jul 2025).

2. Complexity, Modeling Power, and Comparison to Self-Attention

BiMamba matches or surpasses self-attention in its capacity to capture long-range dependencies, but with critical computational advantages:

Computational cost: BiMamba’s SSM layers scale as $\mathcal{O}(L D)$ with token length $L$ and hidden dimension $D$ , compared to $\mathcal{O}(L^2 D)$ for MHSA (Zhang et al., 2024, Gao et al., 13 Jul 2025). Efficient convolutional implementations (parallel scan or FFT) further accelerate inference.
Memory footprint: The memory demand is linear in $L$ , avoiding the quadratic scaling bottleneck of attention-based models, especially for long sequences or high-dimensional data (Schiff et al., 2024, Liu, 2024).
Global context: Unlike strictly local convolutional nets, BiMamba SSMs compute global receptive fields via convolutional kernels dynamically synthesized from sequence content.
Bidirectional dependency: Incorporating both past and future context is vital for non-autoregressive tasks such as speech recognition, sound event detection, and genomics. Bidirectional SSM yields empirically significant performance boosts over strictly causal (unidirectional) designs (Gao et al., 13 Jul 2025, Zhang et al., 2024, Xuan et al., 12 Aug 2025, Zhang et al., 2024).

3. Block Structure, Variants, and Engineering Extensions

Canonical BiMamba blocks instantiate the following pipeline:

Preprocessing: Layer normalization and linear projection of the input tensor to the desired hidden/state dimensionality (typical $D=512$ in audio, $N=16-256$ for SSM state).
Bidirectional SSM: Parallel forward-time and backward-time SSM passes as described above.
Fusion: Channel-wise concatenation plus projection, or scalar-weighted sum.
Nonlinearity and Residual: Application of batch or layer normalization, nonlinearities (ReLU or SiLU), and residual addition.
Additional Modules: Depending on the application, BiMamba often incorporates:
- Asymmetric convolutions: Decomposition of mixing across time and frequency dimensions, e.g., 1D time (kernel $K_t$ ), 1D frequency (kernel $K_f$ ) (Gao et al., 13 Jul 2025, Gao et al., 16 Jun 2025).
- Channel attention (ECA): Per-channel reweighting for multimodal data (Zhang et al., 2024).
- Dimension-agnostic convolutions: For generalizing to 1D, 2D, or 3D data, e.g., via Nd-BiMamba2 (Liu, 2024).
- Gating: Dynamic gating of SSM updates, especially in speech and sleep staging models.
- Specialized fusions: E.g., scalar mixing (learned $\beta$ ) or sum, required for context-sensitive applications.
Decoder/Head: Task-specific linear or MLP heads for classification or regression.

A summary of architectural choices and their domain-specific instantiations appears below:

Domain	Key Extension	Reference
Stereo SELD	Asymmetric conv (time/f)	(Gao et al., 13 Jul 2025)
Genomics	RC equivariant fusion	(Schiff et al., 2024)
Sleep stage/PSG	Channel attention (ECA)	(Zhang et al., 2024)
2D/3D data	Dim-adaptive BiMamba2	(Liu, 2024)
Streaming ASR	Trans-Chunk BiMamba	(She et al., 12 Feb 2026)

4. Applications Across Modalities

BiMamba architectures have demonstrated empirical benefits across heterogeneous tasks:

Speech Recognition and Enhancement: BiMamba, as a replacement for MHSA in Transformer/Conformer backbones, delivers lower word error rate (WER) and better speech enhancement metrics (PESQ, ESTOI) compared to unidirectional Mamba or attention models, with linear-time inference (Zhang et al., 2024, She et al., 12 Feb 2026). The Trans-Chunk BiMamba further supports unified offline and streaming ASR with dynamic chunk sizing and full bidirectional context at every latency (She et al., 12 Feb 2026).
Sound Event Localization/Detection (SELD): Replacement of Conformer decoders by BiMamba with asymmetric convolutions enables improved F1 scores ( $F_{20^\circ}$ ) and reduced parameter counts for stereo SELD (Gao et al., 13 Jul 2025, Gao et al., 16 Jun 2025).
Sleep Stage Classification: In multichannel PSG, BiMamba with ECA achieves higher accuracy and F1 versus Transformer or RNN baselines, while using only a fraction of the parameters (Zhang et al., 2024).
Trustworthy Speech Processing: Fake-Mamba demonstrates that BiMamba can supplant attention in real-time speech deepfake detection, obtaining lower EER than advanced attention-based detectors at lower computational cost (Xuan et al., 12 Aug 2025).
Genomics and DNA Modeling: BiMamba, as the bidirectional module in the Caduceus framework, supports efficient long-range sequence modeling over kilobase- to megabase-scale DNA. The RC-equivariant extension (MambaDNA) enables equivariant sequence modeling for regulatory genomics and variant effect prediction, setting state-of-the-art on several biological sequence benchmarks despite competitive models being orders-of-magnitude larger (Schiff et al., 2024).
Multi-Dimensional Data: Nd-BiMamba2 applies the BiMamba concept to any spatial or spatiotemporal dimension (1D/2D/3D), supporting vision, volumetric, or timeseries tasks in a unified, portable, and efficient module (Liu, 2024).

5. Quantitative Performance and Efficiency Gains

Rigorous benchmarks consistently corroborate the advantages of BiMamba:

SELD (DCASE2025 Task 3): BiMambaAC achieves $F_{20^\circ}=39.6\%$ (vs. Conformer’s $38.2\%$ ) using only 36% of the parameters (76M vs. 210M) and comparable MACs (4.63G vs. 4.69G) (Gao et al., 16 Jun 2025, Gao et al., 13 Jul 2025).
Real-time Speech Deepfake Detection: Fake-Mamba’s PN-BiMamba achieves EERs of 0.97%, 1.74%, and 5.85% on ASVspoof 21 LA, 21 DF, and In-the-Wild, outperforming attention-based XLSR-Conformer and XLSR-Mamba (Xuan et al., 12 Aug 2025).
Speech Enhancement (LibriSpeech): ExtBiMamba-6 reaches NB-PESQ 2.88 and ESTOI 76.69% (vs. Transformer-6’s 2.78 and 74.56%) (Zhang et al., 2024).
Sleep Staging (ISRUC-S3): CNN+ECA+1-BiMamba yields 0.852 accuracy and 0.824 F1 using just 0.47M parameters, outperforming DeepSleepNet and MixSleepNet (Zhang et al., 2024).
Streaming ASR: TC-BiMamba enables dynamic-latency streaming (offline Rescore WER 2.97%) at 1.3× training speedup, 50% less memory, equaling or surpassing fixed-chunk BiMamba and U2++ models at smaller model size (She et al., 12 Feb 2026).
Genomics (Regulatory Classification): Caduceus-PS achieves 0.900 accuracy on Human Enhancer classification (Ensembl), outperforming long-range HyenaDNA (0.849) at <2M parameters (Schiff et al., 2024).
Multi-Dimensional Data: Nd-BiMamba2 delivers up to 8% increased accuracy in vision tasks with only a 2× FLOPs increase over unidirectional processing; adaptive padding strategies yield 10–30% compute savings (Liu, 2024).

6. Implementation, Training Practices, and Trade-Offs

All practical BiMamba instantiations employ a series of architectural and training best practices:

LayerNorm before block entry; residual connections after each block.
Separate forward and backward linear projections for SSM input/output (unless sharing is imposed for efficiency (Schiff et al., 2024)).
Stacked block architectures, typically 4–7 layers deep for speech/audio, up to 16 for genomics.
Small kernel asymmetric convolutions for time/frequency decoupling in audio.
Adam optimizer, ReduceLROnPlateau scheduler, moderate to low learning rates (1e-3 to 3e-5), and batch sizes adjusted for hardware constraints.
Projection weight sharing (in Caduceus) enables deeper BiMamba stacks at low parameter cost without accuracy loss (Schiff et al., 2024).
Dynamic chunk-size sampling in TC-BiMamba for universal streaming/offline ASR (She et al., 12 Feb 2026).
ONNX and TorchScript compatibility for Nd-BiMamba2 ensures easy cross-hardware deployment (Liu, 2024).

Bidirectionality in BiMamba roughly doubles compute per layer relative to unidirectional Mamba, but yields marked gains in downstream modeling tasks. Adaptive padding and modular interfaces mitigate memory/computation overheads in high-dimensional inputs (Liu, 2024). Ablation studies uniformly find bidirectionality, gating, nonlinearity, and residuals as indispensable for state-of-the-art performance (Zhang et al., 2024, Gao et al., 13 Jul 2025, Xuan et al., 12 Aug 2025).

7. Limitations, Extensions, and Ongoing Developments

Limitations of BiMamba reported in the literature include:

2× computational overhead versus unidirectional Mamba (Liu, 2024).
Simple fusion of forward/backward representations (sum or scalar mix); richer cross-stream interactions are underexplored.
Parameter efficiency and representational capacity trade-off: Although shared projections and weight-tying admit deeper networks, some applications may require flexibility from independent per-direction parameterization (Schiff et al., 2024).
No explicit directional bias per spatial axis in Nd-BiMamba2.
For semantic-oriented tasks, the FFN and residual block must be retained to avoid degraded representational power (Zhang et al., 2024).

Current and future research continues to:

Investigate generalized fusion/attention across directions and tasks (Schiff et al., 2024, Liu, 2024).
Extend BiMamba blocks to RC-equivariant, symmetry-preserving contexts (e.g., genomics) (Schiff et al., 2024).
Explore modular, dimension-agnostic block designs for universal deployment in high-dimensional data domains (Liu, 2024).
Unify streaming and non-streaming processing without model bifurcation, as in Trans-Chunk BiMamba for ASR (She et al., 12 Feb 2026).
Apply BiMamba to new domains including vision and time-series beyond speech and language (Xuan et al., 12 Aug 2025, Liu, 2024).