Bidirectional Attention Mamba

Updated 16 December 2025

BAM is a neural sequence modeling paradigm that combines linear-time structured state space scans with bidirectional attention to capture both forward and backward dependencies.
It achieves linear computational scaling by integrating efficient SSM scans with localized self-attention, making it suitable for long-sequence modeling in video, speech, and point cloud tasks.
The architecture offers domain-specific variants and demonstrable empirical gains in performance and complexity management across multiple modalities.

Bidirectional Attention Mamba (BAM) is a neural sequence modeling paradigm that unites the linear-time structured state space modeling of Mamba with bidirectional information aggregation, often augmented by localized or global self-attention. By efficiently capturing both forward and backward contextual dependencies across modalities such as video, vision, speech, time series, and 3D point clouds, BAM provides a scalable alternative to conventional self-attention architectures, enabling long-sequence modeling and global context propagation without incurring quadratic complexity.

1. Architectural Foundations and Mathematical Core

The core BAM block interleaves parallel forward and backward state-space model (SSM) scans—parameterized as (possibly input-dependent) linear recurrences—with optional local self-attention for fine-grained interactions. The canonical SSM recurrence for input $x_t$ at time/index $t$ is

$h_t = \overline{A}_t h_{t-1} + \overline{B}_t x_t, \qquad y_t = C_t h_t + D_t x_t.$

Bidirectionality is achieved by running a standard forward scan and an anti-causal backward scan, i.e., $h'_t = \overline{A}_t h'_{t+1} + \overline{B}_t x_t$ for $t$ decreasing, yielding backward outputs $y'_t$ . These are merged by summation, concatenation, or learned gating:

$y^{\mathrm{BAM}}_t = \mathrm{Merge}(y_t, y'_t).$

In practice, input-dependent ("selective") parameters $\{\overline{A}_t, \overline{B}_t, C_t\}$ are predicted per-timestep with lightweight MLPs/conv1d layers.

When integrated with local self-attention, the sequence of operations for a BAM block in a transformer-like backbone is:

Layer normalization;
Spatial and/or temporal local attention (quadratic but restricted to small groups);
Bidirectional SSM scan;
Residual projection and possible nonlinearity.

This hybridization achieves both fine-grained local interaction and efficient global dependency modeling at overall linear cost in sequence length (Gao et al., 5 May 2024, Ibrahim et al., 11 Feb 2025, Jiang et al., 27 Mar 2024).

2. Complexity Analysis and Scalability Properties

The principal motivation for BAM is the $O(L)$ time and memory scaling, with $L$ the input sequence length, versus $O(L^2)$ for full self-attention. Each direction of the SSM scan is $O(L N)$ with $N$ the SSM state size. Local windowed attention adds $O(L d w)$ operations per head (with $w \ll L$ window size, $d$ head dim). The overall complexity per BAM block is summarized as:

Block component	Complexity	Context span
Full SA	$O(L^2 D)$	Global, pairwise
Local MHA (window)	$O(L d w H)$	Local window
Bi-SSM scan	$O(L E N)$	Global, sequential (bi-dir)
BAM	$O(L E N + L d w H)$	Hybrid: global (bi-dir) + local

For long sequences and moderate $N$ , BAM achieves tractability on workloads such as high-resolution video, gigapixel images, long speech/audio, or point cloud patches (Gao et al., 5 May 2024, Ibrahim et al., 11 Feb 2025, Chen et al., 10 Jun 2024).

3. Variants and Contextual Implementations

BAM manifests in domain-specific architectures:

Matten for Video Generation (Gao et al., 5 May 2024): BAM is structured as a composite residual block interleaving spatial (within-frame) and temporal (across time-steps) self-attention with a fully flattened bidirectional SSM scan, operating at the U-Net bottleneck. This enables local detail capture and global context propagation, yielding SOTA FVD under lower FLOPs than full-attention baselines.
Vision Mamba/VideoMamba (Ibrahim et al., 11 Feb 2025): BAM forms dual-stream parallel SSM layers, fused by elementwise SiLU-gated pointwise mixing, with hierarchical stacking and optional cross-scale modules. BAM provides the backbone for efficient long-sequence understanding in large-scale vision benchmarks.
Dual-path and Speech Applications (Jiang et al., 27 Mar 2024, Zhang et al., 21 May 2024, Xuan et al., 12 Aug 2025): BAM blocks replace self-attention in transformers/conformers for both speech enhancement and ASR, as well as real-time deepfake detection. In these models, the BAM mechanism is either directly substituted in the MHSA position or interleaved at multiple network layers, with empirical gains observed in SI-SNRi, WER, and real-time factors.
Point Cloud Analysis (Chen et al., 10 Jun 2024): In PointABM, initial Transformer-based global context encoding is followed by a stack of Bi-SSM layers; this hybridization achieves improvements on ModelNet40/ScanObjectNN.
Time Series and Adaptive Pooling (Xiong et al., 2 Apr 2025): Attention Mamba further accelerates BAM by combining adaptive global pooling for Q/K (avoiding $O(N^2)$ ), and a bidirectional SSM for V, giving a model that achieves linear scaling while increasing accuracy and receptive field.

4. Empirical Performance and Benchmarks

BAM and its derivatives demonstrate consistent empirical advantages across domains:

Video Generation: In Matten, BAM reduces FVD by $10$–$15$ points against both pure Mamba and pure-attention baselines at equal or lower FLOPs (SkyTimelapse, Baidu-videogen). Increasing model complexity continues to reduce FVD, demonstrating strong scalability (Gao et al., 5 May 2024).
Vision Tasks: VideoMambaPro, incorporating masked backward computation akin to BAM, achieves mAP $49.2\%$ for object detection and $82.7\%$ top-1 action recognition at high throughput (Ibrahim et al., 11 Feb 2025).
Speech/Audio: In the dual-path speech separation regime, DPMamba-M (bi-directional) attains SI-SNRi $22.6$ dB, matching Sepformer at $\sim 60\%$ of the parameter count and greatly reduced memory. Fake-Mamba (PN-BiMamba) achieves EERs of $0.97\%$ (ASVspoof 21LA) and $1.74\%$ (DF) with real-time inference (Xuan et al., 12 Aug 2025).
Time Series: Attention Mamba (with BAM core) achieves lowest MSE/MAE on $5/7$ major datasets and up to $13\%$ better MSE on large-scale transport forecasting, with training times $43\%$ shorter than bidirectional S-Mamba (Xiong et al., 2 Apr 2025).
Point Cloud: PointABM yields up to $+1.3\%$ absolute gains over PointMamba and up to $+2\%$ with pretraining (Chen et al., 10 Jun 2024).

5. Limitations, Design Trade-offs, and Domain-specific Extensions

While BAM delivers substantial efficiency and context reach, several trade-offs and limitations are noted:

Context Fusion Locality: Dual-stream fusion in BAM is typically local (element-wise) and does not model all cross-token interactions directly. Supplementing BAM with lightweight global self-attention modules or periodic full-attention windows can close this gap (Ibrahim et al., 11 Feb 2025).
Parameter and Compute Overhead: Bidirectionality nearly doubles per-block computation versus standard unidirectional SSM, though localized variants (e.g. LBMamba) recover much of this cost by restricting the backward scan within thread-local segments and alternating scan direction layerwise (Zhang et al., 19 Jun 2025).
Stability Requirements: Fine-tuning SSM kernel parameterization (notably the $\Delta$ scaling) is crucial for numerical stability over long sequences, and initialization schemes for $A$ impact convergence (Zhang et al., 21 May 2024).
Residual Nonlinearity: Vanilla Mamba (with minimal nonlinear gating) is insufficient for high-level semantic modeling (e.g. ASR) unless augmented with feedforward residual modules and nonlinearity (Zhang et al., 21 May 2024).
Extension Potential: BAM may be extended by hierarchical stacking (interleaving coarser and finer BAM/focal-attention layers), cross-directional attention at coarse scales, or learned time-dependent kernel modulations, with potential for increased global relational modeling (Ibrahim et al., 11 Feb 2025).

6. Summary of Key Variants and Implementation Strategies

The following table summarizes major published BAM variants and their core mechanics:

Variant/Domain	BAM Construction	Bidirectionality/Fusion	Local vs. Global Context	Reference
Matten (video)	Spatial MHA $\rightarrow$ Temporal MHA $\rightarrow$ Bi-SSM	Sum of forward/backward global SSM	Local details + global context	(Gao et al., 5 May 2024)
VideoMamba/ViM	Parallel F/B SSM with SiLU gating	SiLU-gated sum	Linear SSM on tokens	(Ibrahim et al., 11 Feb 2025)
Dual-path Mamba (speech)	Chunked F/B selective SSM, both short- and long-term	Averaging, linear projection	Bidirectional at both chunk scales	(Jiang et al., 27 Mar 2024)
PointABM	Pre-attention Transformer, stack of F/B SSM	Sum, nonlinearity, residual	One-shot global context encoding + Bi-SSM	(Chen et al., 10 Jun 2024)
Attention Mamba (time series)	Adaptive pooling + BAM block (F/B Conv1D SSM)	RVS sum and re-reverse, elementwise fusion	Global weighted + local nonlinear	(Xiong et al., 2 Apr 2025)
LBMamba/LBVim	In-register local backward scan	Local sum in each window, alternate sequence reversal globally	Near-global context at lower cost	(Zhang et al., 19 Jun 2025)

Implementation best practices for BAM, especially in large-scale or hardware-sensitive deployments, include confining backward passes to per-thread or local memory, tuning window sizes as a function of input length, and alternating scan direction across layers to achieve global receptive field with minimal compute overhead (Zhang et al., 19 Jun 2025).

7. Applications and Directions for Further Research

BAM demonstrates strong cross-domain viability:

Vision: Efficient backbone for high-res video synthesis, detection, segmentation; superior throughput/accuracy trade-off on ImageNet-1K, ADE20K, and COCO (Gao et al., 5 May 2024, Ibrahim et al., 11 Feb 2025, Zhang et al., 19 Jun 2025).
Speech and Audio: Real-time deepfake detection, high-fidelity speech separation, robust code-switching ASR, with reduced memory and latency over transformer baselines (Zhang et al., 21 May 2024, Xuan et al., 12 Aug 2025, Jiang et al., 27 Mar 2024).
Time Series: Adaptive time series forecasting in industrial, weather, and transportation domains, scaling efficiently to long input horizons (Xiong et al., 2 Apr 2025).
3D Point Cloud: Hybrid perception backbones for object classification and segmentation (Chen et al., 10 Jun 2024).

Potential extensions include multi-stage hybridization (full self-attention in coarser layers, BAM in fine layers), learned adaptive kernel parameterizations for richer positional embeddings, and integration with domain-specific pretext tasks or cross-modal fusion mechanisms (Ibrahim et al., 11 Feb 2025, Gao et al., 5 May 2024, Xuan et al., 12 Aug 2025).

BAM’s efficiency, flexibility for hierarchical stacking, and demonstrated gains in context modeling position it as a compelling foundation for long-sequence modeling in computationally demanding environments. Further exploration of cross-directional fusion schemes, global context routing, and hardware-specific optimizations is ongoing in the field (Ibrahim et al., 11 Feb 2025, Zhang et al., 19 Jun 2025).