Bidirectional Attention Mamba
- BAM is a neural sequence modeling paradigm that combines linear-time structured state space scans with bidirectional attention to capture both forward and backward dependencies.
- It achieves linear computational scaling by integrating efficient SSM scans with localized self-attention, making it suitable for long-sequence modeling in video, speech, and point cloud tasks.
- The architecture offers domain-specific variants and demonstrable empirical gains in performance and complexity management across multiple modalities.
Bidirectional Attention Mamba (BAM) is a neural sequence modeling paradigm that unites the linear-time structured state space modeling of Mamba with bidirectional information aggregation, often augmented by localized or global self-attention. By efficiently capturing both forward and backward contextual dependencies across modalities such as video, vision, speech, time series, and 3D point clouds, BAM provides a scalable alternative to conventional self-attention architectures, enabling long-sequence modeling and global context propagation without incurring quadratic complexity.
1. Architectural Foundations and Mathematical Core
The core BAM block interleaves parallel forward and backward state-space model (SSM) scans—parameterized as (possibly input-dependent) linear recurrences—with optional local self-attention for fine-grained interactions. The canonical SSM recurrence for input at time/index is
Bidirectionality is achieved by running a standard forward scan and an anti-causal backward scan, i.e., for decreasing, yielding backward outputs . These are merged by summation, concatenation, or learned gating:
In practice, input-dependent ("selective") parameters are predicted per-timestep with lightweight MLPs/conv1d layers.
When integrated with local self-attention, the sequence of operations for a BAM block in a transformer-like backbone is:
- Layer normalization;
- Spatial and/or temporal local attention (quadratic but restricted to small groups);
- Bidirectional SSM scan;
- Residual projection and possible nonlinearity.
This hybridization achieves both fine-grained local interaction and efficient global dependency modeling at overall linear cost in sequence length (Gao et al., 5 May 2024, Ibrahim et al., 11 Feb 2025, Jiang et al., 27 Mar 2024).
2. Complexity Analysis and Scalability Properties
The principal motivation for BAM is the time and memory scaling, with the input sequence length, versus for full self-attention. Each direction of the SSM scan is with the SSM state size. Local windowed attention adds operations per head (with window size, head dim). The overall complexity per BAM block is summarized as:
| Block component | Complexity | Context span |
|---|---|---|
| Full SA | Global, pairwise | |
| Local MHA (window) | Local window | |
| Bi-SSM scan | Global, sequential (bi-dir) | |
| BAM | Hybrid: global (bi-dir) + local |
For long sequences and moderate , BAM achieves tractability on workloads such as high-resolution video, gigapixel images, long speech/audio, or point cloud patches (Gao et al., 5 May 2024, Ibrahim et al., 11 Feb 2025, Chen et al., 10 Jun 2024).
3. Variants and Contextual Implementations
BAM manifests in domain-specific architectures:
- Matten for Video Generation (Gao et al., 5 May 2024): BAM is structured as a composite residual block interleaving spatial (within-frame) and temporal (across time-steps) self-attention with a fully flattened bidirectional SSM scan, operating at the U-Net bottleneck. This enables local detail capture and global context propagation, yielding SOTA FVD under lower FLOPs than full-attention baselines.
- Vision Mamba/VideoMamba (Ibrahim et al., 11 Feb 2025): BAM forms dual-stream parallel SSM layers, fused by elementwise SiLU-gated pointwise mixing, with hierarchical stacking and optional cross-scale modules. BAM provides the backbone for efficient long-sequence understanding in large-scale vision benchmarks.
- Dual-path and Speech Applications (Jiang et al., 27 Mar 2024, Zhang et al., 21 May 2024, Xuan et al., 12 Aug 2025): BAM blocks replace self-attention in transformers/conformers for both speech enhancement and ASR, as well as real-time deepfake detection. In these models, the BAM mechanism is either directly substituted in the MHSA position or interleaved at multiple network layers, with empirical gains observed in SI-SNRi, WER, and real-time factors.
- Point Cloud Analysis (Chen et al., 10 Jun 2024): In PointABM, initial Transformer-based global context encoding is followed by a stack of Bi-SSM layers; this hybridization achieves improvements on ModelNet40/ScanObjectNN.
- Time Series and Adaptive Pooling (Xiong et al., 2 Apr 2025): Attention Mamba further accelerates BAM by combining adaptive global pooling for Q/K (avoiding ), and a bidirectional SSM for V, giving a model that achieves linear scaling while increasing accuracy and receptive field.
4. Empirical Performance and Benchmarks
BAM and its derivatives demonstrate consistent empirical advantages across domains:
- Video Generation: In Matten, BAM reduces FVD by $10$–$15$ points against both pure Mamba and pure-attention baselines at equal or lower FLOPs (SkyTimelapse, Baidu-videogen). Increasing model complexity continues to reduce FVD, demonstrating strong scalability (Gao et al., 5 May 2024).
- Vision Tasks: VideoMambaPro, incorporating masked backward computation akin to BAM, achieves mAP for object detection and top-1 action recognition at high throughput (Ibrahim et al., 11 Feb 2025).
- Speech/Audio: In the dual-path speech separation regime, DPMamba-M (bi-directional) attains SI-SNRi $22.6$ dB, matching Sepformer at of the parameter count and greatly reduced memory. Fake-Mamba (PN-BiMamba) achieves EERs of (ASVspoof 21LA) and (DF) with real-time inference (Xuan et al., 12 Aug 2025).
- Time Series: Attention Mamba (with BAM core) achieves lowest MSE/MAE on $5/7$ major datasets and up to better MSE on large-scale transport forecasting, with training times shorter than bidirectional S-Mamba (Xiong et al., 2 Apr 2025).
- Point Cloud: PointABM yields up to absolute gains over PointMamba and up to with pretraining (Chen et al., 10 Jun 2024).
5. Limitations, Design Trade-offs, and Domain-specific Extensions
While BAM delivers substantial efficiency and context reach, several trade-offs and limitations are noted:
- Context Fusion Locality: Dual-stream fusion in BAM is typically local (element-wise) and does not model all cross-token interactions directly. Supplementing BAM with lightweight global self-attention modules or periodic full-attention windows can close this gap (Ibrahim et al., 11 Feb 2025).
- Parameter and Compute Overhead: Bidirectionality nearly doubles per-block computation versus standard unidirectional SSM, though localized variants (e.g. LBMamba) recover much of this cost by restricting the backward scan within thread-local segments and alternating scan direction layerwise (Zhang et al., 19 Jun 2025).
- Stability Requirements: Fine-tuning SSM kernel parameterization (notably the scaling) is crucial for numerical stability over long sequences, and initialization schemes for impact convergence (Zhang et al., 21 May 2024).
- Residual Nonlinearity: Vanilla Mamba (with minimal nonlinear gating) is insufficient for high-level semantic modeling (e.g. ASR) unless augmented with feedforward residual modules and nonlinearity (Zhang et al., 21 May 2024).
- Extension Potential: BAM may be extended by hierarchical stacking (interleaving coarser and finer BAM/focal-attention layers), cross-directional attention at coarse scales, or learned time-dependent kernel modulations, with potential for increased global relational modeling (Ibrahim et al., 11 Feb 2025).
6. Summary of Key Variants and Implementation Strategies
The following table summarizes major published BAM variants and their core mechanics:
| Variant/Domain | BAM Construction | Bidirectionality/Fusion | Local vs. Global Context | Reference |
|---|---|---|---|---|
| Matten (video) | Spatial MHA Temporal MHA Bi-SSM | Sum of forward/backward global SSM | Local details + global context | (Gao et al., 5 May 2024) |
| VideoMamba/ViM | Parallel F/B SSM with SiLU gating | SiLU-gated sum | Linear SSM on tokens | (Ibrahim et al., 11 Feb 2025) |
| Dual-path Mamba (speech) | Chunked F/B selective SSM, both short- and long-term | Averaging, linear projection | Bidirectional at both chunk scales | (Jiang et al., 27 Mar 2024) |
| PointABM | Pre-attention Transformer, stack of F/B SSM | Sum, nonlinearity, residual | One-shot global context encoding + Bi-SSM | (Chen et al., 10 Jun 2024) |
| Attention Mamba (time series) | Adaptive pooling + BAM block (F/B Conv1D SSM) | RVS sum and re-reverse, elementwise fusion | Global weighted + local nonlinear | (Xiong et al., 2 Apr 2025) |
| LBMamba/LBVim | In-register local backward scan | Local sum in each window, alternate sequence reversal globally | Near-global context at lower cost | (Zhang et al., 19 Jun 2025) |
Implementation best practices for BAM, especially in large-scale or hardware-sensitive deployments, include confining backward passes to per-thread or local memory, tuning window sizes as a function of input length, and alternating scan direction across layers to achieve global receptive field with minimal compute overhead (Zhang et al., 19 Jun 2025).
7. Applications and Directions for Further Research
BAM demonstrates strong cross-domain viability:
- Vision: Efficient backbone for high-res video synthesis, detection, segmentation; superior throughput/accuracy trade-off on ImageNet-1K, ADE20K, and COCO (Gao et al., 5 May 2024, Ibrahim et al., 11 Feb 2025, Zhang et al., 19 Jun 2025).
- Speech and Audio: Real-time deepfake detection, high-fidelity speech separation, robust code-switching ASR, with reduced memory and latency over transformer baselines (Zhang et al., 21 May 2024, Xuan et al., 12 Aug 2025, Jiang et al., 27 Mar 2024).
- Time Series: Adaptive time series forecasting in industrial, weather, and transportation domains, scaling efficiently to long input horizons (Xiong et al., 2 Apr 2025).
- 3D Point Cloud: Hybrid perception backbones for object classification and segmentation (Chen et al., 10 Jun 2024).
Potential extensions include multi-stage hybridization (full self-attention in coarser layers, BAM in fine layers), learned adaptive kernel parameterizations for richer positional embeddings, and integration with domain-specific pretext tasks or cross-modal fusion mechanisms (Ibrahim et al., 11 Feb 2025, Gao et al., 5 May 2024, Xuan et al., 12 Aug 2025).
BAM’s efficiency, flexibility for hierarchical stacking, and demonstrated gains in context modeling position it as a compelling foundation for long-sequence modeling in computationally demanding environments. Further exploration of cross-directional fusion schemes, global context routing, and hardware-specific optimizations is ongoing in the field (Ibrahim et al., 11 Feb 2025, Zhang et al., 19 Jun 2025).