Papers
Topics
Authors
Recent
2000 character limit reached

Bidirectional Attention Mamba

Updated 16 December 2025
  • BAM is a neural sequence modeling paradigm that combines linear-time structured state space scans with bidirectional attention to capture both forward and backward dependencies.
  • It achieves linear computational scaling by integrating efficient SSM scans with localized self-attention, making it suitable for long-sequence modeling in video, speech, and point cloud tasks.
  • The architecture offers domain-specific variants and demonstrable empirical gains in performance and complexity management across multiple modalities.

Bidirectional Attention Mamba (BAM) is a neural sequence modeling paradigm that unites the linear-time structured state space modeling of Mamba with bidirectional information aggregation, often augmented by localized or global self-attention. By efficiently capturing both forward and backward contextual dependencies across modalities such as video, vision, speech, time series, and 3D point clouds, BAM provides a scalable alternative to conventional self-attention architectures, enabling long-sequence modeling and global context propagation without incurring quadratic complexity.

1. Architectural Foundations and Mathematical Core

The core BAM block interleaves parallel forward and backward state-space model (SSM) scans—parameterized as (possibly input-dependent) linear recurrences—with optional local self-attention for fine-grained interactions. The canonical SSM recurrence for input xtx_t at time/index tt is

ht=A‾tht−1+B‾txt,yt=Ctht+Dtxt.h_t = \overline{A}_t h_{t-1} + \overline{B}_t x_t, \qquad y_t = C_t h_t + D_t x_t.

Bidirectionality is achieved by running a standard forward scan and an anti-causal backward scan, i.e., ht′=A‾tht+1′+B‾txth'_t = \overline{A}_t h'_{t+1} + \overline{B}_t x_t for tt decreasing, yielding backward outputs yt′y'_t. These are merged by summation, concatenation, or learned gating:

ytBAM=Merge(yt,yt′).y^{\mathrm{BAM}}_t = \mathrm{Merge}(y_t, y'_t).

In practice, input-dependent ("selective") parameters {A‾t,B‾t,Ct}\{\overline{A}_t, \overline{B}_t, C_t\} are predicted per-timestep with lightweight MLPs/conv1d layers.

When integrated with local self-attention, the sequence of operations for a BAM block in a transformer-like backbone is:

  1. Layer normalization;
  2. Spatial and/or temporal local attention (quadratic but restricted to small groups);
  3. Bidirectional SSM scan;
  4. Residual projection and possible nonlinearity.

This hybridization achieves both fine-grained local interaction and efficient global dependency modeling at overall linear cost in sequence length (Gao et al., 5 May 2024, Ibrahim et al., 11 Feb 2025, Jiang et al., 27 Mar 2024).

2. Complexity Analysis and Scalability Properties

The principal motivation for BAM is the O(L)O(L) time and memory scaling, with LL the input sequence length, versus O(L2)O(L^2) for full self-attention. Each direction of the SSM scan is O(LN)O(L N) with NN the SSM state size. Local windowed attention adds O(Ldw)O(L d w) operations per head (with w≪Lw \ll L window size, dd head dim). The overall complexity per BAM block is summarized as:

Block component Complexity Context span
Full SA O(L2D)O(L^2 D) Global, pairwise
Local MHA (window) O(LdwH)O(L d w H) Local window
Bi-SSM scan O(LEN)O(L E N) Global, sequential (bi-dir)
BAM O(LEN+LdwH)O(L E N + L d w H) Hybrid: global (bi-dir) + local

For long sequences and moderate NN, BAM achieves tractability on workloads such as high-resolution video, gigapixel images, long speech/audio, or point cloud patches (Gao et al., 5 May 2024, Ibrahim et al., 11 Feb 2025, Chen et al., 10 Jun 2024).

3. Variants and Contextual Implementations

BAM manifests in domain-specific architectures:

  • Matten for Video Generation (Gao et al., 5 May 2024): BAM is structured as a composite residual block interleaving spatial (within-frame) and temporal (across time-steps) self-attention with a fully flattened bidirectional SSM scan, operating at the U-Net bottleneck. This enables local detail capture and global context propagation, yielding SOTA FVD under lower FLOPs than full-attention baselines.
  • Vision Mamba/VideoMamba (Ibrahim et al., 11 Feb 2025): BAM forms dual-stream parallel SSM layers, fused by elementwise SiLU-gated pointwise mixing, with hierarchical stacking and optional cross-scale modules. BAM provides the backbone for efficient long-sequence understanding in large-scale vision benchmarks.
  • Dual-path and Speech Applications (Jiang et al., 27 Mar 2024, Zhang et al., 21 May 2024, Xuan et al., 12 Aug 2025): BAM blocks replace self-attention in transformers/conformers for both speech enhancement and ASR, as well as real-time deepfake detection. In these models, the BAM mechanism is either directly substituted in the MHSA position or interleaved at multiple network layers, with empirical gains observed in SI-SNRi, WER, and real-time factors.
  • Point Cloud Analysis (Chen et al., 10 Jun 2024): In PointABM, initial Transformer-based global context encoding is followed by a stack of Bi-SSM layers; this hybridization achieves improvements on ModelNet40/ScanObjectNN.
  • Time Series and Adaptive Pooling (Xiong et al., 2 Apr 2025): Attention Mamba further accelerates BAM by combining adaptive global pooling for Q/K (avoiding O(N2)O(N^2)), and a bidirectional SSM for V, giving a model that achieves linear scaling while increasing accuracy and receptive field.

4. Empirical Performance and Benchmarks

BAM and its derivatives demonstrate consistent empirical advantages across domains:

  • Video Generation: In Matten, BAM reduces FVD by $10$–$15$ points against both pure Mamba and pure-attention baselines at equal or lower FLOPs (SkyTimelapse, Baidu-videogen). Increasing model complexity continues to reduce FVD, demonstrating strong scalability (Gao et al., 5 May 2024).
  • Vision Tasks: VideoMambaPro, incorporating masked backward computation akin to BAM, achieves mAP 49.2%49.2\% for object detection and 82.7%82.7\% top-1 action recognition at high throughput (Ibrahim et al., 11 Feb 2025).
  • Speech/Audio: In the dual-path speech separation regime, DPMamba-M (bi-directional) attains SI-SNRi $22.6$ dB, matching Sepformer at ∼60%\sim 60\% of the parameter count and greatly reduced memory. Fake-Mamba (PN-BiMamba) achieves EERs of 0.97%0.97\% (ASVspoof 21LA) and 1.74%1.74\% (DF) with real-time inference (Xuan et al., 12 Aug 2025).
  • Time Series: Attention Mamba (with BAM core) achieves lowest MSE/MAE on $5/7$ major datasets and up to 13%13\% better MSE on large-scale transport forecasting, with training times 43%43\% shorter than bidirectional S-Mamba (Xiong et al., 2 Apr 2025).
  • Point Cloud: PointABM yields up to +1.3%+1.3\% absolute gains over PointMamba and up to +2%+2\% with pretraining (Chen et al., 10 Jun 2024).

5. Limitations, Design Trade-offs, and Domain-specific Extensions

While BAM delivers substantial efficiency and context reach, several trade-offs and limitations are noted:

  • Context Fusion Locality: Dual-stream fusion in BAM is typically local (element-wise) and does not model all cross-token interactions directly. Supplementing BAM with lightweight global self-attention modules or periodic full-attention windows can close this gap (Ibrahim et al., 11 Feb 2025).
  • Parameter and Compute Overhead: Bidirectionality nearly doubles per-block computation versus standard unidirectional SSM, though localized variants (e.g. LBMamba) recover much of this cost by restricting the backward scan within thread-local segments and alternating scan direction layerwise (Zhang et al., 19 Jun 2025).
  • Stability Requirements: Fine-tuning SSM kernel parameterization (notably the Δ\Delta scaling) is crucial for numerical stability over long sequences, and initialization schemes for AA impact convergence (Zhang et al., 21 May 2024).
  • Residual Nonlinearity: Vanilla Mamba (with minimal nonlinear gating) is insufficient for high-level semantic modeling (e.g. ASR) unless augmented with feedforward residual modules and nonlinearity (Zhang et al., 21 May 2024).
  • Extension Potential: BAM may be extended by hierarchical stacking (interleaving coarser and finer BAM/focal-attention layers), cross-directional attention at coarse scales, or learned time-dependent kernel modulations, with potential for increased global relational modeling (Ibrahim et al., 11 Feb 2025).

6. Summary of Key Variants and Implementation Strategies

The following table summarizes major published BAM variants and their core mechanics:

Variant/Domain BAM Construction Bidirectionality/Fusion Local vs. Global Context Reference
Matten (video) Spatial MHA →\rightarrow Temporal MHA →\rightarrow Bi-SSM Sum of forward/backward global SSM Local details + global context (Gao et al., 5 May 2024)
VideoMamba/ViM Parallel F/B SSM with SiLU gating SiLU-gated sum Linear SSM on tokens (Ibrahim et al., 11 Feb 2025)
Dual-path Mamba (speech) Chunked F/B selective SSM, both short- and long-term Averaging, linear projection Bidirectional at both chunk scales (Jiang et al., 27 Mar 2024)
PointABM Pre-attention Transformer, stack of F/B SSM Sum, nonlinearity, residual One-shot global context encoding + Bi-SSM (Chen et al., 10 Jun 2024)
Attention Mamba (time series) Adaptive pooling + BAM block (F/B Conv1D SSM) RVS sum and re-reverse, elementwise fusion Global weighted + local nonlinear (Xiong et al., 2 Apr 2025)
LBMamba/LBVim In-register local backward scan Local sum in each window, alternate sequence reversal globally Near-global context at lower cost (Zhang et al., 19 Jun 2025)

Implementation best practices for BAM, especially in large-scale or hardware-sensitive deployments, include confining backward passes to per-thread or local memory, tuning window sizes as a function of input length, and alternating scan direction across layers to achieve global receptive field with minimal compute overhead (Zhang et al., 19 Jun 2025).

7. Applications and Directions for Further Research

BAM demonstrates strong cross-domain viability:

Potential extensions include multi-stage hybridization (full self-attention in coarser layers, BAM in fine layers), learned adaptive kernel parameterizations for richer positional embeddings, and integration with domain-specific pretext tasks or cross-modal fusion mechanisms (Ibrahim et al., 11 Feb 2025, Gao et al., 5 May 2024, Xuan et al., 12 Aug 2025).

BAM’s efficiency, flexibility for hierarchical stacking, and demonstrated gains in context modeling position it as a compelling foundation for long-sequence modeling in computationally demanding environments. Further exploration of cross-directional fusion schemes, global context routing, and hardware-specific optimizations is ongoing in the field (Ibrahim et al., 11 Feb 2025, Zhang et al., 19 Jun 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Bidirectional Attention Mamba (BAM).