Bidirectional Spatiotemporal Mamba

Updated 2 January 2026

Bidirectional Spatiotemporal Mamba is a class of state-space neural architectures that encode both global and local spatial-temporal patterns using bidirectional scanning.
It leverages linear-time complexity and selective scanning to overcome the quadratic cost and rigidity of transformer self-attention, with applications in motion prediction and video analysis.
The architecture employs dual-modality scanning, dynamic gating, and fusion strategies to achieve parameter efficiency and high performance in tasks like pose estimation and text-driven motion synthesis.

Bidirectional Spatiotemporal Mamba is a class of state-space neural architectures that efficiently encode global and local patterns in both spatial and temporal dimensions by leveraging bidirectional, selective scanning mechanisms. Designed to supplant the quadratic cost and representation inflexibility of self-attention used in transformer-based frameworks, Bidirectional Spatiotemporal Mamba provides linear-time and parameter-efficient modeling of complex, non-local dependencies, with validated application in multi-person motion prediction, video understanding, pose estimation, and text-driven motion synthesis (Yin et al., 25 Dec 2025, Huang et al., 2024, Park et al., 2024, Zhan et al., 10 Mar 2025).

1. Foundations of Bidirectional Spatiotemporal Mamba

Bidirectional Spatiotemporal Mamba architectures extend the Mamba state-space model—originally a selective, input-dependent SSM with linear computational complexity—by introducing bidirectional scans over spatial and temporal axes. In contrast to causal, unidirectional variants, bidirectional Mamba processes input sequences in both forward and reverse order, fusing both representations to eliminate causal bias and improve context aggregation. This mechanism is crucial when capturing long-range, cross-dimensional relationships inherent in spatial-temporal data, such as human joint trajectories, video token sequences, or multimodal sensor streams (Yin et al., 25 Dec 2025, Park et al., 2024).

The base SSM constitutes a continuous- or discrete-time recurrence: $h_{n} = \overline A_n\,h_{n-1} + \overline B_n\,x_{n-1},\quad y_n = C\,h_n + D\,x_n$ with input-dependent dynamic gating and parameterization. Bidirectionality is realized by running this recurrence in both scan orders and summing or concatenating the outputs (Park et al., 2024, Lavaud et al., 2024).

2. Architectural Mechanisms and Scanning Strategies

In practice, bidirectional spatiotemporal scanning is embedded in modular blocks. Each block is responsible for either spatial (joint-dimension, grid position) or temporal (frame/step index) scanning. The scan order permutation can be purely forward, purely backward, or span both directions; in video and pose applications, “spatio-temporal reversal” is empirically optimal, as it maximizes long-range pairwise coupling (Park et al., 2024).

For motion and video:

Spatial Module: Scans across joints, body parts, or grid positions, often enhanced with global-local reordering (e.g., via kinematic chains in pose estimation).
Temporal Module: Scans forward and backward across frames, capturing both short-term transitions and long-term temporal structure.
Fusion: LayerNorm, residual connections, and non-linear gating (SiLU, elementwise product) are used to integrate bidirectional scan outputs. In multi-expert setups, outputs are weighted by routing logits and mixed accordingly (Yin et al., 25 Dec 2025).

Typical pseudocode for a bidirectional scan in one axis:

1
2
3

F_fwd = MambaForward(input_sequence)
F_bwd = MambaBackward(reverse(input_sequence))
F_out = F_fwd + F_bwd + input_sequence

This representation may be stacked (e.g., Spatial→Temporal modules), split (e.g., part-based and whole-body branches (Zhan et al., 10 Mar 2025)), or fused via dynamic attention modules.

3. Computational Complexity and Efficiency

Bidirectional Spatiotemporal Mamba inherits the linear computational profile of SSMs:

Mamba blocks: Run two scans (forward, backward) per axis: O(L·d), L = sequence or grid length, d = state dimension.
Comparison with self-attention: Standard transformers require O(L²·d) for N×N matrices; bidirectional Mamba needs O(4·L·d) for dual-axis two-way scanning (Yin et al., 25 Dec 2025, Park et al., 2024).
Parameter economy: Key parameters are small (conv1d kernels, state matrices, linear projections); ST-MoE demonstrates 41.38% fewer parameters and 3.6× faster training versus transformer baselines (Yin et al., 25 Dec 2025). PoseMamba achieves SOTA accuracy with 2–6× fewer params and 4–10× lower FLOPs than MixSTE or MotionBERT (Huang et al., 2024).

4. Variants and Domain-Specific Implementations

Several architectures instantiate bidirectional spatiotemporal Mamba:

Model	Domain	Bidirectional Scan Axes	Additional Modules
ST-MoE (Yin et al., 25 Dec 2025)	Motion pred.	Spatial + Temporal	Mixture of Experts, gating
PoseMamba (Huang et al., 2024)	3D pose est.	Global+local spatial, temporal	Kinematic chain reordering
VideoMamba (Park et al., 2024)	Video recog.	Spatio-temporal permutation	3D conv tokenizer, PE
HiSTF Mamba (Zhan et al., 10 Mar 2025)	Text-motion	Bi-Temporal, Dual-Spatial	Dynamic Spatiotemporal Fusion
GSMamba (Ko et al., 1 Oct 2025)	VSR	Temporal (frame window)	SWSA, Gather-Scatter, flow
Bi-Mamba (Lavaud et al., 2024)	Diffusion	Temporal	Fused regression & segmentation

Contextual significance: Each architecture tailors scan permutations, fusion and routing mechanisms to encode structured dependencies—e.g., four expert chains in ST-MoE, global-local kinematic order in PoseMamba, part/whole branches in HiSTF Mamba, and alignment-aware propagation in GSMamba.

5. Empirical Performance and Benchmarks

Bidirectional Spatiotemporal Mamba modules consistently yield superior accuracy, parameter, and throughput trade-offs over transformer-based and RNN-based competitors:

Motion prediction (ST-MoE): Bi-ST-Mamba improves JPE by 4.3 mm vs forward-only and 5.1 mm vs transformer, and runs 26% faster per iteration (Yin et al., 25 Dec 2025).
3D pose estimation (PoseMamba): PoseMamba-L reaches 38.1 mm P1 on Human3.6M using 6.7M params vs 42M for MotionBERT, outperforming MixSTE at a fraction of MACs (Huang et al., 2024).
Video understanding (VideoMamba): Spatio-temporal reversal yields +3–4% higher top-1 accuracy on HMDB51 and SSV2, with only ~1/4 the FLOPs of VideoSwin-T (Park et al., 2024).
Text-driven motion (HiSTF Mamba): FID improves by ~30% over Motion Mamba, with tight semantic alignment measured by R-Precision and reduced MM-Dist (Zhan et al., 10 Mar 2025).
Anomalous diffusion (Bi-Mamba): Bidirectionality boosts F1 and reduces MAE by 0.05 over Bi-RNN baselines (Lavaud et al., 2024).
Video super-resolution (GSMamba): Achieves 38.25 dB PSNR on Vimeo-90K-T with fewer FLOPs than BasicVSR++, and reduces occlusion artifacts via gather-scatter anchoring (Ko et al., 1 Oct 2025).

Empirical ablations attribute these gains to the ability of bidirectional scans to complete global context and avoid causal edge effects. Removal of backward scan increases error (e.g. by 0.05 MAE for anomalous exponent inference (Lavaud et al., 2024)); global-local fusion further improves accuracy.

6. Applications, Limitations, and Design Insights

Bidirectional Spatiotemporal Mamba modules are now established across human motion analysis, monocular 3D vision, multimodal synthesis, sequential diffusion modeling, and video restoration.

Key application domains:

Multi-person pose and motion prediction, robustness to occlusion and noisy input (Yin et al., 25 Dec 2025)
Monocular video-based pose recovery, refined limb articulation (Huang et al., 2024)
Text-conditioned motion, semantic-alignment and diversity in generation (Zhan et al., 10 Mar 2025)
Anomalous diffusion segmentation and regression in biophysics (Lavaud et al., 2024)
Video super-resolution, occlusion artifact reduction, and efficient propagation (Ko et al., 1 Oct 2025)

Limitations:

Outputs are typically point estimates with no built-in uncertainty quantification (Lavaud et al., 2024).
Performance depends on tailored ordering and fusion strategies for each domain.
Alignment methods (e.g., optical flow in GSMamba) may become inaccurate under large motion or blur (Ko et al., 1 Oct 2025).
Windowed propagation does not guarantee true global context for extremely long video sequences.

This suggests that while Bidirectional Spatiotemporal Mamba models deliver strong efficiency and accuracy, further research on alignment, uncertainty modeling, and integration with convolutional or U-Net architectures for segmentation is warranted in specialized settings.

7. Implementation and Practical Recommendations

Implementing a Bidirectional Spatiotemporal Mamba block requires:

Dynamic, input-dependent gating for SSM parameters;
Efficient scan routines (forward and reversed order) over arbitrary axes (joint, frame, part);
Fusion strategies tailored to the domain (elementwise sum, concatenation, attention-weighted mixing);
Optional expert mixing (as in ST-MoE), local-global split (as in PoseMamba), or part/whole representations (HiSTF Mamba).

Practitioners may reference official codebases for ST-MoE (https://github.com/alanyz106/ST-MoE), PoseMamba, GSMamba, and Bi-Mamba implementations as cited in the respective papers.

Bidirectional Spatiotemporal Mamba thus provides a generalizable, efficient architecture for flexible modeling of global-local dependencies in high-dimensional spatiotemporal tasks.