Global Spatiotemporal Mamba
- Global Spatiotemporal Mamba is a neural architecture that decouples global context and local detail to efficiently model high-resolution spatiotemporal data.
- It utilizes a two-stream pipeline with a 6D selective space–time scan and SSM operators to capture multi-directional dependencies in video sequences.
- The model achieves linear computational complexity and superior accuracy in tasks like video-based human pose estimation compared to traditional convolutional and attention-based methods.
Global Spatiotemporal Mamba refers to a class of neural architectures that extend the core Mamba selective state space model (SSM) methodology to achieve efficient, expressive, and scalable representation learning for high-resolution spatiotemporal data. The approach was introduced to address the dual challenge of modeling both global long-range context and fine-grained local dynamics across space and time, with a specific focus on video-based human pose estimation (VHPE) and other densely structured temporal vision tasks (Feng et al., 13 Oct 2025). Distinct from prior convolutional or attention-based approaches—which typically unify all spatiotemporal modeling into a monolithic block and suffer from quadratic scaling—Global Spatiotemporal Mamba introduces an architecture that separates global and local representation learning, extends Mamba beyond 1D, and achieves linear computational complexity with respect to input sequence size.
1. Core Model Architecture
The foundation of the Global Spatiotemporal Mamba architecture is a two-stream pipeline capable of learning high-resolution, holistic spatiotemporal features from video sequences. The initial stage comprises a visual encoder that processes frame-wise RGB video input into high-resolution feature maps. Each frame is linearly projected into a D-dimensional latent space and augmented with spatial and temporal embeddings: spatially via fixed sine–cosine functions, temporally with learnable parameters, maintaining spatial detail at, e.g., quarter input resolution.
Downstream, a Sequential Channel Attention module highlights channels of global importance by applying spatiotemporal global average pooling followed by channel-wise MLPs and sigmoid gating. This selectively weights feature channels that contribute most strongly to global temporal and spatial context, acting before the core global scan and fusion modules.
The central module is the Global Spatiotemporal Mamba block, composed of:
- 6D Selective Space–Time Scan (STS6D): Features are unfolded into six one-dimensional sequences, each corresponding to a unique spatial or temporal scanning route: spatial x-axis (horizontal), spatial y-axis (vertical), temporal axis (depth), alongside their respective reverse (backward) traversals. Each route exposes a specific structural perspective of the high-dimensional sequence.
- Selective State Space (S6) Block: Each 1D sequence—from its respective route—is independently processed using an S6 module, the core Mamba operator tailored for efficient, dynamic modeling of sequential dependencies with input-adaptive parameters (e.g., , , per route).
- Spatial- and Temporal-Modulated Scan Merging (STMM): After six-fold scanning, paired forward and backward scans are combined (e.g., , with denoting inverse transformation for reversed scans) into three primary context groups: unified global dynamics (), spatial details (), and temporal motion tendencies (). These are adaptively merged using convolutional and deformable convolutional (DCN) layers, with spatially-learned offsets and modulation weights to compensate for misalignments and reinforce key contextual cues.
A parallel gated attention stream performs depthwise convolution and normalization, acting as an element-wise modulator of information propagation. Both streams are finally aggregated and passed through a feedforward network, producing the fused global spatiotemporal representation.
Local, high-frequency detail refinement is handled in a separate Local Refinement Mamba module based on windowed space-time scans.
2. Methodological Innovations
The Global Spatiotemporal Mamba advances prior state space model and transformer approaches through several key innovations:
- Decoupled Global–Local Modeling: It explicitly separates the modeling of holistic global dynamics (via the main Global Spatiotemporal Mamba block) and fine-grained local motion details (via a specialized Local Refinement Mamba block using windowed space–time scan), addressing suboptimal performance from undifferentiated, single-structure models found in previous literature. This division allows global modules to focus on long-range dependencies while local modules recover high-frequency spatial–temporal details.
- 6D Selective Space–Time Scan (STS6D): The architecture generalizes the Mamba SSM from 1D to high-dimensional space-time by constructing six scanning paths per video tensor. This ensures every pixel accesses context along all axes and their reverses, enabling multi-directional long-range information flow while preserving the mild inductive bias of spatial and temporal locality.
- Adaptive Modulated Fusion (STMM): Merging the outputs of paired scans is accomplished via deformable convolutions with spatial and temporal modulation, as opposed to naive concatenation or summation. This mechanism learns spatially-aware offsets and channel-wise weights, dynamically compensating for local misalignments caused by complex motions or spatial deformations.
- Pure SSM (Mamba) Operator Extension: Previous approaches hybridize convolutions, attention, or state space, often mishandling scaling. The Global Spatiotemporal Mamba is a pure SSM-based framework, with Mamba extended beyond 1D to high-dimensional spatiotemporal domains without introducing non-linear scaling bottlenecks.
3. Mathematical Formulation
The operator at each scan direction is characterized as a discretized state space model: with discretized parameters: such that for each scan sequence , the evolution is given by: After all scans: Paired forward and backward scans yield (unified), (spatial), and (temporal) via additive merge with inverse transformations. Modulated merging is performed by deformable convolution: with a similar mechanism for temporal fusion, where and are spatially learned offsets and weights.
4. Complexity and Scalability
A salient feature of the Global Spatiotemporal Mamba is linear computational complexity in input sequence size, made possible by the application of SSM blocks to each 1D scan independently and the avoidance of any explicit global attention mechanism. Unlike transformer variants whose self-attention operations are in the number of tokens (prohibiting practical high-resolution sequence processing), the Mamba-based scanning/merge design ensures that both memory footprint and compute scale linearly with the total number of pixels and frames.
Empirical results show that, for high-resolution input (e.g., spatial downsampled at $1/4$ the original resolution), the proposed method requires lower FLOPs than state-of-the-art transformer-based pose estimation architectures, while retaining superior accuracy.
5. Empirical Performance
The architecture was extensively validated on multiple video-based human pose estimation datasets:
- PoseTrack2017 & PoseTrack2018: The base model (GLSMamba-B, ViT-B backbone) achieves 86.9 mAP on PoseTrack2017 validation, outperforming both top-performing convolution-based models (TDMI, 85.7 mAP) and attention-based models (DiffPose, 86.4 mAP). The large variant (GLSMamba-H, ViT-H backbone) advances to 88.0 mAP.
- PoseTrack21 and Sub-JHMDB: The method demonstrates robust performance, generalizing across various video length, occlusion, and motion blur scenarios.
- Ablation Studies: Isolated addition of the Global Spatiotemporal Mamba module provides significant boosts in accuracy, with the inclusion of local refinement further improving results, validating the architectural decoupling and fusion design philosophy.
6. Broader Implications and Applicability
While the central application is video-based human pose estimation, the methodology demonstrates potential for broader domains, including:
- 3D human pose estimation
- Video segmentation
- General video understanding tasks requiring both global context and local detail
The underlying separation of global and local modeling, and the scalable application of state space models to high-dimensional spatiotemporal data, provide a general framework for future video analysis models, especially where high spatial resolution and long-range context are equally critical.
7. Limitations and Outlook
A plausible implication is that, although the Global Spatiotemporal Mamba delivers both efficiency and accuracy, challenges remain in further generalizing the SSM operators for more extreme spatiotemporal heterogeneity, optimizing the interaction between global and local modules for edge cases (e.g., extreme occlusions), and improving the interpretability of the multi-scan fusion mechanism. Future work may also investigate integration with domain adaptation modules, physics priors, or task-specific self-supervised objectives for further robustness and transferability.
The Global Spatiotemporal Mamba thus synthesizes pure state space modeling, multi-directional space-time scanning, and adaptive context fusion into a unified, computationally efficient, and highly effective framework for high-resolution video sequence understanding, advancing the modeling granularity and scalability of spatiotemporal networks (Feng et al., 13 Oct 2025).