Video Mamba Suite: Efficient SSM Video Modeling

Updated 15 December 2025

Video Mamba Suite is a collection of architectures based on selective state space models that deliver scalable, high-performance video understanding with linear time complexity.
It integrates specialized modules like the Decomposed Bi-Directional Mamba block and hybrid attention-SSM to support tasks such as action recognition, dense captioning, and video-language interaction.
The system employs hardware-aware techniques including fused selective scans and memory blocking, significantly reducing GPU memory usage and computational load compared to traditional self-attention models.

The Video Mamba Suite encompasses a collection of state-space-model (SSM)–based architectures and modules that position the Mamba framework as a scalable, high-performance alternative to Transformer and convolutional neural network (CNN) architectures for video understanding. Its foundation is the selective SSM, designed for linear time and memory complexity, global spatio-temporal modeling capacity, and hardware-aware implementation. The Suite includes a range of modules and approaches addressing short- and long-term action recognition, dense captioning, retrieval, video-language interaction, and more. Key innovations include the Decomposed Bi-directional Mamba (DBM) block for efficient 3D video modeling, robust empirical performance over standard benchmarks, and adaptability as backbones, adapters, or multi-modal fusion components (Chen et al., 14 Mar 2024, Zhang et al., 24 Apr 2024, Park et al., 11 Jul 2024, Li et al., 11 Mar 2024).

1. Foundation: Selective State Space Model (SSM)

The Video Mamba Suite is built upon the Mamba SSM, a continuous- and discrete-time dynamical system for sequential data:

Continuous-time formulation:

$h'(t) = A\,h(t) + B\,x(t),\qquad y(t) = C\,h(t)$

where $x(t)$ is the input, $h(t)$ the hidden state, and $A, B, C$ are system matrices.

Discretization (Zero-Order Hold) yields:

$h_t = \bar{A}\,h_{t-1} + \bar{B}\,x_t,\qquad y_t = C\,h_t$

with $\bar{A} = \exp(\Delta A)$ and $\bar{B} = (\Delta A)^{-1}[ \exp(\Delta A) - I ]\,\Delta B$ for step size $\Delta$ .

Global Convolutional Form:

The sequence output is $y = x * \bar{K}$ , where the kernel

$\bar{K} = [C\bar{B},\,C\bar{A}\bar{B},\,\dots,\,C\bar{A}^{L-1}\bar{B}]$

and $L$ is the sequence length—enabling global, linear-time context integration.

A pivotal innovation is the selection mechanism: $A, B, C$ are made input-dependent via small feed-forward networks, allowing per-token adaptivity and breaking the limitations of time-invariant recurrence. This mechanism enables the SSM blocks to react dynamically to the evolving content of long video sequences, crucial for modeling non-stationary patterns and rapid temporal dynamics (Zhang et al., 24 Apr 2024, Park et al., 11 Jul 2024).

2. Hardware-Aware and Efficient Implementation

The Video Mamba Suite exploits a hardware-conscious design:

Fused Selective Scan: By recasting the tokenwise recurrence into a parallel prefix/suffix scan, multiple operations (gating, state update, convolution, nonlinearity) are fused in a single GPU kernel.
Memory Blocking: The input is tiled so each block update fits into fast on-chip memory; intermediate activations can be recomputed to minimize memory footprint.
No Softmax, O(1D Conv): Unlike self-attention, no quadratic softmax is required; most heavy computation is handled via pointwise or 1D convolutions.
Complexity: For $L$ tokens and hidden size $N\ll L$ , time/memory is $\mathcal{O}(L N)$ . This enables application to streams with thousands of video frames, unattainable for quadratic-complexity models (Zhang et al., 24 Apr 2024, Chen et al., 14 Mar 2024).

3. Video-Specific Mamba Variants

The Suite includes several tailored modules for the video domain:

Decomposed Bi-Directional Mamba (DBM) Block: Extends 1D ViM blocks to 3D by scanning along both spatial and temporal axes, employing shared SSM parameters, separate input projections, and bidirectional passes ( $h_f$ forward, $h_b$ backward), fused via gating and SiLU nonlinearity:

$h_f(t,s) = \bar{A}_t h_f(t-1,s) + \bar{B}_t x(t,s)$

$h_b(t,s) = \bar{A}_t h_b(t+1,s) + \bar{B}_t x(t,s)$

$y(t,s)=\sigma(h_f \odot z_f + h_b \odot z_b)$

VideoMamba: Employs 3D bi-directional SSM scans interleaved with convolutions, combined with self-distillation for scalability.
VMRNN: Hybridizes ConvLSTM with SSM layers for sequence modeling, replacing traditional convolutional cells with VSS-LSTM.
Hybrid Attention-SSM: Mamba blocks can be combined with local/dilated self-attention (as in SSM-ViT), leveraging both mechanisms for spatial and temporal fusion (Chen et al., 14 Mar 2024, Zhang et al., 24 Apr 2024, Park et al., 11 Jul 2024).

4. Complexity and Empirical Efficiency

The computational advantage of Mamba-based modules is pronounced:

Comparison to Self-Attention:
- Vanilla attention: $\mathcal{O}(L^2)$ time and memory
- Mamba: $\mathcal{O}(L N)$ , with $N$ typically $\ll L$
Empirical Metrics:
- On $L\sim 10^4$ (e.g., 64 frames at $224\times224$ resolution), SSM time is $\approx 10\times$ lower and peak GPU memory is $\approx 4$ –$8$GB, compared to $\approx 20$ GB for self-attention.
- Models such as VideoMamba-Ti, -S, and -M span 7–83 GFLOPs and 25–120M parameters, outperforming or matching ViT and TimeSformer at lower resource budgets.
Throughput: VideoMamba-Ti processes 4x longer clip per second than X3D-M with memory footprint $<1$ GB (Zhang et al., 24 Apr 2024, Chen et al., 14 Mar 2024, Li et al., 11 Mar 2024, Park et al., 11 Jul 2024).

5. Empirical Results Across Video Understanding Benchmarks

The Video Mamba Suite has been evaluated on a variety of tasks and datasets:

Action Recognition (Kinetics-400, Something-Something V2):
- VideoMamba-S (28 GFLOPs): 75.3% top-1 on K400 (vs. 74.1% for ViT-Base at 27 GFLOPs)
- DBM-L: 66.8% on Sth-Sth V2 (vs. 65.4% of TimeSformer), halving memory usage
Throughput and Scalability: Mamba models process substantially longer sequences per second than comparable baselines.
Broader Applications: Mamba blocks have also been incorporated as adapters in video-language pipelines, as shown for multi-modal video understanding (via H-MBA) and in fine-grained medical video tasks (see Vivim) (Zhang et al., 24 Apr 2024, Chen et al., 14 Mar 2024, Yang et al., 25 Jan 2024, Chen et al., 8 Jan 2025).

Table: Representative Empirical Results

Model	K400 Top-1 (%)	SthSthV2 Top-1 (%)	Memory (GB)	GFLOPs	Params (M)
VideoMamba-S	75.3	–	4–8	28	60
ViT-Base	74.1	–	20	27	–
DBM-L (Video Mamba)	–	66.8	~50% less	–	–
TimeSformer	–	65.4	~2×	–	–

6. Applications and Task Spectrum

The Video Mamba Suite supports a broad spectrum of video processing tasks:

General Visual Tasks: Action recognition, object detection, video classification, segmentation (2D/3D), image restoration, generation, and super-resolution.
Medical and Scientific Video: Used in segmentation and registration for ultrasound, endoscopy, and other domain-specific datasets (see Vivim).
Remote Sensing and Surveillance: Handles high-resolution and long-context input efficiently.
Multi-Modal and Language Interaction: Serves as temporal or fusion backbone in video-language pipelines, e.g., as context and query adapters in H-MBA for autonomous driving scenarios.
Dense Prediction and Video Diffusion: Owing to linear complexity and global context, SSM decoders are natural drop-in replacements for self-attention in video diffusion/generation models (Zhang et al., 24 Apr 2024, Chen et al., 14 Mar 2024, Yang et al., 25 Jan 2024, Chen et al., 8 Jan 2025).

7. Challenges and Open Research Directions

Key unresolved issues include:

Large-scale pre-training paradigms: How to design masked or predictive SSM training objectives for videos (e.g., masked token, frame prediction).
Hardware kernel engineering: Efficient implementation of n-dimensional SSM scans customized for GPU/TPU architectures.
Spatio-Temporal Fusion: Best practices for dynamically mixing SSM-based spatial and temporal features.
Robustness and Interpretability: Analysis of selection gate dynamics and their focus within long, non-stationary videos.
Multi-modality: Integrating audio, optical flow, and text streams into unified SSM frameworks for video-language and multi-modal understanding.
Adaptive Depth and Structure: Dynamically modifying Suite block configurations per video complexity.
Video Diffusion Architecture: Further paper of SSM-based decoders for video generative models instead of transformer blocks.

A plausible implication is that continued research on selection mechanisms, kernel design, and combined SSM-attention hybrids may further enhance both efficiency and accuracy in future video foundation models (Zhang et al., 24 Apr 2024).

References: