Video Mamba: Efficient SSM Video Models

Updated 14 November 2025

Video Mamba models are defined by SSM-based architectures that harness selective state-space modeling to replace or augment self-attention, enabling linear complexity in processing long video sequences.
They employ bidirectional scans and hybrid designs to process spatial, temporal, and spatio-temporal video tokens, improving efficiency without compromising accuracy.
Empirical results demonstrate competitive performance in tasks like recognition, generation, and restoration, with significant throughput gains over transformer-based approaches.

Video Mamba refers to a class of video models leveraging structured state-space models (SSMs), specifically the Mamba selective state-space architecture, as a scalable alternative to self-attention for spatio-temporal video modeling. These models provide linear complexity in the number of video tokens and are increasingly adopted across diverse tasks in video understanding, generation, restoration, tokenization, and multi-modal integration, often matching or surpassing transformer-based baselines in efficiency and effectiveness.

1. Core Principles: Mamba State-Space Foundation in Video Modeling

The foundation of all “Video Mamba” models is the Mamba SSM layer, which replaces or augments attention mechanisms for sequence modeling. The continuous-time SSM is governed by the system: $h'(t) = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t),$ where $h(t)$ is the hidden state, $x(t)$ is the input, $A$ , $B$ , $C$ , $D$ are learned matrices or dynamically generated from inputs. Discretization (zero-order hold, time-step $\Delta$ ) yields: $\overline{A} = e^{\Delta A},\quad \overline{B} = (\Delta A)^{-1}(e^{\Delta A}-I) B$ with recurrence: $h_k = \overline{A} h_{k-1} + \overline{B} x_k, \qquad y_k = C h_k + D x_k.$

Mamba innovates by making SSM parameters input-dependent (“selective”) through small neural networks and deploying bidirectional scans (forward and backward passes) for richer feature extraction. This SSM “selective scan” operates with linear time and memory in sequence length, a property critical when modeling videos with tens of thousands of patches/frames (Li et al., 11 Mar 2024, Park et al., 11 Jul 2024).

2. Video Mamba Architectures: Variants and Block Compositions

Across the literature, “Video Mamba” does not denote a single architecture but rather a family of designs in which SSM blocks are applied to spatial, temporal, or spatio-temporal orderings of video tokens. Representative instantiations include:

Pure SSM Backbones: E.g., VideoMamba (Li et al., 11 Mar 2024, Park et al., 11 Jul 2024) employs a ViT-style patch embedding and a stack of bidirectional Mamba blocks scanning a flattened spatio-temporal sequence.
Hybrid Attention-SSM Models: “Matten” (Gao et al., 5 May 2024) interleaves spatial and temporal self-attention for local context with global, bidirectional Mamba scans in each U-Net diffusion block, achieving a blend of fine-grained and global modeling.
Hierarchical/Multiscale Designs: Models such as Vivim (Yang et al., 25 Jan 2024) and MambaOVSR (Chang et al., 9 Nov 2025) use multi-stage encoders/decoders, injecting Mamba-based SSM blocks at multiple scales across feature hierarchies.
Spatio-Temporal SSM Innovations: VideoMamba (Park et al., 11 Jul 2024) introduces separate forward and backward SSMs with carefully designed flattening and reversal, capturing both sequential-temporal and non-sequential-spatial dependencies.

Typical pipeline: after patch/token embedding (often via 3D convolution with patch sizes like $1\times16\times16$ ), tokens receive positional embeddings, and are processed through stacks of SSM blocks—sometimes alternated with or augmented by attention or convolutional modules, depending on the task and architecture.

3. Complexity Analysis and Scaling Properties

A defining feature of Video Mamba models is linear complexity. For sequence length $n$ and hidden dimension $d$ , SSM block cost is $O(n d^2)$ (with practical variants like $O(n d)$ under diagonal or low-rank parameterizations), compared to $O(n^2 d)$ for self-attention (Park et al., 11 Jul 2024):

Layer Type	Complexity per Layer	Notes
Self-Attention	$O(n^2 d) + O(n d^2)$	Quadratic in token sequence
Mamba SSM	$O(n d^2)$ (basic), $O(n d)$	Linear, efficient for large n

This linear scaling enables Video Mamba models to process longer sequences (hundreds of frames, $>10$ k tokens) on moderate compute budgets, outperforming Transformer-based analogues in resource-constrained scenarios. For example, Matten-Variant 3 achieves a 25% FLOPS reduction over the Latte transformer in high-resolution video generation (Gao et al., 5 May 2024), and VideoMamba exhibits 4–8× higher throughput than VideoSwin or TimeSformer on long/high-res clips (Park et al., 11 Jul 2024).

4. Empirical Results and Benchmarking

Video Mamba models have been extensively evaluated across recognition, generation, restoration, and segmentation:

Video Understanding: On Kinetics-400, VideoMamba achieves 76.1% (32 frames, IN-1K pretrain), outperforming comparably sized attention models (Park et al., 11 Jul 2024). Self-distillation and masked modeling further improve results to 83.3% at $64\times384$ input (Li et al., 11 Mar 2024). On hidden action datasets (SSv2), Mamba variants consistently close the performance gap to state-of-the-art ViT/Transformer architectures.
Video Generation: Matten (Gao et al., 5 May 2024) achieves competitive or superior Fréchet Video Distance (FVD) to StyleGAN-V, Latte, and prior SSMs on FaceForensics (FVD=45.01), SkyTimelapse (53.56), UCF101 (210.61), and Taichi-HD (158.56), with convincing qualitative sharpness and motion realism.
Efficiency: MVQA (Mi et al., 22 Apr 2025) attains SROCC 0.882 on LSVQ_test, with runtime 0.028 s and $\approx1/5$ GPU memory compared to prior best FAST-VQA transformer, confirming the efficiency advantage in VQA settings.
Ablations: Across many benchmarks, ablations show: interleaving attention and Mamba outperforms pure attention or pure SSM designs (Gao et al., 5 May 2024); multi-scale SSMs and Mamba block type (bidirectional, direction-separated) have significant effect on both efficiency and task accuracy (Chen et al., 14 Mar 2024, Park et al., 11 Jul 2024).

5. Advances, Extensions, and Design Innovations

Recent research has introduced several enhancements and application-driven variants:

Hybrid SSM-Transformer Backbones: VideoMAP (Liu et al., 16 Mar 2025) alternates 4 Mamba layers with 1 Transformer to balance global context with efficiency, showing smooth scale-up to 300M parameters, with 88.3% Top-1 on K400 and improved sample efficiency over pure SSM or Transformer approaches.
Hierarchical and Multiscale SSMs: Modules like the Multiscale Alternating Scanning Mechanism (MASM) in MambaOVSR (Chang et al., 9 Nov 2025), and the Temporal Mamba Block in Vivim (Yang et al., 25 Jan 2024), extend receptive fields and enable artifact-free long-range modeling for restoration and segmentation.
Plug-and-Play Adaptation: In multimodal contexts, H-MBA (Chen et al., 8 Jan 2025) acts as a trainable SSM video adapter for MLLMs, providing $+5.5\%$ mIoU improvement over LCP for autonomous driving risk detection, while keeping ViT and LLM weights frozen.
Fusion of SSM and Attention: Matten (Gao et al., 5 May 2024), MambaSCI (Pan et al., 18 Oct 2024), and various Video Mamba Suite modules (Chen et al., 14 Mar 2024) combine attention and SSM in blockwise or adapter configurations, achieving both local sensitivity and global recall.

6. Applications Across Video Understanding, Generation, and Beyond

Video Mamba models have been deployed for:

Video Recognition/Classification: Action recognition on Kinetics, SSv2, HMDB51, event retrieval, and cross-modal retrieval tasks, consistently yielding competitive accuracy with superior scaling properties (Li et al., 11 Mar 2024, Park et al., 11 Jul 2024, Chen et al., 14 Mar 2024).
Video Generation: Matten implements a hybrid Mamba-attention latent diffusion U-Net for high-quality sample generation; M4V (Huang et al., 12 Jun 2025) uses Mamba SSMs for efficient text-to-video diffusion with a 45% FLOPs reduction versus attention-based alternatives.
Restoration, VQA, and Compression: VSRM (Tran et al., 28 Jun 2025), Vivim (Yang et al., 25 Jan 2024), and MambaSCI (Pan et al., 18 Oct 2024) target super-resolution, medical segmentation, and compressive video imaging, leveraging linear SSM scans to extend receptive fields, preserve high-frequency details, or operate under Bayer/quad-Bayer constraints.
Object Detection and Anomaly Detection: MAMBA (Sun et al., 18 Jan 2024) for long-context aggregation via large memory banks and light-weight attention; STNMamba (Li et al., 28 Dec 2024) for spatio-temporal anomaly detection with dedicated SSM fusions and memory-driven proto-typical regularization.

A plausible implication is that SSM-based architectures, when properly adapted for video, can overcome the quadratic bottlenecks of self-attention across tasks with long-range dependencies, while maintaining or increasing accuracy.

7. Limitations, Open Problems, and Future Directions

Current limitations and opportunities highlighted in the literature:

Spatial Inductive Bias: Pure SSM designs may underperform on pure spatial tasks compared to hybrid CNN/Transformer backbones; further hybridization or spatial-specific tokenization may be required (Chen et al., 14 Mar 2024).
Global Context and Overfitting: Overfitting in pure SSMs at large scale can be addressed via hybrid SSM/Transformer backbones (e.g., the 4:1 ratio in VideoMAP (Liu et al., 16 Mar 2025)).
Scaling to Extreme Long Sequences: VideoMamba is uniquely positioned for modeling very long videos ( $T\sim1000$ frames), but stability and generalization under such regimes merit further investigation (Li et al., 11 Mar 2024).
Self-Supervised and Multimodal Pretraining: Self-distillation and masked modeling techniques enable strong results without extensive video pretraining, but larger, multi-task, and multi-modal pretraining pipelines are areas of ongoing research (Li et al., 11 Mar 2024, Chen et al., 14 Mar 2024).
Extending to Diverse Modalities: Mamba-based models demonstrate promising multimodal compatibility (video-language retrieval, captioning, autonomous driving, music-to-dance synthesis), suggesting potential for video–text–audio–sensor fusion architectures (Chen et al., 8 Jan 2025, Tang et al., 9 Jul 2025).

In sum, the Video Mamba paradigm encompasses a broad class of SSM-based architectures that deliver efficient, high-capacity spatio-temporal modeling for long-form and high-resolution video tasks, with lines of active research in hybridization, scalability, and cross-modal integration.