Temporal Mamba Block for Temporal Modeling

Updated 15 December 2025

Temporal Mamba Block is a hardware-efficient state-space module designed for temporal sequence modeling that integrates local dynamic enhancement with long-range dependency capture.
It employs bidirectional scanning, multi-scale sampling, and input-dependent parameter adaptation to maintain linear computational complexity while processing spatio-temporal data.
Empirical results show that its architecture improves performance in video, medical imaging, and time-series forecasting by reducing errors and enabling robust, scalable modeling.

The Temporal Mamba Block is a versatile, state-space-based architectural component for temporal sequence modeling that unifies local dynamic enhancement, long-range dependency capture, and hardware-efficient scan paradigms. Designed around the Mamba selective state-space model family, the block has proliferated across computer vision, time-series analysis, video understanding, medical imaging, and physiological signal processing applications. Its formal definition centers on input-dependent, continuous-to-discrete state-space recurrences and bidirectional scanning within multi-scale, spatial-temporal tensor ecosystems.

1. Core Mathematical Structure and Model Formulation

The Temporal Mamba Block operates on sequences or spatio-temporal volumes by applying a continuous-time linear state-space model that is subsequently discretized per input (or per spatio-temporal coordinate): $h'(t) = A h(t) + B x(t), \qquad y(t) = C h(t)$ with discretization (typically via zero-order hold): $\bar{A} = \exp(\Delta A), \qquad \bar{B} = (\Delta A)^{-1} (\exp(\Delta A) - I) \Delta B$ and recurrent updates per time step: $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t$ All parameters— $A$ , $B$ , $C$ , $\Delta$ —are subject to dynamic adaptation depending on the current input token or spatio-temporal feature, typically via learned linear projections.

Bidirectionality is integral in most Temporal Mamba Blocks: both forward and backward scans are performed, yielding outputs $y_\text{fwd}, y_\text{bwd}$ which are typically summed or merged and gated.

Many implementations further introduce gating or blending mechanisms such as complementary “forget” gates, multi-stream fusion (e.g., SlowFast), FiLM-style conditional modulation, and dropout regularization (Luo et al., 18 Sep 2024, Karadag et al., 10 Apr 2025, Liang et al., 24 Apr 2024).

2. Local Dynamic Enhancement: Temporal Difference Convolution and Short-Range Modules

For fine-grained temporal dynamics, several variants introduce a local gradient-enhancement stage before state-space modeling. Notable examples:

Temporal Difference Convolution (TDC):

$\mathrm{TDC}(f)(p_0) = \sum_{p_n \in \mathcal{R}} w(p_n) f(p_0 + p_n) + \theta (-f(p_0)) \sum_{p_n \in \mathcal{R}'} w(p_n)$

where $\theta$ balances local smoothing and central differencing (PhysMamba uses $\theta=0.5$ ) (Luo et al., 18 Sep 2024).

Temporal Convolutional Module (TCM): 1D (dilated) convolution for short-range pattern extraction, often with growing dilation rate per block in multi-scale encoders (Sinha et al., 10 Jan 2025).

These local modules precede flattening and input-dependent SSM, enhancing the block’s ability to capture subtle, immediate transitions or artifacts in temporal data.

3. Bidirectional and Multi-Scale State-Space Scanning

Long-range temporal modeling is achieved through bidirectional processing (forward and backward scans), multi-scale parallelization, and multi-orientational traversals:

Bidirectional SSM: Each direction uses its own set of SSM parameters and has its scan reversed as appropriate; outputs are fused by summation or gating (Luo et al., 18 Sep 2024, Liang et al., 24 Apr 2024).
Multi-Scale Sampling: ms-Mamba introduces multiple parallel blocks at distinct sampling rates ( $\Delta_i$ ) to capture both slow and fast temporal features, with outputs averaged per layer (Karadag et al., 10 Apr 2025).
Tetra-Orientated/3D Scans: In medical and video domains, tensors are flattened and scanned along forward/reverse (depth-time, inter-slice, time, spatial) axes (e.g., TetraMamba fuses four independent scans to cover longitudinal/volumetric dependencies) (Kim et al., 13 Apr 2025, Shi et al., 1 Jun 2025, Gong et al., 14 Jan 2025).

This design yields strictly linear runtime in sequence or spatio-temporal length, as opposed to quadratic bottlenecks in Transformer architectures.

4. Integration with Deep Network Topologies

Temporal Mamba Blocks are almost always embedded within larger frameworks:

Dual-Stream SlowFast Fusion: PhysMamba integrates TD-Mamba blocks into both slow and fast temporal branches, fusing after lateral residual connectors, achieving superior representation of multi-scale dynamics (Luo et al., 18 Sep 2024).
U-Net Architectures: Hierarchical Temporal Mamba modules ensemble variable-memory SSM scans at every encoder/decoder stage, with skip connections and block-level gating (Zhang et al., 12 Mar 2024).
Spatio-Temporal Pyramids and Fusion: Coupled with spatial encoders, temporal Mamba blocks are fused multiplicatively and additively, often with channel-attention mechanisms (SE-like) to boost salient motion features (Li et al., 28 Dec 2024, Luo et al., 18 Sep 2024).
Multi-modal Settings: Audio-visual segmentation and video restoration frameworks pair temporal Mamba blocks with vision blocks and cross-modality fusion layers (Gong et al., 14 Jan 2025, Xu et al., 20 Aug 2024).

Hyperparameters commonly tuned include state dimension ( $N$ ), expansion factor ( $E$ ), kernel sizes (3×3×3, etc.), number of scan paths/dilations, learning or fixed sampling intervals ( $\Delta$ ), and depth (#blocks/layers).

5. Parameter Efficiency, Computational Complexity, and Empirical Results

Temporal Mamba Blocks maintain a strict $O(L \cdot N)$ complexity for sequence length $L$ and hidden size $N$ , as opposed to the $O(L^2)$ cost of attention (Karadag et al., 10 Apr 2025, Zhou et al., 3 Nov 2024). Memory and compute per block are fractal with respect to spatial/temporal dimensions in multi-way scan designs (OmniMamba4D, SCST-Mamba).

Table: Representative Empirical Advantages from Recent Papers

Task/Model	Metric	Temporal Mamba Variant	Gain vs. Baseline
rPPG (PhysMamba)	MAE	TD-Mamba + SlowFast	0.25 bpm on PURE
CT Segmentation	Dice (%)	Tetra-orientated Mamba	68.23 vs. 68.86 (SOTA)
Time Series (ms-Mamba)	MSE	Multi-scale Mamba	–4% avg vs. single Δ
Video Super-res (SCST)	Perceptual Q.	Spatio-temporal Mamba	SOTA on real VSR bmarks
Human Motion (HTM)	FID, R-prec	Hierarchical/Multi-Mamba	FID↓40%, ×4 speed

Blocks are systematically ablated in recent studies. Removing bidirectional scan, local difference, gating, or fusion mechanisms leads to consistent accuracy or SOTA drops, confirming their necessity (Luo et al., 18 Sep 2024, Karadag et al., 10 Apr 2025, Liang et al., 24 Apr 2024).

6. Practical Implementation, Hyperparameterization, and Application Notes

Implementation typically entails:

Input normalization (BatchNorm, LayerNorm) after TDC/Conv stages.
Flattening/reshaping of high-dimensional inputs for state-space scan along desired axes.
SSM formulation, zero-order hold discretization, input-dependent parameter projections.
Layer-wise gating, residual connections, channel-attention mechanisms (SE, CAB).
Integration of hardware-parallelized CUDA selective scan kernels for both forward and backward passes (Liang et al., 24 Apr 2024).

Key hyperparameters:

State size $N$ in SSM (typical range: 16–64 for physiological/video, 128–500 for skeleton/motion/time-series).
Expansion factor $E$ for projection layers.
Kernel sizes (Convs: 3×3×3 for local spatio-temporal, TDC, DWConv).
Number of scan paths/directions (K): often 4–8, up to 21 in hierarchical/multiscale setups.
Dropout rates on selective gates (TMB in MambaTS: 0.2–0.3 optimal) (Cai et al., 26 May 2024).

Reproducibility is facilitated by open-source codebases—most recent works provide repository links and exact module-level implementation notes (e.g., stem.py, td_mamba.py, slowfast.py in PhysMamba) (Luo et al., 18 Sep 2024). Hardware-aware parallel scan fuses convolution and gating to optimize runtime.

7. Empirical Impact, Benchmarking, and Limitations

Temporal Mamba Blocks have advanced SOTA in a diverse set of domains:

Physiological Signal Extraction (rPPG): Reduces MAE from 0.68→0.25 bpm without expensive post-processing (Luo et al., 18 Sep 2024).
Longitudinal Medical Imaging: Tetra-oriented scans in OmniMamba4D excel at tracking appearance and disappearance of lesions in CT time-series (Kim et al., 13 Apr 2025).
Time Series Forecasting: Bi-Mamba+ and ms-Mamba outperform both Transformer and vanilla Mamba, with consistent gains on multivariate, long-horizon datasets (Liang et al., 24 Apr 2024, Karadag et al., 10 Apr 2025, Cai et al., 26 May 2024).
Video Restoration, Super-Resolution, Anomaly Detection: Linear-time global spatio-temporal attention via multi-path Mamba enables noise-robust VSR and compact normality learning (Shi et al., 1 Jun 2025, Li et al., 28 Dec 2024).
Human Motion Modeling: Hierarchical and conditional Mamba blocks drive coherent long-sequence motion synthesis with fewer parameters and lower latency than prior diffusion-based and attention-based models (Zhang et al., 12 Mar 2024, Nguyen et al., 14 Oct 2025).
Temporal Action Detection/Segmentation: Multi-scale/dilated designs (MS-Temba) handle long, densely-labeled video with 88% reduced complexity vs. transformer backbones (Sinha et al., 10 Jan 2025).

Systematic ablation studies confirm necessity of local dynamic enhancement, bidirectional scan, channel-attention, and scale-aware fusion. Removing any key component typically results in statistically significant degradation in accuracy, speed, or generalization.

Limitations: While linear in theory and efficient in modern hardware, multi-path or multi-scale designs can accumulate significant parameter overhead at large scale. Careful balancing of state dimension, number of scan directions, and fusion depth is required for optimal deployment efficiency.

In summary, the Temporal Mamba Block is a hardware-efficient, input-adaptive, bidirectional state-space module for temporal sequence modeling, integrating fine-scale local dynamic enhancement with robust long-range dependency extraction. Its modularity and versatility have established new SOTA across video, vision, medical imaging, physiological signal, and time-series forecasting tasks, verified via rigorous empirical benchmarking and ablation. Continued evolution is anticipated toward further scaling, multimodal fusion, and application-specific architectural integrations.