Papers
Topics
Authors
Recent
2000 character limit reached

Temporal Mamba Block for Temporal Modeling

Updated 15 December 2025
  • Temporal Mamba Block is a hardware-efficient state-space module designed for temporal sequence modeling that integrates local dynamic enhancement with long-range dependency capture.
  • It employs bidirectional scanning, multi-scale sampling, and input-dependent parameter adaptation to maintain linear computational complexity while processing spatio-temporal data.
  • Empirical results show that its architecture improves performance in video, medical imaging, and time-series forecasting by reducing errors and enabling robust, scalable modeling.

The Temporal Mamba Block is a versatile, state-space-based architectural component for temporal sequence modeling that unifies local dynamic enhancement, long-range dependency capture, and hardware-efficient scan paradigms. Designed around the Mamba selective state-space model family, the block has proliferated across computer vision, time-series analysis, video understanding, medical imaging, and physiological signal processing applications. Its formal definition centers on input-dependent, continuous-to-discrete state-space recurrences and bidirectional scanning within multi-scale, spatial-temporal tensor ecosystems.

1. Core Mathematical Structure and Model Formulation

The Temporal Mamba Block operates on sequences or spatio-temporal volumes by applying a continuous-time linear state-space model that is subsequently discretized per input (or per spatio-temporal coordinate): h(t)=Ah(t)+Bx(t),y(t)=Ch(t)h'(t) = A h(t) + B x(t), \qquad y(t) = C h(t) with discretization (typically via zero-order hold): Aˉ=exp(ΔA),Bˉ=(ΔA)1(exp(ΔA)I)ΔB\bar{A} = \exp(\Delta A), \qquad \bar{B} = (\Delta A)^{-1} (\exp(\Delta A) - I) \Delta B and recurrent updates per time step: ht=Aˉht1+Bˉxt,yt=Chth_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t All parameters—AA, BB, CC, Δ\Delta—are subject to dynamic adaptation depending on the current input token or spatio-temporal feature, typically via learned linear projections.

Bidirectionality is integral in most Temporal Mamba Blocks: both forward and backward scans are performed, yielding outputs yfwd,ybwdy_\text{fwd}, y_\text{bwd} which are typically summed or merged and gated.

Many implementations further introduce gating or blending mechanisms such as complementary “forget” gates, multi-stream fusion (e.g., SlowFast), FiLM-style conditional modulation, and dropout regularization (Luo et al., 18 Sep 2024, Karadag et al., 10 Apr 2025, Liang et al., 24 Apr 2024).

2. Local Dynamic Enhancement: Temporal Difference Convolution and Short-Range Modules

For fine-grained temporal dynamics, several variants introduce a local gradient-enhancement stage before state-space modeling. Notable examples:

  • Temporal Difference Convolution (TDC):

TDC(f)(p0)=pnRw(pn)f(p0+pn)+θ(f(p0))pnRw(pn)\mathrm{TDC}(f)(p_0) = \sum_{p_n \in \mathcal{R}} w(p_n) f(p_0 + p_n) + \theta (-f(p_0)) \sum_{p_n \in \mathcal{R}'} w(p_n)

where θ\theta balances local smoothing and central differencing (PhysMamba uses θ=0.5\theta=0.5) (Luo et al., 18 Sep 2024).

  • Temporal Convolutional Module (TCM): 1D (dilated) convolution for short-range pattern extraction, often with growing dilation rate per block in multi-scale encoders (Sinha et al., 10 Jan 2025).

These local modules precede flattening and input-dependent SSM, enhancing the block’s ability to capture subtle, immediate transitions or artifacts in temporal data.

3. Bidirectional and Multi-Scale State-Space Scanning

Long-range temporal modeling is achieved through bidirectional processing (forward and backward scans), multi-scale parallelization, and multi-orientational traversals:

  • Bidirectional SSM: Each direction uses its own set of SSM parameters and has its scan reversed as appropriate; outputs are fused by summation or gating (Luo et al., 18 Sep 2024, Liang et al., 24 Apr 2024).
  • Multi-Scale Sampling: ms-Mamba introduces multiple parallel blocks at distinct sampling rates (Δi\Delta_i) to capture both slow and fast temporal features, with outputs averaged per layer (Karadag et al., 10 Apr 2025).
  • Tetra-Orientated/3D Scans: In medical and video domains, tensors are flattened and scanned along forward/reverse (depth-time, inter-slice, time, spatial) axes (e.g., TetraMamba fuses four independent scans to cover longitudinal/volumetric dependencies) (Kim et al., 13 Apr 2025, Shi et al., 1 Jun 2025, Gong et al., 14 Jan 2025).

This design yields strictly linear runtime in sequence or spatio-temporal length, as opposed to quadratic bottlenecks in Transformer architectures.

4. Integration with Deep Network Topologies

Temporal Mamba Blocks are almost always embedded within larger frameworks:

  • Dual-Stream SlowFast Fusion: PhysMamba integrates TD-Mamba blocks into both slow and fast temporal branches, fusing after lateral residual connectors, achieving superior representation of multi-scale dynamics (Luo et al., 18 Sep 2024).
  • U-Net Architectures: Hierarchical Temporal Mamba modules ensemble variable-memory SSM scans at every encoder/decoder stage, with skip connections and block-level gating (Zhang et al., 12 Mar 2024).
  • Spatio-Temporal Pyramids and Fusion: Coupled with spatial encoders, temporal Mamba blocks are fused multiplicatively and additively, often with channel-attention mechanisms (SE-like) to boost salient motion features (Li et al., 28 Dec 2024, Luo et al., 18 Sep 2024).
  • Multi-modal Settings: Audio-visual segmentation and video restoration frameworks pair temporal Mamba blocks with vision blocks and cross-modality fusion layers (Gong et al., 14 Jan 2025, Xu et al., 20 Aug 2024).

Hyperparameters commonly tuned include state dimension (NN), expansion factor (EE), kernel sizes (3×3×3, etc.), number of scan paths/dilations, learning or fixed sampling intervals (Δ\Delta), and depth (#blocks/layers).

5. Parameter Efficiency, Computational Complexity, and Empirical Results

Temporal Mamba Blocks maintain a strict O(LN)O(L \cdot N) complexity for sequence length LL and hidden size NN, as opposed to the O(L2)O(L^2) cost of attention (Karadag et al., 10 Apr 2025, Zhou et al., 3 Nov 2024). Memory and compute per block are fractal with respect to spatial/temporal dimensions in multi-way scan designs (OmniMamba4D, SCST-Mamba).

Table: Representative Empirical Advantages from Recent Papers

Task/Model Metric Temporal Mamba Variant Gain vs. Baseline
rPPG (PhysMamba) MAE TD-Mamba + SlowFast 0.25 bpm on PURE
CT Segmentation Dice (%) Tetra-orientated Mamba 68.23 vs. 68.86 (SOTA)
Time Series (ms-Mamba) MSE Multi-scale Mamba –4% avg vs. single Δ
Video Super-res (SCST) Perceptual Q. Spatio-temporal Mamba SOTA on real VSR bmarks
Human Motion (HTM) FID, R-prec Hierarchical/Multi-Mamba FID↓40%, ×4 speed

Blocks are systematically ablated in recent studies. Removing bidirectional scan, local difference, gating, or fusion mechanisms leads to consistent accuracy or SOTA drops, confirming their necessity (Luo et al., 18 Sep 2024, Karadag et al., 10 Apr 2025, Liang et al., 24 Apr 2024).

6. Practical Implementation, Hyperparameterization, and Application Notes

Implementation typically entails:

  • Input normalization (BatchNorm, LayerNorm) after TDC/Conv stages.
  • Flattening/reshaping of high-dimensional inputs for state-space scan along desired axes.
  • SSM formulation, zero-order hold discretization, input-dependent parameter projections.
  • Layer-wise gating, residual connections, channel-attention mechanisms (SE, CAB).
  • Integration of hardware-parallelized CUDA selective scan kernels for both forward and backward passes (Liang et al., 24 Apr 2024).

Key hyperparameters:

  • State size NN in SSM (typical range: 16–64 for physiological/video, 128–500 for skeleton/motion/time-series).
  • Expansion factor EE for projection layers.
  • Kernel sizes (Convs: 3×3×3 for local spatio-temporal, TDC, DWConv).
  • Number of scan paths/directions (K): often 4–8, up to 21 in hierarchical/multiscale setups.
  • Dropout rates on selective gates (TMB in MambaTS: 0.2–0.3 optimal) (Cai et al., 26 May 2024).

Reproducibility is facilitated by open-source codebases—most recent works provide repository links and exact module-level implementation notes (e.g., stem.py, td_mamba.py, slowfast.py in PhysMamba) (Luo et al., 18 Sep 2024). Hardware-aware parallel scan fuses convolution and gating to optimize runtime.

7. Empirical Impact, Benchmarking, and Limitations

Temporal Mamba Blocks have advanced SOTA in a diverse set of domains:

Systematic ablation studies confirm necessity of local dynamic enhancement, bidirectional scan, channel-attention, and scale-aware fusion. Removing any key component typically results in statistically significant degradation in accuracy, speed, or generalization.

Limitations: While linear in theory and efficient in modern hardware, multi-path or multi-scale designs can accumulate significant parameter overhead at large scale. Careful balancing of state dimension, number of scan directions, and fusion depth is required for optimal deployment efficiency.


In summary, the Temporal Mamba Block is a hardware-efficient, input-adaptive, bidirectional state-space module for temporal sequence modeling, integrating fine-scale local dynamic enhancement with robust long-range dependency extraction. Its modularity and versatility have established new SOTA across video, vision, medical imaging, physiological signal, and time-series forecasting tasks, verified via rigorous empirical benchmarking and ablation. Continued evolution is anticipated toward further scaling, multimodal fusion, and application-specific architectural integrations.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Temporal Mamba Block.