Spatio-temporal Mamba Adapter (STMA)

Updated 28 January 2026

STMA is a modular neural operations layer that integrates input- and position-dependent state-space models to enable linear-time spatio-temporal sequence modeling.
It replaces or augments self-attention in diverse architectures, offering significant improvements in parameter efficiency, scalability, and computational cost for tasks like video and multimodal fusion.
STMA adapts through specialized designs such as frequency mixing and multi-branch fusion, achieving state-of-the-art performance and real-time inference across varied application domains.

A Spatio-temporal Mamba Adapter (STMA) is a modular neural operations layer designed to inject linear-complexity, structured state-space sequence modeling into high-dimensional spatio-temporal data processing. Rooted in the Mamba state-space model (SSM) paradigm, STMA blocks fundamentally replace or augment self-attention for video, multivariate sequence, graph, and multimodal fusion tasks, with the goal of capturing long-range spatial and temporal dependencies using lightweight, flexible recurrent or convolutional mechanisms. STMA modules are instantiated in diverse architectures ranging from vision transformers and video diffusion U-Nets to multi-lead time-series encoders, offering dramatic parameter, efficiency, and scalability improvements over quadratic-complexity attention while providing domain-adaptive, multi-branch, or cross-modal extensions as needed.

1. Core Mathematical Principles and Adapter Design

STMA modules derive from the continuous-time linear SSM: $\frac{dh}{dt} = A h(t) + B x(t), \quad y(t) = C h(t) + D x(t)$ which, under zero-order hold discretization with dynamic gating, yields (per token/patch/segment/channel): $h_t = \Phi_t h_{t-1} + \Gamma_t x_t,\quad y_t = C_t h_t + D_t x_t.$ The STMA’s innovation lies in making the propagation (Φ, Γ, C, D) input- and position-dependent via learned projections (often with selective gating, e.g. $\sigma$ ), and exploiting linear-time sequence convolution via scan or recurrence for $O(Nd)$ complexity. This generalizes to multi-branch, bidirectional, and spatial/temporal parallel scan variants, as well as pathwise 3D scan for video tensors.

Adapter designs include:

Direct SSM block adapters (VideoMamba (Park et al., 2024), MoMa (Yang et al., 29 Jun 2025), S2M2ECG (Zhang et al., 3 Sep 2025))—state update and output projections, selective gating, convolutional kernel realization.
Frequency or channel mixing modules (MultiFFT in UBATrack (Liang et al., 21 Jan 2026))—FFT/EMM/IFFT operations for spectral feature fusion inside the adapter.
Multi-branch or cross-domain adapters—independent adapters per input modality, channel, or domain, integrated via fusion or cross-adaptive SSM (Damba-ST (An et al., 22 Jun 2025)).
3D Selective Scan adapters—continuous spatio-temporal scan paths through a tensor (ControlNet (Shi et al., 1 Jun 2025)), each passed through its own dynamic SSM and recombined.

2. Adapter Placement, Data Flow, and Architectural Variants

STMA blocks are flexible regarding insertion points in backbone models:

ViT/CLIP transformers: Placed after selected attention layers (UBATrack (Liang et al., 21 Jan 2026), MoMa (Yang et al., 29 Jun 2025)), typically with frozen backbone parameters and only adapters/fusion blocks trained for parameter efficiency.
Video U-Nets and Video Diffusion: Inserted at multi-scale resolutions to introduce global spatio-temporal context without quadratic cost (ControlNet (Shi et al., 1 Jun 2025)).
3D Vision or EEG/ECG pipelines: Deployed per sequence/channel/lead, often bi-directionally, possibly with multi-branch parameter sharing or domain-adaptive tokens (S2M2ECG (Zhang et al., 3 Sep 2025), Damba-ST (An et al., 22 Jun 2025)).
Dual encoders/decoders: As part of cascaded spatial and temporal flows, e.g. parallel spatial (MS-VSSB) and temporal (CA-VSSB) adapters in video anomaly detection (STNMamba (Li et al., 2024)).

Data flow typically involves: normalization → local mixing (1D conv or patching) → one or two SSM scans (forward/backward, spatial/temporal) → nonlinear gating and fusion → up-projection or token recombination → residual addition.

3. Specializations: Frequency Mixing, Domain Adaptation, and Fusion

Several STMA implementations augment the core SSM with additional domain or channel-wise operations:

Frequency-domain mixing: FFT is used to decompose tokens along channels, combining with learnable (complex-valued) EMM and nonlinearities to achieve cross-channel, cross-modal fusion while maintaining $O(Nd)$ cost (UBATrack (Liang et al., 21 Jan 2026)).
Multi-branch and domain-adaptive variants: Damba-ST partitions the latent state into shared vs. domain-specific subspaces, instantiating three types of Domain Adapters (spatial, temporal, delay) for cross-domain knowledge sharing via learnable tokens and adapter-wise SSMs (An et al., 22 Jun 2025).
Squeeze-and-excitation and temporal convolution: S2M2ECG (Zhang et al., 3 Sep 2025) fuses per-lead outputs with SENet channel gating for spatial integration, while DMTrack (Li et al., 3 Aug 2025) uses per-modality 1D temporal adapters to “prompt” frozen ViT features.
Spatial-temporal fusion modules: STNMamba (Li et al., 2024) fuses parallel spatial and temporal feature streams at multiple levels via custom blocks to enforce spatial-temporal consistency and memory-based normality.

4. Training Protocols, Computational Complexity, and Integration

STMA blocks are constructed for efficient adapter training under a frozen backbone and to scale linearly in data dimensions:

Parameter efficiency: Inserted adapters in ViT or CLIP backbones add typically $0.02\text{M}$ – $11\text{M}$ parameters relative to hundreds of millions for backbone transformers; much larger models (e.g. MoMa/ViT-B: 11 M in STMA vs 86 M frozen backbone) see significant FLOPs reduction (Yang et al., 29 Jun 2025).
Complexity: All core scan/fusion adapters operate at $O(Nd)$ or $O(Ld^2)$ per-layer cost (with $L$ the sequence or patch length, $h_t = \Phi_t h_{t-1} + \Gamma_t x_t,\quad y_t = C_t h_t + D_t x_t.$ 0 the hidden width), compared to $h_t = \Phi_t h_{t-1} + \Gamma_t x_t,\quad y_t = C_t h_t + D_t x_t.$ 1 attention.
Implementation: Adapters typically use AdamW, moderate learning rates ( $h_t = \Phi_t h_{t-1} + \Gamma_t x_t,\quad y_t = C_t h_t + D_t x_t.$ 2- $h_t = \Phi_t h_{t-1} + \Gamma_t x_t,\quad y_t = C_t h_t + D_t x_t.$ 3 to $h_t = \Phi_t h_{t-1} + \Gamma_t x_t,\quad y_t = C_t h_t + D_t x_t.$ 4- $h_t = \Phi_t h_{t-1} + \Gamma_t x_t,\quad y_t = C_t h_t + D_t x_t.$ 5), and optimizer states restricted to adapters/fusion modules, with regularization (weight decay, dropout) and input normalization.
Self-supervised or contrastive variants: STMA adapters can be used within multi-stage or multi-modal training, e.g. 3-stage HR/LR contrastive scheme in video super-resolution (Shi et al., 1 Jun 2025).

5. Empirical Results, Ablations, and Task-Specific Findings

Across varied domains, insertion of STMA adapters has yielded state-of-the-art or highly competitive results with strong ablation evidence for their necessity and design:

Multimodal object tracking (UBATrack): STMA alone improves LasHeR Success Rate by +5.0 points, DepthTrack F-score by +5.6 (Table V), outperforming 42.5M-param attention blocks with only 0.018M adapters (Liang et al., 21 Jan 2026).
Video recognition (VideoMamba, MoMa): STMA-ViT-B/16 achieves 84.8% Top-1 accuracy on Kinetics-400 at reduced compute (902 GFLOPs vs 1214) (Yang et al., 29 Jun 2025); spatio-temporal scan variant yields +1.5% with temporal PEs (Park et al., 2024).
ECG and time-series (S2M2ECG): STMA raises F1 on Chapman dataset to 0.918 using bidirectional, multi-branch fusion (0.705M params vs 12M for Transformers), exhibiting ~2-7% F1 improvement with key ablations (e.g. bi-directional scan, lead fusion) (Zhang et al., 3 Sep 2025).
Video anomaly detection (STNMamba): Achieves real-time speed (40 FPS at 256×256), 7.2 M total parameters, competitive accuracy against 24–63M-param transformers (Li et al., 2024).
Urban flow prediction: Damba-ST demonstrates best MAE on 9/12 settings, state-of-the-art zero-shot generalization to new cities, linear complexity in look-back length (An et al., 22 Jun 2025); ST-MambaSync and ST-Mamba establish new SOTA on six traffic benchmarks at lower compute than prior models (Shao et al., 2024, Yuan et al., 2024).
Video Super-resolution (ControlNet+STMA): Achieves +0.68 dB PSNR and >0.03 LPIPS improvement on YouHQ relative to attention/conv baselines; 3-stage training with contrastive pretext found essential for robustness (Shi et al., 1 Jun 2025).

Empirical ablations consistently demonstrate:

Best performance with moderate adapter depth or number (e.g., 6 STMA blocks in UBATrack).
Sharply degraded performance if temporal branch or adaptive fusion disabled (InterMamba (Wu et al., 3 Jun 2025)).
Crucial importance of preserving cross-modal or cross-domain separation and aligning fusion/interaction points, e.g., multi-branch and PMCA in DMTrack (Li et al., 3 Aug 2025).

6. Application Domains and Typical Workflows

STMA has been validated in tasks requiring efficient long-range modeling under tight parameter or compute budgets, including:

Multi-modal tracking: Per-modality adapters facilitate prompt-based and memory-efficient fusion (UBATrack, DMTrack) (Liang et al., 21 Jan 2026, Li et al., 3 Aug 2025).
Spatio-temporal video modeling: Plug-in blocks to replace self-attention modules in vision transformers, U-Nets (MoMa, VideoMamba) (Yang et al., 29 Jun 2025, Park et al., 2024).
Biomedical multivariate signals: Multi-lead or multi-channel integration using parallel SSMs with spatial/temporal adapters (S2M2ECG) (Zhang et al., 3 Sep 2025).
Urban and traffic forecasting: Transformer/SSM hybrid stacks for traffic flows; domain adapters enable transfer to unseen cities (Damba-ST, ST-MambaSync) (An et al., 22 Jun 2025, Shao et al., 2024).
Preprocessing for domain adaptation: Cross-PSD optimal-transport alignment as a plug-in before any classifier; guarantees reduction in bias/variance (Monge alignment STMA) (Gnassounou et al., 2024).
Video restoration/generation: 3D scan adapters implemented within diffusion or video restoration pipelines (ControlNet w/ STMA) (Shi et al., 1 Jun 2025).

7. Theoretical, Computational, and Practical Considerations

Theoretical properties: Optimal-transport based STMA for multivariate signals admits tight non-asymptotic concentration bounds, with $h_t = \Phi_t h_{t-1} + \Gamma_t x_t,\quad y_t = C_t h_t + D_t x_t.$ 6 variance decay (Gnassounou et al., 2024).
Linear scaling and real-time inference: All major STMA block designs sharply reduce memory and compute demands, e.g., 1–5 ms per 10 s ECG (S2M2ECG), 40 FPS video anomaly detection (STNMamba), sub-second urban forecasting (Damba-ST).
Design trade-offs: Pure per-modality, per-branch adapters (DMTrack) facilitate rapid fusion but eschew global recurrence, trading a degree of long-range modeling for architectural simplicity.

A plausible implication is that as Mamba SSM-based adapters generalize to ever-larger modalities and more heterogeneous deployment environments, new forms of domain and modality adaptation, multi-branch fusion, and hierarchical scan patterns will become central for state-of-the-art, resource-efficient spatio-temporal AI systems.