MambaSTS: Mamba-based Spatio-Temporal Modeling

Updated 4 July 2026

MambaSTS is a term for Mamba-based spatio-temporal systems that fuse spatial, temporal, and semantic learning, notably in UAV tracking.
It replaces quadratic self-attention with selective state-space modeling to achieve linear-time sequence processing while retaining long-range dependencies.
Empirical results on UAV-Anti-UAV benchmarks demonstrate its robustness and efficiency, achieving high performance at 54 FPS on modern GPUs.

Searching arXiv for the cited papers to ground the article. MambaSTS is an overloaded designation in recent arXiv literature for Mamba-based spatio-temporal systems rather than a single universally fixed architecture. In the most specific and explicit usage, it denotes a baseline tracker for the UAV-Anti-UAV benchmark that performs integrated spatial-temporal-semantic learning by combining a HiViT visual backbone, Mamba-based temporal propagation, and a Mamba language encoder (Zhang et al., 8 Dec 2025). In adjacent literature, the expression also appears as a broader label for Mamba-based spatio-temporal sequence modeling, and it is sometimes conflated with the distinct model name MambaTS for long-term time series forecasting (Cai et al., 2024). Across these usages, the unifying idea is the replacement or reduction of quadratic self-attention with selective state-space modeling, typically to obtain linear-time sequence processing while preserving long-range temporal structure.

1. Naming and scope

The term has at least three distinct functions in the cited literature: a concrete model name, a loose descriptive label, and a source of naming ambiguity.

Usage	Paper	Role
MambaSTS	(Zhang et al., 8 Dec 2025)	UAV-Anti-UAV tracking baseline
MambaTS	(Cai et al., 2024)	Distinct LTSF model; “MambaSTS” appears as a naming variant or typo
MambaSTS as a broad descriptor	(Hamad et al., 5 Jul 2025, Chen et al., 17 Aug 2025)	General Mamba-based spatio-temporal modeling pattern

In the UAV-Anti-UAV work, MambaSTS is a named method introduced together with a million-scale benchmark consisting of 1,810 videos, approximately 1.05M frames, total duration approximately 9.85 hours, bounding-box annotations, one language prompt per video, and 15 tracking attributes (Zhang et al., 8 Dec 2025). By contrast, the long-term forecasting paper explicitly states that the query “MambaSTS” appears to be a naming variant or typo and that the paper’s model is “MambaTS,” not “MambaSTS” (Cai et al., 2024).

This suggests that any encyclopedia treatment of MambaSTS must separate the specific tracker named MambaSTS from the wider family of Mamba-based spatio-temporal architectures to which the label is sometimes informally attached.

2. State-space foundations

The shared technical substrate is the selective state-space model. The continuous-time linear state-space form appears repeatedly in the cited works:

$h'(t) = A h(t) + B x(t), \qquad y(t) = C h(t)$

or, with an explicit direct term,

$\dot{x}(t) = A x(t) + B u(t), \qquad y(t) = C x(t) + D u(t).$

After zero-order-hold discretization with step $\Delta$ , one obtains recurrent updates such as

$h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t,$

or equivalently

$x_{t+1} = \bar{A} x_t + \bar{B} u_t, \qquad y_t = C x_t + D u_t.$

These forms are stated in the MambaSTS tracker, MambaST, MambaTS, MCST-Mamba, STM3, S-Mamba, and HIGSTM papers (Zhang et al., 8 Dec 2025).

The specifically Mamba-style extension is selectivity: the effective state-space parameters depend on the current input. In the traffic forecasting formulation, this appears as

$A_k = f_A(u_k), \quad B_k = f_B(u_k), \quad C_k = f_C(u_k),$

while the time-series forecasting formulation writes

$x_{t+1} = A(s_t)x_t + B(s_t)u_t, \qquad y_t = C(s_t)x_t + D(s_t)u_t,$

with $s_t = \sigma(W_s x_t + b_s)$ (Hamad et al., 5 Jul 2025). The practical consequence, stated across the papers, is that Mamba replaces quadratic self-attention with linear-time or near-linear scan-style sequence processing, making longer temporal windows and recurrent propagation more tractable than standard attention mechanisms (Gao et al., 2024).

Within the MambaSTS tracker, this state-space mechanism is made causal and online. The paper describes a temporal token $T_{\mathrm{temp}}$ that carries compact video memory across frames, with the update

$T_{\mathrm{temp}}^i \leftarrow \mathrm{Mamba}(\{F_x^1, F_x^2, \ldots, F_x^i\}),$

so that long-term context is propagated without storing all prior frame tokens explicitly (Zhang et al., 8 Dec 2025).

3. MambaSTS as a UAV-Anti-UAV tracking architecture

In its explicit named form, MambaSTS is a multi-modal tracker designed for UAV-Anti-UAV, a setting in which a pursuer UAV tracks an adversarial target UAV from onboard video. The paper emphasizes severe dual-dynamic disturbances: rapid motion of both observer and target, rapid scale changes, a high proportion of small objects, strong motion blur and turbulence, frequent viewpoint switches, and rapidly changing backgrounds (Zhang et al., 8 Dec 2025).

The model takes three inputs: a template image $\dot{x}(t) = A x(t) + B u(t), \qquad y(t) = C x(t) + D u(t).$ 0 cropped from the first frame, a search image $\dot{x}(t) = A x(t) + B u(t), \qquad y(t) = C x(t) + D u(t).$ 1 cropped from the current frame, and a sequence-level text prompt $\dot{x}(t) = A x(t) + B u(t), \qquad y(t) = C x(t) + D u(t).$ 2. Visual features are extracted with HiViT-base, described as using a 4×4 patch embedding and two 2×2 merges, with token dimension $\dot{x}(t) = A x(t) + B u(t), \qquad y(t) = C x(t) + D u(t).$ 3 and effective patch stride $\dot{x}(t) = A x(t) + B u(t), \qquad y(t) = C x(t) + D u(t).$ 4. Language is encoded by a Mamba-130M encoder with a GPT-NeoX tokenizer, producing $\dot{x}(t) = A x(t) + B u(t), \qquad y(t) = C x(t) + D u(t).$ 5 language tokens including a [CLS] token. STS Mamba modules are inserted at multiple HiViT stages to fuse template tokens, search tokens, language tokens, and the temporal token (Zhang et al., 8 Dec 2025).

The unified input at stage $\dot{x}(t) = A x(t) + B u(t), \qquad y(t) = C x(t) + D u(t).$ 6 is written as

$\dot{x}(t) = A x(t) + B u(t), \qquad y(t) = C x(t) + D u(t).$ 7

where $\dot{x}(t) = A x(t) + B u(t), \qquad y(t) = C x(t) + D u(t).$ 8 denotes alignment or downsampling to a common token scale. The STS Mamba block then applies unidirectional, causal state-space scanning. This causal modification is important because the tracker is intended for online inference rather than bidirectional offline sequence processing (Zhang et al., 8 Dec 2025).

The prediction head is anchor-free and fully convolutional, with three branches: a center heatmap $\dot{x}(t) = A x(t) + B u(t), \qquad y(t) = C x(t) + D u(t).$ 9, local offsets $\Delta$ 0, and box sizes $\Delta$ 1. No online template update is required; the paper attributes robustness under occlusion, out-of-view events, and scale variation to the temporal token memory and semantic grounding (Zhang et al., 8 Dec 2025).

Training uses a weighted focal loss for classification, together with $\Delta$ 2 and Generalized-IoU box losses:

$\Delta$ 3

with $\Delta$ 4 and $\Delta$ 5. The training data mix comprises GOT-10k, LaSOT, COCO, TrackingNet, and the training split of UAV-Anti-UAV. The reported training setup uses AdamW with weight decay $\Delta$ 6, batch size 32, learning rates $\Delta$ 7 for the visual backbone and $\Delta$ 8 for other parameters, 300 epochs, and a 10× learning-rate decay after epoch 240 (Zhang et al., 8 Dec 2025).

4. Broader uses of the label in spatio-temporal modeling

Outside the UAV tracker, the designation is used more loosely for Mamba-based spatio-temporal modeling across perception and forecasting tasks. The cross-spectral pedestrian detector “MambaST” is not called MambaSTS, but it is architecturally adjacent: it fuses RGB and thermal video with a plug-and-play spatial-temporal fuser, uses recurrent Mamba blocks across frames, and introduces Multi-head Hierarchical Patching and Aggregation (MHHPA) to retain both fine-grained and coarse-grained information (Gao et al., 2024). The model is designed to insert into standard pipelines such as YOLOv5L, processes mid-level RGB and thermal features at three scales, and carries the last hidden state forward between frames to model temporal continuity (Gao et al., 2024).

In multivariate forecasting, MambaTS is a distinct name but addresses a related question: how to adapt selective state-space models to long-term time series forecasting. Its four principal modifications are Variable Scan along Time (VST), a Temporal Mamba Block (TMB) without causal convolution, dropout on selective parameters, and Variable Permutation Training plus Variable-Aware Scan along Time (VPT/VAST) for scan-order robustness (Cai et al., 2024). The paper explicitly warns that “MambaSTS” is a naming variant or typo here rather than the actual model name (Cai et al., 2024).

Other papers use the term as a general design pattern. MCST-Mamba separates temporal and spatial processing into two dedicated Mamba blocks for multichannel traffic prediction and fuses them with learned weights, while using adaptive spatio-temporal embeddings $\Delta$ 9 (Hamad et al., 5 Jul 2025). STM2 and STM3 define a multiscale spatio-temporal Mamba family in which Multiscale Mamba extracts several temporal scales simultaneously and AGCCN performs adaptive graph causal convolution together with causal cross-scale attention (Chen et al., 17 Aug 2025). HIGSTM applies the paradigm to stock forecasting by combining index-guided frequency filtering, time-varying and global stock graphs, and Information-Guided Mamba in which macro information is concatenated into selective parameter generation (Yan et al., 14 Mar 2025).

A plausible implication is that “MambaSTS” has become a shorthand for architectures that exploit selective state-space recurrence along one or more spatio-temporal axes, even when the exact model name differs across papers.

5. Empirical characteristics and reported results

For the UAV-Anti-UAV benchmark, MambaSTS is reported as the highest-performing method among 50 modern trackers under one-pass evaluation. Its reported scores are AUC $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t,$ 0, Precision $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t,$ 1, Normalized Precision $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t,$ 2, Complete Success $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t,$ 3, and mean Accuracy $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t,$ 4. The paper states that the second-best SUTrack-B224 obtains $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t,$ 5, while the mean $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t,$ 6 over all 50 trackers is approximately $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t,$ 7 (Zhang et al., 8 Dec 2025).

The same paper reports strong attribute-wise robustness, especially on Similar Distractors, where MambaSTS reaches $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t,$ 8 success versus $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t,$ 9 for MambaTrack and $x_{t+1} = \bar{A} x_t + \bar{B} u_t, \qquad y_t = C x_t + D u_t.$ 0 for ORTrack. It also reports strong results on Fast Motion, Motion Blur, Small Object, Out-of-View, Camera Motion, and Rotation. The authors nevertheless note that Illumination Variations and Full Occlusion remain hard for all trackers, with even top methods achieving less than $x_{t+1} = \bar{A} x_t + \bar{B} u_t, \qquad y_t = C x_t + D u_t.$ 1 success (Zhang et al., 8 Dec 2025).

Efficiency is a central part of the MambaSTS proposition. The tracker is reported at 54 FPS on a single NVIDIA A6000, and the paper presents this as sufficient for typical onboard requirements of at least 30 FPS (Zhang et al., 8 Dec 2025). In the broader literature, comparable efficiency claims recur. MambaST reports fusion compute of 5.43 GFLOPs and 22.52M parameters, compared with more than 100 GFLOPs and more than 130M parameters for CFT variants, while still improving KAIST multispectral pedestrian detection to LAMR 6.67% at $x_{t+1} = \bar{A} x_t + \bar{B} u_t, \qquad y_t = C x_t + D u_t.$ 2 and 6.32% at $x_{t+1} = \bar{A} x_t + \bar{B} u_t, \qquad y_t = C x_t + D u_t.$ 3 (Gao et al., 2024). MambaTS states complexity $x_{t+1} = \bar{A} x_t + \bar{B} u_t, \qquad y_t = C x_t + D u_t.$ 4 per layer after patching and reports new state-of-the-art performance on most of eight public long-term forecasting datasets (Cai et al., 2024).

These results do not establish a single common benchmark across all “MambaSTS”-like systems, but they consistently support the narrower claim made in the papers: selective state-space modeling is being used to trade quadratic attention costs for linear-time sequence handling in settings where long horizons, large sensor sets, or video recurrence are operationally important.

6. Misconceptions, limitations, and development directions

A common misconception is that MambaSTS is a single canonical model family with one stable definition. The cited literature does not support that reading. The term denotes a specific tracker in UAV-Anti-UAV research, appears as a naming variant or typo for MambaTS in long-term forecasting, and is also used as a broader descriptive label for spatio-temporal Mamba systems in traffic and long-horizon prediction (Zhang et al., 8 Dec 2025).

Another misconception is that replacing attention with Mamba automatically removes the need for task-specific structure. The surveyed papers indicate the opposite. MambaST relies on order-aware concatenation, hierarchical patching, and residual fusion for small pedestrian preservation; MCST-Mamba separates temporal and spatial scans; STM3 adds multiscale decomposition, adaptive graph learning, top-1 MoE routing, and causal contrastive learning; HIGSTM adds index-guided decomposition and graph construction; MambaSTS for UAV tracking adds language prompts, hierarchical visual tokens, and explicit temporal-token propagation (Gao et al., 2024).

The limitations are correspondingly task-specific. The UAV-Anti-UAV paper notes that the dataset is primarily RGB and that more sensing modalities such as IR and LiDAR are future work; it also states that even MambaSTS reaches only 43.7% AUC on the benchmark, indicating substantial headroom for progress (Zhang et al., 8 Dec 2025). MambaST notes degradation under extreme occlusion, severe cross-spectral misalignment, and noisy annotations (Gao et al., 2024). MambaTS highlights scan-order sensitivity, heuristic ATSP decoding via simulated annealing, and possible difficulty when the optimal variable order is non-stationary under distribution shift (Cai et al., 2024).

Taken together, the literature presents MambaSTS less as a fixed recipe than as a research direction: use selective state-space recurrence as the temporal backbone, then specialize the surrounding tokenization, graph structure, multiscale decomposition, or multimodal fusion to the geometry of the target problem.