Papers
Topics
Authors
Recent
Search
2000 character limit reached

Spatio-Temporal Semantic Mamba-Attention

Updated 16 January 2026
  • The paper introduces STS-MA, which integrates semantic graph convolution with temporal state-space modeling for enhanced spatio-temporal forecasting.
  • It employs dual temporal encoders—local masked attention and Mamba SSM—to effectively capture long-range dependencies and transfer semantic patterns.
  • Empirical results demonstrate improved prediction accuracy and robustness in urban accident risk and traffic flow forecasting under noisy conditions.

Spatio-Temporal Semantic Mamba-Attention (STS-MA) is an advanced architectural module designed to model complex, long-range dependencies in spatio-temporal data, combining selective state-space memory with attention-based mechanisms. Its origins trace to recent developments in multi-task urban accident risk forecasting and efficient traffic flow prediction, prominently within the MLA-STNet and ST-MambaSync frameworks. STS-MA’s central innovation is the fusion of semantic graph convolution with temporal Mamba-style state-space modeling and local masked attention, resulting in high interpretability, scalability, and empirical robustness in spatio-temporal forecasting tasks (Fang et al., 9 Jan 2026, Shao et al., 2024).

1. Principle Architecture and Pipeline Integration

STS-MA operates as a specialized branch in dual-stream spatio-temporal forecasting pipelines. In MLA-STNet, it complements the Spatio-Temporal Geographical Mamba-Attention (STG-MA) grid branch with a semantic node-centric approach. Specifically, STS-MA ingests per-node, per-timestep semantic features XsemRT×N×FsemX^{sem}\in\mathbb R^{T\times N\times F_{sem}} alongside multi-type, city-specific support graphs, assembling them into a block-diagonal global adjacency ARN×NA\in\mathbb R^{N\times N} with N=kNkN=\sum_k N_k across CC cities. This design enables multi-task training with shared weights while preserving individual semantic spaces—critical for cross-city accident risk prediction (Fang et al., 9 Jan 2026).

In traffic flow prediction (ST-MambaSync), STS-MA functions as the "ST-Mamba block" after joint spatio-temporal Transformer attention, consuming mixed context matrices and enhancing sequence modeling via linear state-space recurrence (Shao et al., 2024).

2. Semantic Embedding and Multi-Graph Convolution

STS-MA’s input features undergo two successive 1×11\times1 convolutions (nodewise MLPs) to produce latent embeddings HsemRT×N×DH^{sem}\in\mathbb R^{T\times N\times D}:

Hsem=Conv1×1(2)(ReLU(Conv1×1(1)(Xsem)))H^{sem} = \mathrm{Conv}_{1\times1}^{(2)}\bigl(\mathrm{ReLU}(\mathrm{Conv}_{1\times1}^{(1)}(X^{sem}))\bigr)

It aggregates structural information from heterogenous semantic supports—road, risk (historical crash co-occurrence), POI (points-of-interest), and a learnable "adaptive" low-rank support:

Aadp=Softmaxrow(ReLU(E1E2T))A^{adp} = \mathrm{Softmax}_{\text{row}}(\mathrm{ReLU}(E_1 E_2^T))

Normalized supports A^j\widehat{A}_j are input to LL layers of multi-graph GCN (weights shared across cities):

S:,t,:(l+1)=σ ⁣(j=14A^jS:,t,:(l)Wj(l))l=0,,L1S^{(l+1)}_{:,t,:} = \sigma\!\Bigl(\sum_{j=1}^4 \widehat{A}_j S^{(l)}_{:,t,:} W_j^{(l)}\Bigr)\,\qquad l=0,\ldots,L-1

This yields temporally stratified node embeddings SRN×T×DS\in\mathbb R^{N\times T\times D}. Such multi-graph convolution allows transfer of risk-inducing semantic patterns while respecting local road structures via block-diagonal adjacencies (Fang et al., 9 Jan 2026).

3. Temporal Modeling: Fusion of Local Attention and Mamba SSM

STS-MA applies two parallel temporal encoders to node sequences:

  • Local Masked Multi-Head Attention (LMA): For each node nn and time tt, attention is restricted to a sliding causal window ww:

αn,t,s=exp(Qn,tKn,sd)u=max(1,tw+1)texp(Qn,tKn,ud)\alpha_{n,t,s} = \frac{\exp\Bigl(\frac{Q_{n,t}{\cdot} K_{n,s}}{\sqrt{d}}\Bigr)}{\sum_{u=\max(1, t-w+1)}^t \exp\Bigl(\frac{Q_{n,t}{\cdot} K_{n,u}}{\sqrt{d}}\Bigr)}

Ln,t=s=tw+1tαn,t,sVn,sL_{n,t}=\sum_{s=t-w+1}^t \alpha_{n,t,s}V_{n,s}

  • Spatio-Temporal Mamba State-Space Model (STM): Independently for each node, selective memory is computed via:

A~n,t=exp(ΔtAlog)σ(WaSn,t,:) B~n,t=WbSn,t,: hn,t=A~n,thn,t1+B~n,t Gn,t=Wchn,t\begin{aligned} \widetilde{A}_{n,t} &= \exp(\Delta t A_{\log}) \odot \sigma(W_a S_{n,t,:}) \ \widetilde{B}_{n,t} &= W_b S_{n,t,:} \ h_{n,t} &= \widetilde{A}_{n,t}\odot h_{n,t-1}+\widetilde{B}_{n,t} \ G_{n,t} &= W_c h_{n,t} \end{aligned}

  • Channel-wise Fusion: Local and global signals are fused with adaptive gating and layer normalization:

Un,t=LayerNorm(Sn,t,:+Wf[Ln,tGn,t])U_{n,t} = \mathrm{LayerNorm}\left(S_{n,t,:} + W_f [L_{n,t}\,\Vert\,G_{n,t}]\right)

Only the latest timestep Un,TU_{n,T} is retained for spatial projection (Fang et al., 9 Jan 2026).

In ST-MambaSync, STS-MA efficiently models long-range dependencies by running a selective SSM recursion mapped via parameter projections from a linearly transformed context embedding. It is formally equivalent to an attention-weighted sum plus a residual skip connection (Shao et al., 2024).

4. Spatial Projection and Output Fusion

After temporal modeling, node-level features Un,TU_{n,T} are projected back onto the spatial grid via a city-block-diagonal mapping M{0,1}(WH)×NM\in\{0,1\}^{(W\cdot H)\times N}:

Ysem=Reshape(MU:,T)RD×W×HY_{sem} = \mathrm{Reshape}(M U_{:,T}) \in \mathbb R^{D\times W\times H}

For each city, Ysem(k)Y_{sem}^{(k)} is fused with the geographical branch output Ygeo(k)Y_{geo}^{(k)} using a learned sigmoid gate, followed by a 1×11\times1 convolutional output head to generate risk maps (Fang et al., 9 Jan 2026). In ST-MambaSync, the STS-MA output is projected back from sequence to grid shape, added to the residual, normalized, and passed to a regression head (Shao et al., 2024).

5. Parameter Sharing and Semantic Separation

All primary learnable weights—including Mamba kernels, attention projections, GCN weights, fusion, and adaptive adjacency matrices—are shared globally. However, city-specific semantics are maintained via independent blocks for road-topology, crash, and POI graphs, and mapping matrices. No cross-city adjacency connections are created, thus each city retains localized semantic representation and structural heterogeneity, while benefiting from shared modeling of patterns (Fang et al., 9 Jan 2026).

6. Training Procedures and Hyperparameterization

STS-MA modules are trained using Adam optimizer with L2 weight decay, sliding window input of T=12T=12 (forecasting horizon Q=1Q=1), masked MSE loss excluding missing grid cells, and dropout on MLP layers (p0.1p \approx 0.1). Typical hyperparameters are D64D \approx 64, adaptive adjacency rank r16r\approx 16, attention window w=6w=6, GCN depth L=2L=2, and $4$ attention heads. In ST-MambaSync, input horizon and output horizon are matched ($12$ timesteps), with hidden dimension dh=80d_h=80, and ST-Mamba inner dimension 2×dh2\times d_h (Fang et al., 9 Jan 2026, Shao et al., 2024).

7. Empirical Impact, Robustness, and Applicability

Empirical studies demonstrate the STS-MA’s substantial impact:

  • MLA-STNet: Removal of STS-MA yields degraded predictive accuracy: for Chicago, RMSE rises 8.5859.5258.585\rightarrow9.525, Recall drops by 5%5\%, MAP by $0.0147$; for NYC, RMSE increases slightly (Table 6). Multi-city joint training with STS-MA achieves RMSE $6.88$ (NYC), $8.59$ (Chicago), robust Recall/MAP, and the cross-city performance gains disappear if STS-MA is omitted (Table 4b) (Fang et al., 9 Jan 2026).
  • STS-MA improves hotspot localization, suppressing spurious predictions characteristic of grid-only or ablated models (Fig 13).
  • Robustness: MLA-STNet with STS-MA maintains <1%<1\% variation in RMSE/Recall/MAP under 50%50\% Gaussian feature noise, whereas alternatives degrade steadily.
  • ST-MambaSync: Hybrid attention-Mamba models outperform pure attention (STAEformer) or pure Mamba (ST-SSMs) in accuracy and computational efficiency; the $1+1$ hybrid achieves lowest RMSE/MAE/MAPE, cuts FLOPS by 65%\sim65\% versus deep attention models (PEMS08: MAE $13.30$ vs. $13.49$, RMSE $23.14$ vs. $23.30$) (Shao et al., 2024).
  • Explainability: The explicit per-step weights in the Mamba recurrence, with interpretable attention and memory components, offer diagnostic transparency beyond conventional RNNs.
  • Applicability: STS-MA generalizes to modeling spatio-temporal processes in climate, video, biomedical signals, IoT sensors, and financial networks requiring efficient long-sequence handling and joint global-local context modeling (Shao et al., 2024).

STS-MA’s integration of multi-graph semantic modeling, efficient selective memory, and cross-domain transfer makes it a foundational component for robust, interpretable, and scalable spatio-temporal forecasting frameworks.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spatio-Temporal Semantic Mamba-Attention (STS-MA).