Spatio-Temporal Semantic Mamba-Attention
- The paper introduces STS-MA, which integrates semantic graph convolution with temporal state-space modeling for enhanced spatio-temporal forecasting.
- It employs dual temporal encoders—local masked attention and Mamba SSM—to effectively capture long-range dependencies and transfer semantic patterns.
- Empirical results demonstrate improved prediction accuracy and robustness in urban accident risk and traffic flow forecasting under noisy conditions.
Spatio-Temporal Semantic Mamba-Attention (STS-MA) is an advanced architectural module designed to model complex, long-range dependencies in spatio-temporal data, combining selective state-space memory with attention-based mechanisms. Its origins trace to recent developments in multi-task urban accident risk forecasting and efficient traffic flow prediction, prominently within the MLA-STNet and ST-MambaSync frameworks. STS-MA’s central innovation is the fusion of semantic graph convolution with temporal Mamba-style state-space modeling and local masked attention, resulting in high interpretability, scalability, and empirical robustness in spatio-temporal forecasting tasks (Fang et al., 9 Jan 2026, Shao et al., 2024).
1. Principle Architecture and Pipeline Integration
STS-MA operates as a specialized branch in dual-stream spatio-temporal forecasting pipelines. In MLA-STNet, it complements the Spatio-Temporal Geographical Mamba-Attention (STG-MA) grid branch with a semantic node-centric approach. Specifically, STS-MA ingests per-node, per-timestep semantic features alongside multi-type, city-specific support graphs, assembling them into a block-diagonal global adjacency with across cities. This design enables multi-task training with shared weights while preserving individual semantic spaces—critical for cross-city accident risk prediction (Fang et al., 9 Jan 2026).
In traffic flow prediction (ST-MambaSync), STS-MA functions as the "ST-Mamba block" after joint spatio-temporal Transformer attention, consuming mixed context matrices and enhancing sequence modeling via linear state-space recurrence (Shao et al., 2024).
2. Semantic Embedding and Multi-Graph Convolution
STS-MA’s input features undergo two successive convolutions (nodewise MLPs) to produce latent embeddings :
It aggregates structural information from heterogenous semantic supports—road, risk (historical crash co-occurrence), POI (points-of-interest), and a learnable "adaptive" low-rank support:
Normalized supports are input to layers of multi-graph GCN (weights shared across cities):
This yields temporally stratified node embeddings . Such multi-graph convolution allows transfer of risk-inducing semantic patterns while respecting local road structures via block-diagonal adjacencies (Fang et al., 9 Jan 2026).
3. Temporal Modeling: Fusion of Local Attention and Mamba SSM
STS-MA applies two parallel temporal encoders to node sequences:
- Local Masked Multi-Head Attention (LMA): For each node and time , attention is restricted to a sliding causal window :
- Spatio-Temporal Mamba State-Space Model (STM): Independently for each node, selective memory is computed via:
- Channel-wise Fusion: Local and global signals are fused with adaptive gating and layer normalization:
Only the latest timestep is retained for spatial projection (Fang et al., 9 Jan 2026).
In ST-MambaSync, STS-MA efficiently models long-range dependencies by running a selective SSM recursion mapped via parameter projections from a linearly transformed context embedding. It is formally equivalent to an attention-weighted sum plus a residual skip connection (Shao et al., 2024).
4. Spatial Projection and Output Fusion
After temporal modeling, node-level features are projected back onto the spatial grid via a city-block-diagonal mapping :
For each city, is fused with the geographical branch output using a learned sigmoid gate, followed by a convolutional output head to generate risk maps (Fang et al., 9 Jan 2026). In ST-MambaSync, the STS-MA output is projected back from sequence to grid shape, added to the residual, normalized, and passed to a regression head (Shao et al., 2024).
5. Parameter Sharing and Semantic Separation
All primary learnable weights—including Mamba kernels, attention projections, GCN weights, fusion, and adaptive adjacency matrices—are shared globally. However, city-specific semantics are maintained via independent blocks for road-topology, crash, and POI graphs, and mapping matrices. No cross-city adjacency connections are created, thus each city retains localized semantic representation and structural heterogeneity, while benefiting from shared modeling of patterns (Fang et al., 9 Jan 2026).
6. Training Procedures and Hyperparameterization
STS-MA modules are trained using Adam optimizer with L2 weight decay, sliding window input of (forecasting horizon ), masked MSE loss excluding missing grid cells, and dropout on MLP layers (). Typical hyperparameters are , adaptive adjacency rank , attention window , GCN depth , and $4$ attention heads. In ST-MambaSync, input horizon and output horizon are matched ($12$ timesteps), with hidden dimension , and ST-Mamba inner dimension (Fang et al., 9 Jan 2026, Shao et al., 2024).
7. Empirical Impact, Robustness, and Applicability
Empirical studies demonstrate the STS-MA’s substantial impact:
- MLA-STNet: Removal of STS-MA yields degraded predictive accuracy: for Chicago, RMSE rises , Recall drops by , MAP by $0.0147$; for NYC, RMSE increases slightly (Table 6). Multi-city joint training with STS-MA achieves RMSE $6.88$ (NYC), $8.59$ (Chicago), robust Recall/MAP, and the cross-city performance gains disappear if STS-MA is omitted (Table 4b) (Fang et al., 9 Jan 2026).
- STS-MA improves hotspot localization, suppressing spurious predictions characteristic of grid-only or ablated models (Fig 13).
- Robustness: MLA-STNet with STS-MA maintains variation in RMSE/Recall/MAP under Gaussian feature noise, whereas alternatives degrade steadily.
- ST-MambaSync: Hybrid attention-Mamba models outperform pure attention (STAEformer) or pure Mamba (ST-SSMs) in accuracy and computational efficiency; the $1+1$ hybrid achieves lowest RMSE/MAE/MAPE, cuts FLOPS by versus deep attention models (PEMS08: MAE $13.30$ vs. $13.49$, RMSE $23.14$ vs. $23.30$) (Shao et al., 2024).
- Explainability: The explicit per-step weights in the Mamba recurrence, with interpretable attention and memory components, offer diagnostic transparency beyond conventional RNNs.
- Applicability: STS-MA generalizes to modeling spatio-temporal processes in climate, video, biomedical signals, IoT sensors, and financial networks requiring efficient long-sequence handling and joint global-local context modeling (Shao et al., 2024).
STS-MA’s integration of multi-graph semantic modeling, efficient selective memory, and cross-domain transfer makes it a foundational component for robust, interpretable, and scalable spatio-temporal forecasting frameworks.