ST-MoE: Untrammelled Spatiotemporal Experts

Updated 2 January 2026

The paper demonstrates that ST-MoE overcomes quadratic scaling and rigid positional encodings by leveraging a mixture of experts for efficient spatiotemporal modeling.
It introduces specialized experts using bidirectional state-space modules (SMamba/TMamba) that decompose and fuse spatial and temporal features.
Applications in motion prediction and mobility forecasting show significant speedup and parameter reduction, underscoring practical efficiency gains.

The Spatiotemporal-Untrammelled Mixture of Experts (ST-MoE) is an architectural paradigm that addresses heterogeneity and efficiency bottlenecks in learning complex spatial-temporal dependencies within dynamic sequence data. Originating from multi-person motion prediction in human pose estimation, the ST-MoE framework flexibly exploits distinct spatiotemporal correlations through a mixture of neural “experts,” each specializing in a particular decomposition of spatial and temporal modeling. The untrammelled MoE design denotes unconstrained, data-driven routing of input representations to specialized experts, thereby overcoming architectural rigidity and quadratic scaling prevalent in prior transformer-based and attention-centric models (Yin et al., 25 Dec 2025).

1. Core Principles of Spatiotemporal-Untrammelled Mixture of Experts

ST-MoE fundamentally departs from conventional spatiotemporal modeling by eschewing fixed positional encodings and statically grouped attention. Instead, it introduces a pool of experts—each composed of bidirectional state-space modules, such as Spatial Mamba (SMamba) and Temporal Mamba (TMamba)—to adaptively mine intricate, cross-dimensional dependencies native to spatiotemporal data. The architecture’s untrammelled gating mechanism dynamically routes each input sample to a subset of experts based on a data-driven gating network, rather than fixed partitions by spatial or temporal locality.

The central innovation lies in the explicit composition of spatial (“S”) and temporal (“T”) aggregation modules: experts are instantiated as SS (spatial-spatial), TT (temporal-temporal), ST (spatial-temporal), and TS (temporal-spatial), each imposing a unique hierarchical filtering before expert outputs are fused (Yin et al., 25 Dec 2025). This flexibility yields superior coverage of underlying data-generating processes that exhibit nonuniform and entangled spatiotemporal correlations.

2. Formalization of Spatiotemporal Experts and Mamba Blocks

Each expert in ST-MoE applies a specific sequence of bidirectional SMamba and TMamba blocks. These blocks implement input-conditioned, linear-time state-space models for efficient and expressive sequence modeling—contrasting the quadratic time/space complexity of classical self-attention.

Expert definitions, for input $F_{\rm in}\in\mathbb{R}^{B\times D\times T}$ :

SS Expert: $F' = \mathrm{Bi\text{-}SMamba}(F_{\rm in}),\quad F_{\rm out}^{\rm SS} = \mathrm{Bi\text{-}SMamba}(F')$ .
TT Expert: $G' = \mathrm{Bi\text{-}TMamba}(\mathrm{rearrange}(F_{\rm in})),\quad F_{\rm out}^{\rm TT} = \mathrm{rearrange}(\mathrm{Bi\text{-}TMamba}(G'))$ .
ST Expert: $F'' = \mathrm{rearrange}(\mathrm{Bi\text{-}SMamba}(F_{\rm in})),\quad F_{\rm out}^{\rm ST} = \mathrm{rearrange}(\mathrm{Bi\text{-}TMamba}(F''))$ .
TS Expert: $G'' = \mathrm{Bi\text{-}TMamba}(\mathrm{rearrange}(F_{\rm in})),\quad F_{\rm out}^{\rm TS} = \mathrm{rearrange}(\mathrm{Bi\text{-}SMamba}(G''))$ .

Each Mamba block operates via parameterized, input-dependent SSMs, with shared parameters $A, \Delta, B$ across all spatial (and separately, temporal) experts, leading to significant parameter economy (Yin et al., 25 Dec 2025).

3. Gating and Expert Aggregation

ST-MoE employs a lightweight MLP-based gating network $g$ that generates selection scores for all experts. The routing policy preserves efficiency and sparsity via a TopK selection (typically $k=4$ for all experts in the principal instance), enabling aggregation:

$E_{\rm output} = \sum_{e=1}^N p_e\,f_e(F_{\rm input}), \qquad p_e = \mathrm{softmax}(\mathrm{TopK}(g(F_{\rm input}),k))_e$

This gating is “untrammelled” in that it does not enforce assignment of tokens to experts along any pre-defined spatiotemporal axes, thus maximizing representational flexibility and facilitating data-adaptive specialization. No auxiliary regularization beyond standard load-balance constraints is required (Yin et al., 25 Dec 2025).

4. Computational and Modeling Advantages

ST-MoE addresses two foundational bottlenecks: (i) Fixed positional encoding limitations: Unlike attention-based transformers, ST-MoE does not require explicit positional encodings to inject order information, but instead learns four specialized, data-adaptive transformation paths via its mixture of experts. (ii) Quadratic scaling of attention: Replacing global self-attention with linear-time, input-conditioned SSMs (Mamba) in both spatial and temporal subspaces, the approach leads to computational complexity $O(DT N)$ per block, as opposed to $O((DT)^2)$ for attention on the flattened input (Yin et al., 25 Dec 2025).

In multi-person motion prediction, this architecture demonstrated a 3.6× training speedup and a 41.38% reduction in model parameters, while slightly improving mean per-joint position error (JPE) compared with the prior best method (95 mm vs. 96 mm, CMU-Mocap UMPM). Wall-clock per-iteration training time dropped from 1134 ms to 314 ms, and parameter count from 2.61 MB to 1.53 MB, relative to IAFormer (Yin et al., 25 Dec 2025).

5. Application in Spatiotemporal and Mobility Forecasting Domains

The ST-MoE paradigm generalizes across spatiotemporally structured sequence modeling tasks. In cross-city human mobility forecasting, ST-MoE layers are incorporated atop a BERT transformer backbone, as in ST-MoE-BERT (He et al., 2024), with a sparse MoE router (top-2 of $K=8$ experts per input) applied to the [CLS] global context token. The architecture’s strengths include enhanced expert specialization, the ability to flexibly learn from multi-source data, and robust performance under transfer learning. ST-MoE-BERT achieved improvements up to +8.29% GEO-BLEU when augmented with transfer learning from source to target cities, exceeding previous baselines on mobility sequence prediction (He et al., 2024).

Beyond motion or mobility domains, the mixture-of-experts framework is instantiated in generic spatiotemporal transformers via soft, differentiable MoE blocks within spatial-temporal attention layers. For human motion prediction, this injected substantial model capacity while retaining real-time inference performance, e.g., reducing MAE@24 from 18.23 to 17.64 at nearly identical latency (Shieh et al., 2024).

6. Architectural Comparison and Efficiency Table

Model	Key Innovation	Speedup / Efficiency Gains	Benchmark MAE/JPE
IAFormer	Quadratic attention, position encodings	None	96 mm (JPE)
ST-MoE (Yin et al., 25 Dec 2025)	4 expert MoE, linear SSM, shared params	3.6× faster, −41.38% params	95 mm (JPE)
ST-MoE-BERT (He et al., 2024)	MoE on transformer global token, transfer learning	+8.29% GEO-BLEU, −15.5% DTW vs. BERT	28.7% accuracy
ST-MoE block (Shieh et al., 2024)	Soft MoE in ST-Transformer layer	+1.5k× params, same latency	17.64 MAE

7. Limitations and Future Directions

ST-MoE models exhibit dependence on sufficient pretraining data in source domains for effective transfer and may benefit from further advances in expert routing schemes, such as dynamic- $K$ selection or entropy-based regularization. Temporal encodings could be further enriched using learned continuous representations. Extensions under consideration include the integration of large-LLMs for prompt-guided sequence modeling and adversarial alignment for domain adaptation (Yin et al., 25 Dec 2025, He et al., 2024). A plausible implication is that the architectural and representational flexibility of “untrammelled” MoE gating could further generalize to other applications characterized by heterogeneous, entangled spatial and temporal dependencies.