HSTMixer: Hierarchical Spatio-Temporal MLP
- HSTMixer is a novel all-MLP architecture for large-scale traffic forecasting that efficiently captures multi-resolution spatio-temporal dynamics.
- Its design features hierarchical spatio-temporal mixing blocks and adaptive region-specific MLPs to achieve linear computational complexity and scalable performance.
- Empirical evaluations on datasets such as CA demonstrate significant improvements in MAE, RMSE, and MAPE compared to transformer and GNN methods.
The Hierarchical Spatio-Temporal Mixer (HSTMixer) is an all-MLP architecture for large-scale traffic forecasting, designed to efficiently and effectively capture multi-resolution dynamics over spatiotemporal graphs with up to tens of thousands of sensor nodes. HSTMixer’s architecture is built around the hierarchical composition of spatiotemporal mixing blocks and adaptive region-specific MLP parameterizations, enabling state-of-the-art predictive accuracy at linear computational complexity in both node and time dimensions (Wang et al., 26 Nov 2025).
1. Architectural Overview
HSTMixer is structured to address the prohibitive computational cost common to transformer and GNN-based spatiotemporal forecasting methods, replacing self-attention or message-passing with MLP-based mixing. Its design centers on two pillars:
- Hierarchical Spatio-Temporal Mixing Blocks (ST-blocks):
- Each ST-block performs bottom-up (aggregative) compression of temporal and spatial features to coarser (macro) representations, followed by top-down propagation that reincorporates these macro features into finer (micro) resolutions.
- The bottom-up path groups temporal input into windows and aggregates node-level features into regions at multiple spatial scales. The top-down path disseminates information from coarse to fine spatial and temporal resolutions.
- Adaptive Region Mixer:
- At each spatial scale, an adaptive region mixer generates region-specific MLP weights from a small parameter pool, allowing semantically similar regions to share transformation matrices while preserving distinct treatments for dissimilar regions.
By stacking such ST-blocks, HSTMixer constructs a spatiotemporal feature pyramid over both time and space. Final forecast outputs are obtained by fusing all levels of the hierarchy.
2. Mathematical Formulation and Model Components
Key Notation
- : Number of nodes (sensors)
- : Input history length
- : Forecast horizon
- : Hidden feature dimension
- : Number of ST-blocks
- : Temporal window length
- : Number of spatial scales
- : Number of regions at scale (0)
Data Embedding
The raw time series 1 is embedded via
2
accompanied by static spatial embeddings 3 (Node2Vec) and learnable dynamic embeddings 4, summed to give 5. Temporal embeddings 6 and 7 are aggregated as 8.
The input to the first ST-block is
9
Bottom-Up Aggregation
An ST-block’s input 0 undergoes:
- Temporal Aggregation Mixer: Temporal frames are grouped into windows of length 1. For block 2,
3
Two parallel window-mixing MLPs, each structured as FC4 activation 5 FC6 with a positional embedding 7, produce gated outputs:
8
9
- Spatial Aggregation Path: For each scale 0, a learned FC aggregates node (spatial) features to coarser regions:
1
Original (fine) node features 2 are retained for node-level mixing.
Adaptive Region Mixer
For 3, region-scale features are transformed by region-specific MLPs whose weights are generated adaptively:
- Parameter pool (per scale 4):
- Keys: 5
- Base weights: 6
- Region-to-key similarity (over time):
7
- Transformation matrices:
8
Per-region features 9 are transformed using 0 as the weights of a region-specific MLP:
1
2
Outputs: 3. Node-level outputs 4 are produced by a standard (non-adaptive) MLP.
Top-Down Propagation
Spatially, coarse region outputs 5 are progressively merged into finer representations via:
6
Input to the next ST-block:
7
Temporally, final representations at multiple resolutions are successively upsampled:
8
Prediction is produced from 9 and 0 by a final stack of FC and activation layers.
3. Forward Pass Workflow
The following summarizes the complete forward computation:
9 This structure allows efficient and scalable spatiotemporal forecasting aligned with the O(N·T) computational goal.
4. Computational Complexity and Scalability
Each fully connected operation within the mixing MLPs requires either 1 or 2 per block, where 3 is the intermediate hidden dimension. The adaptive region mixer introduces an additional cost of 4 per scale 5 for the similarity computation and 6 for application of transformation weights.
Summed over all blocks and scales, total complexity is
7
assuming uniform 8 and 9. This scaling is strictly linear in the number of nodes 0 and input length 1, in contrast to the quadratic 2 complexity characteristic of transformer attention or 3 in full-graph GNN propagation. On the large-scale CA dataset (4), HSTMixer completed training within hours on a 48 GB GPU, whereas transformer or GNN methods failed due to either memory or time constraints.
5. Experimental Evaluation and Ablative Insights
HSTMixer was evaluated on four real-world large-scale datasets: SD (5), GBA (6), GLA (7), and CA (8), all with 15-minute intervals, using a 12-interval input for a 12-interval forecast.
Comparative Performance
| Dataset | HSTMixer (MAE / RMSE / MAPE) | Next Best Method | Metrics |
|---|---|---|---|
| SD | 14.80 / 25.06 / 9.22 | DGCRN | 15.50 / 25.90 / 9.93 |
| GBA | 17.73 / 30.67 / 12.71 | LSTNN | 18.28 / 31.59 / 12.99 |
| GLA | 16.45 / 28.03 / 9.53 | LSTNN | 17.22 / 29.11 / 9.65 |
| CA | 15.55 / 27.05 / 10.55 | LSTNN | 16.48 / 28.24 / 10.90 |
Average improvements over previous best were MAE ↓4.41%, RMSE ↓3.15%, and MAPE ↓2.03%. Ablation studies indicate that removing any primary component (adaptive mixer, temporal or spatial hierarchies, or top-down propagations) leads to significant increases in MAE. On GBA: disabling the adaptive mixer, temporal hierarchy, spatial hierarchy, temporal propagation, or spatial propagation increased MAE by 1.2%, 2.8%, 3.1%, 2.5%, and 2.9% respectively.
Regarding efficiency, HSTMixer trained on GBA in approximately 4.5 hours (compared to 2–3 hours for smaller MLP baselines), with inference requiring ≈30 seconds per epoch.
6. Significance and Practical Implications
HSTMixer demonstrates that scalable, all-MLP methods with hierarchical and adaptive mixing mechanisms can achieve SOTA accuracy for large-scale traffic forecasting under real-world constraints of graph size and temporal range. Its linear computational profile and memory footprint enable deployment on infrastructure unattainable by transformer or traditional GNN approaches, suggesting practical utility for urban-scale traffic systems where computational efficiency is critical. The hierarchical bidirectional fusion of temporal and spatial context, along with adaptive region parametrization, is empirically validated as essential for accurate forecasting at scale (Wang et al., 26 Nov 2025).