Papers
Topics
Authors
Recent
Search
2000 character limit reached

HSTMixer: Hierarchical Spatio-Temporal MLP

Updated 2 July 2026
  • HSTMixer is a novel all-MLP architecture for large-scale traffic forecasting that efficiently captures multi-resolution spatio-temporal dynamics.
  • Its design features hierarchical spatio-temporal mixing blocks and adaptive region-specific MLPs to achieve linear computational complexity and scalable performance.
  • Empirical evaluations on datasets such as CA demonstrate significant improvements in MAE, RMSE, and MAPE compared to transformer and GNN methods.

The Hierarchical Spatio-Temporal Mixer (HSTMixer) is an all-MLP architecture for large-scale traffic forecasting, designed to efficiently and effectively capture multi-resolution dynamics over spatiotemporal graphs with up to tens of thousands of sensor nodes. HSTMixer’s architecture is built around the hierarchical composition of spatiotemporal mixing blocks and adaptive region-specific MLP parameterizations, enabling state-of-the-art predictive accuracy at linear computational complexity in both node and time dimensions (Wang et al., 26 Nov 2025).

1. Architectural Overview

HSTMixer is structured to address the prohibitive computational cost common to transformer and GNN-based spatiotemporal forecasting methods, replacing self-attention or message-passing with MLP-based mixing. Its design centers on two pillars:

  • Hierarchical Spatio-Temporal Mixing Blocks (ST-blocks):
    • Each ST-block performs bottom-up (aggregative) compression of temporal and spatial features to coarser (macro) representations, followed by top-down propagation that reincorporates these macro features into finer (micro) resolutions.
    • The bottom-up path groups temporal input into windows and aggregates node-level features into regions at multiple spatial scales. The top-down path disseminates information from coarse to fine spatial and temporal resolutions.
  • Adaptive Region Mixer:
    • At each spatial scale, an adaptive region mixer generates region-specific MLP weights from a small parameter pool, allowing semantically similar regions to share transformation matrices while preserving distinct treatments for dissimilar regions.

By stacking LL such ST-blocks, HSTMixer constructs a spatiotemporal feature pyramid over both time and space. Final forecast outputs are obtained by fusing all levels of the hierarchy.

2. Mathematical Formulation and Model Components

Key Notation

  • NN: Number of nodes (sensors)
  • TT: Input history length
  • T′T': Forecast horizon
  • dd: Hidden feature dimension
  • LL: Number of ST-blocks
  • pp: Temporal window length
  • KK: Number of spatial scales
  • Sk<NS_k < N: Number of regions at scale kk (NN0)

Data Embedding

The raw time series NN1 is embedded via

NN2

accompanied by static spatial embeddings NN3 (Node2Vec) and learnable dynamic embeddings NN4, summed to give NN5. Temporal embeddings NN6 and NN7 are aggregated as NN8.

The input to the first ST-block is

NN9

Bottom-Up Aggregation

An ST-block’s input TT0 undergoes:

  • Temporal Aggregation Mixer: Temporal frames are grouped into windows of length TT1. For block TT2,

TT3

Two parallel window-mixing MLPs, each structured as FCTT4 activation TT5 FCTT6 with a positional embedding TT7, produce gated outputs:

TT8

TT9

  • Spatial Aggregation Path: For each scale T′T'0, a learned FC aggregates node (spatial) features to coarser regions:

T′T'1

Original (fine) node features T′T'2 are retained for node-level mixing.

Adaptive Region Mixer

For T′T'3, region-scale features are transformed by region-specific MLPs whose weights are generated adaptively:

  • Parameter pool (per scale T′T'4):
    • Keys: T′T'5
    • Base weights: T′T'6
  • Region-to-key similarity (over time):

T′T'7

  • Transformation matrices:

T′T'8

Per-region features T′T'9 are transformed using dd0 as the weights of a region-specific MLP:

dd1

dd2

Outputs: dd3. Node-level outputs dd4 are produced by a standard (non-adaptive) MLP.

Top-Down Propagation

Spatially, coarse region outputs dd5 are progressively merged into finer representations via:

dd6

Input to the next ST-block:

dd7

Temporally, final representations at multiple resolutions are successively upsampled:

dd8

Prediction is produced from dd9 and LL0 by a final stack of FC and activation layers.

3. Forward Pass Workflow

The following summarizes the complete forward computation:

pp9 This structure allows efficient and scalable spatiotemporal forecasting aligned with the O(N·T) computational goal.

4. Computational Complexity and Scalability

Each fully connected operation within the mixing MLPs requires either LL1 or LL2 per block, where LL3 is the intermediate hidden dimension. The adaptive region mixer introduces an additional cost of LL4 per scale LL5 for the similarity computation and LL6 for application of transformation weights.

Summed over all blocks and scales, total complexity is

LL7

assuming uniform LL8 and LL9. This scaling is strictly linear in the number of nodes pp0 and input length pp1, in contrast to the quadratic pp2 complexity characteristic of transformer attention or pp3 in full-graph GNN propagation. On the large-scale CA dataset (pp4), HSTMixer completed training within hours on a 48 GB GPU, whereas transformer or GNN methods failed due to either memory or time constraints.

5. Experimental Evaluation and Ablative Insights

HSTMixer was evaluated on four real-world large-scale datasets: SD (pp5), GBA (pp6), GLA (pp7), and CA (pp8), all with 15-minute intervals, using a 12-interval input for a 12-interval forecast.

Comparative Performance

Dataset HSTMixer (MAE / RMSE / MAPE) Next Best Method Metrics
SD 14.80 / 25.06 / 9.22 DGCRN 15.50 / 25.90 / 9.93
GBA 17.73 / 30.67 / 12.71 LSTNN 18.28 / 31.59 / 12.99
GLA 16.45 / 28.03 / 9.53 LSTNN 17.22 / 29.11 / 9.65
CA 15.55 / 27.05 / 10.55 LSTNN 16.48 / 28.24 / 10.90

Average improvements over previous best were MAE ↓4.41%, RMSE ↓3.15%, and MAPE ↓2.03%. Ablation studies indicate that removing any primary component (adaptive mixer, temporal or spatial hierarchies, or top-down propagations) leads to significant increases in MAE. On GBA: disabling the adaptive mixer, temporal hierarchy, spatial hierarchy, temporal propagation, or spatial propagation increased MAE by 1.2%, 2.8%, 3.1%, 2.5%, and 2.9% respectively.

Regarding efficiency, HSTMixer trained on GBA in approximately 4.5 hours (compared to 2–3 hours for smaller MLP baselines), with inference requiring ≈30 seconds per epoch.

6. Significance and Practical Implications

HSTMixer demonstrates that scalable, all-MLP methods with hierarchical and adaptive mixing mechanisms can achieve SOTA accuracy for large-scale traffic forecasting under real-world constraints of graph size and temporal range. Its linear computational profile and memory footprint enable deployment on infrastructure unattainable by transformer or traditional GNN approaches, suggesting practical utility for urban-scale traffic systems where computational efficiency is critical. The hierarchical bidirectional fusion of temporal and spatial context, along with adaptive region parametrization, is empirically validated as essential for accurate forecasting at scale (Wang et al., 26 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Spatio-Temporal Mixer (HSTMixer).