HSTMixer: Hierarchical Spatio-Temporal MLP

Updated 2 July 2026

HSTMixer is a novel all-MLP architecture for large-scale traffic forecasting that efficiently captures multi-resolution spatio-temporal dynamics.
Its design features hierarchical spatio-temporal mixing blocks and adaptive region-specific MLPs to achieve linear computational complexity and scalable performance.
Empirical evaluations on datasets such as CA demonstrate significant improvements in MAE, RMSE, and MAPE compared to transformer and GNN methods.

The Hierarchical Spatio-Temporal Mixer (HSTMixer) is an all-MLP architecture for large-scale traffic forecasting, designed to efficiently and effectively capture multi-resolution dynamics over spatiotemporal graphs with up to tens of thousands of sensor nodes. HSTMixer’s architecture is built around the hierarchical composition of spatiotemporal mixing blocks and adaptive region-specific MLP parameterizations, enabling state-of-the-art predictive accuracy at linear computational complexity in both node and time dimensions (Wang et al., 26 Nov 2025).

1. Architectural Overview

HSTMixer is structured to address the prohibitive computational cost common to transformer and GNN-based spatiotemporal forecasting methods, replacing self-attention or message-passing with MLP-based mixing. Its design centers on two pillars:

Hierarchical Spatio-Temporal Mixing Blocks (ST-blocks):
- Each ST-block performs bottom-up (aggregative) compression of temporal and spatial features to coarser (macro) representations, followed by top-down propagation that reincorporates these macro features into finer (micro) resolutions.
- The bottom-up path groups temporal input into windows and aggregates node-level features into regions at multiple spatial scales. The top-down path disseminates information from coarse to fine spatial and temporal resolutions.
Adaptive Region Mixer:
- At each spatial scale, an adaptive region mixer generates region-specific MLP weights from a small parameter pool, allowing semantically similar regions to share transformation matrices while preserving distinct treatments for dissimilar regions.

By stacking $L$ such ST-blocks, HSTMixer constructs a spatiotemporal feature pyramid over both time and space. Final forecast outputs are obtained by fusing all levels of the hierarchy.

2. Mathematical Formulation and Model Components

Key Notation

$N$ : Number of nodes (sensors)
$T$ : Input history length
$T'$ : Forecast horizon
$d$ : Hidden feature dimension
$L$ : Number of ST-blocks
$p$ : Temporal window length
$K$ : Number of spatial scales
$S_k < N$ : Number of regions at scale $k$ ( $N$ 0)

Data Embedding

The raw time series $N$ 1 is embedded via

$N$ 2

accompanied by static spatial embeddings $N$ 3 (Node2Vec) and learnable dynamic embeddings $N$ 4, summed to give $N$ 5. Temporal embeddings $N$ 6 and $N$ 7 are aggregated as $N$ 8.

The input to the first ST-block is

$N$ 9

Bottom-Up Aggregation

An ST-block’s input $T$ 0 undergoes:

Temporal Aggregation Mixer: Temporal frames are grouped into windows of length $T$ 1. For block $T$ 2,

$T$ 3

Two parallel window-mixing MLPs, each structured as FC $T$ 4 activation $T$ 5 FC $T$ 6 with a positional embedding $T$ 7, produce gated outputs:

$T$ 8

$T$ 9

Spatial Aggregation Path: For each scale $T'$ 0, a learned FC aggregates node (spatial) features to coarser regions:

$T'$ 1

Original (fine) node features $T'$ 2 are retained for node-level mixing.

Adaptive Region Mixer

For $T'$ 3, region-scale features are transformed by region-specific MLPs whose weights are generated adaptively:

Parameter pool (per scale $T'$ $T^{'}$ 4):
- Keys: $T'$ 5
- Base weights: $T'$ 6
Region-to-key similarity (over time):

$T'$ 7

Transformation matrices:

$T'$ 8

Per-region features $T'$ 9 are transformed using $d$ 0 as the weights of a region-specific MLP:

$d$ 1

$d$ 2

Outputs: $d$ 3. Node-level outputs $d$ 4 are produced by a standard (non-adaptive) MLP.

Top-Down Propagation

Spatially, coarse region outputs $d$ 5 are progressively merged into finer representations via:

$d$ 6

Input to the next ST-block:

$d$ 7

Temporally, final representations at multiple resolutions are successively upsampled:

$d$ 8

Prediction is produced from $d$ 9 and $L$ 0 by a final stack of FC and activation layers.

3. Forward Pass Workflow

The following summarizes the complete forward computation:

$p$ 9 This structure allows efficient and scalable spatiotemporal forecasting aligned with the O(N·T) computational goal.

4. Computational Complexity and Scalability

Each fully connected operation within the mixing MLPs requires either $L$ 1 or $L$ 2 per block, where $L$ 3 is the intermediate hidden dimension. The adaptive region mixer introduces an additional cost of $L$ 4 per scale $L$ 5 for the similarity computation and $L$ 6 for application of transformation weights.

Summed over all blocks and scales, total complexity is

$L$ 7

assuming uniform $L$ 8 and $L$ 9. This scaling is strictly linear in the number of nodes $p$ 0 and input length $p$ 1, in contrast to the quadratic $p$ 2 complexity characteristic of transformer attention or $p$ 3 in full-graph GNN propagation. On the large-scale CA dataset ( $p$ 4), HSTMixer completed training within hours on a 48 GB GPU, whereas transformer or GNN methods failed due to either memory or time constraints.

5. Experimental Evaluation and Ablative Insights

HSTMixer was evaluated on four real-world large-scale datasets: SD ( $p$ 5), GBA ( $p$ 6), GLA ( $p$ 7), and CA ( $p$ 8), all with 15-minute intervals, using a 12-interval input for a 12-interval forecast.

Comparative Performance

Dataset	HSTMixer (MAE / RMSE / MAPE)	Next Best Method	Metrics
SD	14.80 / 25.06 / 9.22	DGCRN	15.50 / 25.90 / 9.93
GBA	17.73 / 30.67 / 12.71	LSTNN	18.28 / 31.59 / 12.99
GLA	16.45 / 28.03 / 9.53	LSTNN	17.22 / 29.11 / 9.65
CA	15.55 / 27.05 / 10.55	LSTNN	16.48 / 28.24 / 10.90

Average improvements over previous best were MAE ↓4.41%, RMSE ↓3.15%, and MAPE ↓2.03%. Ablation studies indicate that removing any primary component (adaptive mixer, temporal or spatial hierarchies, or top-down propagations) leads to significant increases in MAE. On GBA: disabling the adaptive mixer, temporal hierarchy, spatial hierarchy, temporal propagation, or spatial propagation increased MAE by 1.2%, 2.8%, 3.1%, 2.5%, and 2.9% respectively.

Regarding efficiency, HSTMixer trained on GBA in approximately 4.5 hours (compared to 2–3 hours for smaller MLP baselines), with inference requiring ≈30 seconds per epoch.

6. Significance and Practical Implications

HSTMixer demonstrates that scalable, all-MLP methods with hierarchical and adaptive mixing mechanisms can achieve SOTA accuracy for large-scale traffic forecasting under real-world constraints of graph size and temporal range. Its linear computational profile and memory footprint enable deployment on infrastructure unattainable by transformer or traditional GNN approaches, suggesting practical utility for urban-scale traffic systems where computational efficiency is critical. The hierarchical bidirectional fusion of temporal and spatial context, along with adaptive region parametrization, is empirically validated as essential for accurate forecasting at scale (Wang et al., 26 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

HSTMixer: A Hierarchical MLP-Mixer for Large-Scale Traffic Forecasting (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Spatio-Temporal Mixer (HSTMixer).