Unified Urban Spatial–Temporal Benchmark

Updated 7 March 2026

The paper introduces a unified framework that standardizes urban spatial–temporal prediction by harmonizing data formats, task protocols, and evaluation metrics.
The benchmark offers reproducible data splits and standardized preprocessing pipelines for fair cross-city model evaluation and robust forecasting.
The framework enables integration of multiple urban datasets and modeling protocols, fostering transfer learning and extensibility in addressing real-world urban challenges.

A unified benchmark for urban spatial–temporal prediction is a standardized framework designed to enable rigorous, reproducible, and fair evaluation of models addressing urban phenomena that unfold across both space and time. These benchmarks define unified data formats, task protocols, evaluation metrics, and model baselines, often accompanied by open-source libraries, to support comparison and development of forecasting, imputation, and decision-making approaches in domains such as traffic, mobility, and population dynamics.

1. Motivation and Historical Context

The proliferation of domain-specific deep learning models and the accumulation of large-scale urban spatial–temporal datasets have historically led to fragmentation: data appear in heterogeneous formats, code and models are rarely standardized, and fair cross-paper comparison is infeasible (Jiang et al., 2023, Jiang et al., 2023). Early efforts focused on individual domains (e.g., road traffic) or specific cities, using ad hoc experimental protocols. This lack of unification impeded progress toward generalized, robust methods capable of handling diverse urban data and scenarios. Recent works, notably LibCity, UrbanDiT Benchmark, EvalST, ST-OOD, and USTBench, directly address this fragmentation, each providing a unified benchmark that harmonizes data, tasks, and evaluation for the urban spatio-temporal prediction community (Jiang et al., 2023, Yuan et al., 2024, Chen et al., 24 Feb 2026, Wang et al., 2024, 2505.17572).

2. Unified Data Formats and Preprocessing

Unified data representation is fundamental to benchmark interoperability. Multiple frameworks have converged on “atomic file” formats—minimal, schema-fixed CSV tables that encode spatial units (points, lines, polygons), their relations, dynamic time series, user traces, and external covariates (Jiang et al., 2023, Jiang et al., 2023). After ingestion, data are uniformly assembled into tensors of shape $\mathbb{R}^{T \times N \times D}$ (temporal steps $\times$ spatial units $\times$ features), or higher-order grid tensors as needed.

For benchmark-scale corpora (e.g., WorldST/EvalST), standardized pipelines perform re-sampling (typically to 5 min granularity), variance-based node filtering, outlier clipping, pre-completion (e.g., linear interpolation), and per-node normalization (e.g., RevIN) (Chen et al., 24 Feb 2026). This harmonization enables downstream models to operate agnostic of original modality (sensor-based, grid-based, OD-matrix, etc.), facilitating cross-task and cross-city evaluation.

3. Benchmark Composition: Datasets, Tasks, and Splits

Comprehensive benchmarks curate diverse, heterogeneous datasets representative of real-world urban systems. For instance, LibCity collects 55 datasets comprising group and individual dynamics (traffic flow, speed, taxi/bike demand, OD matrices, POI trace, road networks); UrbanDiT covers grid and graph domains across multiple cities; EvalST aggregates sensor and grid data from 7 major cities worldwide (Jiang et al., 2023, Yuan et al., 2024, Chen et al., 24 Feb 2026). ST-OOD provides in-distribution and out-of-distribution splits to assess temporal generalization (Wang et al., 2024).

Benchmark tasks include:

Univariate/multivariate time series prediction (flow, demand, speed)
Next-location or category classification (POI, congestion levels)
Spatio-temporal imputation (random/masked gaps), interpolation, extrapolation
Multi-step and bi-directional forecasting
Decision-making in simulated urban environments (e.g., traffic signal control)
Foundation model zero-/few-shot transfer (Yuan et al., 2024, Chen et al., 24 Feb 2026, 2505.17572)

Data are partitioned with reproducible splits such as 6:2:2 for train/val/test (chronological order), or via in-/out-of-distribution delineation (e.g., train on year A, test on year A+1) (Wang et al., 2024).

4. Modeling Protocols and Evaluation Metrics

Unified benchmarks prescribe fixed experimental workflows to eliminate confounders:

Model interfaces: standardized APIs for loading data, instantiating models, and training/evaluation loops (Jiang et al., 2023)
Normalization and batch construction: standardized data module logic
Comparison metrics: MAE, RMSE, MAPE, R² for regression; Precision@K, MRR, F1@K for classification; domain-specific metrics (e.g., map-matching error)
Sliding window and horizon configurations (e.g., $T_\mathrm{in}$ history, $T_\mathrm{out}$ forecast) (Jiang et al., 2023, Jiang et al., 2023, Yuan et al., 2024, Chen et al., 24 Feb 2026)
Each run is repeated for statistical robustness.

For foundation models and LLM benchmarks, unified prompts, masking strategies, and process-level QA metrics (for reasoning steps) are introduced (Yuan et al., 2024, 2505.17572). Downstream decision tasks measure cumulative rewards, planning efficiency, and reflection adaptation.

5. Model Baselines, Leaderboards, and Comparative Insights

Benchmarks implement broad “model zoos,” re-implementing up to 65 state-of-the-art models in unified codebases. These span:

Recurrent nets (LSTM, GRU, Seq2Seq)
Temporal/spatial convolutional nets (ST-ResNet, TCN, DMVSTNet)
Graph-based models (STGCN, DCRNN, GWNET, AGCRN, MTGNN)
Attention-based (GMAN, STTN, ASTGCN), transformers, ODE-nets
Recent foundation models (UrbanDiT, UrbanFM), mixture-of-experts, and generalist time-series models (Jiang et al., 2023, Yuan et al., 2024, Chen et al., 24 Feb 2026).

Sample benchmark leaderboards are provided, e.g.:

Model	METR-LA MAE	NYCTaxi150103 MAE
D2STGNN	2.91	10.24
MTGNN	3.02	10.42
GWNET	3.07	10.24
AGCRN	3.17	10.02
GMAN	3.16	9.83

Key findings:

Methods that learn spatial structure (GCN/RNN hybrids, adaptive GCNs, WaveNet modules) decisively outperform temporal-only or spatially-agnostic baselines (Jiang et al., 2023, Jiang et al., 2023).
Attention-based models yield gains for long-term/horizon tasks but incur high computation.
Foundation models (UrbanDiT, UrbanFM) show strong zero-/few-shot generalization, sometimes surpassing supervised baselines across diverse cities without dataset-specific retraining (Yuan et al., 2024, Chen et al., 24 Feb 2026).
In out-of-distribution evaluation (ST-OOD), over-fitting to graph topology is common; simpler MLP-based models and dropout regularization can increase robustness to temporal shift (Wang et al., 2024).

6. Extensibility, Best Practices, and Future Directions

Unified benchmarks promote extensibility by specifying atomic file conversions and model/dataset registration protocols, allowing new datasets and models to be integrated with minimal effort (Jiang et al., 2023). Best practices include normalization before/after modeling, strict split enforcement, dropout/regularization for generalization, and public release of code and all configurations.

Emerging directions:

Scaling benchmarks to multi-modal signals and global city coverage (e.g., WorldST/EvalST’s >100-city corpus) (Chen et al., 24 Feb 2026)
Knowledge fusion: integrating Urban Knowledge Graph (UrbanKG) embeddings to encode domain hierarchies/cycles and improve downstream USTP task performance (Ning et al., 2023)
Open-world and continual learning: structured evaluation protocols for lifelong adaptation, RL-based interactive tasks, and adaptive reflection in LLM-based agents (2505.17572)
Advanced architectural search/autotuning within standardized pipelines

7. Comparative Landscape and Benchmark Evolution

A comparative table synthesizing key unified benchmarks:

Benchmark	Data Formats	#Datasets	#Models/Baselines	Tasks	Foundation Model Support
LibCity	atomic CSV (5/7)	55	65	Forecasting, matching...	No
UrbanDiT/UrbanFM	patched tensor	9–12	25+	Forecasting, imputation, etc.	Yes
ST-OOD	atomic CSV/tensor	6	13	In-/OOD forecasting	No
EvalST	tensor/minipatch	12	10+	Forecasting, imputation	Yes
USTBench	structured QA	9 tasks	13 LLMs	Reasoning, planning	Yes (LLM focus)
UUKG	KG triplets+CSV	5	15 KGE, 9 ST	KG embedding+USTP fusion	No

Unified benchmarks for urban spatial–temporal prediction thus provide the foundation for reproducible, generalizable urban AI, catalyze transfer learning and foundation model innovation, and continue to evolve with advances in model design, dataset diversity, and evaluation methodology (Jiang et al., 2023, Yuan et al., 2024, Jiang et al., 2023, Wang et al., 2024, Chen et al., 24 Feb 2026, 2505.17572, Ning et al., 2023).