SpatialBench-UC: Urban GeoAI Benchmark
- SpatialBench-UC is a benchmark suite for GeoAI that integrates heterogeneous urban data and standardizes multi-task evaluations.
- It supports region-based regression and trajectory modeling using H3 hex grids to examine spatial granularity and information loss.
- By providing reproducible code, transparent baselines, and interpretable metrics, it enables fair comparisons of geospatial embedding architectures.
SpatialBench-UC is a multi-task, modality-agnostic benchmark suite designed to systematically evaluate spatial representations and reasoning in urban geospatial artificial intelligence (GeoAI). Its primary role is to offer a unified framework for assessing geospatial embedders, bridging fragmentation by integrating heterogeneous real-world data, defined downstream tasks, reproducible evaluation protocols, and highly interpretable metrics (Moska et al., 7 Oct 2025).
1. Purpose and Design Principles
SpatialBench-UC addresses longstanding limitations in GeoAI evaluation, such as task silos, single-modality datasets, and lack of standardized performance baselines. It provides:
- Task diversity: Five downstream tasks across region-level regression, classification, and trajectory modeling, each representative of a distinct real-world urban phenomenon (pricing, crime intensity, human mobility, and travel time).
- Modality-agnostic input: Benchmarked models can process a wide range of data—OpenStreetMap (OSM) vectors, satellite imagery, sensor streams, or fused embeddings—enforcing a level playing field for architectural comparisons.
- Spatial granularity: All data are aggregated onto H3 geospatial hex grids (resolutions 8, 9, 10), enabling controlled experiments on information loss and aggregation effects.
- Open, reproducible infrastructure: Code, data splits, configurations, and evaluation scripts are public and versioned via the SRAI library and HuggingFace repositories. Train/test splits are strictly non-overlapping in space to prevent leakage.
- Multi-resolution support: Simultaneously reports performance at several spatial scales, crucial for validating model robustness to aggregation and heterogeneity.
These design pillars ensure that any spatial representation learner evaluated under SpatialBench-UC can be rigorously and fairly compared to others on meaningful end-to-end tasks (Moska et al., 7 Oct 2025).
2. Task Suite
SpatialBench-UC implements five downstream tasks, grouped by data aggregation and temporal dependency:
| Category | Task | Objective |
|---|---|---|
| Region-based regression | Short-Term Rental Price Prediction (STRPP) | Predict average Airbnb price per hex cell |
| Region-based regression | Housing Price Prediction (HPP) | Predict average home sale price per cell |
| Region-based regression | Crime Activity Prediction (CAP) | Estimate normalized crime intensity per cell |
| Trajectory-based modeling | Human Mobility Prediction (HMP) | Classify next H3 cell in human movement sequence |
| Trajectory-based modeling | Travel Time Estimation (TTE) | Regress total travel time for a trajectory |
- Region tasks: Input features are aggregated by H3 cell, with endogenous phenomena (crime, pricing) reflecting local urban form, socioeconomic environment, and access.
- Trajectory tasks: Inputs are ordered sequences of H3 cells (walk, bike, taxi, car). Models must capture time-evolving spatial relations and encoded connectivity.
The open-ended nature of modal input means each embeddable can be tested using only OSM vectors, fused multi-source features, or more sophisticated spatial signal representations (Moska et al., 7 Oct 2025).
3. Dataset Portfolio
Benchmarking is performed across seven diverse, openly licensed urban datasets:
- Point-based (static):
- Airbnb listings (6 cities, Summer 2022–Spring 2023): location, price, room type, availability.
- King County House Sales (Seattle metro, 2014–2015): sale price, property attributes.
- City crime datasets: Chicago (2022), Philadelphia (2023), San Francisco (2018–2024).
- Trajectory-based (dynamic):
- Porto Taxi dataset (Portugal, 2013–2014): 400k complete trajectory traces.
- Geolife dataset (Beijing, 2007–2011): 25M GPS points, labeled modes (walk, cycle, drive).
All spatial indexing uses H3 hexagons at multiple resolutions. Preprocessing includes de-duplication, completeness filtering, temporal alignment, H3 mapping, shortest-path interpolation for trajectory gaps, and stratified train/test splitting to control for spatial autocorrelation (Moska et al., 7 Oct 2025).
4. Evaluation Metrics
SpatialBench-UC adopts precise, mathematically-defined metrics for each task family:
- Region-based regression (STRPP, HPP, TTE):
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- Mean Absolute Percentage Error (MAPE)
- Symmetric Mean Absolute Percentage Error (sMAPE)
- Crime intensity (CAP):
- coefficient of determination, reflecting variance explained (MAPE is unstable due to zero-mean and low-signal cells).
- Human mobility classification (HMP):
- Sequence accuracy ()
- Average Haversine (great-circle) distance between predicted and ground truth centroids
- Dynamic Time Warping (DTW) distance for sequence alignment
Performance is always reported per resolution, exposing robustness or degradation under varying cell aggregation, and at multiple forecast horizons for sequence tasks—highlighting compounding error in multi-step prediction (Moska et al., 7 Oct 2025).
5. Baseline Architectures and Embedding Strategies
The OBSR baseline implementations for SpatialBench-UC emphasize transparency and reproducibility:
- Region-based regression/classification: Three-layer feed-forward neural networks—ReLU activations for continuous tasks (prices), sigmoid for normalized crime rate.
- Trajectory tasks: Two-layer LSTM with multi-head self-attention for classification (HMP) or direct regression (TTE), with teacher forcing and a hybrid loss (cross-entropy + geodesic penalty).
- Embeddings: Four off-the-shelf OSM-based encoders:
- Hex2Vec (Skip-gram over H3/tag pairs),
- CountEmbedder (raw tag counts),
- ContextualCountEmbedder (counts aggregated over local neighborhoods),
- GeoVex (hex-convolutional autoencoder, zero-inflated Poisson loss).
This combination provides baseline points that allow any more complex end-to-end architecture to be evaluated for true incremental gain (Moska et al., 7 Oct 2025).
6. Empirical Findings and Best Practices
Key conclusions and methodological recommendations include:
- Embedding scale sensitivity: Many embedders perform well at high resolution (res 10), but degrade or over-smooth at coarse scale (res 8), indicating the necessity of adaptive or hierarchical representations.
- Task imbalance and metric selection: In crime prediction, overall error minimization yields poor hotspot localization. and stratified validation splits are essential to avoid misleading error distributions.
- Trajectory realism: OSM-only embedders lack explicit encoding of road network topology; predicted paths sometimes cut across obstacles, evidencing limits of static vector features for dynamic or connectivity-dependent tasks.
- Multi-resolution reporting: Metrics must always be reported at multiple H3 scales to properly diagnose both generalization and overfitting.
- Baseline transparency: White-box architectures are favored to facilitate interpretable comparison, reproducibility, and benchmarking of more advanced (e.g., multimodal, pre-trained) approaches (Moska et al., 7 Oct 2025).
A plausible implication is that developing embedders capable of multi-scale reasoning and integrating explicit connectivity (e.g., graph convolutions over road networks) is crucial for further progress.
7. Future Directions and Impact
Several open challenges are identified:
- Input expansion: The modality-agnostic API allows direct benchmarking of models operating on satellite imagery, textual features, or sensor data, but current baselines use only OSM. Systematic testing of multi-modal and learned fusion embedders remains a key next step.
- Task enrichment: The current task suite covers the core of urban spatial learning, but further expansion to environmental risks, real-time anomaly detection, or event forecasting is possible.
- Scalable evaluation: The multi-resolution paradigm and reproducible harness foster large-scale ablation and hyperparameter sweeps across urban systems.
- Broader integration: While developed for GeoAI, the structure and philosophy of SpatialBench-UC could inform other spatial benchmarking scenarios, including ecological, epidemiological, and infrastructure modeling domains.
SpatialBench-UC thus establishes a public, rigorously engineered basis for the systematic, reproducible benchmarking of urban spatial representations, providing the foundation for principled progress in spatial machine learning for urban environments (Moska et al., 7 Oct 2025).