GEO-Bench: Geospatial AI Benchmark Suite

Updated 13 April 2026

GEO-Bench is a benchmark suite designed to evaluate geospatial AI systems, featuring standardized datasets, protocols, and metrics for tasks like classification, segmentation, and cross-modal reasoning.
It enables robust model evaluation using normalized accuracy, mIoU, and bootstrapped metrics, addressing challenges from cross-band generalization to modality shifts.
Its design promotes reproducibility and extensibility, facilitating the development of versatile Earth observation models across diverse datasets and real-world applications.

GEO-Bench

GEO-Bench refers collectively to a suite of standardized benchmarks and codebases designed for rigorous evaluation of models and systems on a variety of geospatial tasks, including, but not limited to, remote sensing, cross-modal geo-localization, spatial representation learning, geometry estimation, foundation modeling for Earth observation, geometric reasoning, geospatial natural language processing, and spatiotemporal data management. Multiple benchmark series and their derivatives have been released under the GEO-Bench moniker, each addressing specific challenges in geospatial AI. This article focuses primarily on the technical design, coverage, and significance of the leading GEO-Bench instances, with particular emphasis on GEO-Bench for Earth Monitoring (Lacoste et al., 2023), its extensions (e.g., GEO-Bench-2 (Simumba et al., 19 Nov 2025), GeoCrossBench (Tamazyan et al., 4 Nov 2025)), and thematically related efforts (e.g., GeoX-Bench (Zheng et al., 17 Nov 2025), GEOBench-VLM (Danish et al., 2024), GeoGrid-Bench (Jiang et al., 15 May 2025), GeoBenchr (Rese et al., 10 Mar 2026)).

1. Foundations and Purpose

GEO-Bench benchmarks originated to address the lack of unified, reproducible, and comprehensive evaluation protocols for geospatial, remote sensing, and Earth observation foundation models. Early works in self-supervised visual pretraining and model transferability for downstream tasks exposed the field’s need for large, diverse, and task-rich testbeds analogous to NLP’s GLUE/SuperGLUE. This motivated the curation of GEO-Bench (Lacoste et al., 2023), focusing initially on six classification and six semantic segmentation tasks using multispectral, SAR, hyperspectral, and RGB imagery spanning land cover, object detection, environmental monitoring, and infrastructure mapping.

GEO-Bench and its successors provide the following unifying properties:

Permissively licensed multi-modal datasets with standardized splits and harmonized class definitions;
Robust, quantifiable evaluation methodology aggregating cross-task performance via interquartile means and normalization;
Open-source implementation of training, fine-tuning, and evaluation protocols with strong model baselines and reproducible seed control.

The design pattern has influenced numerous downstream or related benchmarks targeting complementary problem domains like cross-view localization (Zheng et al., 17 Nov 2025), generalization to new bands or sensors (Tamazyan et al., 4 Nov 2025), multimodal VLM evaluation for geospatial applications (Danish et al., 2024), spatiotemporal database performance (Rese et al., 10 Mar 2026), and scientific grid-data reasoning (Jiang et al., 15 May 2025).

2. Dataset Curation and Task Coverage

GEO-Bench construction is characterized by a combination of scale, diversity, and careful class balancing to overcome data heterogeneity endemic to remote sensing. For example, (Lacoste et al., 2023) includes:

Classification: Datasets such as m-bigearthnet (Sentinel-2, 43 LULC classes, 20,000 train samples), m-so2sat (Sentinel-1/2, 17 local climate zones), m-brick-kiln (Sentinel-2, presence/absence), m-forestnet (Landsat-8, 12 deforestation drivers), m-eurosat, and m-pv4ger (PV array detection).
Segmentation: PV array masks, Chesapeake Land Cover, Cashew Plantation, South Africa Crop Type, NZ Cattle, NeonTree (hyperspectral+elevation).

Selection prioritized both real and synthetic imagery, spatial diversity, spectral heterogeneity (from RGB up to 80+ bands), and tasks historically considered challenging for model generalization.

GEO-Bench-2 (Simumba et al., 19 Nov 2025) and GeoCrossBench (Tamazyan et al., 4 Nov 2025) expand this foundation:

GEO-Bench-2 covers 19 datasets and 9 capability groups, spanning multiclass classification, semantic/instance segmentation, object detection, regression, and temporal analysis, with sub-splits by spatial/spectral/temporal resolution and modality (e.g., Sentinel-1, Sentinel-2, Landsat-8, Aerial RGB, DEM).
GeoCrossBench introduces explicit cross-band generalization, including in-distribution (ID), no-overlap, and superset band scenarios, and tasks such as scene classification, segmentation, and change detection using both Sentinel-1 and Sentinel-2.

A distinctive feature of these benchmarks is the explicit handling of high-dimensional spatial and spectral data, the engineering of train/val/test splits to prevent spatial leakage, and the subsampling of large datasets to avoid class and region imbalance.

3. Evaluation Protocols and Metrics

The evaluation framework across GEO-Bench variants is founded on rigorous, seed-controlled runs, careful train/val/test splits, and aggregate scoring to ensure robustness:

Classification: Raw accuracy per task, normalized to [0,1] using min/max reference baselines, with aggregation via interquartile mean (IQM) over random seeds.
Segmentation: Mean Intersection over Union (mIoU), with normalization and IQM procedures analogous to classification.
Regression: For pixel-wise tasks (e.g., biomass estimation), root mean squared error (RMSE) is employed.
Object/Instance Detection: Mean average precision (mAP) computed using standard COCO-style IoU thresholds.
Capability Scores: For GEO-Bench-2, per-capability aggregates are computed by bootstrapping IQMs and normalizing across datasets assigned to each capability group.

Models are required to use frozen backbones, with additional fine-tuning limited to adaptation heads (linear, UNet decoder, or R-CNN family depending on task). Hyperparameter optimization is strictly protocolized (limited trials, defined search spaces).

The table below captures core metrics as formally defined in (Lacoste et al., 2023) and (Simumba et al., 19 Nov 2025):

Task Type	Metric	Definition (LaTeX / notation)
Classification	Accuracy	$\mathrm{Acc} = \frac{1}{N} \sum_{i=1}^{N} 1\{\hat y_i = y_i\}$
Segmentation	Mean IoU (mIoU)	$\mathrm{mIoU} = \frac{1}{C}\sum_{c=1}^C \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FP}_c + \mathrm{FN}_c}$
Regression	RMSE	$\mathrm{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^N (y_i - \hat y_i)^2}$
Detection	mAP	$mAP = \frac{1}{\|C\|} \sum_c AP_c$ , $AP$ at IoU thresholds
Pixelwise	AbsRel, $\delta_i$ (depth)	$\mathrm{AbsRel} = \frac{1}{N}\sum_{i=1}^N \frac{\|d_i - d^_i\|}{d^_i}$
Normals	Mean Angular Error	$\mathrm{Err} = \frac{1}{M}\sum_k \arccos(\mathbf n_k \cdot \mathbf n^*_k)$

Per-task metrics are reported with confidence intervals via bootstrapping; overall model scores are derived as the IQM of normalized task IQMs.

4. Comparative Baselines and Key Findings

GEO-Bench baselines include standard CNNs (ResNet, ConvNeXt), vision transformers (ViT, Swin), and remote sensing–specific architectures trained from scratch, with ImageNet or remote sensing self-supervised pretraining as variables (Lacoste et al., 2023, Simumba et al., 19 Nov 2025). Major conclusions:

Pretraining on natural images (e.g., ConvNeXt-L), distilled web images (DINOv3), or EO-specialized data (Clay-V1, TerraMind, Prithvi) offers differing strengths depending on the spatial resolution, modality, and task.
For high-resolution RGB tasks, vision transformers and convolutional backbones pretrained on ImageNet or web images excel.
On multispectral and temporal tasks (esp. agriculture, disaster response), EO-tailored self-supervised models (Clay, TerraMind, Prithvi) outperform generic vision backbones.
No single foundation model dominates across all 9 capability groups; model choice should be dictated by downstream requirements and supported modality.
Data quality of fine-tuning sets can outweigh magnitude (scale) or architecture for geometry estimation tasks: fine-tuning DINOv2- or EfficientNet-backboned discriminative models on relatively small, high-quality synthetic datasets surpasses large-scale generative approaches in monocular depth and normal estimation (Ge et al., 2024).
Cross-band generalization remains a major challenge: performance drops 2–4× when evaluated on non-overlapping bands, and even small superset changes (adding a band at test time) can reduce performance by up to 25% (Tamazyan et al., 4 Nov 2025).

5. Extensions and Thematic Variants

Multiple derivatives expand the GEO-Bench paradigm beyond traditional classification/segmentation:

GeoX-Bench (Zheng et al., 17 Nov 2025): A large-scale cross-view geospatial reasoning benchmark pairing panoramic/street-view and satellite images for localization (top-1 accuracy) and pose estimation tasks. Standard LMMs achieve ~50–90% on localization, but <30% on heading prediction unless instruction-tuned.
GEO-Bench-2 (Simumba et al., 19 Nov 2025): Formalizes the evaluation of geospatial foundation models across multi-task, multi-modal datasets, emphasizing capability-based model selection and providing a harmonized evaluation protocol.
GeoCrossBench (Tamazyan et al., 4 Nov 2025): Evaluates cross-satellite generalization, with explicit cross-band (e.g., RGB→SAR) and test-time band mismatch protocols, showing no model family is robust to such distributional shifts.
GEOBench-VLM (Danish et al., 2024): Assesses vision-LLM (VLM) performance on geospatial task suites (classification, detection, change analysis, entity reasoning) using satellite, multispectral, and SAR imagery; accuracy is substantially lower than in natural image VLM benchmarks.
GeoGrid-Bench (Jiang et al., 15 May 2025): Examines foundation models’ spatial reasoning over dense gridded geo-data, especially for climate and hazard analytics. VLMs exhibit strong performance in trend detection but struggle with fine-grained spatial localization.
GeoBenchr (Rese et al., 10 Mar 2026): Benchmarks spatiotemporal database systems (e.g., PostGIS, MobilityDB, SedonaDB), focusing on real-world cycling/aviation/maritime queries, highlighting configuration effects and cross-system tradeoffs.

This proliferation of GEO-Bench variants reflects the spectrum of technical challenges in geospatial AI; while unified by terminology and often sharing principles of design, each addresses a distinct regime of data, reasoning, and performance limits.

6. Impact, Best Practices, and Future Directions

The emergence of GEO-Bench standards has catalyzed reproducible, interpretable progress in geospatial AI, enabling the field to move beyond piecemeal dataset-specific leaderboard reporting. Recommendations and open problems highlighted by the benchmarks include:

Best practice: Always match pretraining and adaptation strategies to the spatial, spectral, and temporal properties of the target dataset; a single model rarely excels across all domains (Simumba et al., 19 Nov 2025).
Cross-modal/temporal generalization: Novel architectural and pretraining strategies—especially those that explicitly align cross-band representations (e.g., ChannelViT/ChiViT (Tamazyan et al., 4 Nov 2025) or height-injection in remote sensing (Hu et al., 26 Mar 2026))—are requisite for robust cross-instrument and cross-location transfer.
Complex reasoning: Chain-of-thought prompting, subgoal decomposition, and structured error localization have become diagnostic targets; benchmarks such as GeoX-Bench and the geometrical GeoBench emphasize modular evaluation of planning, perception, and deduction (Feng et al., 30 Dec 2025, Zheng et al., 17 Nov 2025).
Data efficiency: Simple discriminative models, when fine-tuned on high-quality synthetic data, can rival much more complex generative models in standard geometry tasks, suggesting that the field may overvalue architectural novelty over data curation (Ge et al., 2024).
Reproducibility and reporting: Confidence intervals, distributional summaries, and public leaderboards are essential; protocols enforce strict train/val/test splits, spatial non-overlap, and multiple seeds.
Benchmark extensibility: Modular codebases, open licensing, and consistent metadata standards enable the addition of new tasks/domains (e.g., stationary raster data, 3D, new application areas).
Unresolved: Complete cross-satellite, cross-band, and cross-modal robustness; reliable instance-level pose estimation; and comprehensive multi-modal fusion capabilities remain open challenges.

The continued evolution of GEO-Bench and its derivatives is expected to drive the development of the next generation of general-purpose, robust, and interpretable geospatial AI systems, with direct applications in environmental monitoring, agriculture, disaster response, urban planning, and beyond (Lacoste et al., 2023, Simumba et al., 19 Nov 2025, Tamazyan et al., 4 Nov 2025, Zheng et al., 17 Nov 2025).