GEO-Bench-2: EO Model Benchmark Suite
- GEO-Bench-2 is a comprehensive standard for evaluating GeoFMs using 19 permissively licensed Earth Observation datasets across diverse tasks.
- The framework supports tasks such as classification, segmentation, regression, object detection, and instance segmentation with modular adaptability.
- It employs fixed protocols, geographically stratified splits, and robust metrics to ensure rigorous, reproducible, and transparent model ranking.
GEO-Bench-2 is a standardized evaluation framework for Geospatial Foundation Models (GeoFMs) within Earth Observation (EO). It provides a multidimensional, capability-driven benchmarking suite encompassing classification, segmentation, regression, object detection, and instance segmentation across 19 permissively licensed datasets. GEO-Bench-2’s prescribed yet flexible protocol defines rigorous procedures for transparent, reproducible model ranking along axes tailored to EO’s heterogeneous data modalities, spatial resolutions, and task demands (Simumba et al., 19 Nov 2025).
1. Goals, Design Principles, and Distinctives
The framework advances three primary goals: (1) unify the evaluation of GeoFMs across all major EO task types, (2) introduce “capability groups” for fine-grained comparison on subsets of datasets sharing critical characteristics (such as spectral, spatial, or temporal features), and (3) prescribe an evaluation and adaptation protocol that is both reproducible and accommodating to innovation.
Key design facets are comprehensiveness (19 datasets spanning all EO modalities), modularity (users may submit for one or more capabilities), and reproducibility (geographically stratified splits, fixed normalization, and open-source tooling). Emphasis is placed on empirical discriminability between strong and weak models. GEO-Bench-2 uniquely mandates permissive dataset licenses—excluding GPL and non-commercial restrictions—and transparent leaderboard reporting.
The framework fills shortcomings common in prior benchmarks, such as the original GEO-Bench (lacking detection and instance segmentation tasks), as well as those in PANGAEA, Copernicus-Bench, and REOBench (limited modalities, restrictive licensing, or inconsistent adaptation protocols) (Simumba et al., 19 Nov 2025).
2. Task Suite and Dataset Spectrum
GEO-Bench-2 evaluates models across five canonical EO tasks:
- Classification: Single- and multi-label image tile assignment.
- Pixel-wise Regression: Per-pixel predictions of continuous quantities.
- Semantic Segmentation: Per-pixel, discrete label assignment.
- Object Detection: Bounded object localization and class assignment.
- Instance Segmentation: Per-object instance boundary delineation with pixel masks.
The 19 datasets originate from a variety of EO sub-domains, including land cover, agriculture, disaster mapping, species detection, and urban infrastructure. Modalities encompass multi-spectral (Sentinel-1 SAR, Sentinel-2 MSI, Landsat, PlanetScope, DEM), high-resolution RGB and NIR, and time series. Ground Sampling Distance (GSD) varies from 0.1 m (EverWatch, m-nzCattle) to 30 m (NASA Burn Scars), and tasks frequently leverage both single-timestamp and temporally resolved imagery.
Below is a structured summary of selected datasets:
| Dataset | Modalities | Task Types |
|---|---|---|
| BigEarthNet V2 | Sentinel-1 SAR, Sentinel-2 MSI | Multi-label classification |
| FLAIR 2 | Aerial RGB+NIR, DEM | Semantic segmentation |
| PASTIS-R | Sentinel-1/2 time series | Semantic/Instance segmentation |
| EverWatch | Aerial RGB | Object detection |
| Substations | Sentinel-2 MSI | Instance segmentation |
A total of nine “capability groups” are defined (see Section 3), permitting targeted cross-model comparisons (e.g., multi-spectral dependence, <10 m resolution, multi-temporal input).
3. Capability Groups and Aggregation Methodology
Capability groups partition the 19 datasets along salient dimensions relevant for both research and operational deployment. Defined groups include “Core” (representative, computationally efficient subset), “Classification,” “Pixel-wise” (segmentation and regression), “Detection,” “Multi-temporal” (time series), “<10 m GSD”, “≥10 m GSD”, “RGB/NIR”, and “Multi-spectral-dependent.” A formal indicator function is defined for each group and dataset :
with model score on given by
This structure enables fine-grained performance summaries highlighting, for example, models suited to high-spectral or multi-temporal EO analytics.
4. Evaluation Protocol and Metrics
The protocol enforces rigor and comparability with strict controls on hyperparameter optimization (Bayesian search over a prescribed space, capped trials), adaptation head design (defined for each task type), and repeatability via five-seed fine-tuning. All datasets are split using geographically stratified or checkerboard schemes, and no test-set data influences preprocessing or augmentation.
Key preprocessing steps include per-band z-score normalization and spatial augmentation (random flipping, cropping for large tiles). Multi-temporal inputs are encoded per timestamp and averaged prior to the decoding stage.
Task-specific adaptation heads include linear layers with softmax for classification, UNet decoders for segmentation/regression, and Mask/Faster-RCNN with FPN for detection and instance segmentation.
Metrics used, expressed in LaTeX, include accuracy, multi-label -score, mean Intersection over Union (mIoU), Root Mean Squared Error for regression, and mean Average Precision (mAP) for object-level tasks:
Final scores for each model and group are computed using interquartile mean bootstraps to mitigate outlier impact, and reported with associated uncertainty.
5. Model Suite and Experimental Findings
The evaluation encompasses both EO-specific and large-scale natural image–pretrained models, including ConvNeXt (ImageNet-22k), DINOv3 (Maxar RGB, web), ResNet (ImageNet, DeCUR), DOFA-ViT, Clay-V1-ViT-B (EO, MAE), Satlas-SwinB, Prithvi-EO-V2 (MAE, HLS), and TerraMind-V1 (correlation-based).
Empirical results confirm that no single model dominates across all capability groups. Models pretrained on natural images (ConvNeXt, DINOv3) achieve leading performance for high-resolution, RGB-centered tasks. EO-specific models (TerraMind, Prithvi, Clay) are superior in multi-spectral and multi-temporal settings, which are crucial for agriculture, disaster response, and burn-scar detection. Model size generally correlates with capability score, but EO-specialized models of modest scale (Clay-V1-ViT-B, 86M parameters) can match or exceed much larger architectures, likely due to better alignment with EO-specific data and tasks (Simumba et al., 19 Nov 2025).
6. Limitations and Future Directions
Identified gaps include the absence of a dedicated SAR-only capability, geographic bias toward Europe and North America, and lack of uncertainty quantification in outcomes. Recommendations for advancing the framework and the field include development of SAR-specialized benchmarks, integration of predictive uncertainty metrics (such as expected calibration error), extension of dataset coverage to underrepresented regions, exploration of hybrid convolution–transformer architectures, and inclusion of operational metrics (e.g., latency, energy consumption) in deployment-aware benchmarks.
GEO-Bench-2 thus constitutes a reproducible, rigorously defined means for comparing GeoFMs along axes matched to the diversity and demands of EO. Comprehensive code, datasets, and leaderboard statistics are openly released to accelerate collective progress toward general-purpose geospatial models (Simumba et al., 19 Nov 2025).