Remote Sensing Benchmarks

Updated 30 September 2025

Remote sensing benchmarks are standardized datasets and protocols that rigorously assess AI models for tasks like classification, object detection, and 3D spatial understanding.
They integrate advanced annotation techniques, including crowdsourcing and hierarchical labeling, to capture the diversity and complexity of Earth observation data.
These benchmarks drive innovation by providing clear evaluation metrics, fostering methodological improvements in multi-modal and federated learning approaches.

Remote sensing benchmarks comprise standardized datasets and evaluation protocols designed to rigorously assess algorithms for land cover/land use classification, image retrieval, object detection, salient object segmentation, vision-language tasks, 3D spatial understanding, and other geospatial problems using Earth observation imagery. These resources enable systematic comparison of methods, facilitate the development of robust AI models tuned for remote sensing data, and are critical for recognizing shifts in task complexity, modality, and scale as the field evolves.

1. Evolution and Taxonomy of Remote Sensing Benchmarks

Contemporary remote sensing benchmarks are characterized by increasing scale, richer semantic diversity, multi-modality, and complex task protocols. Early datasets, such as UC Merced, focused on scene classification with relatively few classes and limited size. Later benchmarks, e.g., NWPU-RESISC45 (45 classes × 700 images) (Cheng et al., 2017), RSI-CB (60,000+ images, hierarchical land-use taxonomy via crowdsource annotation) (Li et al., 2017), and BigEarthNet (590,326 multi-spectral patches with hierarchical multi-labels) (Sumbul et al., 2019), expanded class granularity, spatial diversity, and multi-labeling.

Key benchmark families now include:

Benchmark / Suite	Target Task(s)	Distinctive Features
NWPU-RESISC45	Scene Classification	45 scene classes, high intra/inter-class variation
RSI-CB	Scene Classification	Hierarchical City-scale, crowdsource POIs, 60,000+ imgs
PatternNet	Image Retrieval	38 focused classes, 800 imgs/class, high-res
BigEarthNet	Multi-label Classification	Sentinel-2, 12 bands, CLC-derived multi-labels; 590k+
RSSOD, RSSOD-Bench	Small Object/SOD	Small instance focus, VHR images, 22k+ instances
VRSBench	Vision-Language (V&L)	29k images, captions, grounding, VQA; human-verified
XLRS-Bench	V&L Perception/Reasoning (UHR)	8.5k×8.5k px avg., advanced reasoning, 16 subtasks
RS3DBench	3D Spatial Perception	54k+ RGB-DEM pairs, semantic text, global span
OpenEarthSensing	Open-world, Incremental Learning	189 categories, 5 domains, OOD/covariate shift/hybrid
FedRS-Bench	Federated Learning	135 clients, 8 sources, label/data heterogeneity

This proliferation reflects remote sensing’s complexity and the need to evaluate models for domain shift, annotation richness, task transferability, and modality integration.

2. Construction Protocols, Annotation Strategies, and Scale

Benchmark creation has evolved from hand-curated, shallowly annotated archives to protocolized pipelines that integrate crowdsourcing, hierarchical ontology, and even LLM-guided generation. Notable strategies include:

Crowdsource Data Registration: RSI-CB leverages OSM POIs, spatially aligns them with VHR imagery, and screens for duplicates and mis-annotations (Li et al., 2017).
Multi-label and Semantic Hierarchy: BigEarthNet and OpenEarthSensing employ CLC or taxonomically-rich class sets, capturing real-world multi-class spatial mixing and hierarchical scene structure (Sumbul et al., 2019, Xiang et al., 28 Feb 2025).
Textual and Visual Alignment: RS3DBench attaches GLM-v4-generated high-level terrain labels to each RGB-DEM pair (Wang et al., 23 Sep 2025). VRSBench uses LLM (GPT-4V) pipeline and human annotation for detailed captions, question-answer pairs, and object reference sentences, with domain expert secondary review (Li et al., 18 Jun 2024).
Scale Metrics: Recent benchmarks commonly exceed 50k images; XLRS-Bench contains 8,500×8,500 px mean image size for 840 UHR images (Wang et al., 31 Mar 2025). FedRS-Bench simulates federated settings across 135 client splits (Zhao et al., 13 May 2025).

3. Benchmark Tasks and Evaluation Metrics

Remote sensing benchmarks now span tasks requiring not just image-level prediction but also object localization, retrieval, semantic segmentation, composite reasoning, and multi-modal fusion. Evaluation protocols are selected accordingly:

Classification & Retrieval: Scene-level accuracy (OA); retrieval tasks use ANMRR, mAP, Precision@K (Cheng et al., 2017, Zhou et al., 2017).
Multi-label/Hierarchical: Micro/macro F1, mean Average Precision (mAP); accuracy over multi-label sets (Sumbul et al., 2019). For multi-label F1, $F_1 = (2 \cdot P \cdot R) / (P+R)$ .
Object Detection/SOD: [email protected], MAE, F-measure, S-measure, E-measure (Wang et al., 2021, Xiong et al., 2023).
Vision-Language: Captioning BLEU-n, METEOR, CIDEr, ROUGE_L; Visual Grounding measured as Accuracy@τ (IoU ≥ τ), VQA as top-1 accuracy by question type (Li et al., 18 Jun 2024, Wang et al., 31 Mar 2025).
Open-world/Incremental/Federated: AUROC for OOD, session-wise incremental accuracy, knowledge forgetting rates; federated setups use per-client and global accuracy, with dual test sets for stratified evaluation (Xiang et al., 28 Feb 2025, Zhao et al., 13 May 2025).
3D Spatial Perception: Depth estimation models are evaluated by MAE, RMSE, threshold accuracy metrics such as $\delta$ thresholds (Wang et al., 23 Sep 2025).

4. Notable Advances and Empirical Baselines

Recent benchmarks have enabled robust empirical comparison of both traditional and advanced AI architectures:

Deep CNN features (AlexNet, VGG, ResNet, etc.) substantially outperform handcrafted descriptors (LBP, SIFT, GIST), with fine-tuning on domain data boosting performance by up to 6 percentage points in OA (Cheng et al., 2017).
Multi-modal and open-world models benefit from rich, hierarchical annotation and domain-specific pretraining (e.g., SSL4EO-L for Landsat) (Corley et al., 10 Jun 2025).
Federated learning on authentic data splits (FedRS-Bench) consistently outperforms local training, but trade-offs in privacy and convergence emerge under real data heterogeneity (Zhao et al., 13 May 2025).
Vision-LLMs, dissected under VRSBench and XLRS-Bench, reveal that general VLMs perform well on high-level recognition but fail at instance counting, grounding, and spatiotemporal reasoning, with accuracy on such tasks remaining below 50% (Li et al., 18 Jun 2024, Wang et al., 31 Mar 2025).
RS3DBench’s depth estimation benchmarks, conditioned on both imagery and textual semantics via cross-attention in a stable diffusion U-Net, set new state-of-the-art results for global terrain estimation (Wang et al., 23 Sep 2025).

5. Limitations, Domain Gaps, and Goal-Oriented Design

Key challenges persist for the benchmarking of remote sensing algorithms:

Domain Gap: Results demonstrate that evaluation protocols and model preprocessing (e.g., resizing, normalization) significantly impact performance. Strict alignment to pre-training configurations boosts accuracy by up to +32% OA on So2Sat (Corley et al., 2023).
Semantic and Covariate Shift: OpenEarthSensing systematically partitions classes and domains (e.g., $\mathcal{D}_{R1}^{id}$ , $\mathcal{D}_{R1}^{oode}$ , $\mathcal{D}_{R1}^{oodh}$ ), quantifying resilience to both semantic and operational drift (Xiang et al., 28 Feb 2025).
Instance Detail and Reasoning Complexity: Large-scale and UHR benchmarks (e.g., XLRS-Bench, RSMMVP) expose the inability of CLIP-based and VLM-based models to accurately ground or count objects—accuracy on fine-grained grounding remains as low as 1–3% at high IoU, while human baselines approach 90%+ (Adejumo et al., 20 Mar 2025, Wang et al., 31 Mar 2025).
Federated Heterogeneity: Label imbalance, volume skew, and domain shifts challenge consistent performance in federated models, as documented in FedRS-Bench’s comparison of global, local, and centralized solutions across partition schemes (Zhao et al., 13 May 2025).
Data Quality and Standardization: Human-in-the-loop, multi-stage verification is now routine for newer benchmarks (e.g., VRSBench), complementing automatic or LLM-generated meta-annotation (Li et al., 18 Jun 2024), but coverage and ecological extensibility remain ongoing goals (Lines et al., 2022).

6. Impact on the Remote Sensing Research Ecosystem

Comprehensive and protocolized benchmarks have transformed remote sensing AI by:

Providing robust testbeds for model selection, ablation studies, and transfer learning paradigms.
Facilitating the development of open-world, multi-modal, and foundation models explicitly tailored for geospatial, environmental, and urban contexts.
Driving methodological innovation in data fusion, spectral band selection, progressive learning, federated adaptation, and semantic-grounded reasoning.
Enabling community-driven, open science collaborations and standardized evaluation practices, which foster reproducibility and accelerate the transition of research advances to operational and decision-support systems.

7. Future Directions

Emerging lines of investigation indicated by the surveyed benchmarks include:

Expansion of benchmarks to include more diverse modalities (SAR, LiDAR, hyperspectral), finer class ontologies, and ecological/geographic heterogeneity.
Development of automated and community-verified pipelines for benchmark extension, emphasizing multi-ecosystem, multi-region representation (Bountos et al., 2023).
Progress toward unified evaluation of spatiotemporal reasoning, 3D understanding, multimodal fusion, and federated settings that match real-world deployment and operational constraints (Wang et al., 23 Sep 2025, Zhao et al., 13 May 2025).
Systematic measurement of foundation model capabilities and limitations via large-scale, hierarchical, and multi-task benchmarks, with research focused on reducing the gap between human-level and model-level performance in advanced perceptual and reasoning tasks (An et al., 27 Nov 2024, Wang et al., 31 Mar 2025).

In summary, remote sensing benchmarks are central to the quantitative evaluation of methodologies in geospatial AI, underpinning progress in algorithm design, model generalization, and the operational deployment of remote sensing analytics for observational, monitoring, and interpretive applications.