EarthReason: Geospatial Pixel Benchmark

Updated 18 May 2026

EarthReason dataset is a large-scale benchmark designed for open-vocabulary geospatial pixel reasoning, linking implicit language queries with precise binary masks.
It utilizes 5,434 high-resolution images and over 30,000 annotated query-answer pairs, ensuring high-quality, expert-vetted segmentation masks.
Evaluation protocols using cIoU, gIoU, and accuracy provide robust performance metrics, with SegEarth-R1 achieving top results on the benchmark.

The EarthReason dataset is a large-scale, manually annotated benchmark created to enable and rigorously evaluate geospatial pixel reasoning—where remote sensing models must interpret natural-language queries regarding implicit spatial, semantic, or functional intent and return pixel-level segmentation masks that precisely localize the queried regions. It was introduced as the first such resource specifically for the domain of open-vocabulary, instruction-driven remote sensing segmentation and is closely tied to the design and evaluation of the SegEarth-R1 model (Zhou et al., 9 Feb 2026, Li et al., 13 Apr 2025).

1. Purpose and Formal Definition

EarthReason was constructed to advance open-ended geospatial pixel reasoning in remote sensing, moving beyond closed-set segmentation and explicit object detection. The central task formalized by EarthReason is: given a high-resolution remote sensing image $I\in\mathbb{R}^{H\times W\times C}$ and a natural-language query $Q = \{q_1, ..., q_L\}$ , produce a binary mask $M\in\{0,1\}^{H\times W}$ that marks the region(s) corresponding to the answer identified by a spatial-reasoning operator $R(I, Q)$ . The queries are implicit, often requiring models to resolve complex spatial and semantic relationships, rather than direct referring expressions (Li et al., 13 Apr 2025).

2. Dataset Construction and Annotation Protocol

Scope and Composition

Images: 5,434 high-resolution remote sensing images, sampled from established benchmarks (Million-AID and fMoW), with spatial resolutions ranging from 0.5 m to 153 m per pixel.
Masks: Each image is paired with a single, high-quality, manually annotated binary mask ( $M$ ), drawn or corrected by expert annotators to reflect exact pixel-level ground truth.
Queries and Answers: More than 30,000 implicit question–answer pairs are provided, with each image associated with an average of six diverse queries and three acceptable answers per question.
Class Coverage: 28 land-use, land-cover, and functional scene categories are represented, spanning both urban and rural environments. Four challenging categories (“basketball court,” “island,” “lake,” “stadium”) are deliberately held out from the training split to support out-of-domain generalization evaluation.

Annotation Workflow

Image Selection: ∼200 images per category from Million-AID, supplemented by 800 from fMoW and 200 "empty-target" images (no annotated region).
Query Generation: For each image/scene, GPT-4o is prompted to generate one canonical question–answer pair, and GPT-3.5 rephrases each into six question variants and three alternative answers.
Mask Labeling: All mask annotations are created by experienced remote sensing experts, either from scratch (fine polygon marking) or semi-automatically using SAM-H as an assistant for object proposal, with all results cross-validated among annotators for consistency and correctness (Li et al., 13 Apr 2025).

Data Splits

Split	Images	Notes
Train	2,371	∼6 questions, 3 answers/image; no held-out cats
Validation	1,135
Test	1,928	Four challenging categories only at test

3. Task and Evaluation Protocol

The dataset is designed for evaluating models in the challenging setting of open-vocabulary, natural-language-driven pixel reasoning:

Input: $(I, Q)$ — remote sensing image and implicit natural-language question.
Output: $M$ — binary segmentation mask matching the semantic target.
Evaluation Metrics:
- Cumulative Intersection-over-Union (cIoU): Per-category mean IoU, cumulative over all positive samples.
- Global Intersection-over-Union (gIoU): Calculated per image and averaged.
- Accuracy (Acc): Fraction of correctly classified pixels.
- These metrics are reported both in-domain and on four held-out "out-of-domain" categories to test generalization.

Model	cIoU_test	gIoU_test
SegEarth-R1	68.25	70.75
PSALM	64.61	68.30
PixelLM	59.22	60.01
LISA	59.10	60.88

SegEarth-R1 achieves the highest test cIoU and gIoU on EarthReason, validating the dataset’s utility as a benchmark for instruction-driven, open-vocabulary pixel reasoning (Li et al., 13 Apr 2025).

4. Methodological Significance

EarthReason operationalizes open-vocabulary geospatial pixel reasoning, establishing it as a rigorous, large-scale task. Key methodological innovations include:

Implicit, Natural-Language Queries: Unlike referring expression or closed-set segmentation tasks, EarthReason’s queries are often open-ended, requiring models to interpret spatial, functional, and contextual intent rather than simple object labels.
Fine-Grained Manual Masks: Emphasizing expert-vetted, pixel-level ground truth rather than weak, box-level annotations or synthetic masks.
Cross-Validated Annotation: Multi-expert annotation and validation pipeline reduces annotation bias and ensures high-fidelity targets.

5. Benchmark Impact and Role in Model Design

EarthReason’s release addresses crucial limitations of prior remote sensing benchmarks:

Open-Vocabulary Testing: The test split includes categories excluded from training, establishing a protocol for out-of-domain generalization.
Evaluation of Reasoning Models: Designed specifically for the evaluation of systems that interpret and act on implicit instructions that require spatial reasoning, as exemplified by SegEarth-R1 (Li et al., 13 Apr 2025), GRASP (Jiang et al., 23 Aug 2025), RemoteReasoner (Yao et al., 25 Jul 2025), and others.
Downstream Use: The dataset is used both for training and as a rigorous testbed for comparing pixel-level accuracy, generalization to new classes, and robustness of reasoning algorithms (Li et al., 13 Apr 2025).

6. Limitations and Extension Directions

Single-Mask per Image: The current annotation protocol outputs a single mask per query/image; multi-query, multi-mask, and conversational extensions are not natively supported.
Query Complexity: While the dataset captures a variety of implicit queries, further increasing logical, compositional, or multi-step complexity (as explored in other datasets like SQuID (Massih et al., 19 Jan 2026) and Terra-CoT (Shu et al., 19 Mar 2026)) is a plausible direction.
Ancillary Data: EarthReason focuses on RGB/optical imagery; future work may integrate spectral bands, DEM, or temporal series for more complex geospatial reasoning.

EarthReason provides a foundational resource for benchmarking and advancing models in open-vocabulary, natural-language-driven geospatial pixel reasoning, uniquely combining high-resolution imagery, expert manual masks, and a diverse set of implicit queries (Li et al., 13 Apr 2025, Zhou et al., 9 Feb 2026). Its structure and evaluation protocols have influenced subsequent datasets and remain integral to current model development in this domain.