GeoBenchX: Geospatial AI Benchmark

Updated 13 April 2026

GeoBenchX is an open, comprehensive benchmark designed to evaluate LLM performance in executing complex geospatial workflows via standardized tool invocations.
It organizes about 200 tasks into distinct categories, testing both solvable and unsolvable cases to assess analytical reasoning and hallucination resistance.
The framework employs multiple LLM judges to compare semantic equivalence and token efficiency, providing actionable insights into GeoAI system capabilities.

GeoBenchX is an open, comprehensive benchmark created to evaluate the capabilities of LLMs in executing complex, multi-step geospatial workflows through standardized tool invocation. Conceived to address the practical analytic and visualization requirements faced by geographic information systems (GIS) professionals, GeoBenchX encompasses a suite of tasks that exercise tabular, vector, and raster data handling, multi-stage spatial analyses, and hallucination (false-answer) rejection—providing a reproducible, transparent framework for measuring LLM proficiency in GeoAI (Krechetova et al., 23 Mar 2025).

1. Benchmark Structure and Design

GeoBenchX employs a tool-calling agent framework, abstracting geospatial operations as callable Python-style functions. The agent accesses 23 distinct tools, each mapping to commonly required GIS procedures, including but not limited to: loading and merging datasets, spatial filtering, buffer creation, raster-vector analyses, and map production. All tools utilize established libraries (GeoPandas, Rasterio, GDAL, Shapely, Pandas, NumPy) and are exposed as discrete API calls.

The task set is systematically categorized to scale in complexity:

Category	Task Count	Typical Operations
Merge-visualize	36	Attribute merging, mapping
Process-merge-visualize	56	Preprocessing, merge, mapping
Spatial operations	53	Buffer, spatial join, aggregation
Heatmaps, contour lines	54	Visualization, raster/point fusion

Approximately 200 tasks are included, stratified into solvable and intentionally unsolvable variants. This dual design tests both the analytic reasoning and failure-mode awareness of candidate LLMs.

2. Task Taxonomy and Tool Interface

Each GeoBenchX task specifies a problem in natural language, e.g., “Map the relationship between GDP per capita and electric power consumption per capita globally”—then expects the agent to translate this into a tool call sequence. Provided functions include:

Data ingestion and discovery (load_data, load_geodata, get_raster_path)
Spatial analysis (create_buffer, select_features_by_spatial_relationship, analyze_raster_overlap)
Statistical computation (calculate_column_statistics, scale_column_by_value)
Visualization (make_choropleth_map, make_heatmap, plot_contour_lines)
Unsolvable task identification (reject_task)

Tasks range from simple attribute tabulation to compounded spatial and raster queries requiring multi-layer data integration and procedural chaining.

Half of the instances in each category are unsolvable due to limitations in the available data or tool set (e.g., requests for unavailable geographic layers or unsupported analysis types). The agent’s correct use of reject_task is foundational to benchmarking hallucination resistance.

3. Automated Evaluation Methodology

GeoBenchX utilizes an LLM-as-Judge framework. Both reference implementations and agent solutions are compiled as canonicalized sequences of function calls. Three LLM judges—Claude 3.5-sonnet, GPT-4o, Gemini 2.0—evaluate solution equivalence at the semantic (not merely syntactic) level, assigning scores:

2: perfect match
1: partial match
0: no match

Judges account for permissible step order or variation in arguments. Internal validation found judge-human agreement of 91% (Sonnet 3.5), 86% (GPT-4o), and 82% (Gemini 2.0).

Task-level and overall scores are computed as mean match frequencies. Standard metrics—Precision, Recall, F₁—are derived with “true positive” defined as a perfect match for a solvable task, “false positive” for matches in unsolvable situations, etc. GeoBenchX also reports the relative step-length difference $\Delta_\text{steps}$ between candidate and reference solutions.

4. Experimental Results and Comparative Model Performance

Experiments deploy a LangGraph ReAct agent (temperature 0, max 25 steps) and seven commercial LLM APIs:

GPT-4o, GPT-4o-mini, o3-mini (OpenAI)
Gemini 2.0 (Google)
Claude 3.5-sonnet, Claude 3.7-sonnet, Claude 3.5-haiku (Anthropic)

All tests use held-local copies of 21 vector sets, 18 CSV tables, and 11 raster datasets for reproducibility. Token consumption and solution accuracy are compared across both solvable and unsolvable subsets:

Model	Solvable Match Rate	Unsolvable Reject Rate	Token Consumption*
Sonnet 3.5	53%	37%	12M in / 157K out
Sonnet 3.7	50%	14%	24M in / 269K out
GPT-4o	44%	59%	7.5M in / 57K out
Gemini 2.0	43%	46%	7M in / 40K out
Haiku 3.5	43%	20%	15M in / 203K out
o3-mini	21%	76%	3.5M in / 561K out
GPT-4o-mini	16%	62%	8M in / 55K out

*Token metrics report total for all 202 tasks (input/output, rounded).

Anthropic models use significantly more input tokens per run but exhibit higher accuracy on solvable tasks, whereas OpenAI (especially o3-mini, GPT-4o(-mini)) are more adept at task rejection, exhibiting frugal input token usage and lengthy output logs.

5. Error Analysis and Observed Limitations

LLMs consistently exhibit domain-specific error patterns:

Geometric reasoning failures: e.g., centroid vs. polygon confusion, improper handling of lines and polygons.
World knowledge errors: e.g., reliance on outdated country groupings or idiosyncratic spelling variations.
Inefficient data procedures: e.g., merging datasets prior to appropriate filtering, resulting in superfluous tool calls.
Incomplete procedural steps: omission of required attribute or spatial filters.
Unsolvable task mishandling: hallucination of answers when data are unavailable, or arbitrary revision of the user query to suit available information.

Documented mitigations include clarification in tool descriptions, post-inference sanity checking of coverage, and targeted instruction tuning for spatial reasoning.

6. Open-Source Release and Community Adoption

All GeoBenchX artifacts are distributed under an open-source license, including:

The full suite of ~200 tasks with solvable/unsolvable labels and multiple reference solutions
Data snapshots (statistical tables, vector datasets, rasters)
LangGraph ReAct agent code and relevant prompts
LLM-as-Judge evaluation and scoring code
Data pipeline for task extension via LLM assistance

These resources are available at https://github.com/Solirinai/GeoBenchX (Krechetova et al., 23 Mar 2025).

7. Relevance and Impact in Geospatial AI

GeoBenchX establishes an unbiased, interpretable, and reproducible methodology for measuring LLMs’ intermediate and complex geospatial thinking, procedural tool use, and error management. It uniquely integrates solvability assessment into task evaluation—a core competency for GIS AI practitioners. The benchmark’s multi-faceted dataset, stratified task complexity, and automated judgment system enable precise comparison of LLMs’ reasoning, hallucination resistance, and procedural efficiency on real-world GIS workflows, contributing directly to the robust evaluation and ongoing development of LLM-based GeoAI systems.

Markdown Report Issue Upgrade to Chat

References (1)

GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GeoBenchX.