GeoBenchX: Geospatial AI Benchmark
- GeoBenchX is an open, comprehensive benchmark designed to evaluate LLM performance in executing complex geospatial workflows via standardized tool invocations.
- It organizes about 200 tasks into distinct categories, testing both solvable and unsolvable cases to assess analytical reasoning and hallucination resistance.
- The framework employs multiple LLM judges to compare semantic equivalence and token efficiency, providing actionable insights into GeoAI system capabilities.
GeoBenchX is an open, comprehensive benchmark created to evaluate the capabilities of LLMs in executing complex, multi-step geospatial workflows through standardized tool invocation. Conceived to address the practical analytic and visualization requirements faced by geographic information systems (GIS) professionals, GeoBenchX encompasses a suite of tasks that exercise tabular, vector, and raster data handling, multi-stage spatial analyses, and hallucination (false-answer) rejection—providing a reproducible, transparent framework for measuring LLM proficiency in GeoAI (Krechetova et al., 23 Mar 2025).
1. Benchmark Structure and Design
GeoBenchX employs a tool-calling agent framework, abstracting geospatial operations as callable Python-style functions. The agent accesses 23 distinct tools, each mapping to commonly required GIS procedures, including but not limited to: loading and merging datasets, spatial filtering, buffer creation, raster-vector analyses, and map production. All tools utilize established libraries (GeoPandas, Rasterio, GDAL, Shapely, Pandas, NumPy) and are exposed as discrete API calls.
The task set is systematically categorized to scale in complexity:
| Category | Task Count | Typical Operations |
|---|---|---|
| Merge-visualize | 36 | Attribute merging, mapping |
| Process-merge-visualize | 56 | Preprocessing, merge, mapping |
| Spatial operations | 53 | Buffer, spatial join, aggregation |
| Heatmaps, contour lines | 54 | Visualization, raster/point fusion |
Approximately 200 tasks are included, stratified into solvable and intentionally unsolvable variants. This dual design tests both the analytic reasoning and failure-mode awareness of candidate LLMs.
2. Task Taxonomy and Tool Interface
Each GeoBenchX task specifies a problem in natural language, e.g., “Map the relationship between GDP per capita and electric power consumption per capita globally”—then expects the agent to translate this into a tool call sequence. Provided functions include:
- Data ingestion and discovery (
load_data,load_geodata,get_raster_path) - Spatial analysis (
create_buffer,select_features_by_spatial_relationship,analyze_raster_overlap) - Statistical computation (
calculate_column_statistics,scale_column_by_value) - Visualization (
make_choropleth_map,make_heatmap,plot_contour_lines) - Unsolvable task identification (
reject_task)
Tasks range from simple attribute tabulation to compounded spatial and raster queries requiring multi-layer data integration and procedural chaining.
Half of the instances in each category are unsolvable due to limitations in the available data or tool set (e.g., requests for unavailable geographic layers or unsupported analysis types). The agent’s correct use of reject_task is foundational to benchmarking hallucination resistance.
3. Automated Evaluation Methodology
GeoBenchX utilizes an LLM-as-Judge framework. Both reference implementations and agent solutions are compiled as canonicalized sequences of function calls. Three LLM judges—Claude 3.5-sonnet, GPT-4o, Gemini 2.0—evaluate solution equivalence at the semantic (not merely syntactic) level, assigning scores:
- 2: perfect match
- 1: partial match
- 0: no match
Judges account for permissible step order or variation in arguments. Internal validation found judge-human agreement of 91% (Sonnet 3.5), 86% (GPT-4o), and 82% (Gemini 2.0).
Task-level and overall scores are computed as mean match frequencies. Standard metrics—Precision, Recall, F₁—are derived with “true positive” defined as a perfect match for a solvable task, “false positive” for matches in unsolvable situations, etc. GeoBenchX also reports the relative step-length difference between candidate and reference solutions.
4. Experimental Results and Comparative Model Performance
Experiments deploy a LangGraph ReAct agent (temperature 0, max 25 steps) and seven commercial LLM APIs:
- GPT-4o, GPT-4o-mini, o3-mini (OpenAI)
- Gemini 2.0 (Google)
- Claude 3.5-sonnet, Claude 3.7-sonnet, Claude 3.5-haiku (Anthropic)
All tests use held-local copies of 21 vector sets, 18 CSV tables, and 11 raster datasets for reproducibility. Token consumption and solution accuracy are compared across both solvable and unsolvable subsets:
| Model | Solvable Match Rate | Unsolvable Reject Rate | Token Consumption* |
|---|---|---|---|
| Sonnet 3.5 | 53% | 37% | 12M in / 157K out |
| Sonnet 3.7 | 50% | 14% | 24M in / 269K out |
| GPT-4o | 44% | 59% | 7.5M in / 57K out |
| Gemini 2.0 | 43% | 46% | 7M in / 40K out |
| Haiku 3.5 | 43% | 20% | 15M in / 203K out |
| o3-mini | 21% | 76% | 3.5M in / 561K out |
| GPT-4o-mini | 16% | 62% | 8M in / 55K out |
*Token metrics report total for all 202 tasks (input/output, rounded).
Anthropic models use significantly more input tokens per run but exhibit higher accuracy on solvable tasks, whereas OpenAI (especially o3-mini, GPT-4o(-mini)) are more adept at task rejection, exhibiting frugal input token usage and lengthy output logs.
5. Error Analysis and Observed Limitations
LLMs consistently exhibit domain-specific error patterns:
- Geometric reasoning failures: e.g., centroid vs. polygon confusion, improper handling of lines and polygons.
- World knowledge errors: e.g., reliance on outdated country groupings or idiosyncratic spelling variations.
- Inefficient data procedures: e.g., merging datasets prior to appropriate filtering, resulting in superfluous tool calls.
- Incomplete procedural steps: omission of required attribute or spatial filters.
- Unsolvable task mishandling: hallucination of answers when data are unavailable, or arbitrary revision of the user query to suit available information.
Documented mitigations include clarification in tool descriptions, post-inference sanity checking of coverage, and targeted instruction tuning for spatial reasoning.
6. Open-Source Release and Community Adoption
All GeoBenchX artifacts are distributed under an open-source license, including:
- The full suite of ~200 tasks with solvable/unsolvable labels and multiple reference solutions
- Data snapshots (statistical tables, vector datasets, rasters)
- LangGraph ReAct agent code and relevant prompts
- LLM-as-Judge evaluation and scoring code
- Data pipeline for task extension via LLM assistance
These resources are available at https://github.com/Solirinai/GeoBenchX (Krechetova et al., 23 Mar 2025).
7. Relevance and Impact in Geospatial AI
GeoBenchX establishes an unbiased, interpretable, and reproducible methodology for measuring LLMs’ intermediate and complex geospatial thinking, procedural tool use, and error management. It uniquely integrates solvability assessment into task evaluation—a core competency for GIS AI practitioners. The benchmark’s multi-faceted dataset, stratified task complexity, and automated judgment system enable precise comparison of LLMs’ reasoning, hallucination resistance, and procedural efficiency on real-world GIS workflows, contributing directly to the robust evaluation and ongoing development of LLM-based GeoAI systems.