GeoAnalystBench: GeoAI Benchmark Suite

Updated 14 September 2025

GeoAnalystBench is a GeoAI benchmark suite defined by 50 Python-based spatial tasks from real-world GIS, providing practical assessment of workflow synthesis.
It employs multi-dimensional metrics—workflow validity, structural alignment, semantic similarity, and CodeBLEU—to quantify LLM performance in geoprocessing.
The framework facilitates reproducible diagnosis of LLM limitations and promotes enhanced domain-specific training and interactive human-in-the-loop feedback.

GeoAnalystBench is a GeoAI benchmark suite that systematically evaluates LLMs for their capabilities in spatial analysis workflow synthesis and geoprocessing code generation. It is grounded in fifty Python-based tasks curated from real-world GIS practice and rigorously validated by domain experts. The framework supports multi-dimensional assessment criteria—including workflow validity, structural and semantic alignment, and code correctness—thereby enabling reproducible measurement and diagnosis of current LLM abilities and limitations in GIS automation.

1. Design and Composition of the Benchmark

GeoAnalystBench consists of fifty spatial analysis tasks extracted from authentic GIS tutorials, published academic literature, and canonical ESRI ModelBuilder workflows. Each task models a real-world geospatial analysis scenario and is decomposed into 3–10 explicit subtasks, such as data ingestion, attribute filtering, application of spatial tools (for example, buffering, spatial joins, density estimation), and result generation. Every task is associated with a "minimum deliverable product," resulting in explicit expectations for both workflow structure and output.

Benchmark coverage spans classical and modern geoprocessing topics:

Spatial relationship detection (e.g., “model home range using animal tracks” via convex hull, kernel density, and clustering).
Spatial clustering and pattern analysis (e.g., hot spot detection using Getis-Ord Gi*).
Environmental modeling and predictive habitat analysis (e.g., “predict sea grass habitats”).
Network and site selection tasks (e.g., “find optimal corridors to connect mountain lion populations”).

Each benchmark entry includes full domain background, dataset descriptions, and exact scenario specifications. This ensures that LLMs are tested under tightly controlled, realistic conditions directly relevant to GIS practitioner challenges.

2. Evaluation Methodology and Metrics

GeoAnalystBench uses a multi-metric evaluation protocol:

Workflow Validity: Proportion (%) of responses containing well-formed, domain-consistent geoprocessing pipeline steps.
Structural Alignment: Mean Absolute Deviation (MAD) between the number of workflow steps in the LLM output and human reference. The formula is:

$\text{MAD} = \frac{1}{N} \sum_{i=1}^N |L_{LLM}^i - L_{human}^i|$

Lower values reflect closer stepwise agreement with expert-designed workflows.

Semantic Similarity: Cosine similarity between sentence embeddings (e.g., all-MiniLM-L6-v2). Higher scores indicate contextual concordance between model-generated and human workflows.
Code Quality (CodeBLEU): A composite metric integrating n-gram match, weighted n-gram match, syntax AST match, and semantic dataflow match. The aggregation formula given in the paper:

$\text{Score} = 0.2 \times (\text{N-gram}) + 0.2 \times (\text{Weighted N-gram}) + 0.3 \times (\text{Syntax AST Match}) + 0.3 \times (\text{Semantic Data-flow Match})$

CodeBLEU quantitatively assesses Python implementations for lexical, syntactic, and logical fidelity relative to expert reference.

This protocol facilitates comprehensive model comparison not only in code-level correctness but in process reasoning, workflow completeness, and GIS-specific semantic logic.

3. Representative Tasks and Spatial Reasoning Challenges

The benchmark covers a spectrum of problems requiring both mechanistic geoprocessing and advanced spatial reasoning:

Modeling Animal Movement: For GPS-tracked elk data, the required workflow involves convex hull calculation, kernel density estimation, and cluster analysis via DBSCAN. Output must illustrate area of movement and migration patterns over time.
Crash Hot Spot Analysis: Example involves spatially filtering crash data by time, snapping crash points onto road networks, joining locations to a road dataset, calculating crash rates, and applying hot spot statistics (e.g., Getis-Ord Gi*).

Other examples encompass optimal spatial corridor selection, land use mapping, proximity filtering, and event prediction. These require not only basic data manipulation but also context-aware parameter tuning (e.g., choice of buffer sizes, kernel bandwidths), and coherent chaining of spatial operations in code.

4. Comparative Performance of Proprietary and Open-Source Models

Model assessment using GeoAnalystBench establishes clear performance differences:

Proprietary models (ChatGPT-4o-mini, Claude-3.5-Sonnet, Gemini-1.5-Flash) exhibit high workflow validity rates (∼95%), precise structural alignment, and robust CodeBLEU scores (ChatGPT-4o-mini: 0.390 mean).
Open-source models (DeepSeek-R1-7B, CodeLlama-7B) lag markedly, with DeepSeek-R1-7B registering only 48.5% workflow validity and 0.272 CodeBLEU. These models tend to produce incomplete, structurally inconsistent, or semantically misaligned outputs.

Proprietary models show strength in capturing domain-specific steps and logic, whereas open-source models often omit critical workflow elements or apply incorrect toolchains. This suggests an urgent need for domain-oriented pretraining and high-quality geospatial corpora in open-source model development.

5. Diagnostic Insights and Limitations

GeoAnalystBench reveals persistent challenges for LLMs in GIS:

Deep Spatial Reasoning: Tasks involving nuanced spatial relationships, parameter optimization (e.g., buffer distances, kernel sizes), and optimal selection under ambiguity present the highest failure rates.
Logical Consistency: While syntactic code correctness is often achieved, models commonly falter in accurate chaining of workflow steps and dataflow dependencies.
Parameter Assumptions: LLMs variably default to both plausible and unsuitable parameter values in the absence of explicit data or instruction, necessitating interactive refinement or retrieval-augmented guidance.

The benchmark mitigates these gaps by providing rigorous human-in-the-loop validation and by measuring both high-level reasoning and low-level code accuracy.

6. Future Directions for GeoAI Automation

The paper identifies essential next steps:

Enhanced Domain-Specific Training: Incorporation of GIS corpus data, tool documentation, and workflow case studies for LLM pre-training.
Retrieval-Augmented Generation: Leveraging external knowledge sources for parameter selection and workflow refinement.
Human-in-the-Loop Feedback: Systematic integration of expert judgments during LLM fine-tuning and post-processing.
Interactive Systems: Development of systems supporting iterative user adjustments, interactive parameter queries, and explanatory feedback loops.

A plausible implication is that GeoAnalystBench will serve as a standard basis for evaluating forthcoming GeoAI platforms, promoting robust GIS workflow synthesis, and supporting reproducible spatial analysis automation in both proprietary and open-source contexts.

7. Visualization and Workflow Structuring

The benchmark emphasizes explicit workflow structuring and process visualization. A representative code excerpt from the workflow analysis module given in the paper:

tasks = [Task1, Task2, Task3]
G = nx.DiGraph()
for i in range(len(tasks) - 1):
    G.add_edge(tasks[i], tasks[i + 1])
pos = nx.drawing.nx_pydot.graphviz_layout(G, prog="dot")
plt.figure(figsize=(15, 8))
nx.draw(G, pos, with_labels=True, node_size=3000, node_color='lightblue',
        font_size=10, font_weight='bold', arrowsize=20)
plt.title("Workflow for Analyzing Urban Heat Using Kriging Interpolation", fontsize=14)
plt.show()

This approach enforces clarity in process decomposition, step sequencing, and explicit reasoning pathways—central requirements for LLMs aspiring to automate or support professional geospatial analysis.

GeoAnalystBench defines a rigorous, reproducible standard for measuring the capabilities and limitations of LLMs in automating spatial analysis workflows and code generation within the GIS domain. By combining task diversity, multi-dimensional evaluation metrics, and human-validated reference solutions, it both documents current model gaps and establishes the groundwork for advanced GeoAI research and development (Zhang et al., 7 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

GeoAnalystBench: A GeoAI benchmark for assessing large language models for spatial analysis workflow and code generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to GeoAnalystBench.