GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks (2503.18129v1)

Published 23 Mar 2025 in cs.CL and cs.AI

Abstract: In this paper, we establish a benchmark for evaluating LLMs on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess seven leading commercial LLMs (Sonnet 3.5 and 3.7, Haiku 3.5, Gemini 2.0, GPT-4o, GPT-4o mini, and o3-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks across four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test hallucination rejection. We develop an LLM-as-Judge evaluation framework to compare agent solutions against reference implementations. Results show Sonnet 3.5 and GPT-4o achieve the best overall performance, with Claude models excelling on solvable tasks while OpenAI models better identify unsolvable scenarios. We observe significant differences in token usage, with Anthropic models consuming substantially more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline are released as open-source resources, providing one more standardized method for ongoing evaluation of LLMs for GeoAI.

Summary

The paper introduces GeoBenchX, a benchmark evaluating LLMs for multi-step geospatial tasks using 23 specialized GIS tools and a tool-calling agent architecture.
It employs over 200 tasks, ranging from simple mapping to complex spatial operations, to test both task resolution and proper rejection of unsolvable queries.
Evaluation reveals that models like Sonnet 3.5 and GPT-4o balance task success with accurate rejections, highlighting performance variations as task complexity increases.

This paper introduces GeoBenchX, a benchmark designed to evaluate the performance of LLMs on multi-step geospatial tasks relevant to Geographic Information System (GIS) practitioners (2503.18129). The goal is to assess how well current commercial LLMs can function as GIS assistants for tasks like creating maps, generating reports, and performing socio-economic analysis.

The benchmark uses a simple tool-calling agent architecture (Langgraph ReAct) equipped with 23 geospatial tools built upon libraries like GeoPandas, Rasterio, GDAL, Shapely, Pandas, and Matplotlib. These tools cover operations such as loading data (tabular, vector, raster), merging, filtering, spatial operations (joins, buffers, raster analysis), and visualization (choropleth maps, heatmaps, contour lines). A specific reject_task tool allows the agent to explicitly indicate when a task is unsolvable. The agent interacts with snapshots of real-world statistical, vector, and raster datasets.

The benchmark comprises over 200 tasks divided into four categories of increasing complexity:

Merge-visualize: Simple tasks requiring joining tabular and geographic data to create maps.
Process-merge-visualize: Tasks involving data filtering or column operations before mapping.
Spatial operations: Tasks needing spatial joins, buffers, raster calculations, or distance measurements.
Heatmaps, contour lines: Complex tasks requiring specific visualization tools like heatmaps or contour generation, often combined with other spatial operations.

Crucially, each category includes both solvable tasks and intentionally unsolvable tasks (due to data or tool limitations) to test the LLMs' ability to recognize and reject impossible requests, minimizing hallucinated outputs. Tasks were designed to mimic natural language queries from GIS users, sometimes containing ambiguity.

Evaluation employs an LLM-as-Judge framework. Seven commercial LLMs were tested as the "solver" agent (Sonnet 3.5/3.7, Haiku 3.5, Gemini 2.0, GPT-4o/4o-mini, o3-mini), all run at temperature 0 for consistency. The agent's generated sequence of tool calls (candidate solution) for each task is compared against manually curated reference solutions. Reference solutions specify the expected tool calls, represent an empty sequence if no tools are needed, or indicate a reject_task call for unsolvable tasks. Multiple reference solutions are allowed for ambiguous tasks. A panel of LLM judges (Sonnet 3.5, GPT-4o, Gemini 2.0) assesses the semantic equivalence between candidate and reference solutions, assigning a score (match, partial match, no match). The evaluator's performance was validated against human annotations, with Sonnet 3.5 proving the most aligned judge.

Key findings include:

Sonnet 3.5 and GPT-4o achieved the best overall performance, balancing success on solvable tasks with accurate rejection of unsolvable ones.
Anthropic models (Claude Sonnet/Haiku) performed well on solvable tasks but struggled to reject unsolvable ones.
OpenAI models (GPT-4o/4o-mini, o3-mini) were better at identifying unsolvable tasks, with GPT-4o being the most balanced among them. GPT-4o mini and o3-mini showed lower overall success rates.
Performance generally decreased with task complexity, with "Heatmaps, contour lines" being the most challenging category.
Anthropic models consumed substantially more input and output tokens compared to OpenAI and Google models.
Common errors included poor understanding of geometry (e.g., centroids), reliance on outdated world knowledge, inefficient data manipulation (e.g., merging large datasets before filtering), skipping necessary filtering steps, and attempting to solve tasks with partial data instead of rejecting.

The paper concludes that while current LLMs show promise for geospatial tasks, performance varies significantly. GeoBenchX provides a standardized, open-source toolset (benchmark tasks, evaluation framework, datasets, code available on GitHub) for ongoing evaluation of LLM capabilities in the GeoAI domain, particularly for practical GIS applications.

PDF Markdown

GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks (2503.18129v1)

Summary

Related Papers