- The paper introduces GeoBenchX, a benchmark evaluating LLMs for multi-step geospatial tasks using 23 specialized GIS tools and a tool-calling agent architecture.
- It employs over 200 tasks, ranging from simple mapping to complex spatial operations, to test both task resolution and proper rejection of unsolvable queries.
- Evaluation reveals that models like Sonnet 3.5 and GPT-4o balance task success with accurate rejections, highlighting performance variations as task complexity increases.
This paper introduces GeoBenchX, a benchmark designed to evaluate the performance of LLMs on multi-step geospatial tasks relevant to Geographic Information System (GIS) practitioners (2503.18129). The goal is to assess how well current commercial LLMs can function as GIS assistants for tasks like creating maps, generating reports, and performing socio-economic analysis.
The benchmark uses a simple tool-calling agent architecture (Langgraph ReAct) equipped with 23 geospatial tools built upon libraries like GeoPandas, Rasterio, GDAL, Shapely, Pandas, and Matplotlib. These tools cover operations such as loading data (tabular, vector, raster), merging, filtering, spatial operations (joins, buffers, raster analysis), and visualization (choropleth maps, heatmaps, contour lines). A specific reject_task
tool allows the agent to explicitly indicate when a task is unsolvable. The agent interacts with snapshots of real-world statistical, vector, and raster datasets.
The benchmark comprises over 200 tasks divided into four categories of increasing complexity:
- Merge-visualize: Simple tasks requiring joining tabular and geographic data to create maps.
- Process-merge-visualize: Tasks involving data filtering or column operations before mapping.
- Spatial operations: Tasks needing spatial joins, buffers, raster calculations, or distance measurements.
- Heatmaps, contour lines: Complex tasks requiring specific visualization tools like heatmaps or contour generation, often combined with other spatial operations.
Crucially, each category includes both solvable tasks and intentionally unsolvable tasks (due to data or tool limitations) to test the LLMs' ability to recognize and reject impossible requests, minimizing hallucinated outputs. Tasks were designed to mimic natural language queries from GIS users, sometimes containing ambiguity.
Evaluation employs an LLM-as-Judge framework. Seven commercial LLMs were tested as the "solver" agent (Sonnet 3.5/3.7, Haiku 3.5, Gemini 2.0, GPT-4o/4o-mini, o3-mini), all run at temperature 0 for consistency. The agent's generated sequence of tool calls (candidate solution) for each task is compared against manually curated reference solutions. Reference solutions specify the expected tool calls, represent an empty sequence if no tools are needed, or indicate a reject_task
call for unsolvable tasks. Multiple reference solutions are allowed for ambiguous tasks. A panel of LLM judges (Sonnet 3.5, GPT-4o, Gemini 2.0) assesses the semantic equivalence between candidate and reference solutions, assigning a score (match, partial match, no match). The evaluator's performance was validated against human annotations, with Sonnet 3.5 proving the most aligned judge.
Key findings include:
- Sonnet 3.5 and GPT-4o achieved the best overall performance, balancing success on solvable tasks with accurate rejection of unsolvable ones.
- Anthropic models (Claude Sonnet/Haiku) performed well on solvable tasks but struggled to reject unsolvable ones.
- OpenAI models (GPT-4o/4o-mini, o3-mini) were better at identifying unsolvable tasks, with GPT-4o being the most balanced among them. GPT-4o mini and o3-mini showed lower overall success rates.
- Performance generally decreased with task complexity, with "Heatmaps, contour lines" being the most challenging category.
- Anthropic models consumed substantially more input and output tokens compared to OpenAI and Google models.
- Common errors included poor understanding of geometry (e.g., centroids), reliance on outdated world knowledge, inefficient data manipulation (e.g., merging large datasets before filtering), skipping necessary filtering steps, and attempting to solve tasks with partial data instead of rejecting.
The paper concludes that while current LLMs show promise for geospatial tasks, performance varies significantly. GeoBenchX provides a standardized, open-source toolset (benchmark tasks, evaluation framework, datasets, code available on GitHub) for ongoing evaluation of LLM capabilities in the GeoAI domain, particularly for practical GIS applications.