GPSBench: Evaluating Geospatial LLM Reasoning

Updated 4 July 2026

The paper introduces GPSBench, a benchmark assessing whether LLMs understand GPS coordinates as objects for geospatial reasoning rather than relying solely on memorized facts.
It comprises 57,800 samples over 17 tasks across two tracks—Pure GPS for geometric operations and Applied for integrating real-world geographic knowledge.
Empirical results reveal a mean gap of 9.9 points between applied (67.7%) and pure coordinate tasks (57.8%), highlighting challenges in precise geodetic computing.

Searching arXiv for the GPSBench paper and closely related benchmark-suite references. GPSBench is a benchmark for evaluating whether LLMs understand GPS coordinates as objects of geospatial reasoning rather than merely as triggers for surface-level geographic recall. It was introduced as a dataset of 57,800 samples across 17 tasks spanning geometric coordinate operations and reasoning that integrates coordinates with world knowledge, with the stated aim of separating intrinsic coordinate competence from tool-assisted performance (Truong et al., 18 Feb 2026). The benchmark is organized around two tracks—Pure GPS Track and Applied Track—and is motivated by the claim that applications in navigation, robotics, and mapping require robust reasoning over real-world coordinates, while such capability had remained underexplored in LLM evaluation (Truong et al., 18 Feb 2026).

1. Definition and benchmark scope

GPSBench is designed to answer a specific question: whether LLMs “actually understand GPS coordinates,” or whether their apparent competence is primarily a byproduct of geographic facts and generic arithmetic heuristics. The benchmark therefore distinguishes between tasks that require no external world knowledge and tasks that require integrating coordinates with real geography. In the benchmark’s terminology, the Pure GPS Track contains 9 tasks focused on coordinate operations and geometric reasoning, with no need for world knowledge, whereas the Applied Track contains 8 tasks that require integrating coordinates with real geographic knowledge, such as mapping a coordinate to a city or reasoning about continent and country relations (Truong et al., 18 Feb 2026).

This separation is central to the benchmark’s interpretation. It rejects the common conflation of coordinate manipulation with geographic understanding. The paper’s framing implies that geospatial competence is not monolithic: exact coordinate operations, coordinate-to-place grounding, and broader spatial inference may behave differently under evaluation. GPSBench operationalizes that distinction by combining spherical geodesy, coordinate-to-place mapping, and real-world spatial reasoning within a single benchmark structure (Truong et al., 18 Feb 2026).

2. Dataset composition and task taxonomy

GPSBench contains 57,800 samples across 17 tasks, with 3,400 samples for each task and a 60% train / 10% dev / 30% test split. It is built from GeoNames, and the final benchmark covers 18,196 unique locations across six continents (Truong et al., 18 Feb 2026).

The benchmark organizes its task design using the classic landmark–route–survey taxonomy from spatial cognition and adds a geometric category as a control. The four reasoning types are:

Landmark (L): recognizing places from coordinates.
Route (R): reasoning about paths or sequences.
Survey (S): map-like global spatial reasoning.
Geometric (G): pure coordinate computation (Truong et al., 18 Feb 2026).

The task inventory spans both mathematical and applied geospatial problems. The Pure GPS side includes Format conversion, Coordinate transformation, Distance calculation, Bearing computation, Coordinate interpolation along a great-circle path, Polygon area / perimeter, Bounding box, Route geometry simplification, and Relative position. The Applied side includes Place association, Name disambiguation, Relative position between cities, Proximity, Route analysis, Spatial patterns / outlier detection, Boundary analysis, and Terrain classification (Truong et al., 18 Feb 2026).

A concise summary of the dataset structure is given below.

Component	Specification
Total size	57,800 samples
Tracks	Pure GPS Track (9 tasks), Applied Track (8 tasks)
Samples per task	3,400
Split	60% train / 10% dev / 30% test
Source	GeoNames
Geographic coverage	18,196 unique locations across six continents

This design makes the benchmark broad enough to test both exact coordinate math and geographic reasoning over the world as represented in model parameters.

3. Evaluation protocol and measurement

GPSBench emphasizes intrinsic capability rather than tool use. The stated intention is to test what a model can do “from its parameters alone,” rather than through external GIS APIs, calculators, or geocoders. The paper identifies this distinction as especially relevant to settings where low latency, offline use, or privacy matters (Truong et al., 18 Feb 2026).

The authors evaluate 14 state-of-the-art LLMs in zero-shot mode, with standardized prompts, no few-shot examples, and no chain-of-thought prompting. All runs use temperature 0 for reproducibility. The model set comprises:

OpenAI: GPT-5.1, GPT-5-mini, GPT-5-nano, GPT-4.1, GPT-4.1-mini
Google: Gemini-2.5-Pro, Gemini-2.5-Flash
Anthropic: Claude-Haiku-4.5
Qwen3: 235B, 30B, 14B, 8B
Mistral: Large, Small (Truong et al., 18 Feb 2026)

The evaluation uses Accuracy for multiple-choice tasks and $1 - \text{MAPE}$ for numeric tasks, so that higher values are always better (Truong et al., 18 Feb 2026). This metric choice standardizes interpretation across qualitatively different task types, allowing aggregate comparisons between discrete classification problems and continuous geodetic computations.

4. Empirical findings on LLM geospatial reasoning

A central result is that models are generally stronger on applied geographic reasoning than on pure coordinate geometry. Across all 14 models, the mean Applied Track score is 67.7%, whereas the mean Pure GPS score is 57.8%, producing a 9.9-point gap favoring Applied tasks (Truong et al., 18 Feb 2026). The strongest reported model on the Applied Track is GPT-5-mini at 74.1%, while the strongest on the Pure GPS Track is GPT-5.1 at 84.4%; Gemini-2.5-Pro reaches 76.7% on Pure GPS (Truong et al., 18 Feb 2026). Only two models perform better on Pure GPS than Applied: GPT-5.1, for which Pure GPS exceeds Applied by 12.4%, and Gemini-2.5-Pro, for which Pure GPS exceeds Applied by 5.0% (Truong et al., 18 Feb 2026).

The task-level picture is uneven. The paper reports that Distance calculation and Bearing computation are often strong: GPT-5.1 gets 99.9% on both distance and bearing, and GPT-4.1 gets 99.1% distance and 97.5% bearing. By contrast, Coordinate interpolation is described as much harder, and Polygon area as extremely hard, with many models near zero (Truong et al., 18 Feb 2026). The qualitative interpretation given in the paper is that models may often state the correct formula yet fail to execute it correctly, especially for multi-step spherical computations.

A second major finding concerns spatial granularity. On Place Association, geographic knowledge degrades hierarchically:

Country-level accuracy: 59–97%
Province/state-level accuracy: 26–73%
Exact city-level accuracy: 1–23% (Truong et al., 18 Feb 2026)

The strongest city-level model is Gemini-2.5-Pro, with 23.0% city-level accuracy and 96.5% country-level accuracy (Truong et al., 18 Feb 2026). The benchmark therefore indicates strong coarse-grained knowledge but weak exact localization. The paper interprets this as evidence that model geography is encoded in a hierarchical, coarse-grained form rather than as dense coordinate-to-city mappings.

The noise-robustness experiments further refine that interpretation. The authors add Gaussian noise with $\sigma = 0, 10, 50, 100, 500, 1000$ meters and find that performance remains relatively stable: country-level accuracy stays around 79–82%, province-level around 46–52%, and city-level around 6–9%, with maximum changes of $\pm 1.6\%$ , $\pm 5.8\%$ , and $\pm 2.0\%$ respectively (Truong et al., 18 Feb 2026). The paper argues that this robustness suggests models are not merely retrieving memorized coordinate strings. The accompanying Missing Data probe supports that interpretation: when inferring a missing latitude or longitude from a city name, the best model reaches only 12.4%, and the mean is 8.3% (Truong et al., 18 Feb 2026).

A common misconception addressed by these findings is that strong performance on geographic questions implies strong coordinate understanding. GPSBench shows that this inference is unwarranted: country-level recognition, coordinate arithmetic, and precise geodetic reasoning separate sharply under controlled evaluation.

5. Mathematical definitions and benchmark semantics

GPSBench specifies exact mathematical definitions for the geometric tasks, making the benchmark a test of geodetic correctness rather than an open-ended language exercise. For two coordinates $(\phi_1,\lambda_1)$ and $(\phi_2,\lambda_2)$ , the benchmark uses the Haversine distance

$d = 2R \arcsin\sqrt{\sin^2\!\left(\frac{\Delta\phi}{2}\right) + \cos\phi_1 \cos\phi_2 \sin^2\!\left(\frac{\Delta\lambda}{2}\right)}$

with $\Delta\phi = \phi_2 - \phi_1$ , $\Delta\lambda = \lambda_2 - \lambda_1$ , and $\sigma = 0, 10, 50, 100, 500, 1000$ 0 km (Truong et al., 18 Feb 2026).

For the initial bearing, the paper uses

$\sigma = 0, 10, 50, 100, 500, 1000$ 1

with normalization to $\sigma = 0, 10, 50, 100, 500, 1000$ 2 and mapping into the eight directions N, NE, E, SE, S, SW, W, NW (Truong et al., 18 Feb 2026).

For great-circle interpolation, coordinates are first converted to 3D unit vectors

$\sigma = 0, 10, 50, 100, 500, 1000$ 3

then interpolated by spherical linear interpolation:

$\sigma = 0, 10, 50, 100, 500, 1000$ 4

$\sigma = 0, 10, 50, 100, 500, 1000$ 5

and converted back via

$\sigma = 0, 10, 50, 100, 500, 1000$ 6

(Truong et al., 18 Feb 2026).

For spherical polygon area, the benchmark uses L’Huilier’s theorem for a spherical triangle with sides $\sigma = 0, 10, 50, 100, 500, 1000$ 7 and semi-perimeter $\sigma = 0, 10, 50, 100, 500, 1000$ 8:

$\sigma = 0, 10, 50, 100, 500, 1000$ 9

and total polygon area

$\pm 1.6\%$ 0

(Truong et al., 18 Feb 2026).

The benchmark also defines UTM zone as

$\pm 1.6\%$ 1

with central meridian

$\pm 1.6\%$ 2

and gives the Web Mercator transformation

$\pm 1.6\%$ 3

(Truong et al., 18 Feb 2026).

These formulas are not ancillary. They define the benchmark’s ground truth and clarify that GPSBench is evaluating exact geodetic targets, not approximate verbal plausibility.

6. Coordinate augmentation and finetuning trade-offs

Beyond static evaluation, GPSBench is used to probe whether explicit coordinate information improves downstream geospatial reasoning. On MapEval with 66 geocodable samples, adding GPS coordinates improves performance from 75.8% to 81.8%, a +6.1% gain. The reported breakdown is:

Trip planning: 90.9% → 100.0%
POI queries: 72.5% → 76.5%
Nearby search: 66.7% → 100.0% (Truong et al., 18 Feb 2026)

On Hierarchical Spatial, coordinate augmentation produces a larger change, with overall performance improving from 40.9% to 63.6%, a +22.7% gain. It completely removes Hierarchical bias (70% → 100%) and Alignment bias (50% → 100%), but Proximity bias remains 0%, and Rotation bias remains 0% (Truong et al., 18 Feb 2026). The benchmark therefore suggests that explicit coordinates can repair some missing location information without resolving all forms of spatial reasoning failure.

The paper also studies finetuning by applying LoRA to Qwen3-30B-A3B-Instruct on the GPSBench training split, using a learning rate of $\pm 1.6\%$ 4, batch size 32, LoRA rank 64, max sequence length 16,384, 2 epochs, and a training set size of 34,680 samples (Truong et al., 18 Feb 2026). The outcome is explicitly mixed:

Applied Track: 52.3% → 50.7% (-1.6%)
Pure GPS Track: 53.1% → 57.4% (+4.3%)
Overall: 52.7% → 54.1% (+1.5%) (Truong et al., 18 Feb 2026)

The task-level changes show the same tension. Geometry-oriented tasks improve substantially—Spatial Patterns: +56.5%, Polygon Area: +25.9%, Interpolation: +18.9%, Route Geometry: +12.4%—while several world-knowledge tasks degrade, including Boundary Analysis: -25.2% and Name Disambiguation: -17.1%; Relative Position, Proximity, and Bearing also worsen in several cases (Truong et al., 18 Feb 2026). The paper’s interpretation is that geospatial competence is not a single skill and that training for coordinate computation can overwrite or weaken geographic world knowledge.

7. Significance and relation to benchmark design

GPSBench’s principal significance lies in its decomposition of geospatial reasoning into separable competencies: geometric computation, coordinate grounding, and world-knowledge integration. This decomposition allows the benchmark to show that current LLMs are often useful for broad spatial reasoning yet remain unreliable for fine-grained localization, spherical geometry, and precise geodetic reasoning (Truong et al., 18 Feb 2026). The benchmark also strengthens the claim that robustness to noisy coordinates does not imply exact lookup-style memorization, but rather a more diffuse form of geographic structure.

Within the broader landscape of benchmark design, GPSBench can plausibly be read as part of a general movement toward domain-specific, technically grounded evaluation. In other domains, benchmark suites have been modernized to track evolving hardware, workloads, and methodological needs—for example, Altis for modern GPGPU workloads (Hu et al., 2019), gSuite for framework-independent GNN inference benchmarking on GPUs (Tekdoğan et al., 2022), and BenchLink for resilient communication links in GPS-denied environments (Nivas et al., 24 Dec 2025). This suggests a broader benchmark philosophy: domain fidelity, explicit task decomposition, and reproducible evaluation are treated as essential when a capability is likely to be overestimated by generic benchmarks.

For geospatial AI specifically, GPSBench establishes that “GPS understanding” should not be reduced to a single scalar capability. The evidence reported in the benchmark points instead to a layered competence profile: strong country-level grounding, much weaker city-level localization, competence on some elementary calculations, and persistent failure on demanding geodetic operations (Truong et al., 18 Feb 2026). In that sense, GPSBench functions both as an evaluation resource and as a diagnostic framework for distinguishing what LLMs know about geographic space from what they can reliably compute within it.

Markdown Report Issue Upgrade to Chat

References (4)

GPSBench: Do Large Language Models Understand GPS Coordinates? (2026)

ALTIS: Modernizing GPGPU Benchmarking (2019)

gSuite: A Flexible and Framework Independent Benchmark Suite for Graph Neural Network Inference on GPUs (2022)

BenchLink: An SoC-Based Benchmark for Resilient Communication Links in GPS-Denied Environments (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPSBench.

GPSBench: Evaluating Geospatial LLM Reasoning

1. Definition and benchmark scope

2. Dataset composition and task taxonomy

3. Evaluation protocol and measurement

4. Empirical findings on LLM geospatial reasoning

5. Mathematical definitions and benchmark semantics

6. Coordinate augmentation and finetuning trade-offs

7. Significance and relation to benchmark design

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics