MapBench: Spatial Reasoning Benchmark

Updated 29 December 2025

MapBench is a suite of benchmarks and datasets that assess spatial reasoning, navigation, and HD map construction through human-curated map scenarios.
It integrates vision-language and multimodal LLM approaches with detailed scene graph annotations to generate and evaluate optimal route-planning strategies.
Robust evaluation protocols in MapBench reveal limitations in current models under complex map conditions and sensor corruptions, guiding future AI enhancements.

MapBench is a suite of benchmarks and datasets for evaluating the spatial reasoning, navigation, and robustness capabilities of learning systems on human-readable maps and HD map construction tasks. It encompasses several distinct but thematically related resources across vision-LLMs, multimodal LLMs, and map-centric perception in autonomous driving, all unified by the aim of assessing map-based understanding at pixel, symbolic, or BEV (bird’s-eye-view) levels. MapBench offers comprehensive, human-curated map navigation scenarios, fine-grained route-tracing tasks, and an HD map corruption evaluation suite, enabling rigorous testing of both perceptual understanding and spatial planning competence in contemporary AI systems (Xing et al., 18 Mar 2025, &&&1&&&, Panagopoulou et al., 22 Dec 2025, Hao et al., 2024).

1. Dataset Composition and Task Definition

MapBench includes several domains of map-related benchmarks:

Vision-Language Map Navigation (LVLMs): MapBench comprises 100 hand-curated, human-readable “tourist maps” spanning nine real-world scenarios (Zoo, Museum, National Park, Urban, Campus, Theme Park, Trail, Mall, Google-style Streets). Each map is a high-resolution RGB image annotated with a structured Map Space Scene Graph (MSSG) encoding landmarks and intersections. Navigation queries specify start and goal landmarks, totaling 1,649 path-planning tasks with a complexity stratification (easy/medium/hard) based on network-theoretic metrics (Xing et al., 18 Mar 2025).
Fine-Grained Route Tracing (Multimodal LLMs): In MapTrace, MapBench covers 96 commercial-style wayfinding maps (no synthetic maps in the test set), with 1,573 start/end coordinate queries. Task input is the map image, a start pixel, and a goal pixel; the output is a traversable sequence of pixel coordinates optimizing connectivity and obstacle-avoidance (Panagopoulou et al., 22 Dec 2025).
HD Map Construction Robustness: For autonomous driving, MapBench introduces a simulation suite based on nuScenes val that systematically applies 29 sensor corruption types (adverse weather, lighting, sensor malfunction, beam cross-talk/ misconfiguration) to multi-sensor inputs (camera, LiDAR), measuring the robustness of HD map vectorization under real-world domain exposure (Hao et al., 2024).

Each scenario retains strong domain validity, requiring models to parse stylized symbols/text, reason about spatial topology, and plan routes with nontrivial semantic and geometric constraints.

2. Annotation Protocols and Scene Graph Integration

To bridge low-level image data and high-level spatial queries, MapBench leverages explicit graph-based annotation:

Scene Graphs: Each map is manually labeled to produce an undirected scene graph $\mathcal{G}_i=(V,E)$ in which nodes are either landmark locations $(x,y,r,s)$ or intersection points. Edges are labeled as “connect,” “adjacent,” “observable,” or “unrelated,” encoding both geometric and topological relationships (Xing et al., 18 Mar 2025).
Map Space Scene Graph (MSSG): The MSSG enables bidirectional mapping between free-form navigation instructions and ground-truth graph paths. Two conversion algorithms are implemented:
- MSSG→Text translates graph paths into stepwise natural-language navigation,
- Text→MSSG parses model output into graph-edge sequences for evaluative comparison.

This annotation protocol ensures systematic, automatable assessment of instruction validity, route optimality, and path coverage, overcoming the heterogeneity of map imagery.

3. Evaluation Metrics and Experimental Protocols

MapBench utilizes a diverse set of quantitative metrics tailored to task specification:

Language-to-Route Evaluation (LVLMs): Instructions output by LVLMs are parsed into MSSG paths and compared against shortest feasible paths. The principal metric is the Path Quality Score (PQS):

$PQS(G,s,t) = \frac{\ell(L_{\text{MSSG}})}{\ell(L_{sp})},\quad PQS \ge 1,$

with $PQS=1$ signifying optimality. Supplementary language metrics assess missing paths, incoherence, and compliance with format constraints (Xing et al., 18 Mar 2025).

Pixel-Level Trace Evaluation (MLLMs): Route-tracing results are scored with Normalized Dynamic Time Warping (NDTW)

$NDTW(P,G) = 1 - \frac{\text{DTW}(P,G)}{\max(|P|,|G|)},$

for predicted and ground-truth normalized coordinate sequences. Path “Success Rate” (SR) measures the proportion of queries yielding valid, parseable traces (Panagopoulou et al., 22 Dec 2025).

HD Map Construction Robustness: Standard detection metrics—mAP, precision, recall, and IoU—are computed for map elements. Robustness is assessed using Corruption Error (CE) and Resilience Rate (RR):

$CE_i = \frac{\sum_{l=1}^3 [1 - mAP_{i,l}]}{\sum_{l=1}^3 [1 - mAP_{i,l}^{base}]},\quad RR_i = \frac{\sum_{l=1}^3 mAP_{i,l}}{3\,mAP^{clean}}$

Lower CE and higher RR indicate greater resilience under sensor corruptions (Hao et al., 2024).

Inference pipelines include both zero-shot and chain-of-thought (CoT) mechanisms, with CoT shown to improve consistency in multi-step spatial reasoning (Xing et al., 18 Mar 2025).

4. Baseline Performance and Failure Modes

MapBench has exposed significant limitations in both open-source and proprietary models:

LVLM Navigation (Language Instruction): GPT-4o achieved the best overall PQS (closest to 1.0) but still struggled on a large fraction of queries. Open-source models (e.g., Llama-3.2, Qwen2-VL) underperformed, especially in complex or cluttered map scenarios. Chain-of-Thought improved logical consistency but increased the rate of missing route outputs for open models. Typical failure patterns included detouring, landmark mislocalization, and incoherent directions (Xing et al., 18 Mar 2025).
MLLM Route Tracing: Pretrained models, when evaluated on MapBench, suffered from invalid detours, short-circuiting through obstacles, and poor handling of long-range planning. Fine-tuning on synthetic MapTrace data improved mean SR by up to 6.4 points and reduced NDTW by up to 33% in Gemma-3 and Gemini-2.5-Flash, highlighting the transferability from synthetic to real map navigation (Panagopoulou et al., 22 Dec 2025).
HD Map Construction Corruptions: Across 31 tested methods, all experienced major degradation under simulated sensor faults. Snow and sensor hardware failures (camera view loss, LiDAR echo loss, cross-sensor misconfiguration) were the most destructive (up to 80% mAP loss). Even BEV-fusion models only partly mitigated these effects; robustness remained critically limited (Hao et al., 2024).

5. Synthetic Data Generation and Domain Transfer

The MapTrace pipeline demonstrates that explicit, pixel-level synthetic annotation can teach MLLMs fine-grained spatial reasoning largely absent from pretraining. Synthetic dataset construction entails:

Generating 4,000 synthetic maps in diverse categories using text-to-image models (Imagen-4).
Extracting path masks via k-means color clustering and Mask-Critic LLM filtering for traversable regions.
Sampling start/end pairs under geometric constraints, with subsequent path sampling and Path-Critic filtering.
Producing 23,000 paired traces for model fine-tuning (Panagopoulou et al., 22 Dec 2025).

Fine-tuning yields improved adherence to map constraints, fewer detours, and better navigation of complex environments. However, domain shift remains a limitation: synthetic maps may not fully capture the stylistic diversity and graphical complexity of real-world maps.

6. Robustness Benchmarks and Improvement Strategies

The MapBench sensor corruption benchmark directly quantifies the susceptibility of HD map constructor models to environmental and sensor-based failures:

Corruption Taxonomy: The benchmark spans 29 perturbation types, including photometric effects, weather phenomena, LiDAR beam/echo impairments, and multi-sensor failures (Hao et al., 2024).
Architectural Defense: BEV-level fusion and temporal aggregation offer measurable but incomplete resilience. Swin-Transformer backbones outperform ResNet-50 for domain shift. Transformer BEV encoders (e.g., BEVFormer, GKT) deliver more robust degradation curves than basic pooling. Photometric and LiDAR-specific augmentations yield further but moderate gains.
Limitations: No architectural or data-centric approach fully closes the resilience gap induced by severe or multi-modal degradations. Data augmentation with adversarial and style-curated corruptions and collection of real adverse-condition datasets are recommended for future progress.

7. Release, Impact, and Future Directions

MapBench resources—including datasets, annotation tools, evaluation scripts, and code—are publicly available (https://github.com/taco-group/MapBench) (Xing et al., 18 Mar 2025). The suite is now referenced in multiple independent studies (Ariadne RLVR, MapTrace synthetic pipeline, HD constructor robustness) and is rapidly establishing itself as a standard for integrated spatial reasoning benchmarking.

Anticipated research directions include scaling up the diversity and number of map domains, automating scene-graph extraction, developing hybrid neuro-symbolic reasoning agents, and extending robustness evaluation to embodied or real-time map navigation settings (Xing et al., 18 Mar 2025, Panagopoulou et al., 22 Dec 2025). A plausible implication is that as MapBench expands, it will underpin the evaluation and development of vision-language and multimodal models approaching or surpassing human-level map-reading fluency. Continued work is needed to bridge the domain gap, address the complexity ceiling in path planning, and design systems that gracefully degrade under real-world uncertainty (shen et al., 1 Nov 2025, Hao et al., 2024).