MAPBench: Robust HD Map Benchmark

Updated 12 January 2026

MAPBench is a comprehensive benchmark that evaluates HD map construction by simulating 29 real-world sensor corruption scenarios for cameras and LiDAR.
It systematically measures model performance using metrics like mAP, Corruption Error (CE), and Resilience Rate (RR) under varying severity levels.
The framework informs best practices in using transformer backbones, geometry-aware encoders, and temporal fusion to improve robustness in safety-critical applications.

MAPBench refers to several notable datasets and benchmarking infrastructures spanning distinct application domains: robust HD map construction in autonomous driving, spatial reasoning over natural maps for LVLMs, visual map-based geolocalization, automated pixel-accurate route tracing, and extensible motion planning algorithm evaluation. Each MAPBench instance provides a rigorously designed and publicly accessible resource to enable reproducible, fine-grained, and domain-specific evaluation of machine learning or robotics methods.

1. MAPBench for HD Map Construction Robustness

MapBench is the first comprehensive robustness benchmark for bird’s-eye-view (BEV) high-definition (HD) map construction methods in the presence of real-world camera and LiDAR corruptions (Hao et al., 2024). Its main objectives are: (a) systematic evaluation of 31 state-of-the-art HD-map models under wide-ranging, realistic sensor corruptions; (b) principled quantification of degradation; (c) identification of practical robustness strategies.

Corruption Taxonomy

MapBench is implemented on the nuScenes validation split and injects 29 distinct corruption scenarios, each at three severity levels (Easy, Moderate, Hard):

Camera-only corruptions (8): Clean, Frame Lost, Camera Crash, Low-Light, Bright, Color Quantization, Snow, Fog.
LiDAR-only corruptions (8): Clean, Wet Ground, Snow, Motion Blur, Incomplete Echo, Fog, Crosstalk, Cross-Sensor.
Multi-sensor combinations (13): Joint or individual application of camera failures (Unavailable, Crash, Frame Lost) and LiDAR failures (Unavailable, Incomplete Echo, Crosstalk, Cross-Sensor).

Table: Corruption Types in MapBench

Modality	Example Corruptions	Aggregated Total
Camera	Snow, Fog, Frame Lost	8
LiDAR	Crosstalk, Wet Ground	8
Multi-sensor	Camera+LiDAR failures	13

Benchmark Implementation

Corruptions are applied deterministically or stochastically to images or point clouds with tunable parameters (e.g., fog thickness, number of lost LiDAR beams). The benchmark includes well-defined augmentation pipelines (photometric for images, PolarMix for LiDAR) and supports architectural/temporal fusion ablations.

Evaluation pipeline (pseudocode):

for each sample in val_set:
    choose corruption_type ∈ {1…29}
    choose severity_level ℓ ∈ {Easy, Moderate, Hard}
    apply relevant corruption(s)
    feed corrupted inputs to model for evaluation

2. Metrics and Evaluation Protocol

MapBench measures core accuracy via mean Average Precision (mAP) over element classes (pedestrian crossing, lane divider, road boundary) and robustness via Corruption Error (CE) and Resilience Rate (RR):

Corruption Error (CE):

$\text{CE}_i = \frac{\sum_{ℓ=1}^3 [1 - \mathrm{mAP}_{i,ℓ}]}{\sum_{ℓ=1}^3 [1 - \mathrm{mAP}_{i,ℓ}^{\mathrm{base}}]}$

with @@@@1@@@@ (mCE) as an average over all corruptions.

Resilience Rate (RR):

$\mathrm{RR}_i = \frac{\sum_{ℓ=1}^3 \mathrm{mAP}_{i,ℓ}}{3 \times \mathrm{mAP}^{\mathrm{clean}}}$

and mean Resilience Rate (mRR) indicating the proportion of clean accuracy retained under corruption.

Degradation is quantified as a near-linear function of severity for most corruptions, especially prominent for snow and sensor failures, which can cause catastrophic mAP losses.

3. Architectural Design, Training, and Robustness Strategies

The benchmark evaluates 31 models including camera-only, LiDAR-only, and fusion approaches: HDMapNet, VectorMapNet, PivotNet, BeMapNet, MapTR (GKT), MapTRv2 (BEVPool), StreamMapNet, HIMap. Key architectural features assessed include:

Transformer Backbones (Swin-T vs ResNet50): Transformers yield ~20% lower mCE.
2D→BEV Encoder: Geometry-aware GKT is 2–3 mCE points more robust than BEVPool or BEVFormer.
Temporal Fusion: Adding temporal blocks (e.g., StreamMapNet) improves mRR by ~8% and decreases mCE by ~14%.
Extended Training: Longer schedules (30 to 110 epochs) grant ∼7pp higher clean mAP, 10–20pp lower mCE.
Augmentation: Photometric jitter and PolarMix respectively improve robustness for camera and LiDAR modalities.

Fusion models, while higher-performing on clean samples, often suffer large drops if either modality fails, with combined corruptions being particularly detrimental.

4. Empirical Results and Safety Implications

All evaluated methods exhibit significant performance drops under real-world sensor corruptions. Notably:

Worst-case Degradation: Snow (on both modalities), Frame Lost (camera), Cross-Sensor (LiDAR) cause the steepest accuracy collapses.
Linear Degradation: For many corruptions, mAP decreases linearly with severity ( $\Delta_\text{Snow}(ℓ)\approx k\cdot ℓ$ , $k\approx15$ –$25$ mAP/level).
Fusion Collapse: Models like HIMap see a ~40pp drop when one modality is absent, indicating poor graceful degradation in current fusion strategies.

Safety-critical applications (autonomous driving) are acutely impacted by such vulnerabilities, given that corrupted sensor input can cause the total loss of map semantics, which are vital for planning and navigation.

5. Recommendations and Best Practices

Design guidelines for robust HD-map construction, as established from MapBench ablations (Hao et al., 2024):

Prefer large, pretrained transformer backbones to enhance feature robustness.
Use geometry-aware 2D-to-BEV encoders (e.g., GKT) in lieu of naïve pooling.
Employ explicit temporal fusion to mitigate intermittent sensor failures.
Train with curriculum-style corruption annealing schedules.
Regularly apply corruption-mimicking data augmentations.
Integrate modality-gating or dropout in fusion networks to enable operation with partial or missing sensor data.

Collectively, these recommendations provide a roadmap toward improved resilience and reliability in HD map construction pipelines.

6. Public Release and Impact

All codebases, corruption toolkits, trained weights, and validation results of MapBench are publicly released, setting the stage for standardized, reproducible benchmarking in the autonomous driving research community (Hao et al., 2024). This resource facilitates direct comparison of robustness strategies, fosters best practices, and expedites the development of safety-critical infrastructure for HD mapping.

Editor’s note: For MAPBench datasets and protocols addressing spatial reasoning, route tracing, and geolocalization in other domains—such as LVLM map navigation (Xing et al., 18 Mar 2025), spatial-reasoning RL (shen et al., 1 Nov 2025), scalable pixel-accurate path annotation (Panagopoulou et al., 22 Dec 2025), and motion planning algorithm benchmarking (Moll et al., 2014)—the reader is referred to their respective articles and repositories for comparably rigorous methodologies and benchmarking schemas.