SatMap: Satellite-Aided HD Map Estimation

Updated 22 January 2026

SatMap is a high-definition map construction framework that fuses satellite imagery with multi-view camera data to produce vectorized HD maps essential for autonomous vehicles.
It employs a dual-stream BEV extraction pipeline, using a Swin-Tiny Transformer for satellite data and a ResNet-50 with geometry guidance for camera views.
Experimental results on nuScenes demonstrate marked improvements in long-range and adverse weather conditions over camera-only and camera-LiDAR fusion baselines.

SatMap is a high-definition (HD) map construction framework that leverages satellite maps as global priors for online vectorized HD map estimation, substantially advancing autonomous driving applications. By fusing lane-level semantic detail and texture from high-resolution bird’s-eye view (BEV) satellite imagery with multi-view camera observations, SatMap directly predicts vectorized HD maps suitable for downstream planning and prediction modules. Experimental findings on nuScenes demonstrate marked improvements over camera-only and camera-LiDAR fusion baselines, most notably in long-range and adverse weather scenarios (Mazumder et al., 15 Jan 2026).

1. Motivation and Conceptual Foundations

Traditional online HD map construction systems predominantly rely on onboard sensors such as cameras or LiDAR. These modalities suffer from depth ambiguity due to limited range perception, frequent occlusions from dynamic and static objects, and degradation caused by adverse lighting and weather conditions. Projected BEV features from monocular or multi-camera setups often exhibit distance-dependent spatial inconsistencies, resulting in incomplete or noisy map outputs—particularly at extended ranges or when road markings are obstructed (Mazumder et al., 15 Jan 2026).

Satellite imagery addresses these limitations through its true BEV acquisition: lane dividers, boundaries, and crosswalks are visible without perspective distortion or occlusion from traffic, offering global context over wide areas and enhancing long-range coverage. High-resolution satellite tiles (approximately zoom level 20) provide rich semantic and topological priors, which mitigate depth ambiguity and occlusion. This global prior, when fused with multi-view BEV camera features, enhances spatial consistency and completeness in map estimation (Mazumder et al., 15 Jan 2026).

2. Architectural Overview and Feature Fusion Pipeline

The SatMap model comprises four principal stages:

Camera BEV Feature Extraction: Utilizes a ResNet-50 backbone (ImageNet-pretrained) with a geometry-guided kernel transformer (GKT) to lift multi-scale camera features into a unified BEV grid.
Satellite BEV Feature Extraction: Employs a Swin-Tiny Transformer with Generalized Feature Pyramid Network (GFPN) to generate BEV-aligned global features from geo-registered satellite tiles.
BEV Fusion Module ("ConvFuser"): Projects camera and satellite BEV features to a common channel dimension using $1 \times 1$ convolutions, concatenates them, and performs modal fusion via residual convolutional blocks. This approach is robust to small misalignments between modalities.
Map Decoding: Deploys a DETR-style transformer decoder maintaining learnable map queries. Each query predicts a vectorized HD map instance—semantically labelled polylines representing lanes, boundaries, and crosswalks.

Formally, satellite feature extraction operates on a cropped and geo-aligned satellite image $S \in \mathbb{R}^{H_s \times W_s \times 3}$ , producing multi-scale features $F^\text{sat}_l = \text{SwinT}_l(S)$ , which GFPN aggregates as $F^\text{sat}_\text{bev}$ . Camera BEV extraction is realized through geometry-guided sampling kernels projecting perspective features onto the BEV grid: $F^\text{cam}_\text{bev}(x,y) = \sum_{u,v} K_{(x,y),(u,v)} \cdot F^\text{cam}(u,v)$ . The fusion stage concatenates both modalities: $F^\text{fused} = \text{ConvBlocks}(\text{concat}(F^\text{cam}_\text{proj},\,F^\text{sat}_\text{proj}))$ (Mazumder et al., 15 Jan 2026).

3. Training Objectives and HD Map Representation

The map decoding stage outputs a set $\mathcal{M} = \{L_i\}_{i=1}^N$ of ordered 2D polylines $L_i \in \mathbb{R}^{M_i \times 2}$ , each attached to a semantic class $c_i \in \{\text{lane\_divider},\,\text{boundary},\,\text{crosswalk}\}$ . Decoder queries attend to the fused BEV features using multi-head attention, predicting both class logits and BEV point coordinates.

Training employs a set-based Hungarian matching between predictions $(c_i, L_i)$ and ground truths, minimizing a loss: $L = \sum_{i=1}^N [\,\alpha \cdot L_\text{cls}(c_i,\,\hat c_{\sigma(i)})\ +\ \beta \sum_j L_{L1}(p_{i,j},\,\hat p_{\sigma(i),j})\ +\ \gamma \sum_j L_{1}^\text{Chamfer}(L_i, \hat L_{\sigma(i)})\,]$ where the Chamfer term enforces topological shape consistency between predicted and ground-truth map elements (Mazumder et al., 15 Jan 2026).

4. Dataset, Preprocessing, and Experimental Validation

Data preparation entails synchronizing ego poses from nuScenes to WGS84 geocoordinates, retrieving satellite tiles (zoom 20) from sources such as Google Maps, Mapbox, or OpenSatMap, and cropping ego-centric regions. Landmark-based coarse alignment ensures BEV grids from the satellite and camera modalities are matched (Mazumder et al., 15 Jan 2026, Zhao et al., 2024).

Training details:

BEV resolution: $40 \times 80$ (0.75 m/cell, 60 m × 30 m coverage)
Satellite backbone: Swin-Tiny, pretrained
Fusion: 4 residual convolution blocks, $3 \times 3$ kernels
Decoder: 100 queries, 6 layers, 8 heads
Optimizer: AdamW, 24 epochs, batch size 32 on 4 $\times$ NVIDIA A40 GPUs

Quantitative performance (nuScenes dataset, 60 m × 30 m, 24 epochs):

MapTR (camera only): 50.3% mAP
MapTR-Fusion (camera + LiDAR): 62.5% mAP
SatMap (camera + satellite): 67.8% mAP (+34.8% over camera-only, +8.5% over camera-LiDAR fusion)

Long-range (110 epochs): SatMap reaches 72.7% mAP versus 58.7% (MapTR) and 60.6% (ScalableMap) at extended perception ranges. Weather-specific results indicate SatMap enhances robustness: mAP during rain increases from 52.8% (MapTR) to 56.2% (SatMap), with average mAP rising from 58.4% to 70.8% (Mazumder et al., 15 Jan 2026).

Ablation demonstrates that the Swin-Tiny satellite encoder combined with ConvFuser delivers the highest gains (67.8% mAP), outperforming cross-attention designs or alternative backbones.

SatMap builds upon precedent map-fusion architectures, notably improving over SatforHDMap’s cross-attention fusion and P-MapNet’s static SD map priors. ConvFuser’s local fusion and robust satellite feature extraction allow for superior geometry recovery and occlusion-handling compared to generic concatenation and attention strategies. Compared with SATMapTR (Huang et al., 12 Dec 2025), SatMap opts for swappable transformer-based satellite encoders and convolutional fusion, while SATMapTR employs hierarchical gated denoising and grid-to-grid strict fusion for enhanced signal extraction.

The OpenSatMap dataset (Zhao et al., 2024) supports high-fidelity satellite patch provision with fine-grained instance-level annotation, true level-20 resolution (0.15 m/pixel), full nuScenes and Argoverse 2 coverage, and large-scale vectorized line annotations. These properties facilitate evaluation of satellite–camera fusion paradigms and catalyze method development for scalable urban mapping.

Dataset	Resolution [m/pixel]	Annotation Type	Coverage
OpenSatMap 19	0.30	Instance/Vector	60 cities, 19 countries
OpenSatMap 20	0.15	Instance/Vector	60 cities, 19 countries

6. Limitations, Failure Modes, and Prospective Directions

SatMap’s reliance on single-frame fusion and high-resolution up-to-date satellite imagery introduces limitations regarding temporal coherence and resilience to transient misalignment. Satellite occlusions (vegetation, parked vehicles) and stale imagery may misrepresent true lane layouts. Severe camera corruption (rain, lens contamination) can still impair BEV feature quality, limiting fusion efficacy in those settings.

A plausible implication is the necessity for temporal fusion architectures that ingest multi-frame camera and historic satellite data to enhance map stability and occlusion recovery. On-board deployment also motivates backbone pruning or distillation for real-time inference. Prospective multi-modal extensions would integrate radar or low-cost LiDAR signals, using satellite priors as anchors for further geometric refinement.

7. Applications and Future Work

SatMap’s satellite-anchored online HD map construction directly benefits autonomous driving by improving spatial consistency, long-range prediction, and occlusion-remediation. Use-cases include dynamic map updating, urban planning, continual HD map database construction, and cross-modal fusion for robust perception. Future work directions include temporal model extension, lightweight encoder development for automotive hardware, and the incorporation of dynamic, automatically-updated satellite priors for real-world change detection (Mazumder et al., 15 Jan 2026).

By exploiting rich top-down satellite information, SatMap sets a new standard for practical, vectorized HD map estimation under adverse and diverse environmental conditions, and provides a foundation for scalable, robust urban mapping solutions.