HDMapNet: Online HD Semantic Mapping

Updated 30 January 2026

HDMapNet is an online semantic map learning framework that predicts lane boundaries, dividers, and crosswalks directly from camera and LiDAR inputs, eliminating the need for extensive manual annotation.
It employs a multi-branch architecture with perspective image encoding, LiDAR pillar processing, and a BEV decoder that outputs semantic segmentation, instance embeddings, and direction fields.
The fusion modality notably improves metrics such as IoU and mAP, offering scalable, robust mapping solutions suitable for dynamic urban driving environments.

HDMapNet is an online semantic map learning framework designed to dynamically construct high-definition (HD) vectorized road semantics from onboard sensor measurements in autonomous driving scenarios. Unlike traditional HD mapping pipelines reliant on extensive manual annotation and resource-intensive survey processes, HDMapNet predicts lane boundaries, dividers, and pedestrian crossings directly from multi-modal inputs, enabling scalable map generation in the vehicle’s local vicinity.

1. Formulation of HD Semantic Map Learning

HD semantic map learning targets the prediction of vectorized map elements $M$ (lane boundaries, dividers, crosswalks) in the bird’s-eye view (BEV) frame from sensor observations consisting of multi-camera images $I$ and/or LiDAR point clouds $P$ . The mapping function is:

$\hat{M} = \text{HDMapNet}(I, P)$

HDMapNet decomposes into four main neural modules:

$\varphi_I$ : Perspective-view image encoder
$\varphi_V$ : Neural view transformer (perspective/camera $\rightarrow$ BEV)
$\varphi_P$ : Pillar-based LiDAR encoder
$\varphi_M$ : BEV decoder yielding semantic segmentation, instance embeddings, and direction fields

Supervised training employs losses on BEV semantics, instance clusters, and direction labels.

2. Network Architecture and Modalities

HDMapNet utilizes a multi-branch input structure supporting camera-only, LiDAR-only, or camera-LiDAR fusion modalities, all converging into a unified BEV decoder. The architecture encompasses:

Camera branch: $N_m$ surround-camera images processed via EfficientNet-B0 (pre-trained on ImageNet) for perspective-view features $F_I^{pv}$ .
View transformation: For each camera, an MLP $\varphi_{V_i}$ projects perspective-grid locations to small camera-coordinate grids, then warps features into ego-vehicle BEV; averaged across $N_m$ cameras for $F_I^{bev}$ .
LiDAR branch: 32-beam LiDAR input, dynamically voxelized into pillars $P_j$ , each processed with PointNet aggregation to yield pillar features $f_{\text{pillar}_j}$ and encoded via a dedicated 2D CNN for BEV feature maps $F_P^{bev}$ .
BEV decoder: Fully convolutional (ResNet-style), with three simultaneous output heads—semantic segmentation (per-pixel softmax), instance embedding ( $\mathbb{R}^E$ ), and direction classification (discretized into $N_d$ bins).

Intermediate resolutions include $H_{pv}\times W_{pv}$ for perspective views (typically $64\times256$ ) and $H_{bev}\times W_{bev}$ for BEV grids (typically $200\times200$ ).

3. Training Objectives

HDMapNet employs a composite training loss comprising:

Semantic segmentation loss (cross-entropy):

$L_{\text{sem}} = -\sum_{x\in\Omega}\sum_{c=1}^C y_c(x)\log p_c(x)$

Discriminative instance embedding loss [De Brabandere et al., 2017], with cluster variance and separation terms and margin parameters $\delta_v$ , $\delta_d$ :

$L_{\text{var}} = \frac{1}{C} \sum_{c=1}^C \frac{1}{N_c}\sum_{j=1}^{N_c}\left[\|f_j^{\text{inst}}-\mu_c\|-\delta_v\right]_+^2$

$L_{\text{dist}} = \frac{1}{C(C-1)}\sum_{c_A\neq c_B}[2\delta_d-\|\mu_{c_A}-\mu_{c_B}\|]_+^2$

$L_{\text{inst}} = \alpha L_{\text{var}} + \beta L_{\text{dist}}$

Direction classification loss: Per-pixel softmax cross-entropy on $N_d$ direction bins, applied only at lane pixels.

Full training objective:

$L = L_{\text{sem}} + L_{\text{inst}} + L_{\text{dir}}$

Recommended hyperparameters: $\alpha=1$ , $\beta=1$ , $\delta_v=0.5$ , $\delta_d=3.0$ .

4. Output Map Representation and Vectorization Workflow

The BEV decoder produces three grids for each pixel $x$ :

$S(x)$ : Semantic class probability
$E(x)\in \mathbb{R}^{E}$ : Instance embedding
$D(x)\in \Delta^{N_d}$ : Direction logits

Inference post-processing includes:

Thresholding $S(x)$ to select lane-mask pixels.
Clustering per-pixel embeddings $E(x)$ via DBSCAN to define instance sets $\{S_c\}$ .
Applying non-maximum suppression to instance confidences.
For each instance, tracing the polyline by seed-pixel selection and recursive following of predicted direction $D(x)$ .
Simplification of polylines (e.g., Ramer–Douglas–Peucker) into ordered vectors of 2D meter-space coordinates.

5. Evaluation Protocols and Metrics

Semantic-Level Metrics

Intersection-over-Union (IoU) over BEV grid:

$\text{IoU}(D_1, D_2) = \frac{|D_1 \cap D_2|}{|D_1 \cup D_2|}$

Chamfer Distance (CD) for predicted/ground-truth curve sets $S_1$ , $S_2$ :

$\text{CD}_{\text{dir}}(S_1, S_2) = \frac{1}{|S_1|}\sum_{x\in S_1}\min_{y\in S_2}\|x-y\|_2$

$\text{CD}(S_1, S_2) = \text{CD}_{\text{dir}}(S_1, S_2) + \text{CD}_{\text{dir}}(S_2, S_1)$

Instance-Level Metrics

Instance polylines are considered “objects” for detection evaluation. Average Precision (AP) is computed at variable Chamfer distance thresholds $\tau$ , with mAP computed as the mean of AP at ten recall values between $0.1$ and $1.0$. Typical reporting: AP@$0.2$m, AP@$0.5$m, AP@$1.0$m, mean mAP.

6. Experimental Setup and Results

Dataset

nuScenes [Caesar et al., 2020]: 1000 urban driving scenes, each annotated for lane boundaries, dividers, pedestrian crossings in $30\,\mathrm{m}\times 30\,\mathrm{m}$ BEV crops around keyframes. Inputs: 6 surround cameras, 32-beam LiDAR. Splits correspond to standard nuScenes protocol.

Implementation

Image branch: EfficientNet-B0 ( $K=128$ channels)
LiDAR branch: PointPillars ($64$-dim pillar features), followed by a pillar CNN to upsample/concatenate to $K=128$
BEV decoder: ResNet-style FCN with three parallel heads
Training: Adam optimizer, learning rate $1\mathrm{e}{-3}$ , batch size $16$, $30$ epochs, $\times 0.1$ lr decay every $10$ epochs
BEV grid: $200\times200$ at $0.3\,\mathrm{m}$ /pixel ( $60\,\mathrm{m}\times 60\,\mathrm{m}$ )

Quantitative Results

Variant	Divider IoU	Ped-Crossing IoU	Boundary IoU	All-Classes IoU	mAP ({0.2,0.5,1.0}m)
HDMapNet(Surr)	40.6%	18.7%	39.5%	32.9%	22.7%
VPN	36.5%	15.8%	35.6%	29.3%	17.5%
Lift-Splat	38.3%	14.9%	39.3%	30.8%	17.4%
IPM (CB)	38.6%	19.3%	39.3%	32.4%	19.7%
HDMapNet(Fusion)	46.1%	31.4%	56.0%	44.5%	30.6%

Fusion modality yields approximately $50\%$ relative gain over camera-only variant in semantic IoU and up to $55\%$ absolute mAP improvement over IPM.

Qualitative observations: HDMapNet produces visually clean, distortion-free BEV vector maps, robust to nighttime/rain and temporally consistent over $100\,$ m local regions.

7. Modalities, Advantages, Limitations, and Implications

Modality complementarity: Cameras provide enhanced discrimination of color-based semantics (dividers, crosswalks), while LiDAR excels in detecting geometric lane boundaries. Fusion synergistically captures the best features of each.

Advantages:

Enables purely sensor-driven, online local mapping without reliance on global SLAM or manual annotation
Direct output of vectorized elements is immediately utilizable in motion planning systems

Limitations and outlook:

Absolute accuracy lags behind hand-annotated HD maps, though the scalability trade-off is advantageous
Current fusion employs naive feature concatenation; more sophisticated strategies (attention mechanisms, gating, etc.) may yield further improvements
Semantic coverage is restricted to three classes; generalization to additional road infrastructure (traffic signs, poles, curbs) is pending
Highly dynamic urban environments pose challenges and motivate continual-learning approaches

A plausible implication is that HDMapNet serves as a foundation for scalable, sensor-driven semantic mapping, facilitating map generation without manual intervention and with extensible architecture for future expansion of both input modalities and map element classes (Li et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

HDMapNet: An Online HD Map Construction and Evaluation Framework (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HDMapNet.

HDMapNet: Online HD Semantic Mapping

1. Formulation of HD Semantic Map Learning

2. Network Architecture and Modalities

3. Training Objectives

4. Output Map Representation and Vectorization Workflow

5. Evaluation Protocols and Metrics

Semantic-Level Metrics

Instance-Level Metrics

6. Experimental Setup and Results

Dataset

Implementation

Quantitative Results

7. Modalities, Advantages, Limitations, and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

HDMapNet: Online HD Semantic Mapping

1. Formulation of HD Semantic Map Learning

2. Network Architecture and Modalities

3. Training Objectives

4. Output Map Representation and Vectorization Workflow

5. Evaluation Protocols and Metrics

Semantic-Level Metrics

Instance-Level Metrics

6. Experimental Setup and Results

Dataset

Implementation

Quantitative Results

7. Modalities, Advantages, Limitations, and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research