HDMapNet: Online HD Semantic Mapping

Updated 8 January 2026

The paper presents an online HD semantic map construction framework that generates vectorized BEV maps from multi-sensor data, eliminating the need for offline maps.
It employs a neural view transformer and pillar-based LiDAR encoder to convert perspective and LiDAR inputs into precise road map polylines.
Fusion mode significantly improves performance with a 44.5% IoU and 30.6 mAP, establishing unified evaluation standards for autonomous driving.

HDMapNet is an online high-definition semantic map construction and evaluation framework designed to support autonomous driving through scalable, real-time generation of BEV (bird’s-eye-view) vectorized road maps from onboard sensor observations. Unlike traditional mapping pipelines reliant on extensive offline SLAM and manual annotation, HDMapNet enables dynamic inference of road semantics, supporting downstream tasks such as path prediction and planning. The framework represents semantic map elements as polylines in the BEV domain, employs unified evaluation protocols, and demonstrates robust performance improvements over previous projection-based approaches (Li et al., 2021).

1. Problem Definition and Objectives

High-definition semantic map learning is formulated as an online estimation problem:

Inputs: Surround-view camera images $\{I_i \in \mathbb{R}^{H_{pv} \times W_{pv} \times 3}\}_{i=1}^{N_m}$ and/or a 3D LiDAR sweep $P = \{p_n=(x_n, y_n, z_n)\}_{n=1}^N$ .
Outputs: A local HD semantic map $M$ , composed of vectorized map elements $\{C^k\}$ , where each $C^k$ is a polyline in the ego-vehicle BEV frame.

The main goals are:

Elimination of costly pre-built global maps.
Real-time, scalable local map construction from sensor data.
Provision of unified semantic- and instance-level evaluation protocols.

2. Architectural Modules

HDMapNet comprises four key components:

Module	Input Type	Function
Perspective-View Image Encoder $\phi_I$	Camera	Multi-scale PV features
Neural View Transformer $\phi_V$	PV features	PV $\rightarrow$ BEV mapping
Pillar-based LiDAR Encoder $\phi_P$	LiDAR	BEV pillar features
BEV Map Decoder $\phi_M$	Fused BEV tensor	Vectorized map output

Camera-only: $I \rightarrow \phi_I \rightarrow \phi_V \rightarrow \phi_M$
LiDAR-only: $P \rightarrow \phi_P \rightarrow \phi_M$
Fusion: Camera-derived BEV and LiDAR BEV features are concatenated pre- $\phi_M$ , maximizing information content.

The BEV Decoder $\phi_M$ includes three output heads for semantic segmentation, instance embedding, and direction classification.

3. Neural View Transformation and Polyline Vectorization

The camera branch leverages a neural view transformer for perspective-to-BEV mapping:

For each PV image, $\phi_I$ extracts $F_{I_i}^{pv}$ .
A multi-layer perceptron aggregates PV pixels per BEV cell: $F_{I_i}^{c}[h,w] = \phi_{V_i}^{hw}(\{F_{I_i}^{pv}[u,v]\}_{u,v})$ .
Camera extrinsics warp PV features to BEV space.
Multi-view BEV features are averaged: $F_{I}^{bev} = \frac{1}{N_m}\sum_{i=1}^{N_m} F_{I_i}^{bev}$ .

Instance polylines are constructed by clustering embedding maps and applying greedy polyline tracing based on predicted direction bins:

$c_{t+1} = c_t + \Delta_{\text{step}} \cdot d(c_t), \quad d(c_t) \in \{\text{unit vectors in }N_d\text{ bins}\}$

The map representation is a vectorized set of polylines $C = \{c_1, ..., c_L\}$ with $c_\ell \in \mathbb{R}^2$ , rather than a dense occupancy grid.

4. Loss Functions and Training Protocol

HDMapNet's total loss function combines semantic, instance, and direction losses:

$L = L_{\text{seg}} + L_{\text{inst}} + L_{\text{dir}}$

Semantic segmentation: Pixel-wise cross-entropy.
Instance embedding: Discriminative loss with variances and inter-instance distances:

$L_{\text{inst}} = \alpha\,L_{\text{var}} + \beta\,L_{\text{dist}}, \quad \alpha=\beta=1$

Direction classification: Cross-entropy on direction classes, lane pixels only.
Optimization: Adam ( $1 \times 10^{-3}$ ), weight decay ( $1 \times 10^{-7}$ ), decays by 0.1 every 10 epochs.

5. Sensor Fusion and Performance

Three sensor integration modes are supported:

HDMapNet(Surr): Cameras-only, adept at lane dividers and crosswalks.
HDMapNet(LiDAR): LiDAR-only, excels at geometry but less effective for lane markings.
HDMapNet(Fusion): Concatenation of camera and LiDAR BEV features before $\phi_M$ .

Fusion yields significant improvements:

Method	IoU (All %)	CD (m)	mAP (All %)
IPM(CB)	32.4	0.839	19.7
Lift-Splat-Shoot	30.8	0.968	17.4
VPN	29.3	1.337	17.5
HDMapNet(Surr)	32.9	0.834	22.7
HDMapNet(LiDAR)	29.5	1.101	11.6
HDMapNet(Fusion)	44.5	0.639	30.6

Fusion achieves a 12.1 point absolute IoU gain and a 10.9 point mAP gain over best camera-based baselines (Li et al., 2021).

6. Evaluation Metrics and Temporal Consistency

HDMapNet employs both Eulerian and Lagrangian evaluation protocols:

Semantic IoU:

$\mathrm{IoU} = \frac{|\hat S \cap S|}{|\hat S \cup S|}$

Chamfer Distance (CD) for vectorized curves:

$\mathrm{CD}(C^A, C^B) = \frac{1}{|C^A|} \sum_{x\in C^A} \min_{y\in C^B} \|x-y\| + \frac{1}{|C^B|} \sum_{y\in C^B} \min_{x\in C^A} \|y-x\|$

Instance-level mAP: Average precision over recall thresholds, with true positives defined by Chamfer distance criteria.

Temporal fusion via max-pooling BEV probabilities across ego poses supports locally consistent map accumulation, improving robustness to sensor variability and environmental changes.

Key limitations include:

Heuristic vectorization: Polyline tracing is greedy; learned graph-generation may improve topology.
Simple fusion: Camera-LiDAR fusion is concatenation; uncertainty-aware fusion mechanisms could further enhance complementarity.
Accuracy tradeoff: Online maps do not match offline map precision but offer scalability.

This suggests that future extensions should consider advanced fusion strategies, temporal sequence modeling, and expansion to richer semantic layers (e.g., curbs, signage). Related approaches such as the input-level raster fusion and online map prediction in HDNET (Yang et al., 2020), explicit height modeling and foreground-background masking in HeightMapNet (Qiu et al., 2024), and global vector map construction with GlobalMapNet (Shi et al., 2024) expand HDMapNet’s methodology to 3D object detection, height-aware BEV learning, and global online mapping, respectively.

HDMapNet defines the formal problem of online HD semantic map learning, establishes comprehensive evaluation standards, and delivers substantial performance gains over prior BEV and projection-based semantic mapping strategies (Li et al., 2021).

Markdown Report Issue Upgrade to Chat

References (4)

HDMapNet: An Online HD Map Construction and Evaluation Framework (2021)

HDNET: Exploiting HD Maps for 3D Object Detection (2020)

HeightMapNet: Explicit Height Modeling for End-to-End HD Map Learning (2024)

GlobalMapNet: An Online Framework for Vectorized Global HD Map Construction (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HDMapNet Framework.

HDMapNet: Online HD Semantic Mapping

1. Problem Definition and Objectives

2. Architectural Modules

3. Neural View Transformation and Polyline Vectorization

4. Loss Functions and Training Protocol

5. Sensor Fusion and Performance

6. Evaluation Metrics and Temporal Consistency

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

HDMapNet: Online HD Semantic Mapping

1. Problem Definition and Objectives

2. Architectural Modules

3. Neural View Transformation and Polyline Vectorization

4. Loss Functions and Training Protocol

5. Sensor Fusion and Performance

6. Evaluation Metrics and Temporal Consistency

7. Limitations, Extensions, and Related Frameworks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research