Papers
Topics
Authors
Recent
Search
2000 character limit reached

HDMapNet: Online HD Semantic Mapping

Updated 8 January 2026
  • The paper presents an online HD semantic map construction framework that generates vectorized BEV maps from multi-sensor data, eliminating the need for offline maps.
  • It employs a neural view transformer and pillar-based LiDAR encoder to convert perspective and LiDAR inputs into precise road map polylines.
  • Fusion mode significantly improves performance with a 44.5% IoU and 30.6 mAP, establishing unified evaluation standards for autonomous driving.

HDMapNet is an online high-definition semantic map construction and evaluation framework designed to support autonomous driving through scalable, real-time generation of BEV (bird’s-eye-view) vectorized road maps from onboard sensor observations. Unlike traditional mapping pipelines reliant on extensive offline SLAM and manual annotation, HDMapNet enables dynamic inference of road semantics, supporting downstream tasks such as path prediction and planning. The framework represents semantic map elements as polylines in the BEV domain, employs unified evaluation protocols, and demonstrates robust performance improvements over previous projection-based approaches (Li et al., 2021).

1. Problem Definition and Objectives

High-definition semantic map learning is formulated as an online estimation problem:

  • Inputs: Surround-view camera images {IiRHpv×Wpv×3}i=1Nm\{I_i \in \mathbb{R}^{H_{pv} \times W_{pv} \times 3}\}_{i=1}^{N_m} and/or a 3D LiDAR sweep P={pn=(xn,yn,zn)}n=1NP = \{p_n=(x_n, y_n, z_n)\}_{n=1}^N.
  • Outputs: A local HD semantic map MM, composed of vectorized map elements {Ck}\{C^k\}, where each CkC^k is a polyline in the ego-vehicle BEV frame.

The main goals are:

  • Elimination of costly pre-built global maps.
  • Real-time, scalable local map construction from sensor data.
  • Provision of unified semantic- and instance-level evaluation protocols.

2. Architectural Modules

HDMapNet comprises four key components:

Module Input Type Function
Perspective-View Image Encoder ϕI\phi_I Camera Multi-scale PV features
Neural View Transformer ϕV\phi_V PV features PV \rightarrow BEV mapping
Pillar-based LiDAR Encoder ϕP\phi_P LiDAR BEV pillar features
BEV Map Decoder ϕM\phi_M Fused BEV tensor Vectorized map output
  • Camera-only: IϕIϕVϕMI \rightarrow \phi_I \rightarrow \phi_V \rightarrow \phi_M
  • LiDAR-only: PϕPϕMP \rightarrow \phi_P \rightarrow \phi_M
  • Fusion: Camera-derived BEV and LiDAR BEV features are concatenated pre-ϕM\phi_M, maximizing information content.

The BEV Decoder ϕM\phi_M includes three output heads for semantic segmentation, instance embedding, and direction classification.

3. Neural View Transformation and Polyline Vectorization

The camera branch leverages a neural view transformer for perspective-to-BEV mapping:

  • For each PV image, ϕI\phi_I extracts FIipvF_{I_i}^{pv}.
  • A multi-layer perceptron aggregates PV pixels per BEV cell: FIic[h,w]=ϕVihw({FIipv[u,v]}u,v)F_{I_i}^{c}[h,w] = \phi_{V_i}^{hw}(\{F_{I_i}^{pv}[u,v]\}_{u,v}).
  • Camera extrinsics warp PV features to BEV space.
  • Multi-view BEV features are averaged: FIbev=1Nmi=1NmFIibevF_{I}^{bev} = \frac{1}{N_m}\sum_{i=1}^{N_m} F_{I_i}^{bev}.

Instance polylines are constructed by clustering embedding maps and applying greedy polyline tracing based on predicted direction bins:

ct+1=ct+Δstepd(ct),d(ct){unit vectors in Nd bins}c_{t+1} = c_t + \Delta_{\text{step}} \cdot d(c_t), \quad d(c_t) \in \{\text{unit vectors in }N_d\text{ bins}\}

The map representation is a vectorized set of polylines C={c1,...,cL}C = \{c_1, ..., c_L\} with cR2c_\ell \in \mathbb{R}^2, rather than a dense occupancy grid.

4. Loss Functions and Training Protocol

HDMapNet's total loss function combines semantic, instance, and direction losses:

L=Lseg+Linst+LdirL = L_{\text{seg}} + L_{\text{inst}} + L_{\text{dir}}

  • Semantic segmentation: Pixel-wise cross-entropy.
  • Instance embedding: Discriminative loss with variances and inter-instance distances:

Linst=αLvar+βLdist,α=β=1L_{\text{inst}} = \alpha\,L_{\text{var}} + \beta\,L_{\text{dist}}, \quad \alpha=\beta=1

  • Direction classification: Cross-entropy on direction classes, lane pixels only.
  • Optimization: Adam (1×1031 \times 10^{-3}), weight decay (1×1071 \times 10^{-7}), decays by 0.1 every 10 epochs.

5. Sensor Fusion and Performance

Three sensor integration modes are supported:

  • HDMapNet(Surr): Cameras-only, adept at lane dividers and crosswalks.
  • HDMapNet(LiDAR): LiDAR-only, excels at geometry but less effective for lane markings.
  • HDMapNet(Fusion): Concatenation of camera and LiDAR BEV features before ϕM\phi_M.

Fusion yields significant improvements:

Method IoU (All %) CD (m) mAP (All %)
IPM(CB) 32.4 0.839 19.7
Lift-Splat-Shoot 30.8 0.968 17.4
VPN 29.3 1.337 17.5
HDMapNet(Surr) 32.9 0.834 22.7
HDMapNet(LiDAR) 29.5 1.101 11.6
HDMapNet(Fusion) 44.5 0.639 30.6

Fusion achieves a 12.1 point absolute IoU gain and a 10.9 point mAP gain over best camera-based baselines (Li et al., 2021).

6. Evaluation Metrics and Temporal Consistency

HDMapNet employs both Eulerian and Lagrangian evaluation protocols:

  • Semantic IoU:

IoU=S^SS^S\mathrm{IoU} = \frac{|\hat S \cap S|}{|\hat S \cup S|}

  • Chamfer Distance (CD) for vectorized curves:

CD(CA,CB)=1CAxCAminyCBxy+1CByCBminxCAyx\mathrm{CD}(C^A, C^B) = \frac{1}{|C^A|} \sum_{x\in C^A} \min_{y\in C^B} \|x-y\| + \frac{1}{|C^B|} \sum_{y\in C^B} \min_{x\in C^A} \|y-x\|

  • Instance-level mAP: Average precision over recall thresholds, with true positives defined by Chamfer distance criteria.

Temporal fusion via max-pooling BEV probabilities across ego poses supports locally consistent map accumulation, improving robustness to sensor variability and environmental changes.

Key limitations include:

  • Heuristic vectorization: Polyline tracing is greedy; learned graph-generation may improve topology.
  • Simple fusion: Camera-LiDAR fusion is concatenation; uncertainty-aware fusion mechanisms could further enhance complementarity.
  • Accuracy tradeoff: Online maps do not match offline map precision but offer scalability.

This suggests that future extensions should consider advanced fusion strategies, temporal sequence modeling, and expansion to richer semantic layers (e.g., curbs, signage). Related approaches such as the input-level raster fusion and online map prediction in HDNET (Yang et al., 2020), explicit height modeling and foreground-background masking in HeightMapNet (Qiu et al., 2024), and global vector map construction with GlobalMapNet (Shi et al., 2024) expand HDMapNet’s methodology to 3D object detection, height-aware BEV learning, and global online mapping, respectively.

HDMapNet defines the formal problem of online HD semantic map learning, establishes comprehensive evaluation standards, and delivers substantial performance gains over prior BEV and projection-based semantic mapping strategies (Li et al., 2021).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HDMapNet Framework.