Papers
Topics
Authors
Recent
Search
2000 character limit reached

HDMapNet: Online HD Semantic Mapping

Updated 30 January 2026
  • HDMapNet is an online semantic map learning framework that predicts lane boundaries, dividers, and crosswalks directly from camera and LiDAR inputs, eliminating the need for extensive manual annotation.
  • It employs a multi-branch architecture with perspective image encoding, LiDAR pillar processing, and a BEV decoder that outputs semantic segmentation, instance embeddings, and direction fields.
  • The fusion modality notably improves metrics such as IoU and mAP, offering scalable, robust mapping solutions suitable for dynamic urban driving environments.

HDMapNet is an online semantic map learning framework designed to dynamically construct high-definition (HD) vectorized road semantics from onboard sensor measurements in autonomous driving scenarios. Unlike traditional HD mapping pipelines reliant on extensive manual annotation and resource-intensive survey processes, HDMapNet predicts lane boundaries, dividers, and pedestrian crossings directly from multi-modal inputs, enabling scalable map generation in the vehicle’s local vicinity.

1. Formulation of HD Semantic Map Learning

HD semantic map learning targets the prediction of vectorized map elements MM (lane boundaries, dividers, crosswalks) in the bird’s-eye view (BEV) frame from sensor observations consisting of multi-camera images II and/or LiDAR point clouds PP. The mapping function is:

M^=HDMapNet(I,P)\hat{M} = \text{HDMapNet}(I, P)

HDMapNet decomposes into four main neural modules:

  • φI\varphi_I: Perspective-view image encoder
  • φV\varphi_V: Neural view transformer (perspective/camera \rightarrow BEV)
  • φP\varphi_P: Pillar-based LiDAR encoder
  • φM\varphi_M: BEV decoder yielding semantic segmentation, instance embeddings, and direction fields

Supervised training employs losses on BEV semantics, instance clusters, and direction labels.

2. Network Architecture and Modalities

HDMapNet utilizes a multi-branch input structure supporting camera-only, LiDAR-only, or camera-LiDAR fusion modalities, all converging into a unified BEV decoder. The architecture encompasses:

  • Camera branch: NmN_m surround-camera images processed via EfficientNet-B0 (pre-trained on ImageNet) for perspective-view features FIpvF_I^{pv}.
  • View transformation: For each camera, an MLP φVi\varphi_{V_i} projects perspective-grid locations to small camera-coordinate grids, then warps features into ego-vehicle BEV; averaged across NmN_m cameras for FIbevF_I^{bev}.
  • LiDAR branch: 32-beam LiDAR input, dynamically voxelized into pillars PjP_j, each processed with PointNet aggregation to yield pillar features fpillarjf_{\text{pillar}_j} and encoded via a dedicated 2D CNN for BEV feature maps FPbevF_P^{bev}.
  • BEV decoder: Fully convolutional (ResNet-style), with three simultaneous output heads—semantic segmentation (per-pixel softmax), instance embedding (RE\mathbb{R}^E), and direction classification (discretized into NdN_d bins).

Intermediate resolutions include Hpv×WpvH_{pv}\times W_{pv} for perspective views (typically 64×25664\times256) and Hbev×WbevH_{bev}\times W_{bev} for BEV grids (typically 200×200200\times200).

3. Training Objectives

HDMapNet employs a composite training loss comprising:

  • Semantic segmentation loss (cross-entropy):

Lsem=xΩc=1Cyc(x)logpc(x)L_{\text{sem}} = -\sum_{x\in\Omega}\sum_{c=1}^C y_c(x)\log p_c(x)

  • Discriminative instance embedding loss [De Brabandere et al., 2017], with cluster variance and separation terms and margin parameters δv\delta_v, δd\delta_d:

Lvar=1Cc=1C1Ncj=1Nc[fjinstμcδv]+2L_{\text{var}} = \frac{1}{C} \sum_{c=1}^C \frac{1}{N_c}\sum_{j=1}^{N_c}\left[\|f_j^{\text{inst}}-\mu_c\|-\delta_v\right]_+^2

Ldist=1C(C1)cAcB[2δdμcAμcB]+2L_{\text{dist}} = \frac{1}{C(C-1)}\sum_{c_A\neq c_B}[2\delta_d-\|\mu_{c_A}-\mu_{c_B}\|]_+^2

Linst=αLvar+βLdistL_{\text{inst}} = \alpha L_{\text{var}} + \beta L_{\text{dist}}

  • Direction classification loss: Per-pixel softmax cross-entropy on NdN_d direction bins, applied only at lane pixels.

Full training objective:

L=Lsem+Linst+LdirL = L_{\text{sem}} + L_{\text{inst}} + L_{\text{dir}}

Recommended hyperparameters: α=1\alpha=1, β=1\beta=1, δv=0.5\delta_v=0.5, δd=3.0\delta_d=3.0.

4. Output Map Representation and Vectorization Workflow

The BEV decoder produces three grids for each pixel xx:

  • S(x)S(x): Semantic class probability
  • E(x)REE(x)\in \mathbb{R}^{E}: Instance embedding
  • D(x)ΔNdD(x)\in \Delta^{N_d}: Direction logits

Inference post-processing includes:

  1. Thresholding S(x)S(x) to select lane-mask pixels.
  2. Clustering per-pixel embeddings E(x)E(x) via DBSCAN to define instance sets {Sc}\{S_c\}.
  3. Applying non-maximum suppression to instance confidences.
  4. For each instance, tracing the polyline by seed-pixel selection and recursive following of predicted direction D(x)D(x).
  5. Simplification of polylines (e.g., Ramer–Douglas–Peucker) into ordered vectors of 2D meter-space coordinates.

5. Evaluation Protocols and Metrics

Semantic-Level Metrics

  • Intersection-over-Union (IoU) over BEV grid:

IoU(D1,D2)=D1D2D1D2\text{IoU}(D_1, D_2) = \frac{|D_1 \cap D_2|}{|D_1 \cup D_2|}

  • Chamfer Distance (CD) for predicted/ground-truth curve sets S1S_1, S2S_2:

CDdir(S1,S2)=1S1xS1minyS2xy2\text{CD}_{\text{dir}}(S_1, S_2) = \frac{1}{|S_1|}\sum_{x\in S_1}\min_{y\in S_2}\|x-y\|_2

CD(S1,S2)=CDdir(S1,S2)+CDdir(S2,S1)\text{CD}(S_1, S_2) = \text{CD}_{\text{dir}}(S_1, S_2) + \text{CD}_{\text{dir}}(S_2, S_1)

Instance-Level Metrics

Instance polylines are considered “objects” for detection evaluation. Average Precision (AP) is computed at variable Chamfer distance thresholds τ\tau, with mAP computed as the mean of AP at ten recall values between $0.1$ and $1.0$. Typical reporting: AP@$0.2$m, AP@$0.5$m, AP@$1.0$m, mean mAP.

6. Experimental Setup and Results

Dataset

  • nuScenes [Caesar et al., 2020]: 1000 urban driving scenes, each annotated for lane boundaries, dividers, pedestrian crossings in 30m×30m30\,\mathrm{m}\times 30\,\mathrm{m} BEV crops around keyframes. Inputs: 6 surround cameras, 32-beam LiDAR. Splits correspond to standard nuScenes protocol.

Implementation

  • Image branch: EfficientNet-B0 (K=128K=128 channels)
  • LiDAR branch: PointPillars ($64$-dim pillar features), followed by a pillar CNN to upsample/concatenate to K=128K=128
  • BEV decoder: ResNet-style FCN with three parallel heads
  • Training: Adam optimizer, learning rate 1e31\mathrm{e}{-3}, batch size $16$, $30$ epochs, ×0.1\times 0.1 lr decay every $10$ epochs
  • BEV grid: 200×200200\times200 at 0.3m0.3\,\mathrm{m}/pixel (60m×60m60\,\mathrm{m}\times 60\,\mathrm{m})

Quantitative Results

Variant Divider IoU Ped-Crossing IoU Boundary IoU All-Classes IoU mAP ({0.2,0.5,1.0}m)
HDMapNet(Surr) 40.6% 18.7% 39.5% 32.9% 22.7%
VPN 36.5% 15.8% 35.6% 29.3% 17.5%
Lift-Splat 38.3% 14.9% 39.3% 30.8% 17.4%
IPM (CB) 38.6% 19.3% 39.3% 32.4% 19.7%
HDMapNet(Fusion) 46.1% 31.4% 56.0% 44.5% 30.6%

Fusion modality yields approximately 50%50\% relative gain over camera-only variant in semantic IoU and up to 55%55\% absolute mAP improvement over IPM.

Qualitative observations: HDMapNet produces visually clean, distortion-free BEV vector maps, robust to nighttime/rain and temporally consistent over 100100\,m local regions.

7. Modalities, Advantages, Limitations, and Implications

Modality complementarity: Cameras provide enhanced discrimination of color-based semantics (dividers, crosswalks), while LiDAR excels in detecting geometric lane boundaries. Fusion synergistically captures the best features of each.

Advantages:

  • Enables purely sensor-driven, online local mapping without reliance on global SLAM or manual annotation
  • Direct output of vectorized elements is immediately utilizable in motion planning systems

Limitations and outlook:

  • Absolute accuracy lags behind hand-annotated HD maps, though the scalability trade-off is advantageous
  • Current fusion employs naive feature concatenation; more sophisticated strategies (attention mechanisms, gating, etc.) may yield further improvements
  • Semantic coverage is restricted to three classes; generalization to additional road infrastructure (traffic signs, poles, curbs) is pending
  • Highly dynamic urban environments pose challenges and motivate continual-learning approaches

A plausible implication is that HDMapNet serves as a foundation for scalable, sensor-driven semantic mapping, facilitating map generation without manual intervention and with extensible architecture for future expansion of both input modalities and map element classes (Li et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HDMapNet.