HDMapNet: Online HD Semantic Mapping
- The paper presents an online HD semantic map construction framework that generates vectorized BEV maps from multi-sensor data, eliminating the need for offline maps.
- It employs a neural view transformer and pillar-based LiDAR encoder to convert perspective and LiDAR inputs into precise road map polylines.
- Fusion mode significantly improves performance with a 44.5% IoU and 30.6 mAP, establishing unified evaluation standards for autonomous driving.
HDMapNet is an online high-definition semantic map construction and evaluation framework designed to support autonomous driving through scalable, real-time generation of BEV (bird’s-eye-view) vectorized road maps from onboard sensor observations. Unlike traditional mapping pipelines reliant on extensive offline SLAM and manual annotation, HDMapNet enables dynamic inference of road semantics, supporting downstream tasks such as path prediction and planning. The framework represents semantic map elements as polylines in the BEV domain, employs unified evaluation protocols, and demonstrates robust performance improvements over previous projection-based approaches (Li et al., 2021).
1. Problem Definition and Objectives
High-definition semantic map learning is formulated as an online estimation problem:
- Inputs: Surround-view camera images and/or a 3D LiDAR sweep .
- Outputs: A local HD semantic map , composed of vectorized map elements , where each is a polyline in the ego-vehicle BEV frame.
The main goals are:
- Elimination of costly pre-built global maps.
- Real-time, scalable local map construction from sensor data.
- Provision of unified semantic- and instance-level evaluation protocols.
2. Architectural Modules
HDMapNet comprises four key components:
| Module | Input Type | Function |
|---|---|---|
| Perspective-View Image Encoder | Camera | Multi-scale PV features |
| Neural View Transformer | PV features | PV BEV mapping |
| Pillar-based LiDAR Encoder | LiDAR | BEV pillar features |
| BEV Map Decoder | Fused BEV tensor | Vectorized map output |
- Camera-only:
- LiDAR-only:
- Fusion: Camera-derived BEV and LiDAR BEV features are concatenated pre-, maximizing information content.
The BEV Decoder includes three output heads for semantic segmentation, instance embedding, and direction classification.
3. Neural View Transformation and Polyline Vectorization
The camera branch leverages a neural view transformer for perspective-to-BEV mapping:
- For each PV image, extracts .
- A multi-layer perceptron aggregates PV pixels per BEV cell: .
- Camera extrinsics warp PV features to BEV space.
- Multi-view BEV features are averaged: .
Instance polylines are constructed by clustering embedding maps and applying greedy polyline tracing based on predicted direction bins:
The map representation is a vectorized set of polylines with , rather than a dense occupancy grid.
4. Loss Functions and Training Protocol
HDMapNet's total loss function combines semantic, instance, and direction losses:
- Semantic segmentation: Pixel-wise cross-entropy.
- Instance embedding: Discriminative loss with variances and inter-instance distances:
- Direction classification: Cross-entropy on direction classes, lane pixels only.
- Optimization: Adam (), weight decay (), decays by 0.1 every 10 epochs.
5. Sensor Fusion and Performance
Three sensor integration modes are supported:
- HDMapNet(Surr): Cameras-only, adept at lane dividers and crosswalks.
- HDMapNet(LiDAR): LiDAR-only, excels at geometry but less effective for lane markings.
- HDMapNet(Fusion): Concatenation of camera and LiDAR BEV features before .
Fusion yields significant improvements:
| Method | IoU (All %) | CD (m) | mAP (All %) |
|---|---|---|---|
| IPM(CB) | 32.4 | 0.839 | 19.7 |
| Lift-Splat-Shoot | 30.8 | 0.968 | 17.4 |
| VPN | 29.3 | 1.337 | 17.5 |
| HDMapNet(Surr) | 32.9 | 0.834 | 22.7 |
| HDMapNet(LiDAR) | 29.5 | 1.101 | 11.6 |
| HDMapNet(Fusion) | 44.5 | 0.639 | 30.6 |
Fusion achieves a 12.1 point absolute IoU gain and a 10.9 point mAP gain over best camera-based baselines (Li et al., 2021).
6. Evaluation Metrics and Temporal Consistency
HDMapNet employs both Eulerian and Lagrangian evaluation protocols:
- Semantic IoU:
- Chamfer Distance (CD) for vectorized curves:
- Instance-level mAP: Average precision over recall thresholds, with true positives defined by Chamfer distance criteria.
Temporal fusion via max-pooling BEV probabilities across ego poses supports locally consistent map accumulation, improving robustness to sensor variability and environmental changes.
7. Limitations, Extensions, and Related Frameworks
Key limitations include:
- Heuristic vectorization: Polyline tracing is greedy; learned graph-generation may improve topology.
- Simple fusion: Camera-LiDAR fusion is concatenation; uncertainty-aware fusion mechanisms could further enhance complementarity.
- Accuracy tradeoff: Online maps do not match offline map precision but offer scalability.
This suggests that future extensions should consider advanced fusion strategies, temporal sequence modeling, and expansion to richer semantic layers (e.g., curbs, signage). Related approaches such as the input-level raster fusion and online map prediction in HDNET (Yang et al., 2020), explicit height modeling and foreground-background masking in HeightMapNet (Qiu et al., 2024), and global vector map construction with GlobalMapNet (Shi et al., 2024) expand HDMapNet’s methodology to 3D object detection, height-aware BEV learning, and global online mapping, respectively.
HDMapNet defines the formal problem of online HD semantic map learning, establishes comprehensive evaluation standards, and delivers substantial performance gains over prior BEV and projection-based semantic mapping strategies (Li et al., 2021).