VectorNet: GNN for Trajectory Forecasting
- The paper introduces VectorNet, a hierarchical GNN that leverages vectorized HD-map and trajectory representations for precise behavior prediction in autonomous vehicles.
- It employs a two-level hierarchical structure that combines local polyline encoding with global self-attention to effectively capture both geometric details and long-range interactions.
- Experimental results show that VectorNet outperforms raster-based ConvNets with 70% fewer parameters and significantly reduced computational cost on proprietary and public datasets.
VectorNet is a hierarchical graph neural network (GNN) architecture for behavior prediction in dynamic, multi-agent systems, principally designed for autonomous vehicle trajectory forecasting. Unlike conventional rasterized image-based approaches that encode agent histories and map context as images and process them with convolutional neural networks (ConvNets), VectorNet leverages a vectorized polyline-based representation. This design enables precise modeling of high-definition (HD) maps and agent dynamics while efficiently capturing both local geometric and global interaction context. VectorNet was introduced by Gao et al. in 2020 and achieves state-of-the-art performance and efficiency in trajectory prediction tasks on both proprietary and public datasets (Gao et al., 2020).
1. Vectorized Input Representation
VectorNet directly operates on vectorized HD-map entities and agent trajectories, eschewing lossy image rasterization and associated pre-processing steps. The two major components of the input are:
- Map Features: HD maps comprise entities such as lanes, lane boundaries, crosswalks, and traffic signs, described as polylines formed by sequences of spatially sampled control points.
- Agent Trajectories: Historical agent motion footprints are sampled uniformly in time and encoded as time-indexed polylines.
Each directed line segment—termed a “vector”—is considered a node within the graph. The raw feature of the i-th vector in polyline is:
where are start and end point coordinates, encodes type and attribute information (e.g., lane identifier, timestamp), and marks the polyline group ID. All coordinates are recentered (and can be rotated) so that the target agent’s last state is at the origin, ensuring translation invariance.
2. Hierarchical Polyline Subgraph Encoding
The architecture introduces a two-level hierarchical GNN to respect both polyline topology and global context:
- Local Polyline Subgraphs: Each polyline induces a fully connected subgraph among its member vectors. The initial node features are embedded using an MLP:
Intra-polyline message passing is performed via layers, where each node exchanges messages with all other nodes in its polyline using a relation function and element-wise max pooling:
A single polyline embedding is then computed by pooling the final node states:
- Global Interaction Graph: The polyline-level embeddings for all map and agent polylines are aggregated in a fully-connected global interaction graph. This stage employs multi-head self-attention layers, following the transformer paradigm. The updated polyline embeddings capture high-order and long-range contextual dependencies.
3. Auxiliary Masked-Node Recovery Task
To enhance the network’s ability to learn robust and informative context representations, VectorNet introduces a self-supervised auxiliary task:
- During training, a random subset of polyline embeddings is masked and replaced by learnable ID embeddings.
- The network is trained to recover (predict) the original embeddings of the masked polylines using their contextual neighbors after the global interaction stage:
- The loss is computed as a Huber or distance between reconstructed and ground-truth embeddings:
This auxiliary task regularizes feature learning and encourages the exploitation of inter-polyline context.
4. Trajectory Decoding and Training Objective
The decoded output for each target agent is its predicted future trajectory:
- Trajectory Decoder: Using the final global-graph embedding of the target agent, an MLP predicts the future waypoint offsets in the agent-centric frame:
- Loss Functions: The primary supervised loss is either a negative log-likelihood for a Gaussian prediction or average displacement error ( loss):
- Combined Objective: The total training loss balances the trajectory loss and the mask-reconstruction loss, typically with :
5. Empirical Results and Computational Efficiency
VectorNet was empirically evaluated on both an internal proprietary dataset (over 2.2 million training and 0.55 million test samples) and the Argoverse public benchmark (333,000 sequences). Metrics include average displacement error (ADE) and displacement error at 1, 2, and 3 seconds.
Quantitative Comparison:
| Model | DE@1s | DE@2s | DE@3s | ADE |
|---|---|---|---|---|
| ResNet-18 (In-House) | 0.47 | 0.71 | 1.00 | 0.63 |
| VectorNet (no mask) | 0.55 | 0.78 | 1.05 | 0.70 |
| VectorNet (+ mask task) | 0.53 | 0.74 | 1.00 | 0.66 |
| ResNet-18 (Argoverse) | 1.05 | 2.48 | 4.49 | 1.96 |
| VectorNet (no mask) | 0.94 | 2.14 | 3.84 | 1.72 |
| VectorNet (+ mask task) | 0.92 | 2.06 | 3.67 | 1.66 |
Efficiency:
| Model | FLOPs per agent | #Params | DE@3s (In-House) | DE@3s (Argoverse) |
|---|---|---|---|---|
| ResNet-18 | 10.56 G | 246 K | 1.00 | 4.49 |
| VectorNet w/o mask | 0.041 G × n | 72 K | 1.05 | 3.84 |
| VectorNet w/ mask | 0.041 G × n | 72 K | 1.00 | 3.67 |
VectorNet reduces floating-point operations by a factor of 25–200, achieves 70% fewer parameters, and shows an 18% improvement on Argoverse DE@3s compared to the ConvNet approach (Gao et al., 2020).
6. Architectural Properties and Practical Considerations
- Avoidance of Lossy Rasterization: Direct vector input retains geometric and semantic detail, sidestepping challenges of color-coding, discretization, and information loss inherent to rasterization.
- Hierarchical Structure: Fine-grained local polyline encoding precedes a global interaction stage, supporting precise local shape modeling and flexible high-order context aggregation.
- Lightweight Computation: Complexity scales with the number of vectors/polylines rather than ConvNet pixel and channel counts , allowing efficient scaling to large maps and agent sets.
- Flexibility: The architecture admits variations in intra-polyline and inter-polyline aggregation mechanisms, supporting MLP, GRU, self-attention, and other GNN blocks.
Limitations include the requirement for vectorized HD-maps (standard in many autonomous vehicle stacks) and per-target recentering, which complicates batched multi-agent inference. The original implementation decodes only single-modal future trajectories; extension to multi-modal decoders is structurally straightforward but left for future work.
7. Synthesis and Future Directions
VectorNet provides a principled alternative to raster-based encoding for autonomous vehicle behavior prediction by unifying geometric fidelity, context modeling, and computational efficiency through a hierarchical graph structure. Its approach addresses both scalability and expressivity challenges in high-definition map and agent context reasoning.
Further research directions include integration of multi-modal output distributions (anchors, CVAE, generative flows) for capturing uncertainty, optimizing shared centering for efficient joint agent inference, and broader evaluation across diverse urban driving scenarios (Gao et al., 2020).