VectorNet: GNN for Trajectory Forecasting

Updated 30 November 2025

The paper introduces VectorNet, a hierarchical GNN that leverages vectorized HD-map and trajectory representations for precise behavior prediction in autonomous vehicles.
It employs a two-level hierarchical structure that combines local polyline encoding with global self-attention to effectively capture both geometric details and long-range interactions.
Experimental results show that VectorNet outperforms raster-based ConvNets with 70% fewer parameters and significantly reduced computational cost on proprietary and public datasets.

VectorNet is a hierarchical graph neural network (GNN) architecture for behavior prediction in dynamic, multi-agent systems, principally designed for autonomous vehicle trajectory forecasting. Unlike conventional rasterized image-based approaches that encode agent histories and map context as images and process them with convolutional neural networks (ConvNets), VectorNet leverages a vectorized polyline-based representation. This design enables precise modeling of high-definition (HD) maps and agent dynamics while efficiently capturing both local geometric and global interaction context. VectorNet was introduced by Gao et al. in 2020 and achieves state-of-the-art performance and efficiency in trajectory prediction tasks on both proprietary and public datasets (Gao et al., 2020).

1. Vectorized Input Representation

VectorNet directly operates on vectorized HD-map entities and agent trajectories, eschewing lossy image rasterization and associated pre-processing steps. The two major components of the input are:

Map Features: HD maps comprise entities such as lanes, lane boundaries, crosswalks, and traffic signs, described as polylines formed by sequences of spatially sampled control points.
Agent Trajectories: Historical agent motion footprints are sampled uniformly in time and encoded as time-indexed polylines.

Each directed line segment—termed a “vector”—is considered a node within the graph. The raw feature of the i-th vector in polyline $\mathcal{P}_j$ is:

$v_i = [d_i^s,\,d_i^e,\,\alpha_i,\,j]\in\mathbb{R}^D$

where $d_i^s, d_i^e$ are start and end point coordinates, $\alpha_i$ encodes type and attribute information (e.g., lane identifier, timestamp), and $j$ marks the polyline group ID. All coordinates are recentered (and can be rotated) so that the target agent’s last state is at the origin, ensuring translation invariance.

2. Hierarchical Polyline Subgraph Encoding

The architecture introduces a two-level hierarchical GNN to respect both polyline topology and global context:

Local Polyline Subgraphs: Each polyline $\mathcal{P}$ induces a fully connected subgraph among its member vectors. The initial node features are embedded using an MLP:

$h_i^{(0)} = \mathrm{MLP}_{\text{enc}}(v_i) \in \mathbb{R}^{F}$

Intra-polyline message passing is performed via $L_p$ layers, where each node exchanges messages with all other nodes in its polyline using a relation function $\phi_{\mathrm{rel}}$ and element-wise max pooling:

$m_{j\to i}^{(l)} = \phi_{\mathrm{rel}}(h_j^{(l)},h_i^{(l)}),\quad \tilde m_i^{(l)} = \mathrm{AGG}(\{m_{j\to i}^{(l)}\}_{j\neq i})$

$h_i^{(l+1)} = \mathrm{GRU}(h_i^{(l)},\tilde m_i^{(l)})$

A single polyline embedding is then computed by pooling the final node states:

$h^{\mathrm{poly}} = \mathrm{POOL}(\{h_i^{(L_p)}\,|\,i\in\mathcal{P}\}) \in \mathbb{R}^F$

Global Interaction Graph: The polyline-level embeddings $\{h_1^{\mathrm{poly}},\ldots,h_K^{\mathrm{poly}}\}$ for all map and agent polylines are aggregated in a fully-connected global interaction graph. This stage employs multi-head self-attention layers, following the transformer paradigm. The updated polyline embeddings capture high-order and long-range contextual dependencies.

3. Auxiliary Masked-Node Recovery Task

To enhance the network’s ability to learn robust and informative context representations, VectorNet introduces a self-supervised auxiliary task:

During training, a random subset $\mathcal{M}$ of polyline embeddings is masked and replaced by learnable ID embeddings.
The network is trained to recover (predict) the original embeddings of the masked polylines using their contextual neighbors after the global interaction stage:

$\hat h_i^{\mathrm{poly}} = \mathrm{MLP}_{\mathrm{node}}(h_i^{(g)})$

The loss is computed as a Huber or $\ell_2$ distance between reconstructed and ground-truth embeddings:

$\mathcal{L}_{\mathrm{mask}} = \sum_{i\in\mathcal{M}} \mathrm{Huber}(\hat h_i^{\mathrm{poly}} - h_i^{\mathrm{poly,*}})$

This auxiliary task regularizes feature learning and encourages the exploitation of inter-polyline context.

4. Trajectory Decoding and Training Objective

The decoded output for each target agent is its predicted future trajectory:

Trajectory Decoder: Using the final global-graph embedding $h_{\mathrm{tar}}^{(g)}$ of the target agent, an MLP predicts the future waypoint offsets in the agent-centric frame:

$\{\Delta x_t, \Delta y_t\}_{t=1..T} = \mathrm{MLP}_{\mathrm{traj}}(h_{\mathrm{tar}}^{(g)})$

Loss Functions: The primary supervised loss is either a negative log-likelihood for a Gaussian prediction or average displacement error ( $\ell_2$ loss):

$\mathcal{L}_{\mathrm{traj}} = \sum_{t=1}^T \|[x_t^{\mathrm{pred}},y_t^{\mathrm{pred}}] - [x_t^{\mathrm{gt}},y_t^{\mathrm{gt}}]\|_2^2$

Combined Objective: The total training loss balances the trajectory loss and the mask-reconstruction loss, typically with $\alpha=1$ :

$\mathcal{L} = \mathcal{L}_{\mathrm{traj}} + \alpha \mathcal{L}_{\mathrm{mask}}$

5. Empirical Results and Computational Efficiency

VectorNet was empirically evaluated on both an internal proprietary dataset (over 2.2 million training and 0.55 million test samples) and the Argoverse public benchmark (333,000 sequences). Metrics include average displacement error (ADE) and displacement error at 1, 2, and 3 seconds.

Quantitative Comparison:

Model	DE@1s	DE@2s	DE@3s	ADE
ResNet-18 (In-House)	0.47	0.71	1.00	0.63
VectorNet (no mask)	0.55	0.78	1.05	0.70
VectorNet (+ mask task)	0.53	0.74	1.00	0.66
ResNet-18 (Argoverse)	1.05	2.48	4.49	1.96
VectorNet (no mask)	0.94	2.14	3.84	1.72
VectorNet (+ mask task)	0.92	2.06	3.67	1.66

Efficiency:

Model	FLOPs per agent	#Params	DE@3s (In-House)	DE@3s (Argoverse)
ResNet-18	10.56 G	246 K	1.00	4.49
VectorNet w/o mask	0.041 G × n	72 K	1.05	3.84
VectorNet w/ mask	0.041 G × n	72 K	1.00	3.67

VectorNet reduces floating-point operations by a factor of 25–200, achieves 70% fewer parameters, and shows an 18% improvement on Argoverse DE@3s compared to the ConvNet approach (Gao et al., 2020).

6. Architectural Properties and Practical Considerations

Avoidance of Lossy Rasterization: Direct vector input retains geometric and semantic detail, sidestepping challenges of color-coding, discretization, and information loss inherent to rasterization.
Hierarchical Structure: Fine-grained local polyline encoding precedes a global interaction stage, supporting precise local shape modeling and flexible high-order context aggregation.
Lightweight Computation: Complexity scales with the number of vectors/polylines $\mathcal{O}((V+P)^2)$ rather than ConvNet pixel and channel counts $\mathcal{O}(HW \times C^2)$ , allowing efficient scaling to large maps and agent sets.
Flexibility: The architecture admits variations in intra-polyline and inter-polyline aggregation mechanisms, supporting MLP, GRU, self-attention, and other GNN blocks.

Limitations include the requirement for vectorized HD-maps (standard in many autonomous vehicle stacks) and per-target recentering, which complicates batched multi-agent inference. The original implementation decodes only single-modal future trajectories; extension to multi-modal decoders is structurally straightforward but left for future work.

7. Synthesis and Future Directions

VectorNet provides a principled alternative to raster-based encoding for autonomous vehicle behavior prediction by unifying geometric fidelity, context modeling, and computational efficiency through a hierarchical graph structure. Its approach addresses both scalability and expressivity challenges in high-definition map and agent context reasoning.

Further research directions include integration of multi-modal output distributions (anchors, CVAE, generative flows) for capturing uncertainty, optimizing shared centering for efficient joint agent inference, and broader evaluation across diverse urban driving scenarios (Gao et al., 2020).

PDF Markdown Chat (Pro)

References (1)

VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation (2020)

VectorNet: GNN for Trajectory Forecasting

1. Vectorized Input Representation

2. Hierarchical Polyline Subgraph Encoding

3. Auxiliary Masked-Node Recovery Task

4. Trajectory Decoding and Training Objective

5. Empirical Results and Computational Efficiency

6. Architectural Properties and Practical Considerations

7. Synthesis and Future Directions

Whiteboard

Follow Topic

Continue Learning

VectorNet: GNN for Trajectory Forecasting

1. Vectorized Input Representation

2. Hierarchical Polyline Subgraph Encoding

3. Auxiliary Masked-Node Recovery Task

4. Trajectory Decoding and Training Objective

5. Empirical Results and Computational Efficiency

6. Architectural Properties and Practical Considerations

7. Synthesis and Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics